Without getting into the details yet: it aims to make web data collection a little bit easier for non-devs. I'll soon have an MVP and will start pitching to investors: aiming for an open-source business model (after a few months of stealth development) and eventually a typical SaaS offering for extra functionality.
At this point I'm trying to consolidate and counter the steel-man counter-arguments I should expect from investors. The most obvious one: as one can imagine, the product it's not magic and, after a certain point it does require some manual work from the customer, hence this is an aspect I should prepare for.
I have done some preliminary analysis of the space of potential competitors (think import.io, Apify, Zyte/ScarpingHub, etc.) and described opportunities for differentiation. What I'm afraid of is getting sidetracked in a discussion of "um, this is web scraping and it's hard to make a business on top of it".
I understand that there's not much context now and one could easily say "well yeah, anything could be possible with a good team, product...", but I'm reaching out to the HN community to gather some considerations, mental models and pointers, I may not think of myself at this point.
Monopolies, lobbying and protectionism got in the way of keeping the web truly machine readable. There's tremendous value in restoring some of it.
It is an arms race, since many people don’t want you to scrape. We tried hard to respect robots.txt, but we still got angry cease-and-desist emails from people who’d malformed or misconfigured the file.
You will have a scale problem: it’s a lot of data. You’ll have parsing problems: live HTML is about the dirtiest data set I’ve ever seen. Refresh rate can be a major competitive advantage: how often can you scrape, store, diff, and report? These days you’ll need first-class JavaScript execution to catch dynamic content.
But the biggest problem isn’t the scraping tech, it’s the use case — what uses cases are you going to afford your early users? You don’t mention this in your post, and it will non-trivially affect what you scrape and how you report it. I’d encourage you to find users who have business problems that can be solved by paying money for scraping. Otherwise you’ll be another interesting open source tool that no one’s figured out how to monetize. Do this _before_ you talk to investors or take their money.
* What features are common among my competitors?
* What features are unique?
* Who are target customers and users? Is there any overlap, or do some competitors target unique market segments?
This last question ties in to a discussion I was having with a friend recently. In B2B sales, your customers are businesses, but your users are people in those businesses with certain roles and responsibilities. Understanding the difference is key, because you will often need to develop your sales and marketing strategies based on the business/customer profile, but your UX will depend on the needs of the users within those businesses.
In my opinion you are more likely to be successful if you can get an initial foothold in a market by identifying a specific target of customers and users, solving their use case very well, developing a moat, and then growing out from that foothold to provide a wider set of options. Web scraping is just a tool. You need to find businesses who can gain value from scraping or from scraped data. Are there businesses who, for whatever reason, would not be able to adopt one of your competitors' products, or would find that adoption difficult? Maybe you could specialise in scraping a particular kind of data, or providing a full-stack solution for companies with limited in-house technical expertise (like some kind of consulting, you hop on a call with the client, they tell you what they want to scrape, and you set up a hosted solution which provides a SQL or Excel interface to the data).
In short, successful product development is all about understanding customer and user pain and needs. If you can find pains or needs which are a common theme for a particular demographic of companies and roles, you can work with those people to understand their problems and make a product which is very valuable to them.
Instead, think about what people want to do with the data. For example, if you are going to scrape diamond prices, don’t try to sell that feed. Set up a website with a UI so people can research diamond prices, and get alerts when specific thresholds are met or items come in stock. Monetize with ads.
Recently, there is a boom of "anti-bot" services. These are essentially SaaS businesses that "protect" websites from being scraped by automated software. As you onboard the first customer who wants to extract data from a bot-protected website, you are going to run into an unlimited waterfall of stupid troubles. Your bots will be blocked, will consume excessive amount of data, kill your CPU/GPU performance.
I have shared some highlights on how to bypass these recently on HN [1], but it is sadly only the tip of the iceberg. On the other hand, since the post has been featured on HN I have been reached by more than 50 companies and individuals whose business operating model is based solely on data extraction/automated scraping. These are (in my opinion) successful companies, and two out of these are part of YC.
There are a handful of companies doing very well with models similar to what you’re describing. I can’t mention specific customers, but I see some of them doing very large scraping volume through our network.
It’s an industry where having a good product is more important than the amount you’re spending on marketing. If developers are happy with your product they’ll take it with them to future companies/projects and share it with colleagues.
It can be a cat-and-mouse development cycle where the sites you target break your functionality and you’ll have customers that will want fixes to be implement ASAP because they rely on your tools to make money.
I don’t know what you’re building exactly, but keep in mind there’s a good chance that you’ll need to commit to long-term, continuous, rapid development cycles if you want to retain customers.
Best of luck!
I wrote the scraping code. Had a list of sites and macros for extracting quotes, updated every day to every customer. If one quit working (the site attempted to prevent scraping) the app would use another and give a notice back to me. I'd tweak the macro for that site, and we'd be back scraping it the next day.
We eventually hired a finance student (Josh Hatwich, now a fellow at Adobe) to parse a Comstock satellite feed we put on the roof. That ended the era of scraping at StockPoint.
1. Changes in data structures. If some site randomly decides to alter the format of their json/xml objects for their frontend api it may brake your scraper and anything that relies on that scraper’s output.
2. Security controls like rate limiting, captcha, ip blacklisting, auth systems.
3. Html which is rendered via complicated client side JavaScript blobs or web sockets. You’ll need a Headless browser engine like selenium and some site-specific parsing logic.
4. Legal issues.
"But what about Google?" Google is worth 100 billion dollars and can play by completely different rules than scraping startups.
I think you're missing a step, which is where _you_ answer if web scraping can be a viable business model. You're attempting to convince VCs with logic (and a few assumptions), but there's an easier (or harder) way to do it... convince them by making money.
Most VCs aren't ideologues, and don't have an opinion about business models. They will be convinced if you simply show them you're making money. It's not their job to decide if an idea can make money or not; that's your job as a founder.
I applied to YC twice. The first time we spent the 10 minutes talking about if it could make money or not, and never got anywhere good. We got rejected. The second time we were making money, the conversation was smoother, and we got in. It's so much easier to be able to replace "I believe" with "our customers believe". It changes the conversation completely. You don't need to be making billions of dollars; just enough to show that people want what you're making!
tl;dr You're trying to convince VCs when you need to be convincing customers!
(For the record, a lot of what I said here is very money-driven and that's not how I build my company. However, in the context of VCs, which are purely financially driven, it's how you should be thinking about it.)
Good luck, and let me know if I can help! My email is in my profile if you want to talk!
The more specialized you can get the better chances of success imo. Generic web scrapers are dime-a-dozen
The service itself will always be in flux because of how freeform hypertext is as a schema. So many other comments here reflect that better than I could.
The fact is that any chunk of data you're handing over to clients still needs to be handled by their team and in my experience reality often falls short of expectations. If you can somehow deliver them something cleaner (or even something that can help them reach conclusions faster), then you have a product with a high value prop.
They crawled the data, but also had a services component to do something with the data. Like, they had contracts with pharma companies to search for indications that a page was selling counterfeit drugs.
I'm not sure of the exact details of why they didn't make it.
Also, I'm thinking about Recorded Future (https://www.recordedfuture.com). They do something like this -- again, the mechanics of scraping, and a services component for analysis.
The value is in the decisions that can be made based on the data being sold rather than the method at which you extract it from. If you focus on the value of the data and who needs it, you’ll likely find a viable business model.
Many, many websites contain legal language that forbids automatic data collection/scraping. How can a business be built in such a case?
Perhaps OPs tool only scrapes a select few sites that don't prohibit scraping, but that seems like the exception, not the norm.
But I think there is a huge opportunity in scraping data and then doing something interesting with it. Google is the most obvious example of this type of company. But, for example, certain CRM companies are more about data scraping than working with user-provided data.
For a while I ran product for a social monitoring company and our traditional user base was brands, agencies, etc. who would use our giant database of public content to do market research, etc. At various points we would get inbound requests from someone with a unique ask - I recall:
- a military historian working on a government grant who wanted to analyze the social media activities of various militias in a particular part of the world
- several pharma companies looking for adverse drug reaction reports online
- hedge funds looking for deep sentiment trends in particular areas for perception of certain businesses
- some company looking to find properties where women made announcements that they were pregnant.
And then there’s always the requests for X but in Y language/country. “There’s a Twitter like service in Bangladesh, can you get that data?”.
All of these people had money to spend and specific interests - we couldn’t help most of them as the economics didn’t work out in terms of building a scalable business, but if you can find a niche and run things lean, there’s a real long tail of opportunity there.
Indexed & web-scraped data is worth Y.
A searchable index of web-scraped data is worth X^Y.
I think you'll have a lot more luck finding 2-3 initial customers before you try to raise money. It's always easier to explain what your product does, and who the target market is, in terms of actual customers; instead of hypotheticals.
Remember, the goal is to build a business. If you fall in love with the code, it's too easy to build something you enjoy working on, but has limited commercial value.
> after a certain point it does require some manual work from the customer
Once you gain traction, you can become a platform, intermediary between customers and engineers that will fine tune scraping to what the customer needs. This could either be some sort of "Solution Engineer" that the company hires, or open it up to outside developers that get paid per integration (either by you or by the customer, or both). There's a solution to every problem.
As far as the business itself, I think you could be on to something. Of course, ideas are cheap and it's the execution that counts, but here's how I'd think about it: with scraping, every website on the web has an API. Before, only 0.1% of websites had an API.
And certainly wouldn't hurt to change the "scraping" word – such an ugly word.
Please consider helping us support the EFF actions. Outside the obvious vested interests of our business, I truly believe scraping the web is a force of good and progress. And the EFF work in ensuring web scraping stays a legal practice in the Unite States has been outstanding. [1]
[1] https://www.eff.org/deeplinks/2021/07/eff-ninth-circuit-rece...
https://www.screendaily.com/features/how-uk-data-company-app...
Applaudience’s algorithms trawl through every exhibitor website, looking at every showtime of every film, and tracks the auditorium layout as each seat flips from available (unsold) to unavailable (sold).
>the product it's not magic and, after a certain point it does require some manual work from the customer, hence this is an aspect I should prepare for.
Can you make it magic or maybe develop end to end solutions for your first "customers" using your product? That sounds like the schlep you need to do.
Sounds promising! Find yourself a customer or two!
If you really want to go the open source route, just focus on that and then see if people pick it up and use it. Then you'd offer the SaaS.
Consider, apart from a tool for general purpose scraping, what information a specialized scraper might obtain for a valuable but underserved industry that can profit from the data.
I have a buddy that scrapes data specifically for the tanker/shipping industry, for example.
General purpose scraping will involve a lot of competition and a bit of an arms race. Niche scraping lets you fly a little below the radar.
Search engines deliberately wiped out personal websites, blogs, small news organisations etc, and spammers drowned out the remaining real user generated content from the www. Social media ate the forums and closed the doors.
Websites now aren't websites but businesses and they don't like people snooping around
Make sure your pricing is clear so the profit calculation for the customer is transparent.
You then have a tangible product line you can pitch to investors regardless of whether they can appreciate the more abstract solution/platform.
Any I could've imagining this being a paid pro feature down the line
You’ll likely have to do more manual data cleaning than you expect, and get some amount of pushback from the sources you’re crawling (depending how commercially valuable the data is).
Are there a couple thousand people who would pay for a SaaS offering? Then it's a business. The real goal would be identifying a hair on fire problem that you are in a unique position to solve. That's always the problem, and it has nothing to do with web scraping in particular.
(I was CTO of hiQ (technically still am I think))
contact me if you want to chat about scraping danbmil99 at gmail
Re: URL Database for Brand Safety & Contextual Targeting
Would think this is a legal concern...
Makes the founder a hefty sum.
They are a profitable startup and have a SDK solution for scale web scraping bots on many business fields.
I'm a advisor/investor in Datallog. You can connect with the founder in https://www.linkedin.com/in/joelder-maragno-arcaro/