HACKER Q&A
📣 fforflo

Can web scraping be the basis of a viable business model?


I'm a data engineer at heart, and I never did or enjoyed front-end work. Having said that I always was happy to code and evolve crawlers and web scrapers. Now I've taken some time off from work and gigs and I'm working on a side-project I've been hacking for some time.

Without getting into the details yet: it aims to make web data collection a little bit easier for non-devs. I'll soon have an MVP and will start pitching to investors: aiming for an open-source business model (after a few months of stealth development) and eventually a typical SaaS offering for extra functionality.

At this point I'm trying to consolidate and counter the steel-man counter-arguments I should expect from investors. The most obvious one: as one can imagine, the product it's not magic and, after a certain point it does require some manual work from the customer, hence this is an aspect I should prepare for.

I have done some preliminary analysis of the space of potential competitors (think import.io, Apify, Zyte/ScarpingHub, etc.) and described opportunities for differentiation. What I'm afraid of is getting sidetracked in a discussion of "um, this is web scraping and it's hard to make a business on top of it".

I understand that there's not much context now and one could easily say "well yeah, anything could be possible with a good team, product...", but I'm reaching out to the HN community to gather some considerations, mental models and pointers, I may not think of myself at this point.


  👤 apienx Accepted Answer ✓
Google, a trillion dollar company, is essentially the world's largest web scraper. So...yes! You'll almost certainly find a way to monetize that.

Monopolies, lobbying and protectionism got in the way of keeping the web truly machine readable. There's tremendous value in restoring some of it.


👤 sonofhans
I worked at AboutUs.org for a while, and that’s what we did. Good news: it was fun and rewarding. In many ways it felt like a satisfying old-skool problem: scrape, find edge case, patch it, scrape again. We were scraping 100 million domains once a week with a team of six engineers, one UX (me), and Ward Fucking Cunningham as wiki expert. Ward in particular was great at prototyping solutions.

It is an arms race, since many people don’t want you to scrape. We tried hard to respect robots.txt, but we still got angry cease-and-desist emails from people who’d malformed or misconfigured the file.

You will have a scale problem: it’s a lot of data. You’ll have parsing problems: live HTML is about the dirtiest data set I’ve ever seen. Refresh rate can be a major competitive advantage: how often can you scrape, store, diff, and report? These days you’ll need first-class JavaScript execution to catch dynamic content.

But the biggest problem isn’t the scraping tech, it’s the use case — what uses cases are you going to afford your early users? You don’t mention this in your post, and it will non-trivially affect what you scrape and how you report it. I’d encourage you to find users who have business problems that can be solved by paying money for scraping. Otherwise you’ll be another interesting open source tool that no one’s figured out how to monetize. Do this _before_ you talk to investors or take their money.


👤 wcerfgba
You've already identified some of your competitors, you should go in to more detail and try to answer:

* What features are common among my competitors?

* What features are unique?

* Who are target customers and users? Is there any overlap, or do some competitors target unique market segments?

This last question ties in to a discussion I was having with a friend recently. In B2B sales, your customers are businesses, but your users are people in those businesses with certain roles and responsibilities. Understanding the difference is key, because you will often need to develop your sales and marketing strategies based on the business/customer profile, but your UX will depend on the needs of the users within those businesses.

In my opinion you are more likely to be successful if you can get an initial foothold in a market by identifying a specific target of customers and users, solving their use case very well, developing a moat, and then growing out from that foothold to provide a wider set of options. Web scraping is just a tool. You need to find businesses who can gain value from scraping or from scraped data. Are there businesses who, for whatever reason, would not be able to adopt one of your competitors' products, or would find that adoption difficult? Maybe you could specialise in scraping a particular kind of data, or providing a full-stack solution for companies with limited in-house technical expertise (like some kind of consulting, you hop on a call with the client, they tell you what they want to scrape, and you set up a hosted solution which provides a SQL or Excel interface to the data).

In short, successful product development is all about understanding customer and user pain and needs. If you can find pains or needs which are a common theme for a particular demographic of companies and roles, you can work with those people to understand their problems and make a product which is very valuable to them.


👤 edmundsauto
Yes, although I would encourage you to think about something higher on the value chain than raw data feeds. Those exist and have become an increasingly difficult market to compete in. You can buy a custom feed for like $250/mo.

Instead, think about what people want to do with the data. For example, if you are going to scrape diamond prices, don’t try to sell that feed. Set up a website with a UI so people can research diamond prices, and get alerts when specific thresholds are met or items come in stock. Monetize with ads.


👤 proszkinasenne2
Sure, it can be! Also, as some people have already pointed out, this is often a gray area where people go beyond violating ToS. Some good examples are privacy violations (scraping personal data), credentials stuffing etc.

Recently, there is a boom of "anti-bot" services. These are essentially SaaS businesses that "protect" websites from being scraped by automated software. As you onboard the first customer who wants to extract data from a bot-protected website, you are going to run into an unlimited waterfall of stupid troubles. Your bots will be blocked, will consume excessive amount of data, kill your CPU/GPU performance.

I have shared some highlights on how to bypass these recently on HN [1], but it is sadly only the tip of the iceberg. On the other hand, since the post has been featured on HN I have been reached by more than 50 companies and individuals whose business operating model is based solely on data extraction/automated scraping. These are (in my opinion) successful companies, and two out of these are part of YC.

[1] https://news.ycombinator.com/item?id=29060272


👤 rdbell
I co-founded https://packetstream.io

There are a handful of companies doing very well with models similar to what you’re describing. I can’t mention specific customers, but I see some of them doing very large scraping volume through our network.

It’s an industry where having a good product is more important than the amount you’re spending on marketing. If developers are happy with your product they’ll take it with them to future companies/projects and share it with colleagues.

It can be a cat-and-mouse development cycle where the sites you target break your functionality and you’ll have customers that will want fixes to be implement ASAP because they rely on your tools to make money.

I don’t know what you’re building exactly, but keep in mind there’s a good chance that you’ll need to commit to long-term, continuous, rapid development cycles if you want to retain customers.

Best of luck!


👤 JoeAltmaier
Did it back in the old days, scraping stock quotes to build a database for display by our Java app and web services. Called NetProphet, it would do a score of trend lines etc as overlays.

I wrote the scraping code. Had a list of sites and macros for extracting quotes, updated every day to every customer. If one quit working (the site attempted to prevent scraping) the app would use another and give a notice back to me. I'd tweak the macro for that site, and we'd be back scraping it the next day.

We eventually hired a finance student (Josh Hatwich, now a fellow at Adobe) to parse a Comstock satellite feed we put on the roof. That ended the era of scraping at StockPoint.


👤 bdcravens
Yes, but scraping is a small part of the overall puzzle. As developers, we overestimate how valuable tools are (as opposed to solutions). I think the better opportunity is not to be another scraping-as-a-service provider, but to niche down to a solution that uses your scraping technology.

👤 rank0
Some major issues from my experience web scraping:

1. Changes in data structures. If some site randomly decides to alter the format of their json/xml objects for their frontend api it may brake your scraper and anything that relies on that scraper’s output.

2. Security controls like rate limiting, captcha, ip blacklisting, auth systems.

3. Html which is rendered via complicated client side JavaScript blobs or web sockets. You’ll need a Headless browser engine like selenium and some site-specific parsing logic.

4. Legal issues.


👤 bjourne
Web scraping is a legal gray area in many or most jurisdictions. In some jurisdictions, depending on the tos of the web site itself, scraping it might be illegal. In others republishing the scraped information in any form might be illegal. In others still you might not be allowed to use the scraped data for any commercial purpose.

"But what about Google?" Google is worth 100 billion dollars and can play by completely different rules than scraping startups.


👤 max002
From what i know thats how skyscanner started, not sure if ita very popular in US, but it is in Europe. Now they're paying/having deqls for data, but they started with scraping the hell out of airline sites.

👤 gkoberger
Hey! Hopefully this comment comes across as helpful rather than hurtful. I'm also the founder of a developer tool, and it's hard to raise money for! (I know you said yours isn't strictly a developer tool, but you mentioned open source so I think it's fair to assume it falls in this category.)

I think you're missing a step, which is where _you_ answer if web scraping can be a viable business model. You're attempting to convince VCs with logic (and a few assumptions), but there's an easier (or harder) way to do it... convince them by making money.

Most VCs aren't ideologues, and don't have an opinion about business models. They will be convinced if you simply show them you're making money. It's not their job to decide if an idea can make money or not; that's your job as a founder.

I applied to YC twice. The first time we spent the 10 minutes talking about if it could make money or not, and never got anywhere good. We got rejected. The second time we were making money, the conversation was smoother, and we got in. It's so much easier to be able to replace "I believe" with "our customers believe". It changes the conversation completely. You don't need to be making billions of dollars; just enough to show that people want what you're making!

tl;dr You're trying to convince VCs when you need to be convincing customers!

(For the record, a lot of what I said here is very money-driven and that's not how I build my company. However, in the context of VCs, which are purely financially driven, it's how you should be thinking about it.)

Good luck, and let me know if I can help! My email is in my profile if you want to talk!


👤 tommoor
ClearBit and Plaid haven't been mentioned yet – both examples of multi-million/billion dollar business built on the back of scraping.

The more specialized you can get the better chances of success imo. Generic web scrapers are dime-a-dozen


👤 ChikkaChiChi
Scraping services on their own are a viable business product, but the power to assign metadata and contextualization is where the unicorn lies.

The service itself will always be in flux because of how freeform hypertext is as a schema. So many other comments here reflect that better than I could.

The fact is that any chunk of data you're handing over to clients still needs to be handled by their team and in my experience reality often falls short of expectations. If you can somehow deliver them something cleaner (or even something that can help them reach conclusions faster), then you have a product with a high value prop.


👤 xedarius
I went for an interview once at a hedge fund. There were a surprising amount of questions about web scraping. I very much got the feeling it was an active and ongoing problem. So yes I do think there’s a business in there.

👤 deanebarker
There was a company in my city that did something like this. They didn't survive.

They crawled the data, but also had a services component to do something with the data. Like, they had contracts with pharma companies to search for indications that a page was selling counterfeit drugs.

I'm not sure of the exact details of why they didn't make it.

Also, I'm thinking about Recorded Future (https://www.recordedfuture.com). They do something like this -- again, the mechanics of scraping, and a services component for analysis.


👤 tixocloud
There are many data providers serving the financial services/capital markets sector that provide everything from raw data feeds to insights based off of web-scraped data. What will be important for you is to identify a good niche, understand which part of the data value chain your customer needs, and deliver the data in a way that fits their workflow.

The value is in the decisions that can be made based on the data being sold rather than the method at which you extract it from. If you focus on the value of the data and who needs it, you’ll likely find a viable business model.


👤 adinosaur123
I don't wish to hijack this thread, but I've been pondering a similar question. I've been working on a product that requires a very large amount of data that, as far as I can tell, can only be gathered by scraping (real estate data - even data vendors like estated.com don't have stuff like sales data).

Many, many websites contain legal language that forbids automatic data collection/scraping. How can a business be built in such a case?

Perhaps OPs tool only scrapes a select few sites that don't prohibit scraping, but that seems like the exception, not the norm.


👤 jamesmishra
There is a lot of competition when it comes to building a pure-play data scraping company. There are also various regulatory concerns about scraping various types of data--PII like phone numbers, biometric data like images, or data about concert ticket prices.

But I think there is a huge opportunity in scraping data and then doing something interesting with it. Google is the most obvious example of this type of company. But, for example, certain CRM companies are more about data scraping than working with user-provided data.


👤 mattzito
There is always value in having a store of data that is unique or differentiated in some way. The trick is figuring out what that is and who might be interested.

For a while I ran product for a social monitoring company and our traditional user base was brands, agencies, etc. who would use our giant database of public content to do market research, etc. At various points we would get inbound requests from someone with a unique ask - I recall:

- a military historian working on a government grant who wanted to analyze the social media activities of various militias in a particular part of the world

- several pharma companies looking for adverse drug reaction reports online

- hedge funds looking for deep sentiment trends in particular areas for perception of certain businesses

- some company looking to find properties where women made announcements that they were pregnant.

And then there’s always the requests for X but in Y language/country. “There’s a Twitter like service in Bangladesh, can you get that data?”.

All of these people had money to spend and specific interests - we couldn’t help most of them as the economics didn’t work out in terms of building a scalable business, but if you can find a niche and run things lean, there’s a real long tail of opportunity there.


👤 arinze11
Hi , I am the ceo of http://webautomation.io the largest marketplace for no-code webscrapers. I can tell you first hand the Market you are going after is very big and getting bigger everday. We get thousands of business's/people sign up everymonth . So my advice is not to worry as I am sure you will find a viable business with your proposed product

👤 MR4D
Web-scraped data is worth X.

Indexed & web-scraped data is worth Y.

A searchable index of web-scraped data is worth X^Y.


👤 gwbas1c
Are you falling in love with the code?

I think you'll have a lot more luck finding 2-3 initial customers before you try to raise money. It's always easier to explain what your product does, and who the target market is, in terms of actual customers; instead of hypotheticals.

Remember, the goal is to build a business. If you fall in love with the code, it's too easy to build something you enjoy working on, but has limited commercial value.


👤 thruway516
Never mind business, a web-scraping command-line utility as comprehensive and easy to use as say curl would be something. I would even pay for that.

👤 max002
My only advice would be have a backup plan (redundancy) . I.e. Design basic version that works and dont get blocked as bot, then design another one. This will save you from situation like described below where your original method stopped working, but your client wants data now (because yjey pay for it). And be nice, dont take data you dont need. Keep it easy on servers.

👤 erichocean
Scraping isn't the problem, it's getting access to the pages to scrape at high-enough volume, at low-enough cost.

👤 shtopointo
And to answer your question, because my other post was a question itself:

> after a certain point it does require some manual work from the customer

Once you gain traction, you can become a platform, intermediary between customers and engineers that will fine tune scraping to what the customer needs. This could either be some sort of "Solution Engineer" that the company hires, or open it up to outside developers that get paid per integration (either by you or by the customer, or both). There's a solution to every problem.

As far as the business itself, I think you could be on to something. Of course, ideas are cheap and it's the execution that counts, but here's how I'd think about it: with scraping, every website on the web has an API. Before, only 0.1% of websites had an API.

And certainly wouldn't hurt to change the "scraping" word – such an ugly word.


👤 test0account
Is Google not just one massive web scraper?

👤 hartator
I am the CEO of https://serpapi.com.

Please consider helping us support the EFF actions. Outside the obvious vested interests of our business, I truly believe scraping the web is a force of good and progress. And the EFF work in ensuring web scraping stays a legal practice in the Unite States has been outstanding. [1]

[1] https://www.eff.org/deeplinks/2021/07/eff-ninth-circuit-rece...


👤 mukundesh
Not sure if this company survived the pandemic but check out Applaudience it was crawling seat level data from event websites.

https://www.screendaily.com/features/how-uk-data-company-app...

Applaudience’s algorithms trawl through every exhibitor website, looking at every showtime of every film, and tracks the auditorium layout as each seat flips from available (unsold) to unavailable (sold).


👤 sjg007
It is valuable. There's a lot of Robotic Process Automation, competitive analysis etc...

>the product it's not magic and, after a certain point it does require some manual work from the customer, hence this is an aspect I should prepare for.

Can you make it magic or maybe develop end to end solutions for your first "customers" using your product? That sounds like the schlep you need to do.

Sounds promising! Find yourself a customer or two!

If you really want to go the open source route, just focus on that and then see if people pick it up and use it. Then you'd offer the SaaS.


👤 14
I saw a project here on HN that was basically a webscraper for non developers that aimed to make it easy to scrape data from various sites. What some warned the project did not do is warn you that you could potentially be banned from some services by using bots and scrapers. I forget the project name but once someone warned it could potentially have your facebook or whatever banned I decided not to try it out. I would warn users be very careful about the TOS for each site you decide to scrape if you are logged in as a user.

👤 katzgrau
Ultimately people want information from scraping, because to a lot of businesses, the information is what's valuable.

Consider, apart from a tool for general purpose scraping, what information a specialized scraper might obtain for a valuable but underserved industry that can profit from the data.

I have a buddy that scrapes data specifically for the tanker/shipping industry, for example.

General purpose scraping will involve a lot of competition and a bit of an arms race. Niche scraping lets you fly a little below the radar.


👤 subpixel
Pitch your investors the VPN business. That's what web scraping is at scale - a series of networking techniques that allow users to do what they want without being blocked.

👤 mateo1
A good think to ask yourself is whether there's anything left out there that's both accessible (legally and technically) and worth scraping at the same time.

Search engines deliberately wiped out personal websites, blogs, small news organisations etc, and spammers drowned out the remaining real user generated content from the www. Social media ate the forums and closed the doors.

Websites now aren't websites but businesses and they don't like people snooping around


👤 notduncansmith
It will help to recognize the key use-cases and provide lots of support out of the box like pre-built scrapers for price comparison, social media mentions (or other analysis), whatever you find that people will pay for.

Make sure your pricing is clear so the profit calculation for the customer is transparent.

You then have a tangible product line you can pitch to investors regardless of whether they can appreciate the more abstract solution/platform.


👤 wombatpm
If your potential customers are willing to pay to scrape data, why aren't they will to pay for the data from the source directly? Is it not available or is it considered exclusive or proprietary. I'm thinking about the lawsuits around deep linking and TicketMaster. Web scraping at scale is a never ending arms race because designs evolve or the host is actively trying to thwart you.

👤 markl42
Maybe only kinda sorta relevant, but I interned at RefME for a summer - basically a Zotero competitor. There was a lot of value (for users) in scraping web pages to autofill to author, title, date etc so it could generate the references, something I started to work on (but then they got bought out so not sure what ended up happening)

Any I could've imagining this being a paid pro feature down the line


👤 nojs
I cofounded a company based on scraping academic publications. We ended up getting a lot of traffic (millions of pageviews per month) because we had good SEO, but it’s not necessarily a defensible business model by itself.

You’ll likely have to do more manual data cleaning than you expect, and get some amount of pushback from the sources you’re crawling (depending how commercially valuable the data is).


👤 hooande
Web scraping can be a viable business. It depends on what you're scraping and who your customers are.

Are there a couple thousand people who would pay for a SaaS offering? Then it's a business. The real goal would be identifying a hair on fire problem that you are in a unique position to solve. That's always the problem, and it has nothing to do with web scraping in particular.


👤 webdoodle
I did so for years, scraping university press releases, obscure government data (like the Federal Register), jail/prison rosters, Reddit posts/comments/users etc. Companies like LexusNexus have been doing it for far longer than the digital world, offering it as a clipping service. If you can find a niche, it can be a regular subscription income stream.

👤 danbmil99
We tried https://techcrunch.com/2021/06/14/supreme-court-revives-link...

(I was CTO of hiQ (technically still am I think))

contact me if you want to chat about scraping danbmil99 at gmail


👤 suifbwish
I’m curious how you deal with JavaScript that will load other pages including other JavaScript documents that cannot be loaded until the first set of JavaScript is executed. I’ve played with the chromium web driver a few times but it seems to be tricky to implement in a completely headless environment.

👤 arasx
If you can apply the scraping to a use case and sell that use case you have a better chance at a viable business model. Examples come to my mind is; builtwith (scrape sites and publish the list of technologies they use), ahrefs (scrape sites and find outgoing/incoming links) etc.

👤 orev
If you’re scraping someone else’s data, do you know what they copyright status is? Have you made deals with the original sources that permits you to use their data? How will you deal with lawsuits and the constant blocking of your scrapers?

👤 bootcat
I think yes, web scraping can give a viable product. Now with the ability to scrape react/js pages and availability of 24GB cloud free tier machines and transformer models - i think atleast for the next 5 years, should be possible !

👤 kkoppenhaver
Don't know if this gets to the heart of your question, but I was surprised to not have seen https://www.scrapingbee.com/ mentioned here yet.

👤 mrleinad
An online law editorial I used to work at around 10 years ago basically scraped free available content from different sources around the web, repackaged it, and sold it online via subscription model. They're still around.

👤 itake
You might want to look at web scraping for data scientists. I am trying to build a ML Model for NSFW text detection in multiple languages and I am not looking forward to scraping p*rn and youtube websites for comments.

👤 Supermancho
Web scraping on demand: https://zvelo.com/ in the service of adtech, ofc.

Re: URL Database for Brand Safety & Contextual Targeting


👤 nomdep
Instead of competing head-on with ScrapingHub/etc, you could ask yourself “why companies doing X pay them” and sell a specific product for that niche.

👤 ethor
In my opinion you should start with defining your value proposition and target market, i.e. how do I create value for customers, and who are my customers?

👤 shtopointo
Isn't one big issue that any website that has data worth scraping has ToS that disallow scraping? e.g. craigslist?

Would think this is a legal concern...


👤 bitshaker
It is for openexchangerates.org

Makes the founder a hefty sum.


👤 eruci
Yes! Everybody scrapes everybody. The difference lies on what they do with the data afterwards.

👤 asimjalis
Another example is Zillow which scrapes public records. Public records have some advantages.

👤 kccqzy
Of course. See Yodlee and Plaid.

👤 launchiterate
Yes if you point it at a well defined target market and create a solution.

👤 authed
Scraping user content or recipes should be fine.

👤 martini333
Yes.

👤 gtsop
Point them to the disgusting but successful pracrices of ClearviewAI. That should do it.

👤 dradicchi
Please, see https://www.datallog.com/

They are a profitable startup and have a SDK solution for scale web scraping bots on many business fields.

I'm a advisor/investor in Datallog. You can connect with the founder in https://www.linkedin.com/in/joelder-maragno-arcaro/