HACKER Q&A
📣 asim-shrestha

Would an easier way to scrape 100s of websites be useful to you?


In the process of building AI agents, we've found that what we built could eventually be good at dynamically scraping data across a variety of websites (10s to 100s of different sites at a time)

Our understanding is that existing web scraping tools are bad at this because you need to write custom scraping configurations per site. Not only that, but when a site changes styling, it might completely break your automation. With agents however, you can provide a high level natural language overview of the data you'd like from a website or class of websites, and the agent system will deal with the details of traversing a page and fetching data automatically.

We’re curious how useful this might be for people. If you’ve experienced issues that this might solve or have already explored the space, I'd love to hear from you!


  👤 PaulHoule Accepted Answer ✓
My take is that the "idea people" usually underestimate how easy it is make user interfaces and overestimate the difficulty of scraping, API "integrations", etc. (It's what makes Zapier such a successful racket)

👤 shopvaccer
What kind of websites? You mean like social media sites that are obfuscated to prevent scraping? I suppose, it would have to be quite reliable.

I don't know how relevant this is, but I was thinking that you could probably use some sort of AI to enhance OCR and convert written documents into some sort of semantic form like HTML or Latex. That would allow you to use books to scrape information, and written books still have a lot of untapped knowledge.

It seems like the demand for web scraping and such is to create datasets for ML training. And now you are using AI for scraping. So it is sort of a self-improving cycle


👤 DamonHD
Scraping hundreds of site seems mostly unethical.

My sites are periodically on the wrong end of scrapers, greedy by design or in error, occasionally needing to be manually blocked or even legals threatened.

Just because something can be done, done't mean it should. It also doesn't mean that you should make it easier. Any more than offering 'better' SPAM engines...


👤 LinuxBender
Would an easier way to scrape 100s of websites be useful to you?

Not to me but I would be curious if you found a way to mimic a real human browsing a site aside from Chrome Headless. Do your TCP packets and HTTPS requests look indistinguishable from real people?