Our understanding is that existing web scraping tools are bad at this because you need to write custom scraping configurations per site. Not only that, but when a site changes styling, it might completely break your automation. With agents however, you can provide a high level natural language overview of the data you'd like from a website or class of websites, and the agent system will deal with the details of traversing a page and fetching data automatically.
We’re curious how useful this might be for people. If you’ve experienced issues that this might solve or have already explored the space, I'd love to hear from you!
I don't know how relevant this is, but I was thinking that you could probably use some sort of AI to enhance OCR and convert written documents into some sort of semantic form like HTML or Latex. That would allow you to use books to scrape information, and written books still have a lot of untapped knowledge.
It seems like the demand for web scraping and such is to create datasets for ML training. And now you are using AI for scraping. So it is sort of a self-improving cycle
My sites are periodically on the wrong end of scrapers, greedy by design or in error, occasionally needing to be manually blocked or even legals threatened.
Just because something can be done, done't mean it should. It also doesn't mean that you should make it easier. Any more than offering 'better' SPAM engines...
Not to me but I would be curious if you found a way to mimic a real human browsing a site aside from Chrome Headless. Do your TCP packets and HTTPS requests look indistinguishable from real people?