Have there been any innovations in crawling or data extraction?

Question

I'm currently working on a project which involves crawling data from a large collection of public sites. I need to figure out how to crawl these sites and parse the data that I need.In the past I've used tools like selenium and beautiful soup to crawl and extract data. A lot of the tedious work is in manually figuring out how to crawl sites and extract data from pages, which is particularly annoying when you have numerous different sites to manage.My knowledge is a bit outdated. Are there any tools today that make this kind of work easier? Anything that can automatically figure out how to crawl a site or automatically extract data from a page?

PaulHoule · Accepted Answer

In principle you could feed HTML into a transformer model and train it to do tasks such as classification (is this an "open access" scientific paper?) or extraction.
In practice the current models have a limited attention window so you can't feed a large HTML document into them. Chunking the document into smaller pieces is not attractive because you really want to have the start tag and the end tag of an element in the same window.
Long-range transformers are rapidly improving and somebody will soon "crack the code" and come out with something that will make the job easy even if it is slow and somewhat expensive.
Other approaches such as graph neural networks and systems based on classical machine learning are also in development so you will see some pretty awesome tools coming online.

caotic123 · Answer

Yeah, you could always use a language model to handle the parsing. I think that is the state-of-art of crawling today, but you have to worry about proxies, and rendering pages. But the only API that I know that truly is able to scrape using zero code is datalambda, you contact them at contact@datalambda.tech

throwawayadvsec · Answer

there are a few services in beta that offer to scrape them +/- automatically using LLMs
what I do when I get lazy is to copy some html, send it to chatgpt and ask it to generate a puppeteer script, usually works with a bit of tweaking
if you have a budget I've done a lot of freelancing missions of this type