In the past I've used tools like selenium and beautiful soup to crawl and extract data. A lot of the tedious work is in manually figuring out how to crawl sites and extract data from pages, which is particularly annoying when you have numerous different sites to manage.
My knowledge is a bit outdated. Are there any tools today that make this kind of work easier? Anything that can automatically figure out how to crawl a site or automatically extract data from a page?
In practice the current models have a limited attention window so you can't feed a large HTML document into them. Chunking the document into smaller pieces is not attractive because you really want to have the start tag and the end tag of an element in the same window.
Long-range transformers are rapidly improving and somebody will soon "crack the code" and come out with something that will make the job easy even if it is slow and somewhat expensive.
Other approaches such as graph neural networks and systems based on classical machine learning are also in development so you will see some pretty awesome tools coming online.
what I do when I get lazy is to copy some html, send it to chatgpt and ask it to generate a puppeteer script, usually works with a bit of tweaking
if you have a budget I've done a lot of freelancing missions of this type