How to Web Scrape in 2020?

Question

Are there particular libraries or scraping-as-a-service UIs you would recommend?I'm particularly interested in restaurant reviews website which has been an increasingly detestable company over the years.

marcell · Accepted Answer

scrapy for python is pretty good, check it out.
In most cases getting banned is the big issue. The bigger the site, the more advanced their bot detection is. You can use luminato.io to get residential and mobile IP's, but it's pricey.
Some sites will also obfuscate the DOM, ie. removing classnames and ID's, which complicates the data extraction.
http://scrapinghub.com/ has a paid "do it for me" service, which may be an option depending on your budget.

mtmail · Answer

Related from 2 month ago "Ask HN: What's state of the art for screen scraping these days?" https://news.ycombinator.com/item?id=22148803 where https://simplescraper.io/ was recommended

krageon · Answer

For avoiding bans, having a large ipv6 range can help (e.g. like one you might get with a VPS at a proper hosting company). As for grabbing the content itself, I've used a lot of frameworks but I usually end up back at some combination of simple string search and regex.

jamil7 · Answer

Depending on what you're scraping you might run into a fair few JS-Only websites that are a pain to scrape. On top of all the things mentioned here you will need to run pages through a headless browser like puppeteer. For these sites you maybe be able to reverse engineer their APIs and attempt to scrape those rather than the pages themselves.

ariosto · Answer

python + beautiful soup