HACKER Q&A
📣 alephnan

How to Web Scrape in 2020?


Are there particular libraries or scraping-as-a-service UIs you would recommend?

I'm particularly interested in restaurant reviews website which has been an increasingly detestable company over the years.


  👤 marcell Accepted Answer ✓
scrapy for python is pretty good, check it out.

In most cases getting banned is the big issue. The bigger the site, the more advanced their bot detection is. You can use luminato.io to get residential and mobile IP's, but it's pricey.

Some sites will also obfuscate the DOM, ie. removing classnames and ID's, which complicates the data extraction.

http://scrapinghub.com/ has a paid "do it for me" service, which may be an option depending on your budget.


👤 mtmail
Related from 2 month ago "Ask HN: What's state of the art for screen scraping these days?" https://news.ycombinator.com/item?id=22148803 where https://simplescraper.io/ was recommended

👤 krageon
For avoiding bans, having a large ipv6 range can help (e.g. like one you might get with a VPS at a proper hosting company). As for grabbing the content itself, I've used a lot of frameworks but I usually end up back at some combination of simple string search and regex.

👤 jamil7
Depending on what you're scraping you might run into a fair few JS-Only websites that are a pain to scrape. On top of all the things mentioned here you will need to run pages through a headless browser like puppeteer. For these sites you maybe be able to reverse engineer their APIs and attempt to scrape those rather than the pages themselves.

👤 ariosto
python + beautiful soup