HACKER Q&A
📣 imnotreallynew

What is the best tech stack nowadays for mass scraping?


I’m in the design phase for a new project that will involve quite a bit of web scraping, formatting that data as appropriate and saving to a DB.

The sources to be scraped are quite varied.

I’ve got a little bit of experience with Node/Express and Ruby/Rails. I’m more than happy to pick up Go or Python/Django or Elixir or something else if those are more appropriate. I think my hesitation with going back to Node is that I slightly prefer static typed languages, but happy to use the best tool for the job.

My concern is computing/bandwidth costs as the various scrapers will be running and alternating quite frequently.

I’m hoping you all could give some recommendations for a stack that makes it easy to run mass scheduled web scraping jobs with little overhead in order to reduce server costs. Thanks!


  👤 awesomegoat_com Accepted Answer ✓
I have built my web scraping system ( https://awesomegoat.com ) on Ruby on Rails. And while I spent this Christmas-break exploring Elixir/Phoenix, I am so far staying with Ruby on Rails.

While it seems I could have built a slightly more (CPU & memory) efficient system in elixir, I am afraid the development of new features would be a bit slower and my time is more precious than the machine's.

Also, CPU & memory are likely not the constraints in the scraping exercise. What you will likely find later on that you will get blocked by Cloudflare on week 2 and superb backend won't make a difference.


👤 accrual
Maybe TypeScript for a typed, familiar and easy to read/write language, and either an internal scheduler or an external one (e.g. cron). A file with per-website rules/scraping hints stored as JSON on disk or in a database, unless it's supposed to be dynamic/one ruleset to rule them all. If you need to go faster, you could retrieve the data (wget/curl/some lib) and pass it to some binary (C/Rust) for processing into a database at core speed.