HACKER Q&A
📣 EdwardDiego

Processing multiple GBs of data on local machine


I recall a few blog posts on this, but I'm struggling to find them now that I need them.

I have about 100GB of Log4j formatted logs to process to find a particular needle in the haystack, and am looking for a decent way to process those files locally without breaking out Spark in EMR etc.

I recall a few blog posts on this subject, but my search fu is letting me down. Is this ringing bells for anyone?

Thanks in advance :)


  👤 viraptor Accepted Answer ✓
I know one related one, but can't remember how to find it either. (I found a response to a response to that post - https://news.ycombinator.com/item?id=8920194)

However, if you're just after a specific needle... why not grep? With 100GB on a local machine you'll either need to do pre-processing / filtering, or you'll be IO-limited. So instead of heavy tools, why not start with basics?


👤 Andys
ripgrep ? It works on my computer at over 1GB per second

👤 VirusNewbie
Spark works locally and should exploit multicore. I'd use spark instead of writing my own multithreaded system. If you don't like spark there's also apache beam.

👤 speedgoose
You could also use SQLite after some little parsing.