Processing multiple GBs of data on local machine

Question

I recall a few blog posts on this, but I'm struggling to find them now that I need them.I have about 100GB of Log4j formatted logs to process to find a particular needle in the haystack, and am looking for a decent way to process those files locally without breaking out Spark in EMR etc.I recall a few blog posts on this subject, but my search fu is letting me down. Is this ringing bells for anyone?Thanks in advance :)

viraptor · Accepted Answer

I know one related one, but can't remember how to find it either. (I found a response to a response to that post - https://news.ycombinator.com/item?id=8920194)However, if you're just after a specific needle... why not grep? With 100GB on a local machine you'll either need to do pre-processing / filtering, or you'll be IO-limited. So instead of heavy tools, why not start with basics?

Andys · Answer

ripgrep ? It works on my computer at over 1GB per second

VirusNewbie · Answer

Spark works locally and should exploit multicore. I'd use spark instead of writing my own multithreaded system. If you don't like spark there's also apache beam.

speedgoose · Answer

You could also use SQLite after some little parsing.