What would be the fastest way to grep Common Crawl?

Question

Most recent Common Crawl includes 80TB WET (extracted text) files from the latest web crawl.Assuming you have the files locally what would be the fastest way to "fgrep" (string search) through them?Testing with ripgrep on my 10 core iMac Pro I get about 6 seconds to grep through 20GB file. That means about 5 minutes per TB or almost 7 hours for common crawl. What setup would I need to do it in

burntsushi · Accepted Answer

ripgrep author here.When you want to search TBs of data, then you have to ask: what do you need to do? If you only need to search the data set once for a single query, then ripgrep (or similar) is probably your shortest path in terms of end-to-end time to get your results.If you however want to run many queries against an unchanging or growing data set at this scale, then I think you probably want to find some way to index it before hand. Tools like SOLR or Elasticsearch are fairly standard and shouldn't cost you too much in terms of learning how to use them. You can perhaps go faster than Elasticsearch/SOLR/Lucene if there is something special about your problem that permits you to design a custom indexing strategy. But that requires knowing more about your search goals and also costs more to develop (probably).Otherwise, if you're willing to pay, then spin up a bunch of machines with lots of RAM and partition the data set such that each partition fits in memory of a single machine. Then parallelize your search. (This is probably a middle ground between grepping the whole thing on a single machine and using something like fulltext indexing.)

freediver · Answer

Update: splitting files to 100M chunks increases search speed by 4x. So it's 1.5 sec to grep through 20GB. Smaller file sizes do not yield improvement.