HACKER Q&A
📣 brundolf

Tool for finding repeated chunks of text across files?


I'm hoping to automatically find potential copypasta across a codebase (not just individual lines, but sequences of lines). I realize this is an exponential problem, though N won't be crazy-large so it should be tractable.

Anybody know of something like this?


  👤 grok22 Accepted Answer ✓

👤 PaulHoule
Make tuples of (line_content, file, line_number) then sort by line_content, throw out all the singletons, then sort by (file, line_number) and you will get contiguous stretches of lines that are duplicated.