Tool for finding repeated chunks of text across files?

Question

I'm hoping to automatically find potential copypasta across a codebase (not just individual lines, but sequences of lines). I realize this is an exponential problem, though N won't be crazy-large so it should be tractable.Anybody know of something like this?

grok22 · Accepted Answer

https://pmd.github.io/latest/pmd_userdocs_cpd.html

PaulHoule · Answer

Make tuples of (line_content, file, line_number) then sort by line_content, throw out all the singletons, then sort by (file, line_number) and you will get contiguous stretches of lines that are duplicated.