HACKER Q&A
📣 tamesquo572

How do you deduplicate fuzzy files?


I am faced with the task of designing a system in which users can upload a large number of files of various types which are similar but not the same.

I would like to deduplicate parts of the files, but I am wondering if there are better (existing) solutions than simply searching for large sections of the same blocks in the files, swapping them out and then referencing them.


  👤 DemocracyFTW2 Accepted Answer ✓
You want content-addressed storage; this works with rolling content hashes that identify common blocks of memory. `rsync` uses that technique to minimize bytes to be transferred. https://github.com/qarmin/czkawka is a GUI app and CLI tool to find identical files in general and similar images in particular.

The task is much simpler if you only want to find bit-identical entire files, not part of files; in that case, you can just run a tool like `sha1sum` over each file and record the hash digest in a database; identical files—and only identical ones, with high probability—will have the same hash, non-identical ones will have different hashes.


👤 IcePic
https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm and https://en.wikipedia.org/wiki/Rolling_hash#Cyclic_polynomial can be used to find common substrings and is used by deduplicating programs.

👤 solardev
I should just note that you might have to account for how data recovery would work in such a system, if parts of some files are gone and only the references remain.