How do you deduplicate fuzzy files?

Question

I am faced with the task of designing a system in which users can upload a large number of files of various types which are similar but not the same.I would like to deduplicate parts of the files, but I am wondering if there are better (existing) solutions than simply searching for large sections of the same blocks in the files, swapping them out and then referencing them.

DemocracyFTW2 · Accepted Answer

You want content-addressed storage; this works with rolling content hashes that identify common blocks of memory. `rsync` uses that technique to minimize bytes to be transferred. https://github.com/qarmin/czkawka is a GUI app and CLI tool to find identical files in general and similar images in particular.The task is much simpler if you only want to find bit-identical entire files, not part of files; in that case, you can just run a tool like `sha1sum` over each file and record the hash digest in a database; identical files&mdash;and only identical ones, with high probability&mdash;will have the same hash, non-identical ones will have different hashes.

IcePic · Answer

https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm and https://en.wikipedia.org/wiki/Rolling_hash#Cyclic_polynomial can be used to find common substrings and is used by deduplicating programs.

solardev · Answer

I should just note that you might have to account for how data recovery would work in such a system, if parts of some files are gone and only the references remain.