I would like to deduplicate parts of the files, but I am wondering if there are better (existing) solutions than simply searching for large sections of the same blocks in the files, swapping them out and then referencing them.
The task is much simpler if you only want to find bit-identical entire files, not part of files; in that case, you can just run a tool like `sha1sum` over each file and record the hash digest in a database; identical files—and only identical ones, with high probability—will have the same hash, non-identical ones will have different hashes.