HACKER Q&A
📣 highwaylights

Best tool for de-duplication of local files?


I’m currently in the predicament of helping a friend restore recovered photos and videos from multiple backup dumps, from multiple cloud providers, from multiple time periods.

There is quite a bit of overlap and many instances exist of the same file having been backed up in different places.

As it stands I’ve collated the dumps into local folders on a fairly fast NVME drive and I’m trying to think of the best way to merge 3-4 local folder trees representing the cloud dumps into a single output folder that will skip binary duplicates by hash.

Ideally I’d also be able to detect duplicates where some files have been compressed (so will not match on hash), but I have a feeling that’s a really optimistic goal for any kind of automation.

Can anyone help with a suggestion?


  👤 Someone Accepted Answer ✓
I would first remove exact duplicates by generating a database of hashes for every single file, and automatically discard duplicates.

Once you have removed the easy duplicates, use perceptual hashing (https://en.wikipedia.org/wiki/Perceptual_hashing) to find likely copies. Then, either inspect duplicates manually to check whether they are indeed variants of the same, or just assume they are. When you have the set of (assumed) duplicates, either manually pick the best one or just keep the largest file (that works reasonably well for .jpg files, but may discard better images if, say, the original JPEG 2000 file was converted to a smaller .png)

I would also keep all original paths, format, size for every disambiguated file. That may come in handy later, and shouldn’t add much.

Edit: if you think you can trust file creation/modification dates, you can also use them to find the likely original version of a file.


👤 AlexITC
I had a similar issue with photos/videos and ended up building a cli app to organize everything, it has worked for my use case relatively well, still, there are uncovered corner cases. For example, this compares hashes only instead of the same photo in a different dimension: https://github.com/wiringbits/my-photo-timeline

I understand that https://photostructure.com/ has a far more sophisticated dedup algorithm, which can be worth a try.


👤 97-109-107
Perhaps [qarmin/czkawka: Multi functional app to find duplicates, empty folders, similar images etc.](https://github.com/qarmin/czkawka) and it's alternatives mentioned in the readme?

👤 bradknowles
There are many different applications in this space, but you could start with the list at https://alternativeto.net/software/remo-duplicate-photos-rem...

👤 kodachi
I recommend rmlint. Had great experiences with it. Just run rmlint /your/dir, and get an executable script to remove all found duplicates.

https://rmlint.readthedocs.io/en/latest/


👤 aquajet
I haven't tried this out personally but have used some of his other tools: https://github.com/anishathalye/periscope