HACKER Q&A
📣 JKCalhoun

Identify duplicate files in my data hoard?


Backing up my documents and files to portable hard drives has been my strategy for decades now. I have so much data now, copied from one machine after another, from one drive to another that I have multiple copies of many files squirreled away in different locations on my backup drives (who knows how that happens — organization strategies changing over time?).

There has to be a tool (script?) that can scan a volume and using filenames, file sizes, come up with a list of possible duplicate files.

I don't think I would trust it to auto-cleanup (auto-delete duplicates) but at least with the duplicate paths laid out I could go through and begin the slow process of pruning until I have a final, canonical (and singular) backup of my files.

(Bonus if I can point it at the iCloud for the same purpose.)


  👤 tony-allan Accepted Answer ✓
If your handy with Python, a script to create the MD5 of each file, saved to a SQLite database isn’t that hard to write. It can then identify common files irrespective of the file name.

👤 nullrouten
Duplicates aren’t always bad… some files naturally exist in many places, and removing them from some of the places make that directory/app incomplete.

If you do want to save space by storing one copy of the bits/blocks, and still retain the index of all the original locations… you can store all your backups on a ZFS file system with Dedup turned on… (this uses memory and has performance implications)..

Or back everything up with restic:

https://github.com/restic/restic

…restic stores files encrypted in a tree by the hash, so it naturally stores 1 file but as many references to it as needed. It has lookup and list functions that can tell you what’s duplicated.

To simply find/report dups to be dealt with manually, you could quite easily md5/Sha1 your entire file tree, storing the output in a text file , which you can then pipe through sort , awk, and uniq to see which hashes occupy multiple lines … this is labor intensive… I just let my backup tools “compress” by saving one copy of each hash, and then it doesn’t matter as much (in my opinion).

If its pictures or some other specific file type that you want to focus on the most… I’d pick an app that’s intended for cataloging those. Example: Adobe Lightroom shows me my duplicate pics and I can deal with those easily there.


👤 luzifer42
DupeGuru is an interesting tool to find duplicates.

It's fast and flexible.

It can even search for similar files (binary, music and pictures).

https://github.com/arsenetar/dupeguru


👤 fxde
There are two tools I regularly use for linux:

  fdupes
  rmlint
They don't always give the same results. I also encountered problems scanning a smb share but I would say it is worth giving them a try.

👤 tfeldmann

👤 Liru
Czkawka worked pretty well for me.

https://github.com/qarmin/czkawka


👤 Raziarazzi
If you're signed in to the OneDrive sync app on your computer, you can access your OneDrive using File Explorer. You can also access your folders from any device by using the OneDrive mobile app.

👤 groffee
Do you really want to delete duplicate files?

If one of your drives gets bricked or accidentally formatted there's a chance you'll still have the files backed up somewhere else.


👤 abdullin
I can’t recommend Borg Backup enough (OSS).

It does deduplication at the chunk level.

This handles both duplicate files and large binaries that change slowly over time.


👤 2Gkashmiri
Uh.... I need something like this on android. My photos in folders have gotten out of hand

👤 johng
rdfind is amazing for this. You can install it on Windows in a Linux shell or it already works on mac.