Identify duplicate files in my data hoard?

Question

Backing up my documents and files to portable hard drives has been my strategy for decades now. I have so much data now, copied from one machine after another, from one drive to another that I have multiple copies of many files squirreled away in different locations on my backup drives (who knows how that happens &mdash; organization strategies changing over time?).There has to be a tool (script?) that can scan a volume and using filenames, file sizes, come up with a list of possible duplicate files.I don't think I would trust it to auto-cleanup (auto-delete duplicates) but at least with the duplicate paths laid out I could go through and begin the slow process of pruning until I have a final, canonical (and singular) backup of my files.(Bonus if I can point it at the iCloud for the same purpose.)

tony-allan · Accepted Answer

If your handy with Python, a script to create the MD5 of each file, saved to a SQLite database isn&rsquo;t that hard to write. It can then identify common files irrespective of the file name.

nullrouten · Answer

Duplicates aren&rsquo;t always bad&hellip; some files naturally exist in many places, and removing them from some of the places make that directory/app incomplete.If you do want to save space by storing one copy of the bits/blocks, and still retain the index of all the original locations&hellip; you can store all your backups on a ZFS file system with Dedup turned on&hellip; (this uses memory and has performance implications)..Or back everything up with restic:https://github.com/restic/restic&hellip;restic stores files encrypted in a tree by the hash, so it naturally stores 1 file but as many references to it as needed. It has lookup and list functions that can tell you what&rsquo;s duplicated.To simply find/report dups to be dealt with manually, you could quite easily md5/Sha1 your entire file tree, storing the output in a text file , which you can then pipe through sort , awk, and uniq to see which hashes occupy multiple lines &hellip; this is labor intensive&hellip; I just let my backup tools &ldquo;compress&rdquo; by saving one copy of each hash, and then it doesn&rsquo;t matter as much (in my opinion).If its pictures or some other specific file type that you want to focus on the most&hellip; I&rsquo;d pick an app that&rsquo;s intended for cataloging those. Example: Adobe Lightroom shows me my duplicate pics and I can deal with those easily there.

luzifer42 · Answer

DupeGuru is an interesting tool to find duplicates.It's fast and flexible.It can even search for similar files (binary, music and pictures).https://github.com/arsenetar/dupeguru

fxde · Answer

There are two tools I regularly use for linux: fdupes rmlint They don't always give the same results. I also encountered problems scanning a smb share but I would say it is worth giving them a try.

tfeldmann · Answer

Have a look at https://github.com/tfeldmann/organize

Liru · Answer

Czkawka worked pretty well for me.https://github.com/qarmin/czkawka

Raziarazzi · Answer

If you're signed in to the OneDrive sync app on your computer, you can access your OneDrive using File Explorer. You can also access your folders from any device by using the OneDrive mobile app.

groffee · Answer

Do you really want to delete duplicate files?If one of your drives gets bricked or accidentally formatted there's a chance you'll still have the files backed up somewhere else.

abdullin · Answer

I can&rsquo;t recommend Borg Backup enough (OSS).It does deduplication at the chunk level.This handles both duplicate files and large binaries that change slowly over time.

2Gkashmiri · Answer

Uh.... I need something like this on android. My photos in folders have gotten out of hand

johng · Answer

rdfind is amazing for this. You can install it on Windows in a Linux shell or it already works on mac.

Identify duplicate files in my data hoard?

If your handy with Python, a script to create the MD5 of each file, saved to a SQLite database isn’t that hard to write. It can then identify common files irrespective of the file name.

DupeGuru is an interesting tool to find duplicates.
It's fast and flexible.
It can even search for similar files (binary, music and pictures).
https://github.com/arsenetar/dupeguru

There are two tools I regularly use for linux:
`fdupes rmlint`
They don't always give the same results. I also encountered problems scanning a smb share but I would say it is worth giving them a try.

Have a look at https://github.com/tfeldmann/organize

Czkawka worked pretty well for me.
https://github.com/qarmin/czkawka

If you're signed in to the OneDrive sync app on your computer, you can access your OneDrive using File Explorer. You can also access your folders from any device by using the OneDrive mobile app.

Do you really want to delete duplicate files?
If one of your drives gets bricked or accidentally formatted there's a chance you'll still have the files backed up somewhere else.

I can’t recommend Borg Backup enough (OSS).
It does deduplication at the chunk level.
This handles both duplicate files and large binaries that change slowly over time.

Uh.... I need something like this on android. My photos in folders have gotten out of hand

rdfind is amazing for this. You can install it on Windows in a Linux shell or it already works on mac.