HACKER Q&A
📣 mothsonasloth

How do you organise and de-duplicate photos from old storage?


I have reached a point in my life where I have 7 HDDs (ranging from 60GB to 500GB) filled with family photos and other stuff.

They have been gathering dust in my "bits and junk box".

I have started copying them over to a newer 2TB external HDD.

The problem is that there is definitely duplication and junk files coming through (MacOS hidden files, thumbnails generated by image software etc.)

Any advice how to decant all these photos, videos and other media to one storage location?

I tried copying over and then running dupeGuru. However it didn't seem to be smart enough to tell the difference between a live version of a song, versus the same studio recording. Therefore I wasn't confident in its de-duplication capabilities.

So HN, how do you go about solving this problem?


  👤 notemaker Accepted Answer ✓
For myself and my family, I wrote my own tool [1] that runs everyday on an "input folder". A quick google on "github photo organizer" shows a lot of others having done the same :)

It organizes all traversed photos by date (extracted from exif or from filename), and puts them in a "failed" folder if it can't parse the date.

If any photos get the same name, they are either deduped because they are exact duplicates, or are marked as conflicts (e.g. A.jpg and A_conflict1.jpg) if they are different.

Last time I used it for a large input it took 3h for 200GB, though I suspect network latency was the main bottleneck.

It's around 300 lines of python - verify the code for yourself if you want to use it! You probably also need to fork it if you don't intend to run it on a Synology NAS.

However, as I mentioned last time I pitched this, elodie [2] might be more suitable for others than my little hack. Haven't used it though!

1: (https://github.com/johan-andersson01/photo_organizer

2: https://github.com/jmathai/elodie


👤 brudgers
Problem? What problem?

If I have two copies of an image it is easier to find and less likely to be lost. Storage is cheap. Time is finite. And anyone who judges me for having duplicate photos has bigger problems than me.

Over the years I've learned that I can make mistakes when deleting and I've had much deeper regrets over things I've lost than irritation over multiple copies of the same thing. Maybe because when I am looking for something I usually stop searching when I find the first copy.

Sure in a business environment with multiple team members and staff turnover duplicates and haphazard disorganization are often problematic. But that's not my use case. It's not costing me money beyond a couple of pennies per gigabyte.

YMMV. Good luck.


👤 yellow_lead
rmlint[1] works extremely well for duplicated files (including photos)

[1] https://github.com/sahib/rmlint


👤 mceachen
For photos and videos, you can try PhotoStructure. I wrote it because I was in exactly your situation: a pile of hard drives, and no coherent organization. It won't handle deduping other files types, though.

https://photostructure.com/about/introducing-photostructure/