There is quite a bit of overlap and many instances exist of the same file having been backed up in different places.
As it stands I’ve collated the dumps into local folders on a fairly fast NVME drive and I’m trying to think of the best way to merge 3-4 local folder trees representing the cloud dumps into a single output folder that will skip binary duplicates by hash.
Ideally I’d also be able to detect duplicates where some files have been compressed (so will not match on hash), but I have a feeling that’s a really optimistic goal for any kind of automation.
Can anyone help with a suggestion?
Once you have removed the easy duplicates, use perceptual hashing (https://en.wikipedia.org/wiki/Perceptual_hashing) to find likely copies. Then, either inspect duplicates manually to check whether they are indeed variants of the same, or just assume they are. When you have the set of (assumed) duplicates, either manually pick the best one or just keep the largest file (that works reasonably well for .jpg files, but may discard better images if, say, the original JPEG 2000 file was converted to a smaller .png)
I would also keep all original paths, format, size for every disambiguated file. That may come in handy later, and shouldn’t add much.
Edit: if you think you can trust file creation/modification dates, you can also use them to find the likely original version of a file.
I understand that https://photostructure.com/ has a far more sophisticated dedup algorithm, which can be worth a try.