HACKER Q&A
📣 b20000

Tool to find identical file subtrees scattered over disks


Over the years I have made backups of backups, over multiple disks, moving to different computers, and the result is now that I have multiple identical subtrees of files in several places, without a good way to detect which trees are a subset of other trees. I have been searching for a long time for "really smart software" that can help organize archives of work files and detect identical subtrees and subtrees that are older versions of newer trees etc. Does anyone know of commercial/open source tools that help with this?


  👤 hayst4ck Accepted Answer ✓
Identical sub trees sounds like a pretty reasonable interview question, probably an hour or two of work (more because you want to actually do it without hand waiving!).

Create a tree datastructure that mimics the file system and hash each file in place in the tree. Then create a new tree structure, where for each parent, you sort the children's hash's and hash them.

Any duplicate hashes should be identical.

I actually wouldn't be surprised if just straight:

  find $ROOT_OF_SEARCH -type f | xargs shasum  | sort > file_hashes
  cat file_hashes | awk '{print $1}' | sort | uniq -c | sort -n > hash_freq
  vim -O file_hashes hash_freq
was good enough

> subtrees that are older versions of newer trees etc

This is more complicated. How would you concretely define the relationship between older and newer versions?


👤 pyinstallwoes