HACKER Q&A
📣 sandreas

Does hashing only part of a file make sense as unique checksum?


Hey HN,

currently, I'm having a performance issue with my little side project `tonehub`[1]. It's a small Web API including a background indexer task for audio files in pretty early state.

Introduction: Sometimes I move an audio file to another directory, because the metadata changed. This results in losing all of the files non-metadata history (playback count, current position, playlists, etc.).

To overcome this, I implemented hashing via xxhash only for the audio-part of the file skipping the metadata part. If a file is indexed, but its location is not found in the database, it hashes the file, looks it up and if a unique match is present, it updates only the location keeping the history and releations.

Now to my problem: It's too slow. I have many audio book files in m4b format, most of the time bigger than 200MB and hashing a file like this takes pretty long, long enough that indexing a whole library feels to slow in my opinion.

So I thought about following alternatives to improve that:

  - Hashing only a fixed length part of of the file (e.g. 5MB around the midpoint position, because of intros and outros are often the same)
  
  - Hashing a percentage size part (e.g. 5% of the audio data size)
  
  - Combine one of these "partial" hashes with a size check (e.g. hash=0815471 + size=8340485bytes, because hash and size collision may be less likely)?
It feels like that won't work. So I ask HN:

  Would one of these alternatives be enough to avoid collisions? 

  If so, what would be a "sufficient" part of the file and which alternative is the best?
Thank you

[1] https://github.com/sandreas/tonehub


  👤 compressedgas Accepted Answer ✓
I would recommend not moving around files that are under program control. Have a means to do so through the program.

If the program must support files being moved underneath it, use an extended attribute to mark the files with a UUID or equivalent such as NTFS's OBJECTID. If that is not possible, use a combination of the file's fileid or inode number and its mtime and size to match up the missing files with files found elsewhere.


👤 beardyw
No checksum is unique. Your problem is to do with the probability of collision. My guess is that hashing a very small part of the file would suffice as long as it contains actual audio (not silence) maybe just a few seconds. If you are concerned that you may have different edits of the same audio, you could try skipping through taking a small sample from every n seconds.

👤 h2odragon
In this instance, it seems to me that a hash that only samples the input data would be just as useful as one that runs over the whole set. Sample every 128th byte, for example, to build your hash.

or you could put your metadata into your audio files so it sticks with them when they move.