currently, I'm having a performance issue with my little side project `tonehub`[1]. It's a small Web API including a background indexer task for audio files in pretty early state.
Introduction: Sometimes I move an audio file to another directory, because the metadata changed. This results in losing all of the files non-metadata history (playback count, current position, playlists, etc.).
To overcome this, I implemented hashing via xxhash only for the audio-part of the file skipping the metadata part. If a file is indexed, but its location is not found in the database, it hashes the file, looks it up and if a unique match is present, it updates only the location keeping the history and releations.
Now to my problem: It's too slow. I have many audio book files in m4b format, most of the time bigger than 200MB and hashing a file like this takes pretty long, long enough that indexing a whole library feels to slow in my opinion.
So I thought about following alternatives to improve that:
- Hashing only a fixed length part of of the file (e.g. 5MB around the midpoint position, because of intros and outros are often the same)
- Hashing a percentage size part (e.g. 5% of the audio data size)
- Combine one of these "partial" hashes with a size check (e.g. hash=0815471 + size=8340485bytes, because hash and size collision may be less likely)?
It feels like that won't work. So I ask HN: Would one of these alternatives be enough to avoid collisions?
If so, what would be a "sufficient" part of the file and which alternative is the best?
Thank you[1] https://github.com/sandreas/tonehub
If the program must support files being moved underneath it, use an extended attribute to mark the files with a UUID or equivalent such as NTFS's OBJECTID. If that is not possible, use a combination of the file's fileid or inode number and its mtime and size to match up the missing files with files found elsewhere.
or you could put your metadata into your audio files so it sticks with them when they move.