HACKER Q&A
📣 logicallee

What compression doesn't re-include the same file multiple times?


As a test, I copied a 91 megabyte creative commons mp3 file to a test folder's /1, /2 and /3 subfolder.

I then tried compressing the three directories into one compressed file. I tried various utilities. I tested every setting of winrar (including maximum dictionary size) and most settings of 7zip, as well as tar -czf from the git bash prompt.

Except for 7zip when outputing 7zip with a very large dictionary size (3840 MB) and when it output .wim which is some kind of windows packaging, all tries by all programs failed to get a file around 91 megabytes (every try resulted in the same file being included 3 times, pushing the resulting zip or .tar.gz file to 275 MB, even on maximum settings.)

Other than 7zip on such large settings, which compression formats would handle this case? Does any open compression format support it?


  👤 elmerfud Accepted Answer ✓
Compression and deduplication are different things although there is some overlap. The theory behind each is quite different. The only compressor that I know that has deduplication option in it is rar 5 and higher.

👤 defrost
You could try (depending on file system) running a {hard|soft} link dedup program (that replaces duplicate files with hardlinks or symlinks to a single data file, and then archive your tree with tar | zip using a --symlinks flag.

WARNING: - I've read about this .. never used it despite storing and archiving for backup and data rention for decades.

I prefer to keep file trees 'clean' (no extranous dupes) and fully store them as they are.


👤 DemocracyFTW2
the keywords to search for include "content-addressed storage" and "rolling hash". The latter is used by rsync to avoid having to send repeating byte sequences

👤 mharig
I am working on a Python progrämmchen (little program) that uses a block based approach to store files in a SQLite DB. I guess I can push an alpha version to github during the next 2 weeks.

👤 pestatije
You could consider ln as a compression tool/format...