Think of gperf, or unicode tables, which do change every year, and it's not clear which strategy should be used, eg. to find the script property of a codepoint, or if it's a valid identifier or such. If it's a three way array lookup or a perfect hash or a binary range search. For perfect hashes I already wrote a tool (phash), but for different data containers and strategies I'm missing the cost models. eg. gperf is limited, and I routinely create faster hashes with better strategies. I even made memcmp 2000x faster when knowing the data.
Does anybody know such a thing?
Tools like the ones here https://awesomeopensource.com/projects/binary-analysis such as radare2, or bap.
But since these are normally used for dynamic research oriented workflows you may find the performance inadequate, and they will be more focused on extracting security relevant data. But the underlying libraries, the tool chains, and just generally the way these tools are built for smashing open arbitrarily packed and compressed executable data files will likely be at least a useful learning resource even if none of it is directly usable for solving your specific problem.