HACKER Q&A
📣 rurban

Anyone knows a data structure compiler?


I need a compiler to create optimized data structures for fast searches in lots of known data. Code compilers are easy, data compilers should be also.

Think of gperf, or unicode tables, which do change every year, and it's not clear which strategy should be used, eg. to find the script property of a codepoint, or if it's a valid identifier or such. If it's a three way array lookup or a perfect hash or a binary range search. For perfect hashes I already wrote a tool (phash), but for different data containers and strategies I'm missing the cost models. eg. gperf is limited, and I routinely create faster hashes with better strategies. I even made memcmp 2000x faster when knowing the data.

Does anybody know such a thing?


  👤 techdragon Accepted Answer ✓
I’d be looking at the tools used by security researchers to analyse binary files. They seem the most appropriate I’d you really want a “general purpose” tool for efficiently extracting chunks of structured data from arbitrary structured data files.

Tools like the ones here https://awesomeopensource.com/projects/binary-analysis such as radare2, or bap.

But since these are normally used for dynamic research oriented workflows you may find the performance inadequate, and they will be more focused on extracting security relevant data. But the underlying libraries, the tool chains, and just generally the way these tools are built for smashing open arbitrarily packed and compressed executable data files will likely be at least a useful learning resource even if none of it is directly usable for solving your specific problem.


👤 rurban
Since nobody had an idea, I started with creating one.

https://github.com/rurban/optdata/blob/master/optdata


👤 icsa
Have you tried the cmph (C Minimal Perfect Hash) library?