How do you save data structures on disk?

Question

Structures like linked list, trie, binary trees, stacks, etc.One approach is to convert them to json and serialize it and save it as a string or binary file. But then if you are dealing with TB of data, or just large amount of data, now your program needs to load the entire thing in ram/memory first.So Second approach I think could be like regular databases, on-disk. Are there on-disk data structures? Can you make any data structure to exist on disk.Suppose I have a huge decision tree, terabytes of data. How can I use it without loading the entire thing in memory.

tgflynn · Accepted Answer

I think your question is too general to answer effectively. In theory just about any data structure can be implemented on disk but in practice access times are likely to be too slow for that to be a usable solution in most cases. As far as I know databases typically use a few specific data structures which are optimized for this sort of application, such as B-trees.
Terabytes of data for a decision tree ? I've heard of deep learning models with billions of parameters but I find it surprising that you would need anywhere near that much storage for a decision tree.
EDIT: Moreover if you do need that much storage, you are almost certainly going to need multiple nodes to process that data reasonably quickly so you would almost certainly need a way to partition the dataset across nodes rather than having everything stored in a single file. The company probably most known for developing that kind of architecture is Google and I think there is a fair amount of information available online on their architecture. That might be a good place to look to get you started.
EDIT2: And how would you create such a large decision tree in the first place ? Are you sure that's the best type of model for your problem ? It seems to me it would likely massively overfit just about any dataset you trained it on.

jjgreen · Answer

Search for "serialise" + name of structure.

karmakaze · Answer

One approach might be a memory-mapped-file. You could then use a custom allocator that uses memory in that address range for your data structures. Your language would have to be low-level enough to let you map address ranges and memory allocation from them.
Now that I think of it, I did use ObjectStore[0] at one time. It did what it claimed to do, but performance wasn't that great.
[0] https://en.wikipedia.org/wiki/ObjectStore

leros · Answer

I've had great success in Java of using the built in serializing functionality to generate binary blobs and write those to .bin files.Nothing wrong with having a multi-GB HashMap or something that you have to load into RAM in startup. It just depends what you're doing. It's certainly cheaper than recomputing that HashMap.

samsaga2 · Answer

If you don't need versioning nor compatibility with other architectures, save the data directly as binary into a file. If you need versioning or compatibility (x64 vs x86 or low endian vs high endian) you need to convert the data using json, cbor, some custom format or anything else.

h2odragon · Answer

You're gonna love the mmap() stuff.