People who write databases obviously have solutions to this problem - for example, DuckDB is based on a number of papers that have come out over the years on this topic.
If I wanted to design, ie, a tree structure which was intended to be read/written from a disk, are there general principles or patterns the have been developed to take advantage of locality of reference, minimize random reads, or decrease the overhead of writes, that I could familiarize myself with?
What is the CLRS for disk?
Like, this is the kind of bullshit people used to do research on: https://tlb.org/docs/usenixw95.pdf (I'm an author). This paper (and 100s of others) exist only because sequential reads & writes on disks were much faster than random access, because you had to (a) move the head, and (b) wait for the sector you're interested in to come around. At 7200 RPM, this is up to 4.2 milliseconds, so it was worth burning 10000s of CPU instructions to try to sort read requests in some order that might avoid waiting for another rotation. Many popular disk controllers couldn't read sequential sectors, so it was better to sort write requests so that it hit every second sector or something. Madness.
Anyway, today there are only 3 kinds of secondary storage worth caring about: flash (sector granularity, but no seeks), cloud (like S3), and archive (like Glacier).
But if you're working with some old-timey environment and really need to know about optimizing for spinning disks, most of the ideas were published in the late 80s and 90s. This book https://www.amazon.co.uk/Design-Implementation-Operating-Add... (McKusick et al) is excellent, and describes a filesystem that works well on a wide variety of workloads. Or see the references in the Blackwell paper above.
If you really want that then I think a good keyword to search for is "External memory algorithms" (or also I think secondary storage), e.g. this 2001 survey:
External Memory Algorithms and Data Structures: Dealing with Massive Data
https://users.cs.duke.edu/~reif/courses/alglectures/vitter.p...
I think there is at least one other survey out there which would have references. Of course not all these algorithms have actually been tried in production :) Which is what I tend to look for
And this is pre-cloud, which changes things a lot for "massive data"
Once you read that, I'll suggest reading the source of a simple embedded key-value database, I wouldn't bother with RDBMs as they are complex beasts and contain way more than you need. BoltDB is a good project to read the source of https://github.com/boltdb/bolt, the whole thing is <10k lines of code and is a full blown production grade system with ACID semantics so packs a lot in those 10k and isn't just merely a toy.
- http://blog.adnansiddiqi.me/gocache-lru-cache-implementation...
- http://blog.adnansiddiqi.me/fehrist-document-indexing-librar...
If you want to dive deeper on the subject of persistence, the book Operating Systems: Three Easy Pieces has a section devoted to it.
Google the two CMU db courses, advanced and intro, they have the required material and references to understand, it's a case where practice > theory
For example, to reduce random reads in a b+tree (data structure used for indexes), you leave room for the index data to grow in the node, so your DBMS doesn't need to allocate a new page immediately (this new page read would be a random, non-sequential access on a later read). Google "index fragmentation" to find out more
Btw, because nobody has said it explicitly yet: you should start at BTrees [1]. As you guessed, the database folks were the primary "we have to deal with spinning disk and it really matters to be fast" people.
There's also the Log-Structured < Noun > universe of things which 'tlb is pointing you at. Roughly, you turn random writes into appends (to a "log"), often followed by periodic compaction to make reads reasonable (since in-place updates to a "file" are now appended at the end of the log, you've suddenly made reads worse).
Edit: add link:
https://www.manning.com/books/algorithms-and-data-structures...
https://www.amazon.com/File-Structures-2nd-Michael-Folk/dp/0...
It was very good, describing b-trees and its variants.
I’d recommend reading up on WAFL filesystem design. ARIES. WAL. The real race horses of this space today are on HPC file systems. Check out the Top 500 file systems. The new Chinese file systems have some impressive metrics, beating the Intel ones significantly. I believe the Chinese ones are taking advantage of RocksDB (LSM).
At a higher level, the outer diameter of an HDD is significantly faster than the inner diameter for sequential reads/writes. The LBA addressing starts at the outer diameter. Typically systems will maintain access metrics on their data, and relocate the hotter data to the Outer, and colder data to the Inner.
While accessing the drive, optimizing for reads, using an adaptive prefetch algorithm will maintain sequential disk access patterns without wasting the time of the head dwelling over data that will be discarded.
If you have a battery backed write back cache, you now have the luxury of optimizing writes to the disk. To maintain optimal disk write performance, you’ll want to maintain your writes in an ordered manner, and present them to the disk with a high queue depth. Ideally you will take advantage of HDD queues, and send latency sensitive reads to a higher priority queue, or head of queue. Additionally, with write back cache, you have the opportunity to enable write cache on the HDD. Each HDD vendor/model have slightly different write cache handling mechanism, so I’d recommend testing.
I’ve been impressed with PingCap’s use of some of the new algorithms. Check out their YouTube architecture overview talks which provide details on exactly which papers/algorithms they’re using.
If your goal is to use Shingled Magnetic Recording drives, these optimizations become even more complex, with best practices not yet fully defined.
Sure, on higher performance systems you will be dealing with bigger demons such as cache and TLB performance depending on data size. But many embedded systems are performance limited for cost and power reasons. There is nothing here cloud will solve, nor anything else than more expensive NAND flashes, which require more power, and money. Hence why designing algorithms for critical data throughput are not as simple as using cloud or a filesystem.