Example of some questions I'd like to be able to answer or at least make reasonable decisions about (note: I don't actually want any answers to the above now, they're just examples of the sort of thing I'd like to read in depth about, and build up some background knowledge):
* how to ensure data's been safely written (e.g. when to flush, fsync, what
guarantees that gives, using WAL)
* blocks sizes to read/write for different purposes, tradeoffs, etc.
* considerations for writing to different media/filesystems (e.g. disk, ssd, NFS)
* when to rely on OS disk cache vs. using own cache
* when to use/not use mmap
* performance considerations (e.g. multiple small files vs. few larger ones,
concurrent readers/writers, locking, etc.)
* OS specific considerations
I recall reading some posts (related to Redis/SQLite/Postgres) related to this, which made me realise that it's a fairly complex topic, but not one I've found a good entry point for.Any pointers to books, documentation, etc. on the above would be much appreciated.
* Bigger blocks = better performance. The bigger you can make it the faster you'll go. Your limiting factor is usually the desired resolution of the user (i.e. aggregation will inevitably result in under-utilized space).
* Disk, SSD and NFS don't all belong to the same category. Most modern products in storage are developed with the expectation that the media is SSD. Virtually nobody wants to enter the market of HDDs. The performance gap is just too big, and the existing products that still use HDDs rely on fast caching in something like flash memory anyways. NFS is a hopelessly backwards and outdated technology. It's the least common denominator, and that's why various storage products do support it, but if you want to go fast, forget about it. The tradeoff here usually is between writing your own client (usually, a kernel module) to do I/O efficiently, or spare users the need for installing a custom kernel module (often a security audit issue) and let them go slow...
* OS disk cache is somewhat of a misnomer. There are also two things that might get confused here. OS doesn't cache data written to disk -- the disk does. OS provides mechanism to talk to the disk and instruct it to flush the cache. There's also filesystem cache -- that's what OS does. It caches in the memory it manages the file contents of recently accessed files.
* I/O through mmap is a gimmick. Just one of the ways to abuse system API to do something it's not really intended to do. You can safely ignore it. If you are looking into making I/O more efficient, look into uring_io.
I highly recommend Gregg's Systems Performance (2nd edition came out in 2020). While the book is focused on performance rather than development, Gregg does a great job explaining a huge number of concepts without going too deep, specifically related to memory, fs, and block I/O.
Unfortunately, in terms of many of the things you care about, books tend to be outdated. Kerrisk's Linux Programming Interface is over 10 years old, and covers only ext2. Robert Love's great books on the kernel are hugely useful (though less intended for application developers) but also slightly outdated.
As far as books are concerend: - Database Internals
- Designing Data Intensive Applications
- Disk-Based Algorithms for Big Data
- Database Systems by Ullman et al (http://infolab.stanford.edu/~ullman/pub/dscbtoc.txt) Part IV covers implementation details of a database system.
https://stackoverflow.com/questions/75697877/why-is-liburing...
If you use uring (such as via liburing) on Linux it forces you to split your IO in half: submit and then wait for callback but you can still do other things. You can submit multiple writes or reads in parallel and handle them when they're ready.
This white paper talks about writing to disk in S3 in a scheduled order to avoid corruption of concurrent requests in the event of a crash.
"Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3"
Apache Kafka supposedly doesn't need fsync due to the recovery protocol. So you might want to investigate why that is the case and whether or not you can create the same behaviour.
A good start on device drivers (old 2.6.10 kernel but still good): https://lwn.net/Kernel/LDD3/
Operating Systems: Three Easy Pieces - Files and directories: https://pages.cs.wisc.edu/~remzi/OSTEP/file-intro.pdf (Another great book from an OS perspective with some userspace interactions)
A thorough Linux internal engineering book is: https://mirrors.edge.kernel.org/pub/linux/kernel/people/paul... (The bibliography has tons of links on topics you might be interested in, Chapter 7 on locks is great)
I recommend implementing a basic key/value single table "database" in C/C++ and then add threading/multi-process interfaces so you can mentally figure out all the pros/cons. It's not technically "hard" and you'll learn a lot.
- "Practical Filesystem Design": http://www.nobius.org/dbg/practical-file-system-design.pdf
- "Robert Love: Linux Kernel Development (chapters 13-14)" https://www.amazon.com/Linux-Kernel-Development-Robert-Love/...
- "The Linux Programming Interface (File I/O chapters)": https://www.amazon.com/Linux-Programming-Interface-System-Ha...
Books can't get you very far. All they're really good for is informing that exploration.
https://www.youtube.com/watch?v=oeYBdghaIjc&list=PLSE8ODhjZX...
What I did to learn the lower-level APIs, and perform initial testing on the driver, was write a "mirror" drive. The user-mode code pointed to a folder on disk, the driver made a virtual disk drive, and all reads and writes in the virtual disk drive went to the mirror folder. All of our (cough) unit tests for virtual drive handling used the mirror drive. ("cough" because the tests fit into that happy area that truly is a unit test but drives people nuts about splitting hairs about the semantics between unit and integration tests.)
On Windows, you can implement something like that using Dokany, Dokan, or Winfsp. On linux, there's the Fuse API. On Mac, there's MacFUSE.
Even if you don't do a "mirror" drive, understanding the callbacks that libraries like Dokany, Dokan, Winfsp, and Fuse do helps you understand how IO happens in the driver. Many IO methods provided in popular languages provide abstractions above what the OS does. (For example, the Windows kernel has no concept of the "Stream" that's in your C# program. The "Stream"'s Position property is purely a construct within the .Net framework.)
https://github.com/dokan-dev/dokany
Another place to start is the OS's documentation itself. For example, you can start with Window's CreateFileA function. This typically is what gets called "under the hood" in most programming languages when you open or create a file: https://learn.microsoft.com/en-us/windows/win32/api/fileapi/...
Too many times have I seen some data scientist trying to parse and write 6 Petabytes of data with multiple cores, while the disk is thrashing about.
Spinning disks are still the backbone of most data science operations because they deal with >>4TB datasets, which can't be stored in SSD drives without breaking some serious bank.
So yes, understanding how to properly use producer/consumer multiprocessing queues correctly should be taught to everyone who does computing as the standard template.
Disk thrash is a threat.
https://blog.koehntopp.info/2023/05/05/50-years-in-filesyste...
https://codecapsule.com/2014/02/12/coding-for-ssds-part-1-in...
An SSD is just a completely different beast from a spinning disk. Spinning disks are much slower and really want to read things linearly.
Most of your other questions - block sizes, caching, mmapping, file size, concurrency - the answers will be completely different on SSD vs spinning disk.
https://databasearchitects.blogspot.com/2021/06/what-every-p...
https://itnext.io/modern-storage-is-plenty-fast-it-is-the-ap...