What are some good resources for learning about low level disk/file IO?

Question

I've been messing around with writing a toy database for fun/learning, and realised I've got a fairly big gap in my knowledge when it comes to dealing with performance and durability when dealing with file reads/writes.Example of some questions I'd like to be able to answer or at least make reasonable decisions about (note: I don't actually want any answers to the above now, they're just examples of the sort of thing I'd like to read in depth about, and build up some background knowledge): * how to ensure data's been safely written (e.g. when to flush, fsync, what guarantees that gives, using WAL) * blocks sizes to read/write for different purposes, tradeoffs, etc. * considerations for writing to different media/filesystems (e.g. disk, ssd, NFS) * when to rely on OS disk cache vs. using own cache * when to use/not use mmap * performance considerations (e.g. multiple small files vs. few larger ones, concurrent readers/writers, locking, etc.) * OS specific considerations I recall reading some posts (related to Redis/SQLite/Postgres) related to this, which made me realise that it's a fairly complex topic, but not one I've found a good entry point for.Any pointers to books, documentation, etc. on the above would be much appreciated.

crabbone · Accepted Answer

There are some easy answers here:* Bigger blocks = better performance. The bigger you can make it the faster you'll go. Your limiting factor is usually the desired resolution of the user (i.e. aggregation will inevitably result in under-utilized space).* Disk, SSD and NFS don't all belong to the same category. Most modern products in storage are developed with the expectation that the media is SSD. Virtually nobody wants to enter the market of HDDs. The performance gap is just too big, and the existing products that still use HDDs rely on fast caching in something like flash memory anyways. NFS is a hopelessly backwards and outdated technology. It's the least common denominator, and that's why various storage products do support it, but if you want to go fast, forget about it. The tradeoff here usually is between writing your own client (usually, a kernel module) to do I/O efficiently, or spare users the need for installing a custom kernel module (often a security audit issue) and let them go slow...* OS disk cache is somewhat of a misnomer. There are also two things that might get confused here. OS doesn't cache data written to disk -- the disk does. OS provides mechanism to talk to the disk and instruct it to flush the cache. There's also filesystem cache -- that's what OS does. It caches in the memory it manages the file contents of recently accessed files.* I/O through mmap is a gimmick. Just one of the ways to abuse system API to do something it's not really intended to do. You can safely ignore it. If you are looking into making I/O more efficient, look into uring_io.

zvmaz · Answer

The book Operating Systems, Three Easy Pieces [1] has a chapter on I/O and persistence in general.[1] https://pages.cs.wisc.edu/~remzi/OSTEP/

dmazin · Answer

I think, most of all, you should dig through LWN and the kernel documentation. Both are great learning resources. LWN has many introductory/educational articles about the very topics you care about. (Also, Jens Axboe wrote a long and great explanation why blk-mq was necessary, which you should seek out. He did the same for io_uring)
I highly recommend Gregg's Systems Performance (2nd edition came out in 2020). While the book is focused on performance rather than development, Gregg does a great job explaining a huge number of concepts without going too deep, specifically related to memory, fs, and block I/O.
Unfortunately, in terms of many of the things you care about, books tend to be outdated. Kerrisk's Linux Programming Interface is over 10 years old, and covers only ext2. Robert Love's great books on the kernel are hugely useful (though less intended for application developers) but also slightly outdated.

abhi9u · Answer

Did you see this discussion? https://news.ycombinator.com/item?id=32965075
As far as books are concerend: - Database Internals
- Designing Data Intensive Applications
- Disk-Based Algorithms for Big Data
- Database Systems by Ullman et al (http://infolab.stanford.edu/~ullman/pub/dscbtoc.txt) Part IV covers implementation details of a database system.

samsquire · Answer

I recently read this stack overflow post which was interesting:
https://stackoverflow.com/questions/75697877/why-is-liburing...
If you use uring (such as via liburing) on Linux it forces you to split your IO in half: submit and then wait for callback but you can still do other things. You can submit multiple writes or reads in parallel and handle them when they're ready.
This white paper talks about writing to disk in S3 in a scheduled order to avoid corruption of concurrent requests in the event of a crash.
"Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3"
Apache Kafka supposedly doesn't need fsync due to the recovery protocol. So you might want to investigate why that is the case and whether or not you can create the same behaviour.

convivialdingo · Answer

Read the Unix books, LWN, the kernel mailing list - It's definitely a good way to start out to get the fundamentals.
A good start on device drivers (old 2.6.10 kernel but still good): https://lwn.net/Kernel/LDD3/
Operating Systems: Three Easy Pieces - Files and directories: https://pages.cs.wisc.edu/~remzi/OSTEP/file-intro.pdf (Another great book from an OS perspective with some userspace interactions)
A thorough Linux internal engineering book is: https://mirrors.edge.kernel.org/pub/linux/kernel/people/paul... (The bibliography has tons of links on topics you might be interested in, Chapter 7 on locks is great)
I recommend implementing a basic key/value single table "database" in C/C++ and then add threading/multi-process interfaces so you can mentally figure out all the pros/cons. It's not technically "hard" and you'll learn a lot.

shpx · Answer

You might be interested in the Modern SSDs course from ETH Z&uuml;rich https://youtube.com/playlist?list=PL5Q2soXY2Zi_8qOM5Icpp8hB2...

gavinray · Answer

You might find these helpful:- "Practical Filesystem Design": http://www.nobius.org/dbg/practical-file-system-design.pdf- "Robert Love: Linux Kernel Development (chapters 13-14)" https://www.amazon.com/Linux-Kernel-Development-Robert-Love/...- "The Linux Programming Interface (File I/O chapters)": https://www.amazon.com/Linux-Programming-Interface-System-Ha...

jcalvinowens · Answer

My advice is simple: write code. Implement toy systems that use these things, and begin exploring how they work and their inherent trade offs. Don't read about it, do it: practical learning is always 1000x as valuable.Books can't get you very far. All they're really good for is informing that exploration.

joelrwilliams1 · Answer

I stumbled across this Carnegie Mellon University database course a year or two ago and this guy gets really deep into the technologies behind a database. Here's the list of YouTube videos of lectures that the professor has posted online:https://www.youtube.com/watch?v=oeYBdghaIjc&list=PLSE8ODhjZX...

poo-yie · Answer

Not specifically addressing your question, but when you get to the point of wanting to start doing some experiments you may find that 'fio' [1] is very handy.[1] https://github.com/axboe/fio

gwbas1c · Answer

I lead a project that included shipping a filesystem driver and a virtual disk on Windows.
What I did to learn the lower-level APIs, and perform initial testing on the driver, was write a "mirror" drive. The user-mode code pointed to a folder on disk, the driver made a virtual disk drive, and all reads and writes in the virtual disk drive went to the mirror folder. All of our (cough) unit tests for virtual drive handling used the mirror drive. ("cough" because the tests fit into that happy area that truly is a unit test but drives people nuts about splitting hairs about the semantics between unit and integration tests.)
On Windows, you can implement something like that using Dokany, Dokan, or Winfsp. On linux, there's the Fuse API. On Mac, there's MacFUSE.
Even if you don't do a "mirror" drive, understanding the callbacks that libraries like Dokany, Dokan, Winfsp, and Fuse do helps you understand how IO happens in the driver. Many IO methods provided in popular languages provide abstractions above what the OS does. (For example, the Windows kernel has no concept of the "Stream" that's in your C# program. The "Stream"'s Position property is purely a construct within the .Net framework.)
https://dokan-dev.github.io/
https://github.com/dokan-dev/dokany
https://osxfuse.github.io/
Another place to start is the OS's documentation itself. For example, you can start with Window's CreateFileA function. This typically is what gets called "under the hood" in most programming languages when you open or create a file: https://learn.microsoft.com/en-us/windows/win32/api/fileapi/...

chaxor · Answer

This is certainly something that needs to be addressed far more often.
Too many times have I seen some data scientist trying to parse and write 6 Petabytes of data with multiple cores, while the disk is thrashing about.
Spinning disks are still the backbone of most data science operations because they deal with >>4TB datasets, which can't be stored in SSD drives without breaking some serious bank.
So yes, understanding how to properly use producer/consumer multiprocessing queues correctly should be taught to everyone who does computing as the standard template.
Disk thrash is a threat.

sthrs · Answer

This series has been really informative regarding the history of filesystems - when each breakthrough occurred and the historical tradeoffs that were considered at each point.https://blog.koehntopp.info/2023/05/05/50-years-in-filesyste...

argulane · Answer

> Are You Sure You Want to Use MMAP in Your Database Management System?https://db.cs.cmu.edu/mmap-cidr2022/

giovannibonetti · Answer

This one is my favorite guide about SSD internals for programmers.https://codecapsule.com/2014/02/12/coding-for-ssds-part-1-in...

mparnisari · Answer

File Structures: An Object-Oriented Approach with C++ https://www.goodreads.com/book/show/614858.File_Structures

bsenftner · Answer

Rarely covered, but critical is knowledge of the seek timings on different media, and that same from different manufacturers, as well as the transition times for reads and writes, and finally many storage devices have additional information one can request, or it is just there if you know: the ability to have I/O buffer state transition callbacks, such as buffer filled, ring buffer rotated, last byte read and so on. I spent time working at disc manufacturers, where various dedicated software companies, mostly database developers and game devs, compile this information for their company's needs.

sarupbanskota · Answer

In the spirit of building complex systems from scratch, you&rsquo;ll enjoy this repo https://github.com/codecrafters-io/build-your-own-x that contains several ideas for projects around building things from scratch (eg Build your own Git, Docker, etc)

lacker · Answer

The issue "SSDs vs spinning disks" is such a big one, I feel like you should understand the differences there before digging into anything else.
An SSD is just a completely different beast from a spinning disk. Spinning disks are much slower and really want to read things linearly.
Most of your other questions - block sizes, caching, mmapping, file size, concurrency - the answers will be completely different on SSD vs spinning disk.

lathiat · Answer

Worth a watch: https://youtu.be/LMe7hf2G1po

mauvia · Answer

this only covers SSDs but I remember this from a previous hackernews thread. It's a very short read.https://databasearchitects.blogspot.com/2021/06/what-every-p...

vram22 · Answer

IMO, reading books about operating systems internals / implementation, particularly the parts about disk management and file systems, block vs. character I/O, the buffer cache, fragmentation, etc., would help, too. Some are Unix-based terms, translate for other OSes.

dochtman · Answer

Somewhat specific to Linux, but this one comes to mind:https://itnext.io/modern-storage-is-plenty-fast-it-is-the-ap...

mikewarot · Answer

Personally, I'd like to figure out how to use a SATA controller as a high speed serial port. Swapping pairs on a 6 Ghz cable will be dicey, especially compared to the old RS-232 cables and DB9/25 connectors, but it should be possible to have comms between PCs via spare SATA ports.

rramadass · Answer

Related oldie and good starting point : Memory Systems: Cache, DRAM, Disk by Bruce Jacob et al.

What are some good resources for learning about low level disk/file IO?

The book Operating Systems, Three Easy Pieces [1] has a chapter on I/O and persistence in general.
[1] https://pages.cs.wisc.edu/~remzi/OSTEP/

You might be interested in the Modern SSDs course from ETH Zürich https://youtube.com/playlist?list=PL5Q2soXY2Zi_8qOM5Icpp8hB2...

Not specifically addressing your question, but when you get to the point of wanting to start doing some experiments you may find that 'fio' [1] is very handy.
[1] https://github.com/axboe/fio

This series has been really informative regarding the history of filesystems - when each breakthrough occurred and the historical tradeoffs that were considered at each point.
https://blog.koehntopp.info/2023/05/05/50-years-in-filesyste...

> Are You Sure You Want to Use MMAP in Your Database Management System?
https://db.cs.cmu.edu/mmap-cidr2022/

This one is my favorite guide about SSD internals for programmers.
https://codecapsule.com/2014/02/12/coding-for-ssds-part-1-in...

File Structures: An Object-Oriented Approach with C++ https://www.goodreads.com/book/show/614858.File_Structures

In the spirit of building complex systems from scratch, you’ll enjoy this repo https://github.com/codecrafters-io/build-your-own-x that contains several ideas for projects around building things from scratch (eg Build your own Git, Docker, etc)

Worth a watch: https://youtu.be/LMe7hf2G1po

this only covers SSDs but I remember this from a previous hackernews thread. It's a very short read.
https://databasearchitects.blogspot.com/2021/06/what-every-p...

IMO, reading books about operating systems internals / implementation, particularly the parts about disk management and file systems, block vs. character I/O, the buffer cache, fragmentation, etc., would help, too. Some are Unix-based terms, translate for other OSes.

Somewhat specific to Linux, but this one comes to mind:
https://itnext.io/modern-storage-is-plenty-fast-it-is-the-ap...

Personally, I'd like to figure out how to use a SATA controller as a high speed serial port. Swapping pairs on a 6 Ghz cable will be dicey, especially compared to the old RS-232 cables and DB9/25 connectors, but it should be possible to have comms between PCs via spare SATA ports.

Related oldie and good starting point : Memory Systems: Cache, DRAM, Disk by Bruce Jacob et al.