HACKER Q&A
📣 nphase

Why are there no open source NVMe-native key value stores in 2023?


Hi HN, NVMe disks, when addressed natively in userland, offer massive performance improvements compared to other forms of persistent storage. However, in spite of the existence of projects like SPDK and SplinterDB, there don't seem to be any open source, non-embedded key value stores or DBs out in the wild yet.

Why do you think that is? Are there possibly other projects out there that I'm not familiar with?


  👤 diggan Accepted Answer ✓
I don't remember exactly why I have any of them saved, but these are some experimental data stores that seems to be fitting what you're looking for somewhat:

- https://github.com/DataManagementLab/ScaleStore - "A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA"

- https://github.com/unum-cloud/udisk (https://github.com/unum-cloud/ustore) - "The fastest ACID-transactional persisted Key-Value store designed for NVMe block-devices with GPU-acceleration and SPDK to bypass the Linux kernel."

- https://github.com/capsuleman/ssd-nvme-database - "Columnar database on SSD NVMe"


👤 bestouff
Naive question: are there really expected gains to address natively an NVMe disk wrt using a regular key-value database on a filesystem ?

👤 delfinom
https://github.com/OpenMPDK/KVRocks

Given however, that most of the world has shifted to VMs, I don't think KV storage is accessible for that reason alone because the disks are often split out to multiple users. So the overall demand for this would be low.


👤 zupa-hu
Is there any performance gain over writing append-only data to a file?

I mean, using a merkle tree or something like that to make sense of the underlying data.


👤 brightball
SolidCache and SolidQueue from Rails will be doing that when released.

Otherwise though…you have the file system. Is that not enough?


👤 telegpt
The Key To Value: Understanding The Nvme Key-Value Standard

https://cliprecaps.com/read/?v=uQFl5T7IKpI


👤 znpy
I often attended a presentation by some presales engineer from Aerospike and IIRC they're doing some nvme-in-userspace stuff.

👤 formerly_proven
There's actually an NVMe command set which allows you to use the FTL directly as a K/V store. (This is limited to 16-byte keys [1] however, so it is not that useful and probably not implemented anywhere)

[1] These slides claim up to 32 bytes, which would be a practically useful length: https://www.snia.org/sites/default/files/ESF/Key-Value-Stora... but the current revision of the standard only permits two 64-bit words as the key ("The maximum KV key size is 16 bytes"): https://nvmexpress.org/wp-content/uploads/NVM-Express-Key-Va...


👤 CubsFan1060
Interesting article here: https://grafana.com/blog/2023/08/23/how-we-scaled-grafana-cl...

Utilizing: https://memcached.org/blog/nvm-caching/,https://github.com/m...

TLDR; Grafana Cloud needed tons of Caching, and it was expensive. So they used extstore in memcache to hold most of it on NVMe disks. This massively reduced their costs.


👤 jiggawatts
Note that some cloud VM types expose entire NVMe drives as-is directly the guest operating system without hypervisors or other abstractions in the way.

The Azure Lv3/Lsv3/Lav3/Lasv3 series all provide this capability, for example.

Ref: https://learn.microsoft.com/en-us/azure/virtual-machines/las...


👤 javierhonduco
There’s Kvrocks. It uses the Redis protocol and it’s built on RocksDB https://github.com/apache/kvrocks

👤 ilyt
It becomes complex when you want to support multiple NVMes

Even more complex when you want to have any kind of redundancy, as you'd essentially need to build-in some kind of RAID-like into your database.

Also few terabytes in RAID10 NVMes + PostgreSQL and something covers about 99% of companies needs for speed.

So you're left with 1% needing that kind of speeds


👤 gavinray
Why do you mean by non-embedded?

You might also be interested in xNVMe and the RocksDB/Ceph KV drivers:

https://github.com/OpenMPDK/xNVMe

https://github.com/OpenMPDK/KVSSD

https://github.com/OpenMPDK/KVRocks


👤 threeseed
Crail [1] which is a distributed K/V store on top of NVMEoF.

[1] https://craillabs.github.io


👤 otterley
Because you haven't written it yet!

👤 rubiquity
I think it's mostly because while the internal parallelism of NVMe is fantastic our logical use of them is still largely sequential.

👤 espoal

👤 caeril
> non-embedded key value stores or DBs out in the wild yet

I like how you reference the performance benefits of NVMe direct addressing, but then immediately lament that you can't access these benefits across a SEVEN LAYER STACK OF ABSTRACTIONS.

You can either lament the dearth of userland direct-addressable performant software, OR lament the dearth of convenient network APIs that thrash your cache lines and dramatically increase your access latency.

You don't get to do both simultaneously.

Embedded is a feature for performance-aware software, not a bug.


👤 altairprime
[delayed]

👤 Already__Taken
A seaweedFS volume store sounds like a good candidate to split some of the performance volumes across the nvme queues. You're supposed to give it a whole disk to use anyway.

👤 nerpderp82
Aerospike does direct NVME access.

https://github.com/aerospike/aerospike-server/blob/master/cf...

There are other occurrences in the codebase, but that is the most prominent one.


👤 nerpderp82
Eatonphil posted a link to this paper https://web.archive.org/web/20230624195551/https://www.vldb.... a couple hours after this post (zero comments [0])

> NVMe SSDs based on flash are cheap and offer high throughput. Combining several of these devices into a single server enables 10 million I/O operations per second or more. Our experiments show that existing out-of-memory database systems and storage engines achieve only a fraction of this performance. In this work, we demonstrate that it is possible to close the performance gap between hardware and software through an I/O optimized storage engine design. In a heavy out-of-memory setting, where the dataset is 10 times larger than main memory, our system can achieve more than 1 million TPC-C transactions per second.

[0] https://news.ycombinator.com/item?id=37899886