Why are there no open source NVMe-native key value stores in 2023?

Question

Hi HN, NVMe disks, when addressed natively in userland, offer massive performance improvements compared to other forms of persistent storage. However, in spite of the existence of projects like SPDK and SplinterDB, there don't seem to be any open source, non-embedded key value stores or DBs out in the wild yet.Why do you think that is? Are there possibly other projects out there that I'm not familiar with?

diggan · Accepted Answer

I don't remember exactly why I have any of them saved, but these are some experimental data stores that seems to be fitting what you're looking for somewhat:
- https://github.com/DataManagementLab/ScaleStore - "A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA"
- https://github.com/unum-cloud/udisk (https://github.com/unum-cloud/ustore) - "The fastest ACID-transactional persisted Key-Value store designed for NVMe block-devices with GPU-acceleration and SPDK to bypass the Linux kernel."
- https://github.com/capsuleman/ssd-nvme-database - "Columnar database on SSD NVMe"

bestouff · Answer

Naive question: are there really expected gains to address natively an NVMe disk wrt using a regular key-value database on a filesystem ?

delfinom · Answer

https://github.com/OpenMPDK/KVRocksGiven however, that most of the world has shifted to VMs, I don't think KV storage is accessible for that reason alone because the disks are often split out to multiple users. So the overall demand for this would be low.

zupa-hu · Answer

Is there any performance gain over writing append-only data to a file?I mean, using a merkle tree or something like that to make sense of the underlying data.

brightball · Answer

SolidCache and SolidQueue from Rails will be doing that when released.Otherwise though&hellip;you have the file system. Is that not enough?

telegpt · Answer

The Key To Value: Understanding The Nvme Key-Value Standardhttps://cliprecaps.com/read/?v=uQFl5T7IKpI

znpy · Answer

I often attended a presentation by some presales engineer from Aerospike and IIRC they're doing some nvme-in-userspace stuff.

formerly_proven · Answer

There's actually an NVMe command set which allows you to use the FTL directly as a K/V store. (This is limited to 16-byte keys [1] however, so it is not that useful and probably not implemented anywhere)[1] These slides claim up to 32 bytes, which would be a practically useful length: https://www.snia.org/sites/default/files/ESF/Key-Value-Stora... but the current revision of the standard only permits two 64-bit words as the key ("The maximum KV key size is 16 bytes"): https://nvmexpress.org/wp-content/uploads/NVM-Express-Key-Va...

CubsFan1060 · Answer

Interesting article here: https://grafana.com/blog/2023/08/23/how-we-scaled-grafana-cl...
Utilizing: https://memcached.org/blog/nvm-caching/,https://github.com/m...
TLDR; Grafana Cloud needed tons of Caching, and it was expensive. So they used extstore in memcache to hold most of it on NVMe disks. This massively reduced their costs.

jiggawatts · Answer

Note that some cloud VM types expose entire NVMe drives as-is directly the guest operating system without hypervisors or other abstractions in the way.
The Azure Lv3/Lsv3/Lav3/Lasv3 series all provide this capability, for example.
Ref: https://learn.microsoft.com/en-us/azure/virtual-machines/las...

javierhonduco · Answer

There&rsquo;s Kvrocks. It uses the Redis protocol and it&rsquo;s built on RocksDB https://github.com/apache/kvrocks

ilyt · Answer

It becomes complex when you want to support multiple NVMes
Even more complex when you want to have any kind of redundancy, as you'd essentially need to build-in some kind of RAID-like into your database.
Also few terabytes in RAID10 NVMes + PostgreSQL and something covers about 99% of companies needs for speed.
So you're left with 1% needing that kind of speeds

gavinray · Answer

Why do you mean by non-embedded?You might also be interested in xNVMe and the RocksDB/Ceph KV drivers:https://github.com/OpenMPDK/xNVMehttps://github.com/OpenMPDK/KVSSDhttps://github.com/OpenMPDK/KVRocks

threeseed · Answer

Crail [1] which is a distributed K/V store on top of NVMEoF.[1] https://craillabs.github.io

otterley · Answer

Because you haven't written it yet!

rubiquity · Answer

I think it's mostly because while the internal parallelism of NVMe is fantastic our logical use of them is still largely sequential.

espoal · Answer

I'm building one: https://github.com/yottaStore/yottaStore

caeril · Answer

> non-embedded key value stores or DBs out in the wild yet
I like how you reference the performance benefits of NVMe direct addressing, but then immediately lament that you can't access these benefits across a SEVEN LAYER STACK OF ABSTRACTIONS.
You can either lament the dearth of userland direct-addressable performant software, OR lament the dearth of convenient network APIs that thrash your cache lines and dramatically increase your access latency.
You don't get to do both simultaneously.
Embedded is a feature for performance-aware software, not a bug.

altairprime · Answer

[delayed]

Already__Taken · Answer

A seaweedFS volume store sounds like a good candidate to split some of the performance volumes across the nvme queues. You're supposed to give it a whole disk to use anyway.

nerpderp82 · Answer

Aerospike does direct NVME access.https://github.com/aerospike/aerospike-server/blob/master/cf...There are other occurrences in the codebase, but that is the most prominent one.

nerpderp82 · Answer

Eatonphil posted a link to this paper https://web.archive.org/web/20230624195551/https://www.vldb.... a couple hours after this post (zero comments [0])
> NVMe SSDs based on flash are cheap and offer high throughput. Combining several of these devices into a single server enables 10 million I/O operations per second or more. Our experiments show that existing out-of-memory database systems and storage engines achieve only a fraction of this performance. In this work, we demonstrate that it is possible to close the performance gap between hardware and software through an I/O optimized storage engine design. In a heavy out-of-memory setting, where the dataset is 10 times larger than main memory, our system can achieve more than 1 million TPC-C transactions per second.
[0] https://news.ycombinator.com/item?id=37899886

Why are there no open source NVMe-native key value stores in 2023?

Naive question: are there really expected gains to address natively an NVMe disk wrt using a regular key-value database on a filesystem ?

https://github.com/OpenMPDK/KVRocks
Given however, that most of the world has shifted to VMs, I don't think KV storage is accessible for that reason alone because the disks are often split out to multiple users. So the overall demand for this would be low.

Is there any performance gain over writing append-only data to a file?
I mean, using a merkle tree or something like that to make sense of the underlying data.

SolidCache and SolidQueue from Rails will be doing that when released.
Otherwise though…you have the file system. Is that not enough?

The Key To Value: Understanding The Nvme Key-Value Standard
https://cliprecaps.com/read/?v=uQFl5T7IKpI

I often attended a presentation by some presales engineer from Aerospike and IIRC they're doing some nvme-in-userspace stuff.

Note that some cloud VM types expose entire NVMe drives as-is directly the guest operating system without hypervisors or other abstractions in the way.
The Azure Lv3/Lsv3/Lav3/Lasv3 series all provide this capability, for example.
Ref: https://learn.microsoft.com/en-us/azure/virtual-machines/las...

There’s Kvrocks. It uses the Redis protocol and it’s built on RocksDB https://github.com/apache/kvrocks

Why do you mean by non-embedded?
You might also be interested in xNVMe and the RocksDB/Ceph KV drivers:
https://github.com/OpenMPDK/xNVMe
https://github.com/OpenMPDK/KVSSD
https://github.com/OpenMPDK/KVRocks

Crail [1] which is a distributed K/V store on top of NVMEoF.
[1] https://craillabs.github.io

Because you haven't written it yet!

I think it's mostly because while the internal parallelism of NVMe is fantastic our logical use of them is still largely sequential.

I'm building one: https://github.com/yottaStore/yottaStore

[delayed]

A seaweedFS volume store sounds like a good candidate to split some of the performance volumes across the nvme queues. You're supposed to give it a whole disk to use anyway.

Aerospike does direct NVME access.
https://github.com/aerospike/aerospike-server/blob/master/cf...
There are other occurrences in the codebase, but that is the most prominent one.

Why are there no open source NVMe-native key value stores in 2023?

Naive question: are there really expected gains to address natively an NVMe disk wrt using a regular key-value database on a filesystem ?

https://github.com/OpenMPDK/KVRocksGiven however, that most of the world has shifted to VMs, I don't think KV storage is accessible for that reason alone because the disks are often split out to multiple users. So the overall demand for this would be low.

Is there any performance gain over writing append-only data to a file?I mean, using a merkle tree or something like that to make sense of the underlying data.

SolidCache and SolidQueue from Rails will be doing that when released.Otherwise though…you have the file system. Is that not enough?

The Key To Value: Understanding The Nvme Key-Value Standardhttps://cliprecaps.com/read/?v=uQFl5T7IKpI

I often attended a presentation by some presales engineer from Aerospike and IIRC they're doing some nvme-in-userspace stuff.

Note that some cloud VM types expose entire NVMe drives as-is directly the guest operating system without hypervisors or other abstractions in the way.The Azure Lv3/Lsv3/Lav3/Lasv3 series all provide this capability, for example.Ref: https://learn.microsoft.com/en-us/azure/virtual-machines/las...

There’s Kvrocks. It uses the Redis protocol and it’s built on RocksDB https://github.com/apache/kvrocks

Why do you mean by non-embedded?You might also be interested in xNVMe and the RocksDB/Ceph KV drivers:https://github.com/OpenMPDK/xNVMehttps://github.com/OpenMPDK/KVSSDhttps://github.com/OpenMPDK/KVRocks

Crail [1] which is a distributed K/V store on top of NVMEoF.[1] https://craillabs.github.io

Because you haven't written it yet!

I think it's mostly because while the internal parallelism of NVMe is fantastic our logical use of them is still largely sequential.

I'm building one: https://github.com/yottaStore/yottaStore

[delayed]

A seaweedFS volume store sounds like a good candidate to split some of the performance volumes across the nvme queues. You're supposed to give it a whole disk to use anyway.

Aerospike does direct NVME access.https://github.com/aerospike/aerospike-server/blob/master/cf...There are other occurrences in the codebase, but that is the most prominent one.

https://github.com/OpenMPDK/KVRocks
Given however, that most of the world has shifted to VMs, I don't think KV storage is accessible for that reason alone because the disks are often split out to multiple users. So the overall demand for this would be low.

Is there any performance gain over writing append-only data to a file?
I mean, using a merkle tree or something like that to make sense of the underlying data.

SolidCache and SolidQueue from Rails will be doing that when released.
Otherwise though…you have the file system. Is that not enough?

The Key To Value: Understanding The Nvme Key-Value Standard
https://cliprecaps.com/read/?v=uQFl5T7IKpI

Note that some cloud VM types expose entire NVMe drives as-is directly the guest operating system without hypervisors or other abstractions in the way.
The Azure Lv3/Lsv3/Lav3/Lasv3 series all provide this capability, for example.
Ref: https://learn.microsoft.com/en-us/azure/virtual-machines/las...

Why do you mean by non-embedded?
You might also be interested in xNVMe and the RocksDB/Ceph KV drivers:
https://github.com/OpenMPDK/xNVMe
https://github.com/OpenMPDK/KVSSD
https://github.com/OpenMPDK/KVRocks

Crail [1] which is a distributed K/V store on top of NVMEoF.
[1] https://craillabs.github.io

Aerospike does direct NVME access.
https://github.com/aerospike/aerospike-server/blob/master/cf...
There are other occurrences in the codebase, but that is the most prominent one.