HACKER Q&A
📣 database_guy

Why are there no easy-to-use highly-available open source databases?


I'm a single engineer who wants to run a database by myself, no cloud vendors involved. I don't really care about scalability to a point, I don't really care about the API to interact with the database. I just want something that I can set up on multiple machines with minimal effort and have the loss of one of those nodes not cause catastrophic failures. I want the experience of something like etcd without being limited to a few gigabytes of data in the cluster (a Terabyte or maybe a couple hundred Gigabytes as a limit would probably be fine). I've spent a lot of time looking through the various options and wrote up my thoughts on why each of them wasn't a good fit.

Postgres: A great database but HA options either have far too many moving parts, require manual intervention, are proprietary, or are Fake Open Source. Seems like there's lack of interest in addressing this because companies that sponsor development generate a lot of value off complex operations

MySQL: This database has some odd behavior but that would be fine if it weren't Fake Open Source

MariaDB: HA Options are poor or abandoned

SQL Server: Proprietary

Oracle: haha

CockroachDB: Fake Open Source

YugabyteDB: Fake Open Source. Special shout out here for not even linking to instructions for how to build the database in the readme.

MongoDB: Proprietary, and even before the license change it was Fake Open Source

Cassandra: Not fun to run, and even though I said I didn't care about API I don't necessarily love how it works. But it comes probably the closest?

ScyllaDB: Fake Open Source

TiDB: Fake Open Source

Singlestore: Fake Open Source

FoundationDB: This one comes close but its beginning as a proprietary database really hurt its community, which is way smaller now than it should be. Could grow into something great if more folks got behind it.

etcd: not suitable for use above a couple GB of data

What do I mean by Fake Open Source? A project that has a large percentage of its contributors beholden to a single organization/entity to me is not really open source in spirit. I'm looking for a project where I can feel confident my contributions won't effectively end up behind some proprietary license down the line if/when the VC backed organization that primarily sponsors development decides it needs to protect itself from AWS. If there's an "Enterprise" product and the organization calls the source code for the main project the "Community Edition" or something like it, it's not Real Open Source. If a single organization shuts down and contributions fall off a cliff (https://github.com/rethinkdb/rethinkdb/graphs/contributors) it's not Real Open Source. There are lots of Real Open Source projects with great communities of users/contributors, but many of the newer databases don't have legitimate open source development communities behind them, in my opinion.

I've probably missed some examples. Mostly, I wonder why there hasn't been a general purpose open source database that does the operations stuff as well as the proprietary databases do. Am I missing something?


  👤 ummonk Accepted Answer ✓
> What do I mean by Fake Open Source? A project that has a large percentage of its contributors beholden to a single organization/entity to me is not really open source in spirit.

Then you should use a different term - something like "community project". The single organization projects are still open source, both technically and in spirit.

> I'm looking for a project where I can feel confident my contributions won't effectively end up behind some proprietary license down the line if/when the VC backed organization that primarily sponsors development decides it needs to protect itself from AWS.

If you're talking about ElasticSearch, I'll point out that Amazon forked it and OpenSearch is not behind a proprietary license, so all contributions made to ElasticSearch continue to be available as open source, with improvements being made.


👤 elmerfud
I'm not sure if you're just picky or discerning but it seems you can find a reason to exclude anything if all you do is look for reasons to exclude.

Why not just use SQLite with streaming replication? It should fit your bill.

Databases rarely have what you would define as real open source with real contributors because the nature of a database means you need one owner and that owner has to be picky and exclude things. Allowing the wrong commit into MariaDB could introduce regressions that no one even could imagine because the complexity of things.

Even when a database starts out the way you describe with good intentions in order for it to become a widely adopted product it has to be pulled in under a single umbrella to direct it and build it toward its vision. This puts it firmly in your fake open source camp.


👤 ddorian43
> YugabyteDB: Fake Open Source. Special shout out here for not even linking to instructions for how to build the database in the readme.

All the features are open source. Here is how to build from source https://docs.yugabyte.com/latest/contribute/core-database/bu....

> What do I mean by Fake Open Source? A project that has a large percentage of its contributors beholden to a single organization/entity to me is not really open source in spirit.

Well, somebody gotta start the project, no? Feel free to contribute though. Since it reuses PostgreSQL, it directly inherits the "postgresql community commits". The same with being a fork of Apache Kudu fork.

> If there's an "Enterprise" product and the organization calls the source code for the main project the "Community Edition" or something like it, it's not Real Open Source.

The "Enterprise" edition is "just" some scripts that make deployment & monitoring easier (and includes 24/7 developer support). All c++ features are open source.

And it's still young. You can't compare against PostgreSQL that has 20+ years of being available.


👤 endisneigh
I’m not understanding why it matters if it’s fake open source or not. Even with “real open source” there’s no guarantee your contributions will be included. Or that the license won’t change in the future.

👤 iAm25626
PostgresSQL - Not sure which HA solution you had experience with: https://patroni.readthedocs.io/en/latest isn't too bad. old adage comes to mind. Fast/Cheap/Good - pick 2 HA design are not all created equal. rubber stamp something HA often give false sense of security. HA to me is explicitly defined risk(down time) tolerance. For each of 9 it get more complex and cost goes up. Most commercial DB with HA are opinionated which is often the opposite ethos of open source

👤 neoCrimeLabs
You are missing something.

You list several completely different types of databases and then mention you are looking for a "general purpose database".

This tells me you do not understand the problem you are trying to solve, therefore do not know how to define the requirements nor features you require to solve it.

It also tells me you do not understand the difference between different types of databases, nor the niche they fill, let alone why their HA functions very differently.


👤 newaccount74
It's not clear to me why you need a distributed database in the first place. If it's just for general purpose small scale projects, does it really matter if your database is down once a year?

I run a couple of Postgres databases on cheap linux VMs, for various projects, and they have been running smoothly for years. The only problem I had was two times when the disk was full. If I had multiple nodes they would all have been full...

Github has been down more often than my Postgres databases.

HA adds so much complexity and tradeoffs that I would really think hard about wether it's worth it for your use case.


👤 snikolaev
Manticore Search is not in the list:

* 100% open source (GPLv2)

* Easy replication:

  - node 1: CREATE CLUSTER c; ALTER CLUSTER c ADD tbl; 

  - node 2: JOIN CLUSTER c AT 'host:port'
* Easy HA: CREATE TABLE dist TYPE='distributed' agent='...' agent='...'

* Real alternative to Elasticsearch in terms of built-in full-text search capabilities, but easier to use.

* Works fine with small and large data volumes: - with in-memory storage for smaller data - with columnar storage (separate library, Apache2 license) for big data that doesn't fit into RAM

* Does analytical queries well

* Not fully ACID-compliant (as well as Elasticsearch, Clickhouse others)


👤 d_t_w
I have deployed and run Cassandra myself, basically as you describe.

The first 'dev' Cassandra install was three nodes. I downloaded the .tar.gz, installed it, started each process in turn on each node with the required configuration.

That was 2012 and there is a chance that cluster is still running. It was low-volume in terms of data. TTL configured so it would never run out of disk. Never had any issues in particular. I used it for ~5 years before concluding work with that client.

The problem in that case was Cassandra proliferated in that small company, they didn't build any particular expertise beyond me, and in the end I was being pulled into discussions from different teams, different products, split across about 13 clusters. Wasn't much fun - but that wasn't the DB's fault.


👤 otoolep
rqlite https://github.com/rqlite/rqlite

I'm the creator of this project. While it's not going to work super well at very large datasets, it's explicitly designed to be trivial to deploy, and very easy to operate. You can get it up and running in seconds, and clustering seconds later. My practical experience with databases told me that operating the database is at least as important as performing queries with it. So I put a lot of work into easy clustering, clear diagnostics, and solid code.


👤 jakobdabo
I'm not sure if Percona's HA solutions are any better than what MariaDB offers, but it's not in your list, so maybe worth mentioning.

https://www.percona.com/services/support/high-availability

https://www.percona.com/blog/2021/04/14/percona-distribution...


👤 PeterZaitsev
One thing which Percona has which MySQL and MariaDB does not is mature operators with HA support, which if you're using Kubernetes make High Availability much easier

https://www.percona.com/software/percona-kubernetes-operator...

Disclosure: I'm CEO at Percona


👤 jka
Yep, your assessment seems broadly accurate to me. I was going to suggest Cassandra until I saw it in the list. I'm {interested in/optimistic about} FoundationDB too, although haven't had a chance to use it in practice yet.

Out of curiosity: what would your preferred choice(s) be to fit these requirements using existing proprietary products?


👤 asadawadia
What do you want specifically in a DB? Consistency? Sql or nosql? HA or sharded? homogeneous nodes?

Jbtw, the risk of nodes dying is quite exaggerated. DO/Vultr/Linode all provide 99.99% up time


👤 znpy
The reason: ha for databases is hard.

However galera cluster and/or percona xtradb cluster work remarkably well, considering they’re open source.


👤 techdragon
The community is still slowly recovering from the collapse of the company but RethinkDB is worth considering in your analysis.

👤 PaulHoule
I'd say Oracle has been a better steward of MySQL than Sun was back when Sun was a separate company.