Postgres: A great database but HA options either have far too many moving parts, require manual intervention, are proprietary, or are Fake Open Source. Seems like there's lack of interest in addressing this because companies that sponsor development generate a lot of value off complex operations
MySQL: This database has some odd behavior but that would be fine if it weren't Fake Open Source
MariaDB: HA Options are poor or abandoned
SQL Server: Proprietary
Oracle: haha
CockroachDB: Fake Open Source
YugabyteDB: Fake Open Source. Special shout out here for not even linking to instructions for how to build the database in the readme.
MongoDB: Proprietary, and even before the license change it was Fake Open Source
Cassandra: Not fun to run, and even though I said I didn't care about API I don't necessarily love how it works. But it comes probably the closest?
ScyllaDB: Fake Open Source
TiDB: Fake Open Source
Singlestore: Fake Open Source
FoundationDB: This one comes close but its beginning as a proprietary database really hurt its community, which is way smaller now than it should be. Could grow into something great if more folks got behind it.
etcd: not suitable for use above a couple GB of data
What do I mean by Fake Open Source? A project that has a large percentage of its contributors beholden to a single organization/entity to me is not really open source in spirit. I'm looking for a project where I can feel confident my contributions won't effectively end up behind some proprietary license down the line if/when the VC backed organization that primarily sponsors development decides it needs to protect itself from AWS. If there's an "Enterprise" product and the organization calls the source code for the main project the "Community Edition" or something like it, it's not Real Open Source. If a single organization shuts down and contributions fall off a cliff (https://github.com/rethinkdb/rethinkdb/graphs/contributors) it's not Real Open Source. There are lots of Real Open Source projects with great communities of users/contributors, but many of the newer databases don't have legitimate open source development communities behind them, in my opinion.
I've probably missed some examples. Mostly, I wonder why there hasn't been a general purpose open source database that does the operations stuff as well as the proprietary databases do. Am I missing something?
Then you should use a different term - something like "community project". The single organization projects are still open source, both technically and in spirit.
> I'm looking for a project where I can feel confident my contributions won't effectively end up behind some proprietary license down the line if/when the VC backed organization that primarily sponsors development decides it needs to protect itself from AWS.
If you're talking about ElasticSearch, I'll point out that Amazon forked it and OpenSearch is not behind a proprietary license, so all contributions made to ElasticSearch continue to be available as open source, with improvements being made.
Why not just use SQLite with streaming replication? It should fit your bill.
Databases rarely have what you would define as real open source with real contributors because the nature of a database means you need one owner and that owner has to be picky and exclude things. Allowing the wrong commit into MariaDB could introduce regressions that no one even could imagine because the complexity of things.
Even when a database starts out the way you describe with good intentions in order for it to become a widely adopted product it has to be pulled in under a single umbrella to direct it and build it toward its vision. This puts it firmly in your fake open source camp.
All the features are open source. Here is how to build from source https://docs.yugabyte.com/latest/contribute/core-database/bu....
> What do I mean by Fake Open Source? A project that has a large percentage of its contributors beholden to a single organization/entity to me is not really open source in spirit.
Well, somebody gotta start the project, no? Feel free to contribute though. Since it reuses PostgreSQL, it directly inherits the "postgresql community commits". The same with being a fork of Apache Kudu fork.
> If there's an "Enterprise" product and the organization calls the source code for the main project the "Community Edition" or something like it, it's not Real Open Source.
The "Enterprise" edition is "just" some scripts that make deployment & monitoring easier (and includes 24/7 developer support). All c++ features are open source.
And it's still young. You can't compare against PostgreSQL that has 20+ years of being available.
You list several completely different types of databases and then mention you are looking for a "general purpose database".
This tells me you do not understand the problem you are trying to solve, therefore do not know how to define the requirements nor features you require to solve it.
It also tells me you do not understand the difference between different types of databases, nor the niche they fill, let alone why their HA functions very differently.
I run a couple of Postgres databases on cheap linux VMs, for various projects, and they have been running smoothly for years. The only problem I had was two times when the disk was full. If I had multiple nodes they would all have been full...
Github has been down more often than my Postgres databases.
HA adds so much complexity and tradeoffs that I would really think hard about wether it's worth it for your use case.
* 100% open source (GPLv2)
* Easy replication:
- node 1: CREATE CLUSTER c; ALTER CLUSTER c ADD tbl;
- node 2: JOIN CLUSTER c AT 'host:port'
* Easy HA: CREATE TABLE dist TYPE='distributed' agent='...' agent='...'* Real alternative to Elasticsearch in terms of built-in full-text search capabilities, but easier to use.
* Works fine with small and large data volumes: - with in-memory storage for smaller data - with columnar storage (separate library, Apache2 license) for big data that doesn't fit into RAM
* Does analytical queries well
* Not fully ACID-compliant (as well as Elasticsearch, Clickhouse others)
The first 'dev' Cassandra install was three nodes. I downloaded the .tar.gz, installed it, started each process in turn on each node with the required configuration.
That was 2012 and there is a chance that cluster is still running. It was low-volume in terms of data. TTL configured so it would never run out of disk. Never had any issues in particular. I used it for ~5 years before concluding work with that client.
The problem in that case was Cassandra proliferated in that small company, they didn't build any particular expertise beyond me, and in the end I was being pulled into discussions from different teams, different products, split across about 13 clusters. Wasn't much fun - but that wasn't the DB's fault.
I'm the creator of this project. While it's not going to work super well at very large datasets, it's explicitly designed to be trivial to deploy, and very easy to operate. You can get it up and running in seconds, and clustering seconds later. My practical experience with databases told me that operating the database is at least as important as performing queries with it. So I put a lot of work into easy clustering, clear diagnostics, and solid code.
https://www.percona.com/services/support/high-availability
https://www.percona.com/blog/2021/04/14/percona-distribution...
https://www.percona.com/software/percona-kubernetes-operator...
Disclosure: I'm CEO at Percona
Out of curiosity: what would your preferred choice(s) be to fit these requirements using existing proprietary products?
Jbtw, the risk of nodes dying is quite exaggerated. DO/Vultr/Linode all provide 99.99% up time
However galera cluster and/or percona xtradb cluster work remarkably well, considering they’re open source.