Or put another way, what are some cutting edge OLTP database techniques/features/architectures that PostgreSQL and MySQL would have a hard time supporting?
1. Horizontal scaling. PG can do sharding or replication, but that's not the same thing. Manual resharding is something that, once you've experienced it, you don't want to do it ever again, especially when things like Spanner/Cockroach exist. Database-level liquid sharding is such a dramatically superior solution that it makes any system that depends on an enumerated set of master instances seem completely obsolete.
2. Strong commit timestamps, again a la Spanner/Cockroach. Globally-ordered commit times and read snapshots aren't something you'll need in every schema, but when you do, they're magical - suddenly, you can use your OLTP system for strong lease assignment (conditioning other transactions on lease validity), you can construct arbitrary application-level transaction semantics (non-transactional read followed by transactional compare-and-swap is pretty powerful in some scenarios), and all sorts of other things.
3. Interface abstraction. Databases that ship language-support drivers should also ship a fake in-memory implementation for testing purposes that supports the same interfaces and options. There's no reason why I should have to start a subprocess and emulate a network connection to tell whether a query functions properly or not, or whether I've set isolation properties correctly for a pipeline.
4. Declarative schemas, and a well-defined update process (that isn't human-written DDL), are essential at any sort of organizational scale. Every time I see an "ALTER TABLE" statement checked into a repository, I tear my hair out.
5. Queue support, where message publishing/consumption is transactional with other read/write operations. You can always build a queue with a time-keyed table, but it's kind of a pain in the ass to get all the edge cases right. It's much better as a first-class concept supporting automatic sharding, retries, batching, and bounded-out-of-order delivery with a global cursor.
2. Powerful graph query language like Cypher. It might not perform well in real life, but my personal experience left me amazed[2]. There is a number of issues with SQL that could be addressed[3], but current standard is way too much prevalent.
3. Zero impedance mismatch between database and application representation. In database like Smalltalk GemStone it is really seamless experience for developer to write database code and application code[4]. To some extent, MongoDB success can be attributed to this aspect.
4. Datomic temporal capabilities[5]. It is hard to maintain temporal tables in postgres. There are some use cases where you really want query in point of time. Strictly not an OLTP feature, but I can see this be usefully in many scenarios.
[1] https://www.youtube.com/watch?v=Fo1dPRqbF-Q
[2] https://www.youtube.com/watch?v=pMjwgKqMzi8&t=726s
[3] https://www.edgedb.com/blog/we-can-do-better-than-sql
[4] https://www.youtube.com/watch?v=EyBkLbNlzbM
[5] https://www.youtube.com/watch?v=7lm3K8zVOdY
Edited: independence -> impedance
It has broad PGSQL language (and also wire I think) compatibility yet has a clustered peer architecture well suited to running in a dynamic environment like cloud or k8s. Nodes can join dynamically and it can survive them leaving dynamically as long as there's a quorum. Data is distributed across the nodes without administrator needing to make any shard rebalance type interventions.
PGSQL is designed for deployment as a single server with replica servers for HA. It's not really designed for horizontal scalability like Cockroach. You can do it - the foreign data wrappers feature and table partitioning can give you poor man's scale out. Or you can use Citus which won itself a FOSS license earlier this year. And there are other Foss and proprietary approaches too.
MySQL is similar - you can do it, like with their recent router feature, but it has been retrofitted, and it's not as fluid as Cockroach. IIRC MySQL router is similar in configuration to Galera - that is, a static config file containing a list of cluster members.
Listen I'm sure that the design approach of Cockroach could be retrofitted to PGSQL and MySQL, but I'm pretty sure that doing a good job of it would be a lot of work.
So in answer to your question, I'm not sure that there's all that much RDBMS can't be made to do. Geospatial, Graph, Timeseries, GPU acceleration. Postgres has it all and often the new stuff comes to Postgres first.
By the way I love MySQL and PostgreSQL, and the amazing extensions for PGSQL make it extra awesome. Many are super mature and make pgsql perfect for many many diverse use cases.
For XXL use cases though, CockroachDB is taking a very interesting new path and I think it's worth watching.
We think a lot about this exact question. Here are some of the things YugabyteDB can do as a "modern database" that a PostgreSQL/MySQL cannot (or will struggle to):
* High availability with resilience / zero data loss on failures and upgrades. This is because of the inherent architecture, whereas with traditional leader-follower replication you could lose data and with solutions like Patroni, you can lose availability / optimal utilization of the cluster resources.
* Scaling the database. This includes scaling transactions, connections and data sets *without* complicating the app (like having to read from replicas some times and from the primary other times depending on the query). Scaling connections is also important for lambdas/functions style apps in the cloud, as they could all try to connect to the DB in a short burst.
* Replicating data across regions. Use cases like geo-partitioning, multi-region sync replication to tolerate a region failure without compromising ACID properties. Some folks think this is far fetched - its not. Examples: the recent fire on an OVH datacenter and the Texas snowstorm both caused regional outages.
* Built-in async replication. Typically, async replication of data is "external" to DBs like PG and MySQL. In YugabyteDB, since replication is a first-class feature, it is supported out of the box.
* Follower reads / reads from nearest region with programmatic bounds. So read stale data for a particular query from the local region if the data is no more than x seconds old.
* We recently enhanced the JDBC driver to be cluster aware, eliminating the need to maintain an external load balancer because each node of the cluster is "aware" of the other nodes at all times - including node failures / add / remove / etc.
* Finally, we give users control over how data is distributed across nodes - for example, do you want to preserve ASC/DESC ordering of the PKs or use a HASH based distribution of data.
There are a few others, but this should give an idea.
(Disclosure: I am the cto/co-founder of Yugabyte)
While MVCC is fashionable nowadays, and more or less every platform offers it at least as an option, my experience, and also opinions I have heard from people using SQL Server and similar platforms professionally, is that for true OLTP at least, good ol’ locking-based protocols in practice outperform MVCC-based protocols (when transactions are well programmed).
The “inconvenient truth” [0] that maintaining multiple versions of records badly affects performance might in the future make MVCC less appealing. There’s ongoing research, such as [0], to improve things, but it’s not clear to me at this point that MVCC is a winning idea.
CockroachDB is a distributed SQL database, with strong consistency. I think that Yugabyte is similar.
Support for realtime changes, including queries over those changes, is better in other databases, like RethinkDB, OVSDB or Firebase.
Relational is not always the best way to store your data. Sometimes a graph model is better. Or for full text search something specialized is better. In that sense dgraph or elastic search may be better OLTP databases in some cases.
Columnar storage, like used in Vertical or BigQuery have several advantages in processing and data compression. But implementing it in PG or MySQL I think that would require almost a full rewrite.
2. Progressive automatic scalability. As load increases or storage runs out, the DB should be able to automatically. NewSQL databases do this already.
3. Tiered storage.
4. Support streaming, stream processing, in memory data structures, etc. I feel like this is one of those weird things but I keep wishing this were possible when I work on side projects or startups. I don't want to have to spin up mysql/postgres, kafka/pulsar, flink/whatever/stream processing, redis, etc separately because that would be prohibitively expense when you're just getting off the ground unless you have VC money. So I find myself wishing I could deploy something that would do all of those things, that could also scale somewhat until the need to break everything into their own infrastructure, and if it was wire compatible with popular projects then that would be perfect. Will it happen? I doubt it, but it would be lovely assuming it worked well.
https://www.usenix.org/legacy/events/lisa11/tech/slides/ston...
Have a look at Michael Stonebraker, he's a database genius, and keeps starting "NewSQL" companies which use the relational model, but specialized for different applications, e.g. column stores.
1) Incremental materialized view maintenance, à la Materialize (bonus points for supporting even gnarly bits of SQL like window functions).
2) Really ergonomic and scalable pub/sub of some sort, à la RethinkDB.
3) Fine tuned control over query plans if I want it.
4) Probably very deep Apache Arrow integration.
And, I think over the years they have been such an awesome community that has helped me irregularly when I popped into IRC every year or two for a problem I was having. https://github.com/davidfetter ; https://github.com/RhodiumToad ; https://github.com/bmomjian - you are great peeps!
What I would like to see are more security usability enhancements over and above what you find in https://www.crunchydata.com/products/hardened-postgres/
Security Enhancements List:
Attack Resistance
Better protections for the DBMS itself in assume compromised scenarios, or malicious users
Customized implementations per compliance/security regime
Accelerators for various security configurations
Self-audit, self-secure
Functionality Enhancements List:
Continuous Sync Postgres to other serious DB (not a one-way sync, not a DB move)
For example: https://redis.io/commands/BLPOP
If I could do something like:
Select id, msg
From job
Where id > 1224
And have that query block until there actually was a job with an id > 1224, it would open up some interesting use cases.
Once your developers have completed an iteration, your DB will see the same queries over and over again (if it doesn't, then it should be an OLAP aggregate). These databases optimize for writes, and defer complexity to reads and, considering that you could see millions more reads than writes, makes no sense whatsoever.
- Comprehensive auto tuning: automatic index creation, automatic schema tuning, dynamically switching between column/row-oriented, etc. User specifies SLOs, database does the rest.
- Deeply related to the previous two points: perfect horizontal scalability
- Configurable per-query ACID properties (e.g. delayed indexing)
- All of the above while maintaining complex SQL features like arbitrary joins, large transactions, triggers, etc.
Sure, some of these are in some form in some existing databases. But none offer all of them.
EDIT: yes thanks for the comments, views and creating and API layers and adding instead of subtracting do all work, but I believe they're all workarounds for the underlying problem. Fixing versioning would make everyone's lives easier.
I believe PostgreSQL and MySQL will support this within a few years.
One can achieve this to a degree by storing the data as JSON, but it would be nice to be able to remove the chance of introducing errors when converting to/from JSON.
- Very smart clustering/redundancy features
- Can be queried over HTTP
- Has a great API
- Counterpart project (Pouch DB) can work on the client
- Queries can be cached (it's HTTP)
- Said to be 'schemaless' but can be structured
The project is always overlooked, but I still feel it's one of the better NoSQL datastores.
Real-time queries should be the default. Some kind of smart query invalidation (similar to incremental view maintenance) as a result of DML statements.
2. Can run in a web browser / mobile phone
Your client-side cache is usually a messy denormalization of your SQL database. For offline/local-first apps you essentially need to run your entire backend API in your browser. So you should be running an SQL database in the browser. For this to be seamless you need your database targetting JS during design stage.
3. Syncing
To support offline/local/mobile use cases, your database needs to be able to sync changes efficiently.
4. Logical-first data schema
You should be able to give your db a logical schema instead of a physical one, and the database can take care of the necessary denormalizations for performance reasons automatically.
5. Lower-level api for query planner
Instead of having SQL as your only interface, there should be a way to interact with the query tree and the query plan directly. People are always creating new query languages hacked over SQL (like Google's Logica).
6. Graph-based queries
Support a query language like Cypher. Provide better solutions for tree-based data structures - the queries get too unwieldy today.
This is a tricky engineering problem - have two kinds of indexes in the same database. But disk space is cheap. Network is expensive (especially if you're on AWS).
So they have synchronized clients that do not use HTTP or JSON. You need asynchronous clients that do not consume thread context switches.
Also you always want to replicate all data in more than one location in real-time because 100% read uptime is really easy to provide, while 100% write uptime is a complex nightmare.
I made this database after using MySQL, Oracle and Postgres for 10 years (it's a 2000 line replacement for my needs that these databases filled, a very narrow subset of the features you'll find in legacy databases): http://root.rupy.se
Also, getting up and running with an investor demo ASAP with zero technical fuss, because you have a startup idea but you're broke and can't pay your next month's rent unless you either (A) finish this demo and get that investor money next week, or (B) quit working on your idea and take the Google offer. (Yes, I've actually been there.)
There’s a variant of the question that interests me though. It’s said that much of large scale software now days is built the way we build our cities. One layer on top of another on top of another. Even though a city like London is large and capable and full featured, it has years of history below it. And sometimes that slow evolution even complicates things. Some might argue, that if we had unlimited resources, we could go find a nice new piece of virgin ground and build a New London, that could be simplified and not contend with its historical baggage.
So by analogue, if we could respin a fresh new simplified/unified/rationalized version of postgresql today, what would it look like?
While possible with older SQL's through your own code, distributed sharding and keeping multiple databases in sync I would think be useful at a DB level vs user code level.
You can certainly argue that shouldn't be part of the database software though.
A modern database language/interface would be far more useful to most than general purpose programming languages are. We can do better for end-users than forcing them to grapple with archaic `SELECT` statements. The crowd that have invested in learning the language already will disagree, of course, but far too many newbies self-select out of that group because it seems too complicated to them. It doesn't have to be so.
Calvin (http://cs-www.cs.yale.edu/homes/dna/papers/calvin-sigmod12.p...) and VoltDB are examples of different architectures that trade off the ability to do SQL-style interactive transactions (`BEGIN`+run code client side+`COMMIT`), forcing everything through predefined procedures. Combining that with better stored procedures is interesting.
This greatly improves latency and simplifies code. Trying to manually merge the mutations into all of your queries would be very awkward.
Also, the whole concept of browser clients interacting directly with the database server is darn interesting. Combine that with better support for stored procedures etc for actually enforcing business logic (somewhat made possible by Firestore security rules, but very awkward).
* Exclusion Constraints support for GIN indexes (which supports Arrays and JSONB.) This would allow unique keys on array elements without having to use triggers to put the individual array elements in another table.
* Array element foreign keys, which I think is a subset of the Inclusion Constraints proposal (which would additionally allow constraints into ranges.)
* A faster procedural language. While plv8 is an improvement over plpgsql, it still seems substantially slower than running the equivalent in a separate NodeJS process. (Possibly just build issues for me.)
Edit - reread the question and it mentions OLTP workloads. Snowflake and Redshift are specialized analytics DBs so they don’t meet that criteria.
The tl;dr is that its schema definitions and query language align much more closely to typical app logic, theoretically eliminating the draw of an ORM and being very good at things that are unwieldy with SQL, like retrieving sets of related records.
You can do most anything with the wrong tool. I'd rather ask the question, when is PostgreSQL or MySQL the wrong tool for the job. I'm not sure I'm qualified to answer this, but I can point you in the direction of a book that has given me a much better understanding of the space. https://www.oreilly.com/library/view/designing-data-intensiv...
I sorely miss them on PG.
MariaDB has them https://mariadb.com/kb/en/bitemporal-tables/
2) Distributed DB, like Spanner and CRDB.
3) Real time analysis of performance and then adapt the DB config accordingly, without requiring HUMAM intervention.
4) Multiple (joint?) storage types per table (Memory, Columnar or Rows, for example);
5) Native support for Copy On Write replicas (useful for ephemeral environments, Stage and development)
6) Serverless capabilities
Linear relationships must be defined. Your database doesn't do much for you. You must define every relationship between tables.
- Algebraic data types, removal of NULLs.
- Including a relational language, not just a partial query language (SQL). (I making one at https://tablam.org, just to get the idea)
- So, is full relational (you can store tables in tables, you can model trees with table because above, etc)
- SQL is a interface for compatibility and stuff, but the above is for the rest, because:
- The engine is not a full black box but a composite of blocks so:
-- The inner and only only black box is the full ACID storage layer, that is concerned in manage PAGEs, WALs, etc made in a lang like Rust.
-- The user-facing/high-level storage layer is above this. I think this will allow to code it in the lang above because exist:
-- A pluggable language interface (making a "WASM for database/VM") that others (like SQL) compile to. And probably WASM for stored procedures and/or extend it, THEN
-- This will allow to compile "SELECT field FROM table" CLIENT-SIDE and check it! (after supplied with the schema definition), AND TOO:
- Because it not have a limited query language but one that is full, you can code a new INDEX with it. Note how do it in any language (ignoring the complexity of storage and acid, this is where a high-level interface is needed) is simple, but impossible in current RDBMS.
- Because the DB is truly, fully, relational, you can do "SELECT * FROM Index"
- Then, you can add a cargo-like package manager to shared code to the community
- Then, you can "db-pkg add basic-auth" to install stuff like auth modules that are actually used, not like the security that is included in old database for a use case not many care for
- Allow to make real-time subscriptions to data/schema changes
- Make it HTTP-native, so is already REST/GrapQL/WebSocket/etc endpoint-capable and
- Go extra-mile with the idea of Apache Avro or similar and make the description of the DB-schema integral to it, so you can compile interfaces to the db
- Store the schema changes, so it have in-built MIGRATION support (git-like?)
- Then auto-generate DOCS with something like swagger?
----
In relational to the engine itself:
- The storage is mixed row/columnar (PAX-like) to support mixed-workloads
- The engine, like sqlite, is a single library. Server-support is another exe and the package manager is what install support for operation
- The DB is stored in a single-file?
- We want to store:
-- Rows/Tables: BTrees + PAX like today, nothing out-of-ordinary
-- LOGs/Metrics: is the same as the WAL!. A rdbms already have it, but is buried: Allow to surface that, so you can do 'SELECT * FROM my_wal"
-- Vectors: Is the same as a PAX storage but one where is only 1 column
-- Trees: Is something you can do if the DB is truly relational and allow to store tables/algebraic types on it
IF the storage have a high-level interface and exist a full-featured language ("WASM-like") interace to it, you can add the optimizations to the query planner and the code that manipulate the data without demand to get into the deeps of the engine.
This mean that people that want to disable the query planner, INSTEAD NEED to improve it! IF the query planner is a component of the engine that is surfaced, and can tweak it.
- Expose lower-level access to MVCC etc. Something finer grained than global transactions.
- Similarly, make indexes more first-class and visible.
Sometimes I feel like databases were created by people who never built a website before. Most websites are pretty similar, and databases historically have never felt (to me at least) "modern". I always feel like I'm fighting against them, and making usability concessions for the sake of performance.
First, the ability to subscribe to external data sets. I feel like I spend so much time writing crappy syncing code with external APIs (like Clearbit, GitHub, etc), and it would be so much nicer if I could just "connect" with them and know it will be fairly up to date.
I also think there's so many things everyone finds themselves redoing for no reason. For example, almost every site on the internet as a user database with sessions, and each user has an email (that must be valid + verified), a password (that's encrypted + salted) and 2FA (which everyone is basically implementing themselves). It'd be so nice if the database just "knew" it was a user, and you were tweaking the presets rather than building it from scratch.
Every single company has similar workflows they each solve themselves (often in insecure ways): they all have a production database, staging environments, migrations, direct access for customer support to fix things, local db access/clones for development, etc. I'd LOVE a database that was created with all these use-cases in mind... such as a way to connect to a DB locally but scrub sensitive data, take care of migrations seamlessly, etc.
This might be a bit too "in a magical world"-y, but I'd love to not have to think about tradeoffs. Kind of like an automatic car, I'd love my database to be able to shift based on the types of data and amount of read/writes. At my company, we have 3-4 different databases for different reasons (Mongo, REDIS, ElasticSearch, ClickHouse), and it gets really difficult to keep all the data synced and connect them behind the scenes. I'd love to just never have to think about the low-level data store ever again, and have the DB do all the work without us having to worry.
There's a number of primitives that I think are used a lot, and it'd be amazing if they were built in. For example, time. It'd be great to easily get the difference between two times, or total the times of a bunch of rows. Airtable has a time primitive, and it's amazing how much friendlier it is to use.
Overall, I'd also love it to just feel a lot more like Airtable, including an Airtable-like interface for working with it (right down to the ability to create custom views and create on-the-fly forms people can submit data to). I honestly use Airtable for most of my DB needs these days (for one-off small projects), and it's such a delight to use.
Maybe I'm underestimating the importance but... I feel like databases are pretty performant these days. I hope that we can start seeing dramatic UX improvements, since we don't have to optimize for performance the same way we have in the past.
Persistent queries, where any change to the answer is propagated, like a subscription to a feed.
The rest of my admittedly very snarky, but arguably highly accurate commentary on various veins of discussion on this topic will be left in a time capsule below for future generations of people who actually have a legitimate interest in computer programming to discover after the Taliban are driven out of the computer industry.
Theoretically you can replace hiring smart programmers with magical artificial intelligence that reads your mind and then uses the computer to bend the world to your iron will, but if someone else could produce such a magical artificially intelligent database why would they share it with you? Perhaps they are an omnipotent omnipresent omnibenevolent entity that loves you just a little bit more because of your good looks and high intelligence?
Is this different from wanting to do a persistent query? Basically, since you have a programmable computer that you can program to talk to the database program, you write a computer program that periodically queries the database and then takes the action when the condition is met.
SQL is an implementation of a mathematical descriptive language for relationships. The whole point is that temporal logic like "wait until this happens, then do this" can be kept somewhat separate from logic describing the data you are tracking.
You have SQL that describes what it is you want to store, and particular questions you want to ask about what it is you want to store, and then the job of the database program is to figure out how to store the data safely and efficiently and answer your questions quickly. How you write the SQL that describes the way you want to store the data depends some on what kind of questions you want to ask, and this is what an actually skilled "Database Application Programmer" can figure out for you.
Some proprietary(and probably also Postgres) databases do provide support for the kind of thing you are asking to do here in the form of what are called "Stored Procedures" . Your average corporation accumulates an utter shitload of these stored procedures that various non-technical technical question askers in different departments don't tell each other about and they are often doing the same thing in different ways at different times. Then later they crash the database and break the application itself because there is insufficient technical oversight and communication at the actual decision making levels of the corporate bureaucracy.
Long story short, do all of this stuff in persistent queries done outside of the database and tracked in a shared medium like a wiki page, or even better a physical notebook routinely reviewed as part of management actually doing some managing. https://en.wikipedia.org/wiki/Gantt_chart
This is an indication that you are not using a Model View Controller approach to building your database client. There are some python ORMs that were doing this automatically correctly a decade ago, but there are presuambly still several corporations with hundreds of millions of dollars using Enterprise Java Beans in 2021, also maybe some people with nuclear arsenals as well so you shouldn't consider yourself too behind the times.