HACKER Q&A
📣 vaughan

Is there a way to efficiently subscribe to an SQL query for changes?


I know [RethinkDB][1] used to do this with their SQL-like ReQL language, but I looked around a bit and can't find much else about it - and I would have thought it would be more common.

I'm more interesting in queries with joins and doing it efficiently, instead of just tracking updates to tables that are modified, and re-rerunning the entire query.

If we think about modern frontends using SQL-based backends, essentially every time we render, its ultimately the result of a tree of SQL queries (queries depend on results of other queries) running in the backend. Our frontend app state is just a tree of materialized views of our database which depend on each other. We've got a bunch of state management libraries that deal with trees but they don't fit so well with relational/graph-like data.

I came across a Postgres proposal for [Incremental View Maintenance][2] which generates a diff against an existing query with the purpose of updating a materialized view. Oracle also has [`FAST REFRESH`][3] for materialized views.

I guess it's relatively easy to do until you start needing joins or traversing graphs/hierarchies - which is why its maybe avoided.

EDIT: [Materialize][1] looks interesting in this space: "Execute streaming SQL Joins" but more focused on the event streams rather than general-purpose DML/OLTP.

[1]: https://github.com/rethinkdb/rethinkdb_rebirth

[2]: https://wiki.postgresql.org/wiki/Incremental_View_Maintenance

[3]: https://docs.oracle.com/database/121/DWHSG/refresh.htm#DWHSG8361

[4]: https://materialize.com/


  👤 rockwotj Accepted Answer ✓
I'm not sure I would call this relatively easy even without joins. Without joins Google Cloud Firestore does exactly what you're describing. The initial query runs against the DB then the client gets a filtered stream of updates to only that query. Its distributed and scales logarithmically with the number of queries, as it doesn't need to keep the query results in memory/materialized.

The fun part of this problem is that it's really inverted from most traditional database literature out there. Mostly the problem is you have a query and need to find the matching rows. To make updates efficient you need to do this the other way around - if there is a row (that has changed) find all the matching queries.

With joins (or any data dependant query where you can't tell a row is in the result set without looking at other data) you need to keep the query results materialized otherwise you can't have enough information without going back to disk or keeping everything in memory, which isn't really feasible in most cases.

Source: I worked on Google Cloud Firestore from it's launch until 2020 and was the on responsible for the current implementation and data structures of how changes get broadcasted to subscribed queries.


👤 cirego
Disclaimer: I work at Materialize.

This is the exact problem that we are solving here at Materialize! I wrote an example blogpost that details to how to subscribe to a SQL query in practice: https://materialize.com/a-simple-and-efficient-real-time-app...

Regarding your comment about "focus on streams", it's true that we first focused on extracting event data from other sources (Kafka, Debezium, direct change data capture from external databases such as Postgres, S3, etc). Over time, however, we plan to add additional features that will allow users to also treat Materialize as a general purpose database.

Hope this helps and happy to answer any questions!

edit: I was imprecise in my usage of the term event streams. Materialize supports inserts, updates and deletes at its core (the topK query shown in the blog post above shows this). Materialize is a more general solution than something focused on append-only event streams.


👤 wolframhempel
Here's how it works with JOINS in Postgres, using the NOTIFY/LISTEN mechanism

  -- Run the query and store result in updated_ids
  WITH updated_ids AS (
   UPDATE public.doc
   SET version = version + 1
   
  --  Joins are not directly part of Postgres' update syntax, but this achieves the same
   FROM public.org
   WHERE public.org.id = public.doc.org_id AND public.org.type = 'customer'
   
  -- Returns all ids of affected rows
   RETURNING public.doc.id
  )
  -- convert the returned ids into a comma-separated string that we can sent via the NOTIFY mechanism
  , id_string_result AS (
   SELECT ARRAY_TO_STRING(ARRAY_AGG(id) ,',') AS id_string FROM updated_ids
  )
  
  -- use LISTEN/NOTIFY to publish a changed_row_ids event with a comma separated list of changed row ids
  SELECT pg_notify( 'changed_row_ids', id_string_result.id_string )

👤 BrentOzar
Microsoft SQL Server has a rarely-used feature to do exactly this: [Query Notifications][1].

It sounded great in theory when it came out in 2005, but the drawback was that every open subscription was an open connection to the database server, from every application server. The more web servers you have, and the more queries you want to subscribe to, the more open connections you end up with, which at large scale can cause performance issues even when the connections aren't doing anything.

You can mitigate this by using less subscriptions to monitor more data - but when anything changes in the underlying result set, you end up fetching the entire result set. With a lot of web/app servers, this means an avalanche of queries running at exactly the same time, trying to hit the same data to refresh their cache.

In a perfect world, I'd try to mitigate this by offloading these kinds of queries to read-only replicas. That's a feature limited to the expensive Enterprise Edition ($7K USD per CPU core), and when you scale out to a lot of database servers to leverage this, then it becomes prohibitively expensive.

Another way to mitigate the many-app-servers, many-subscriptions problem is to have a central app server manage the caching for everyone else.

[1]: https://docs.microsoft.com/en-us/sql/connect/ado-net/sql/ena...


👤 thruflo22
Supabase’s realtime library consumes the logical WAL and provides a Phoenix channel + JavaScript API to subscribe to matching events from it: https://github.com/supabase/realtime

Under the hood, the core implementation was copied (with credit / attribution) from: https://github.com/cainophile/cainophile

I happened to do a similar thing but I adapted cainophile into an Elixir “OffBroadway” producer: https://github.com/integratedb/integrate/blob/main/lib/integ...

These approaches rely on acking the WAL to confirm data has been processed. It’s simpler than running Debezium / Kafka for “zookept” CDC. However, they are “at least once” at best and it’s easy to shoot yourself in the foot so think twice before relying on this kind of thing for a real application.

Materialize is nice — TAIL is a lovely abstraction and their data ingest uses Debezium under the hood. That said, I believe their Postgres binlog source is still alpha / under active community dev.


👤 theospears
Disclaimer: I work at Asana

We have an in house system (LunaDb) which is a little like this. There's a tech talk available about how it works at https://blog.asana.com/2015/10/asana-tech-talk-reactive-quer... - it's from a few years ago, but the core ideas are there. There's also some details on the caching layer we built for it at https://blog.asana.com/2020/09/worldstore-distributed-cachin...

A few properties based on your questions and the observations here:

- We don't attempt to incrementally update query results. Given the number of simultaneous queries the system handles, we've found it much more important to instead by very precise about only re-running exactly the right queries in response to data modification.

- We support joins (although not queries of arbitrary complexity). We avoid a race conditions and cross-table locking issues by using the binlog as our source of changes, which imposes a linearization on the changes matching the state in the db. Correctly determining which queries to update for these requires going back to the database.

- Performance is an interesting problem. It's easy to arrange situations where total load is a function of "rate of data change" * "number of queries over that data", so being overly broad in triggering recalculations gets expensive fast.

We're actively hiring to work on this - if you are interested my contact details are in my hn profile.


👤 stuiterbal
Maybe https://hasura.io is something to have a look at? It's a GraphQL server that gives you realtime GraphQL API's over Postgres/SQL.

I've been trying out https://nhost.io for some time now. They use Hasura and I'm impressed by how easy subscriptions are using Apollo to query the GraphQL API.


👤 sathorn
There is also the research DB Noria[0] that's based on this idea. It maintains materialized views for queries and efficiently updates them when the data changes.

[0] https://github.com/mit-pdos/noria


👤 brianr
I like this idea a lot. In a sense it's thinking of your application as a spreadsheet, where the database is a data tab and the frontend is the summary tabs. If there's a change to the data tab, the "spreadsheet engine" (or "materialized view engine" in your case) walks the dependency graph and updates all the relevant parts of the summary tabs.

In the Rollbar UI, we implemented a lot of this type of logic by hand to make our real-time aggregated Items list work (i.e. each time an Item has a new occurrence, we update its stats in the database and push the diff to connected clients). It would save an immense amount of code to have had a solution that did this out of the box.

The closest thing I'm aware of to this is BigQuery's materialized views ( https://cloud.google.com/bigquery/docs/materialized-views-in... ), which take care of making otherwise expensive queries cheap fast, but they are rather limited (i.e. no subqueries or joins), and don't have the "streaming output of changes" you describe.


👤 hpeinar
Very interesting topic!

I feel like this would be the right time to ask for an advice regarding doing something similar for user search results with PostgreSQL (v11)

Eg. User "subscribes" to product searches for "Women - Nike - Size M" and the system sends her a daily notification or email if there are new result within her filter.

How would one solve this kind of subscription logic? So far what I've up with is just to save and re-run the user queries from the API side and have last results last primary ID in a separate table to know which results are to be considered new. (So I can use primary key filter on the query to lessen the products DB has to go through, EG: "SELECT id FROM products WHERE [user filters] AND id > 1234"

But it doesn't feel right to bombard the DB with 10k queries on daily basis to achieve this, but maybe I'm overthinking and it is viable.

I looked through some of the comments mentioning NOTIFY / LISTEN but I don't think I could viably use this as I can't apply separate filters on this on the DB level.

Note: I'm looking for general ideas / things I should consider. I'm not expecting anyone to do my work.


👤 kissgyorgy
It's not as simple as "subscribe to this SQL query", but you can do this relatively easy with PostgreSQL LISTEN / NOTIFY: https://gist.github.com/kissgyorgy/beccba1291de962702ea9c237...

You just notify the clients with the primary keys of what changed, so you only need to run the query if something changed. If you implement it correctly, you only have to run the query once and push the same result to every client.

I have a small project where I used this and wherever you change anything and send the NOTIFY, every client updates real-time.


👤 devit
In PostgreSQL you can either add yourself as a replica with logical decoding (e.g. PostGraphile supports GraphQL with live queries that way) or do it manually with LISTEN/NOTIFY.

On top of that you can use transaction ids or manual transaction serials to get only new rows (but of course note that transactions can be long-running, so you need to either accept potentially unbounded repeated data, or serialize all write transactions on a single serial assignment single-row table, or record the whole set of committed transactions whose data you have processed rather than a single serial number).


👤 mtrovo
> I guess it's relatively easy to do

This is a very hard problem to do the right way and probably would need some changes on the RDBMS itself. You would need to monitor all tables that might affect your query for changes and how these changes affect your query (say you're just reading a value, aggregating with sum, doing average with count of rows, the list goes on). Add more complexity on top of that if you want to support querying from other views that also aggregate the data on your query.

As pointed on another comment there's DB Noria [0] but I'm not sure how production ready it's right now. You an idea of the complexity of the task on a interview with one of the project leads [1].

[0] https://github.com/mit-pdos/noria [1] https://corecursive.com/030-rethinking-databases-with-jon-gj...


👤 justsomeuser
If you have spare CPU capacity, you might want to just re-run and diff the query results. You are already paying for the idle CPU anyway.

I think the issue is that most SQL engines are not designed for streams, just a single request/response. Once the query is complicated enough I think it would be difficult to do efficiently and correctly.

I would go with SQLite in WAL mode (concurrent reads), many cores, and many threads/green threads in a compiled language.


👤 joel_dice
I've implemented a RDBMS that supports this [1]. It handles joins, views (which are automatically materialized and incrementally updated), etc. It's memory only, and it doesn't support exotic stuff like recursive CTEs, but it does exactly what you're asking for. We used it in production successfully for frequently-updated real time data at the company where I used to work.

Notably, it uses persistent search trees such that each revision shares structure with the previous one, which makes diffing two closely-related revisions extremely efficient (just skip over any shared structure). Subscribers just receive a stream of diffs, with backpressure handled automatically by skipping over intermediate revisions. See [2] for a more detailed summary.

It also exposes revisions as first-class objects, which allows you to tag, diff, and even three-way merge them. Specifically, you can run arbitrary queries on both revisions and diffs. See [3] for examples.

It's no longer maintained, unfortunately. Someday I may revive it, perhaps adding support for spilling data that won't fit in memory to log-structured merge trees. I'd also rewrite it in a language like Rust, which will help flatten some of the more pointer-heavy data structures and reduce tail latencies. If anyone is interested in seeing that happen or helping out, let me know.

I'm really surprised this still isn't supported in mainstream DBMSes. The MVCC model in PostgreSQL seems particularly well suited to it.

[1]: https://github.com/ReadyTalk/revori

[2]: https://github.com/ReadyTalk/revori/wiki/Design-and-Implemen...

[3]: https://github.com/ReadyTalk/revori/wiki/CLI-Revori-Client


👤 pachico
It is a feature, although experimental, of ClickHouse https://clickhouse.tech/docs/en/sql-reference/statements/cre...

👤 _nhh
The first thing that comes to mind ist LISTEN/NOTIFY from postgres.

You can listen per code and notify per code.

https://www.postgresql.org/docs/current/sql-listen.html

The js adapter for pg can handle this: https://medium.com/@simon.white/postgres-publish-subscribe-w...


👤 BenoitP
I think streaming-first systems like Flink are best positioned to answer your need.

You might need to abstract the query or subscription infrastructure into your data model though:

Subscriptions:

    INSERT INTO user_orders_total_feed(user_id, data)
    SELECT 
      uo.user_id,
      SUM(o.price) total
    FROM user_order uo
    JOIN order o ON uo.user_id = o.user_id
    JOIN user_live_subscription uls ON uo.user_id = uls.user_id
    GROUP BY uo.user_id
    WHERE uls.subscription_topic = 'orders_total'
And for queries you would join on a temporal snapshot using 'FOR SYSTEM TIME AS OF' [1]

You could have your data fully stored in Kafka or Pulsar, Kappa-architecture-style ; or you could have it in regular rdbms but with Change Data Capture on top [2] (basically shipping your WAL to Kafka/Pulsar)

The good thing about Kafka and Flink is that they scale. Your regular db might not sustain the query rate if every query needs a write to a table, but Kafka/Flink will.

[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/t...

[2] https://debezium.io/


👤 lsiebert
Here's my thoughts. Not so much for you as for myself because it's an interesting question.

I once did a very hacky version of something like with mysql, where I was batch updating ranking data for a leaderboard, using internal variables based on a Stack Overflow answer, when I was a junior. Nobody should do that lol.

In fact, I'd avoid mysql for this if at all possible.

It seems like this fairly easy to do in postgres, and a pain in the neck in mysql.

In postgres I'd use a materialized view with a unique index and refresh it CONCURRENTLY where it will compare the view and the underlying data and only change things as needed. It just doesn't do this with a stream, you have to request updates. You can presumably refresh it internally using TRIGGERS.

I mean you can do your own stream with postgres as well (using NOTIFY and LISTEN to subscribe to notifications created with AFTER CUD TRIGGERs) or even in MYSQL in a hacky way and refresh it on the back end or in the browser, but it's likely more efficent to let the db handle batch updating itself rather than handrolling your own updates. Probably easier to update to incremental views if and when those are added.

I guess the question is, what to do when either of the above isn't fast enough? My thought would be to switch between two identical materialized views with one always updating (not using concurrently). Alternatively use an external cache to do something of the same thing. Then multiple systems can query data, with refreshes being triggered by one of the selects, which switches the view to read from after updating, or which updates the cache atomically.


👤 gsvclass
In Go you can use GraphJin an automatic GraphQL to SQL compiler that supports GraphQL subscriptions (aka Live Queries). GraphJin works either as a standalone service or as a library in your own code. It uses a very efficient mechanism to manage subscriptions keeping db load low while subscribing to an SQL query using polling.

https://pkg.go.dev/github.com/dosco/graphjin@v0.16.44/core#e...

https://github.com/dosco/graphjin


👤 kopos
We used flexviews for MySQL. It would use the MySQL bin log to read through the actual DML queries (UPDATE, DELETE, INSERT), parse the query and then make changes to the materialised views based on the dependency tree defined by the SQL queries.

👤 Fire-Dragon-DoL
This problem cannot really be solved without application participation, because to keep the data up to date across joins at some point you need cross table locking, which would become highly inefficient. This is the whole cqrs thing by the way

👤 jgraettinger1
Can you speak more to how updates get pushed out to your UI clients? Are you using Firestore ? Hasura? Polling?

Something we're looking at enabling w/ Estuary Flow [1] is materializing to Firestore for this use case (and Flow already supports PostgreSQL materializations).

Flow would capture changes from your database tables (or you write normalized events directly to Flow collections). You aggregate and join as needed, and then materialize your views into Firestore (or another Postgres table) where you'd subscribe for live updates.

[1]: https://github.com/estuary/flow


👤 ikawe
A sqlite database observation system in swift: https://github.com/groue/GRDB.swift#valueobservation

I don't believe it includes any efficiency magic to know "which rows" are affected. Even knowing what tables are affected required some... creative coding. A little insight into how it's implemented: https://www.mail-archive.com/sqlite-users@mailinglists.sqlit...


👤 neximo64
Try Supabase out. Its basically a Postgres plugin and you can subscribe to tables live changes using Websockets.

If you have a query you could reorientate the table's changes and see if the changes affect the output incrementally manually.


👤 okennedy
We haven't updated it in a while, but check out DBToaster [1]. Give it some SQL queries and a set of tables, and it'll compile for you a Scala or C++ class with insert() and delete() functions for every table listed, as well as a functions to get the (incrementally maintained) result of any query. Supports SQL92, with some decent extensibility for UDFs, etc... and has quite a bit of theory backing it.

[1] https://dbtoaster.github.io/


👤 lgas
PipelineDB might be of interest. https://github.com/pipelinedb/pipelinedb

👤 numtel
A few years ago, I made node.js packages to do this for MySQL (by reading changes from the replication log) and Postgres. (using triggers) Both require specifying cache invalidation functions for each dependent table.

https://github.com/numtel/mysql-live-select

https://github.com/numtel/pg-live-select


👤 avinassh
> I'm more interesting in queries with joins and doing it efficiently, instead of just tracking updates to tables that are modified, and re-rerunning the entire query.

can you give an example?


👤 slifin
I think most of the time I would use SQL subscriptions to notify web clients that they are stale but don't auto update the client's state unless the form is read only otherwise current client state could be destroyed

React server components can do intelligent merges, in client action or on tick, I guess it would just mean hooking into that system to prevent automatically merging in all cases in favour of manual merges client side sometimes


👤 TristanBall
Informix has its "push data" feature - except it appears to be limited to a single table, not complex queries - I suspect you're right, it's hard to do safely and efficiently.

https://www.ibm.com/docs/en/informix-servers/14.10/14.10?top...


👤 tobilg
Debezium is used for Change Data Capture, and created database changes event streams: https://debezium.io/documentation/reference/1.5/connectors/i...

It's not exactly what you're searching for, but might be worth a look.


👤 daniel_iversen
I'm not 100% sure but might you be able to use some sort of Event Stream Processing[1] product on top of your database (if you need the database) and use that layer in the architecture instead of SQL and the DB?

[1]: https://en.wikipedia.org/wiki/Event_stream_processing


👤 edoceo
if you're looking at pg materialized views, look at listen/notify too.

so, the builder Listens and write queries then Notify, so re-do the m-view?

or just use Notify with good payload after each write you think is important? (I use this sometimes)

I mean, a trigger to push could work, yea?

or, some clever rig with pg-logical replication (I'd have to think on this one more, may be crap)


👤 bleonard
It's certainly not production situation, but I quickly made this [1] a while back to help with "watching" certain queries for debugging. It uses a polling approach.

[1] https://github.com/grouparoo/db_watch


👤 caseymarquis
It's less turnkey, but Postgres does support Pub/Sub. If you're willing to follow your updates with a pub message, you can then manually refresh your data elsewhere in response. I use this feature to horizontally scale websocket updates across multiple servers.

👤 mjdrogalis
ksqlDB (http://ksqldb.io/) does exactly this out of the box. You write SQL queries whose results you can subscribe to with a client connection. Today these subscriptions are somewhat expensive, but we're actively working on making them lightweight and scalable. I'm really looking forward to seeing the end result, I think it opens up whole new use cases.

👤 1337shadow
We do this in Ryzom, but we had to build it from the ground up, and implemented a meteorjs-like protocol.

👤 raguessner
Disclaimer: I work at EsperTech

Esper provides streaming SQL that gets compiled to JVM or CLR byte code


👤 quickthrower2
In sql server you could create an index view for a simple query then add triggers.

👤 raguessner
Disclaimer: I work at EsperTech

Esper is open source and provides streaming sql


👤 andyxor
check out Kafka Streams, they have streaming SQL as well https://www.confluent.io/online-talks/intro-to-ksql-streamin...

👤 PanosJee
KSqlDb

👤 100011_100001
It's not clear to me if this would solve your problem, because it kind of depends on how you are planning on reacting, but Github and Gitlab have the concept of a Code Owner and it can work for specific files. So if you know that your queries that you want to monitor exist in tidy DB Impl classes or maybe a specific repository, you could be the code owner of those files.

This approach is pro-active because you can block merges that aren't as well thought out, and you can be the final arbiter of when something gets merged.

I feel like it's kind of an out of a box way of doing what you want. It doesn't give you any visual tools or anything that makes the relationships clearer to see, but it does allow you to get closer to the code that affects what you care about.

Also since I do DevOps, if you are competent or your DevOps is competent (aka they are not just glorified Ops) you can create performance tests that either trigger on merges or like a cron job. They can even trigger daily or hourly depending on code change frequency. Then you can see trends, for example how certain inserts, deletes etc change over time, if you have a semi competent metrics driven architecture you could even see how things change over time graphically.