HACKER Q&A
📣 kordlessagain

How do you do real-time updates and deletes on massive datasets?


I’m building something that needs to do very fast updates on multi TB datasets. I’m wondribg how others are solving this.


  👤 nivertech Accepted Answer ✓
There are 4 solutions I can think of:

1. Lambda architecture - split into Speed/Realtime and Batch/Historical layers. You stream updates into a realtime layer. Periodically (i.e. each night) you merge them into the Historical layer. And each query need to merge data from both layers (or you can pre-compute the results in case queries are predefined/fixed).

https://en.wikipedia.org/wiki/Lambda_architecture

kdb+/q was one of the first databases to use this scheme, they have RDB (Realtime/RAM in-memory DB) and HDB (Historical DB - partitioned on disk), and in some configurations even IDB (Intraday DB) on-disk for cases when RDB will spillover and hold the day's data. Each query would processed in parallel on all of them and then the results will be merged with UDF.

https://en.wikipedia.org/wiki/Kdb%2B

2. hot + cold storage (very similar to Lambda Architecture, but not exactly the same).

3. use OLAP DB/service that already does realtime data ingestion for you, e.g. Druid, kdb+/q, AWS Kinesis+Athena, GCP Pub/Sub ingesting directing into BigQuery, etc.

4. use Hadoop M/R, Spark, Flink, Apache Beam/GCP Dataflow, etc. to build a pre-computed results (i.e. Lambda arch. w/o serving layer).


👤 dossy
map/reduce