How to store billions of rows of time-series data?

Question

I'm trying to build a system to store price data. According to my calculations, I'm going to be inserting about 20 million rows every day. This will quickly add up to billions of rows.The queries will be very ad-hoc, mostly things like histograms, and grouping by certain attributes. It's also possible that some of the products will rarely/never be queried, not sure if it makes a difference. Read performance is still more important.Is something like TimescaleDB a good choice?I would also like to not spend a fortune on this (couple hundred a month at maximum). So is this even possible with the cost constraint?

valyala · Accepted Answer

Take a look at VictoriaMetrics. It supports OHLC function (rollup_candlestick) via MetticsQL [1], which is backwards compatible with PromQL - query language used in Prometheus. PromQL is optimized for typical queries over time series data [2].
VictoriaMetrics works out of the box without the need to set up third-party systems or wroting complex configs. It is also very fast for both data ingestion and querying [3], [4].
[1] https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/Metr...
[2] https://medium.com/@valyala/promql-tutorial-for-beginners-9a...
[3] https://medium.com/@valyala/measuring-vertical-scalability-f...
[4] https://medium.com/@valyala/billy-how-victoriametrics-deals-...

nknealk · Answer

We use kinesis firehose and Athena as a SQL API for customer interactions analytics on our digital properties (among other things). Data sizes are upwards of 10M rows per day. Athena will blow through months of data pretty quickly at very reasonable cost.
Some gochas: AWS lock in. Especially since S3 is your storage back end. Kinesis firehose writes files by insert date and you may need ETLs to partition by some other time dimension inherent to the data. This gets tricky if stuff comes in really late and you may have to decide to drop records older than N days to make such an ETL work at reasonable cost.
Some pros: easy to POC. Pure variable cost architecture. Serverless (also a downside because if any of the services go down so do you) with little administrative work. It can get expensive if you don’t think through partitioning for analytics, but Kinesis firehose can be configured to write parquet to S3 natively allowing you to use existing spark/hive/whatever if your org has that as your SQL layer.
Edit: come to think of it, there is another division in my company using bigquery with whatever google’s streaming ingestion product is. They’re achieving similar results at similar cost. I haven’t personally done development in GCP though so I couldn’t comment

gwittel · Answer

A timeseries db may work fine. 20 million per day isn’t bad, unless you expect huge spikes in a short duration.
What do your rows look like? What time ranges do you expect to query? Does your data arrive in a stream or large batches? This will help determine a rough IO and storage range needed (or at least worst case).
At this scale and going for low cost you have some trade offs that could help reduce costs and improve performance. I would try to keep it simple and test something like timescale with dummy data first. Chunks of Parquet or ORC data and Presto/Athena may also work depending on your workload.
That aside, other things may help make it even easier for you. Can you reduce the problem any (e.g. reduce number of rows you need)?
For example:
What is the minimum time period granularity you need for a single datapoint? Lower res = fewer rows.
Can you aggregate upon ingest?
For distributions can you use approximations (eg t-digest) or do you need exact values?

seektable · Answer

Yandex ClickHouse one more free/open source solution that can be considered for this purpose. Setup/maintenance may be a bit tricky - but you'll be surprised how fast CH executes your ad-hoc queries (SQL!) over billions of rows.

mfreed · Answer

(Timescale person here.)Many of our users store many billions of rows + 10s of millions per day. Native compression will give you significant storage savings, and very flexible options for indexing to support various queries. And if you have some things which are commonly queried, easy to build automated continuous rollups (e.g., to continuous rollup to OHLCV candlesticks).Can deploy either self-managed or Timescale Cloud (timescale.com/cloud), where you can find deployments within $100s per month.Cheers!

GauntletWizard · Answer

Yes, it sounds like what you want is a timeseries DB. Prometheus can handle tens of millions of rows and billions of datapoints with ease, and it's query language is designed for that kind of grouping via set arithmetic.

icsa · Answer

Try the 32-bit evaluation version of kdb+ (https://kx.com/). Twenty million rows of data per day is easy with kdb+ - even on a laptop.

sman393 · Answer

Depending on what your reporting needs are I would definitely recommend TimescaleDB. I love Prometheus for storing server resource metrics or basic app metrics (counts/timers). For more business-facing data, having the query functionality and visualization tooling of Postgres is a major win. I am working on a similar use-case to you, we considered Cassandra, Timescale, and Clickhouse. With our insert rates and reporting needs, We decided to start working with TimescaleDB.

runT1ME · Answer

My current gig is using Scylla for storing time series data, we're partitioning by day and anywhere from 5-15 million rows per day.We've mostly been using Apache Beam/Cloud Dataflow for working wtih the data en mass for occasional reports, and we're planning on trying Presto soon as well. I'm interested in what solution you come up with, happy to chat more about this kind of thing on twitter or email.

jll29 · Answer

Google BigQuery is your friend.