Long story short, we are a data analytics company from S23 batch storing currently >20 billion rows of data and it is exponentially growing.
There are a lot of custom properties we have to store in JSONs (over 100 KBs per row) and many mutable datapoints (we do daily syncs with other platforms). If you've ever worked with ClickHouse then you'll know we are pushing it to its unintended limits here :D
We just got three of our biggest customers last month and the trend is that it is going to keep on growing from here on out. We have 8 ClickHouse servers with 128gb memory and it still doesn't feel enough.
We are looking for some og ClickHouse experts to give us some advice on how to scale this setup to handle 100x the load. I'm sure we'll hit that scale by next year.
You can reach out to me from arda [at] hockeystack [dot] com if you are interested in getting your hands dirty with a big production system in 30 minutes. Even better if you are in SF, I'll buy you a coffee :)
Maybe pay them as a consultant? :)
Would be useful to indicate what your scaling constraint is, since if you aren't cost bound you can always run 800 servers if you need to scale 100x. And if you are cost bound, it would make sense to qualify your ops/cost metric to design against.
Why do you need Clickhouse? Why do you need to store 20 billion rows of data? Even if you do need to store that much, why isnt something like Postgres sufficient?
If you’re doing revenue attribution based on things like website visits, or activities, and only have hundreds of customers, then i imagine at worse you are processing a thousand events every few seconds? Mostly writes. Seems you can do something around caching events and bulk inserting to optimize things. Still think using Postgres is possible here. If you need to scale horizontally, you could shard based on customer ID too. If you need to do analytics on this data, Postgres should still be usable as long as you create the right indices