Help with exponentially growing ClickHouse setup

Question

Hey everyone,Long story short, we are a data analytics company from S23 batch storing currently >20 billion rows of data and it is exponentially growing.There are a lot of custom properties we have to store in JSONs (over 100 KBs per row) and many mutable datapoints (we do daily syncs with other platforms). If you've ever worked with ClickHouse then you'll know we are pushing it to its unintended limits here :DWe just got three of our biggest customers last month and the trend is that it is going to keep on growing from here on out. We have 8 ClickHouse servers with 128gb memory and it still doesn't feel enough.We are looking for some og ClickHouse experts to give us some advice on how to scale this setup to handle 100x the load. I'm sure we'll hit that scale by next year.You can reach out to me from arda [at] hockeystack [dot] com if you are interested in getting your hands dirty with a big production system in 30 minutes. Even better if you are in SF, I'll buy you a coffee :)

aswerty · Accepted Answer

> Even better if you are in SF, I'll buy you a coffee :)
Maybe pay them as a consultant? :)
Would be useful to indicate what your scaling constraint is, since if you aren't cost bound you can always run 800 servers if you need to scale 100x. And if you are cost bound, it would make sense to qualify your ops/cost metric to design against.

altdataseller · Answer

I am assuming your product is HockeyStack.
Why do you need Clickhouse? Why do you need to store 20 billion rows of data? Even if you do need to store that much, why isnt something like Postgres sufficient?
If you’re doing revenue attribution based on things like website visits, or activities, and only have hundreds of customers, then i imagine at worse you are processing a thousand events every few seconds? Mostly writes. Seems you can do something around caching events and bulk inserting to optimize things. Still think using Postgres is possible here. If you need to scale horizontally, you could shard based on customer ID too. If you need to do analytics on this data, Postgres should still be usable as long as you create the right indices

aristofun · Answer

On top of a horizontal scaling which isn&rsquo;t a big deal (yet) you would definitely need to think about optimizing the data itself (100kb per row sounds big for marketing/analytics/bi piece of data) but most importantly the retention timeframe. To archive/dump/optimize old data not critical for the product.

leros · Answer

Are you paying ClickHouse? If so, have you talked to their solutions engineers?

caprock · Answer

Find some other businesses using clickhouse rigorously and reach out to them on LinkedIn or email. You might be able to get some contacts by just reaching out to clickhouse staff and asking.