HACKER Q&A
📣 tomerbd

What Are the Limitations of Druid


It's frustrating that it's hard to dig the limitations of architectures here is what I have so far

1. Not for joins

2. Not for sql DISTINCT

3. Limited for high variability columns

Anything other gotchas? (Yes denormalization is stated clearly but the other two not, looking for such similar non clearly stated limitations).


  👤 cheddar Accepted Answer ✓
I apparently wrote a novel that HackerNews claims is too long, so let me try to break it up into parts. Hopefully HN doesn't hate me for it...

As the guy who wrote the first lines of code in Druid, I can venture a description of the limitations as I tend to see them. Of course, all systems have limitations, they are built for a purpose. The purpose that Druid was built for was to power an analytics-oriented product. What's a data-oriented product and why am I starting to talk all meta when you are asking for very specific technical limitations? Because understanding the meta leads to a much stronger understanding of fit and purpose than talking about a specific technical blahty-blah.

So, an analytics-oriented product is a product that renders a screen for an end-user based on some data/analytics. The very first one of these that we powered was a Digital Advertising focused dashboard that provided visibility into impressions, clicks, conversions and revenue across marketplaces. Other examples are things like metering and billing that show usage and costs attributed all the way down to an individual request level. Yet other examples are product recommendations based on purchasing behaviors, or fraud analysis based on recent purchase history, or observability style use cases viewing the flow of events through a system.

The thing that all of these have in common is that they are products that tend to be: 1) Multi-tenant 2) Follow a similar general "pattern" of queries, with highly variable boolean filtering criteria and a consistent need to view the same data across a wide array of different dimensions 3) Have an SLA defined for the product experience that determines a budget for how fast queries must run