Anyone else seeing people averaging percentiles?

Question

I work at a middling SV company, and I see people taking averages of percentiles (or, even crazier, percentiles of percentiles) every day - that is, each server computes its own "50%/90%/95% latency" of all requests, sends it to a central time-series database, and the Grafana console averages them all to show a "nice" graph. They are used for everything from alerts to launch decisions.It's driving me crazy because that makes no sense: you can't average percentiles, you will get bogus numbers that randomly jumps up and down based on aggregation (how many servers you have, which tag sets you use, and so on). And I'm apparently the only one who is seriously bothered. Everyone else is somewhere between "Eh, that's the best data we have." and "What do you mean the numbers are wrong?"Is this normal?

richk449 · Accepted Answer

On one hand, your point seems to be technically correct.
On the other hand, don't we average things that we can't technically justify all the time? Teachers give out homework and tests, assign a weight to each homework and each test, and average the results to assign you a grade. Is that grade arbitrary? Yes. Does that mean it is useless? Probably not.
If I were you, I would make your case based less on "you can't do that" and more on "if we used this approach to aggregation, we would improve our ability to detect cases X and Y that our customers really care about".

NumberCruncher · Answer

> Is this normal?It is. At my last job the first ticket assigned to me was about fixing the bug "average is greater than 80% percentile". When I told them that this isn't a bug, they thought I am just kidding.

Bostonian · Answer

I think the idea is to measure average latency but to give extra weight to the worst 5% and 10% of latency cases. I've not seen such a measure before but don't think it's absurd.

wizwit999 · Answer

I think they might be getting at what I've seen called a 'trimmed mean'. Its actually a pretty useful statistic for something like latency, because with an average you can see changes in the distribution within the percentile, as opposed to a simple percentile.But that's a wrong way to collect it, they should aggregate at the end.

sobriquet9 · Answer

It's actually not as stupid as it looks. There is a whole class of estimators [1] that use weighted average of quantiles.[1] https://en.wikipedia.org/wiki/L-estimator

millrawr · Answer

It happens frequently because there's often not support in the underlying systems for mergeable latency sketches (e.g. DDSketch, HDR Histogram, or libcircllhist). There thankfully seems to be slowly increasing amounts of support for doing this the correct way.

wikibob · Answer

Do you work at my company? This is literally rampant. Nobody understands.

kderbyma · Answer

You need to do aggregate operations. Some operations are point based (think matrix dot multiplication) and others are series based (FFT)This is typically fixed realising which you need.

ryanmonroe · Answer

Not sure it&rsquo;s possible to know the answer without more context. What do they use the averages for? What do you think they should do instead and why is it better?

lolln · Answer

Sounds like another dev who doesn&rsquo;t know about LLN tbqh.

Anyone else seeing people averaging percentiles?

> Is this normal?
It is. At my last job the first ticket assigned to me was about fixing the bug "average is greater than 80% percentile". When I told them that this isn't a bug, they thought I am just kidding.

I think the idea is to measure average latency but to give extra weight to the worst 5% and 10% of latency cases. I've not seen such a measure before but don't think it's absurd.

It's actually not as stupid as it looks. There is a whole class of estimators [1] that use weighted average of quantiles.
[1] https://en.wikipedia.org/wiki/L-estimator

It happens frequently because there's often not support in the underlying systems for mergeable latency sketches (e.g. DDSketch, HDR Histogram, or libcircllhist). There thankfully seems to be slowly increasing amounts of support for doing this the correct way.

Do you work at my company? This is literally rampant. Nobody understands.

You need to do aggregate operations. Some operations are point based (think matrix dot multiplication) and others are series based (FFT)
This is typically fixed realising which you need.

Not sure it’s possible to know the answer without more context. What do they use the averages for? What do you think they should do instead and why is it better?

Sounds like another dev who doesn’t know about LLN tbqh.