P50 vs P95 vs P99 Latency: What These Percentiles Actually Mean (And How to Use Them)

[-]

servermeta_net@reddit

Not only you should care about P99, but in 15 years of career I NEVER saw a P99 compute correctly.

Since exact algorithms are too expensive, people assume that the central limit theorem holds for their dataset but it's absolutely not true: think the same endpoint (let alone different endpoint), one time it fails with an authorization error, one time it goes through. How could you assume those two times are representable by variables that are IID? To correctly calculate P99 you need aparametric statistics, or you need to model the distribution of your latency (VERY hard).

Datadog offers some tools in this regard, but NOBODY I've worked with knows them, even at big companies like Amazon and Google make the same mistake. One time after doing the math for a mission critical system the P99 went from 8.72 ms (with TLC) to 23.19 ms (without TLC), this is how much the difference can be.

[-]

lordnacho666@reddit

You sound like you might have a link or two about this?

[-]

servermeta_net@reddit

Well not much... honestly between probability, statistics and measure theory I had a total of 11 exams on the topic. If you want some academic reference you could read:

R.V. Hogg, E.A. Tanis, Probability and Statistical Inference
G.G. Roussas, A Course in Mathematical Statistics, Academic Press
N. Draper, H. Smith, Applied Regression Analysis

Also while searching datadog documentations I found this nice article:
https://www.datadoghq.com/blog/engineering/computing-accurate-percentiles-with-ddsketch/

It's very poor on mathematical details, but it's a good short intro. They say they can compute accurate percentiles but that was written by a marketing guy, it's still an approximation, although orders of magnitude better than the standard approach of using the TLC

[-]

lordnacho666@reddit

Much appreciated. What are the keywords, other than perhaps "percentile"?

[-]

servermeta_net@reddit

Nonparametric statistics, computational statistics, central limit theorem.

Also please note: MANY common books on statistics for engineers contain crude mistakes, like the ones I mentioned above. Please refer to this paper:
https://eric.ed.gov/?id=ED395989
Where not only it talks about this topic (although without proposing solutions that are a good fit for us), but mentions many errors in the academic literature.

[-]

servermeta_net@reddit

Maybe I will write a post about this, with some examples.

[-]

ccb621@reddit

Datadog offers some tools in this regard, but NOBODY I've worked with knows them, even at big companies like Amazon and Google people make the same mistake.

You’re just gonna leave us hanging?

[-]

servermeta_net@reddit

https://docs.datadoghq.com/metrics/distributions/

https://www.datadoghq.com/blog/engineering/computing-accurate-percentiles-with-ddsketch/

This is approximate aparametric statistics. Not approximating the result would be too computational expensive, but the folks at datadog are VERY smart (they actually publish a lot of very good operative research) and came out with an approach that doesn't break the bank.

I feel the article, although poor of mathematical details, is a good intro to the topic.

[-]

globalaf@reddit

AI slop.

[-]

Jealous-Weekend4674@reddit

another bot account that likes to spam

[-]

OuPeaNut@reddit (OP)

:(