P50 vs P95 vs P99 Latency: What These Percentiles Actually Mean (And How to Use Them)
Posted by OuPeaNut@reddit | ExperiencedDevs | View on Reddit | 11 comments
servermeta_net@reddit
Not only you should care about P99, but in 15 years of career I NEVER saw a P99 compute correctly.
Since exact algorithms are too expensive, people assume that the central limit theorem holds for their dataset but it's absolutely not true: think the same endpoint (let alone different endpoint), one time it fails with an authorization error, one time it goes through. How could you assume those two times are representable by variables that are IID? To correctly calculate P99 you need aparametric statistics, or you need to model the distribution of your latency (VERY hard).
Datadog offers some tools in this regard, but NOBODY I've worked with knows them, even at big companies like Amazon and Google make the same mistake. One time after doing the math for a mission critical system the P99 went from 8.72 ms (with TLC) to 23.19 ms (without TLC), this is how much the difference can be.
lordnacho666@reddit
You sound like you might have a link or two about this?
servermeta_net@reddit
Well not much... honestly between probability, statistics and measure theory I had a total of 11 exams on the topic. If you want some academic reference you could read:
R.V. Hogg, E.A. Tanis, Probability and Statistical Inference
G.G. Roussas, A Course in Mathematical Statistics, Academic Press
N. Draper, H. Smith, Applied Regression Analysis
Also while searching datadog documentations I found this nice article:
https://www.datadoghq.com/blog/engineering/computing-accurate-percentiles-with-ddsketch/
It's very poor on mathematical details, but it's a good short intro. They say they can compute accurate percentiles but that was written by a marketing guy, it's still an approximation, although orders of magnitude better than the standard approach of using the TLC
lordnacho666@reddit
Much appreciated. What are the keywords, other than perhaps "percentile"?
servermeta_net@reddit
Nonparametric statistics, computational statistics, central limit theorem.
Also please note: MANY common books on statistics for engineers contain crude mistakes, like the ones I mentioned above. Please refer to this paper:
https://eric.ed.gov/?id=ED395989
Where not only it talks about this topic (although without proposing solutions that are a good fit for us), but mentions many errors in the academic literature.
servermeta_net@reddit
Maybe I will write a post about this, with some examples.
ccb621@reddit
You’re just gonna leave us hanging?
servermeta_net@reddit
https://docs.datadoghq.com/metrics/distributions/
https://www.datadoghq.com/blog/engineering/computing-accurate-percentiles-with-ddsketch/
This is approximate aparametric statistics. Not approximating the result would be too computational expensive, but the folks at datadog are VERY smart (they actually publish a lot of very good operative research) and came out with an approach that doesn't break the bank.
I feel the article, although poor of mathematical details, is a good intro to the topic.
globalaf@reddit
AI slop.
Jealous-Weekend4674@reddit
another bot account that likes to spam
OuPeaNut@reddit (OP)
:(