Please Implement This Simple SLO
Posted by IEavan@reddit | programming | View on Reddit | 8 comments
In all the companies I've worked for, engineers have treated SLOs as a simple and boring task. There are, however, many ways that you could do it, and they all have trade-offs.
I wrote this satirical piece to illustrate the underappreciated art of writing good SLOs.
QuantumFTL@reddit
Sure would be nice to define SLO the first time you use it. I have to adhere to SLAs at my day job, constantly mentioned. I have never heard someone discuss an SLO.
ThatNextAggravation@reddit
Thanks for giving me nightmares.
IEavan@reddit (OP)
If those nightmares makes you reflect deeply on how to implement the perfect SLO, then I've done my job.
ThatNextAggravation@reddit
Primarily it just activates my impostor syndrome and makes me want to curl up in fetal position and implement Fizz Buzz for my next job interview.
CatpainCalamari@reddit
eye twitching intensifies
I hate this so much. Good writeup though, thank you.
IEavan@reddit (OP)
I'm glad you liked it
Arnavion2@reddit
I know it's a made-up story, but for the second issue about service down -> no failure metrics -> SLO false positive, the better fix would've been to expect the service to report metrics for number of successful and failed requests in the last T time period. That would also have avoided the issues after that.
IEavan@reddit (OP)
I've seen this solution in the wild as well. If you expect consistent traffic to your service, then it can generally work well. But some services have time periods where they don't expect traffic. You can easily modify your alerts to exclude these times, but will you remember to update these exclusions when daylight savings comes and goes? :p
Also it might still mess up your SLO data for certain types of partial failures. If your service is crashing sporadically and being restarted. Your SLI will not record some failures, but no metrics will be missing, so no alert from the secondary system.