The Root Cause Fallacy: Systems fail for multiple reasons, not one

[-]

SanityInAnarchy@reddit

I'm curious if there are places that do RCA so religiously that they don't consider these other contributing factors. I've worked in multiple places where the postmortem template would have a clear "root cause" field to fill in, but in a major outage, you'd end up with 5-10 action items to address other contributing factors.

Every postmortem I've ever written for a major incident had a dozen action items.

[-]

Murky-Relation481@reddit

If they're doing RCA religiously then they would be getting multiple action items. Root cause analysis is analyzing the root cause, not just identifying the thing that went wrong. It's how you got to the thing that went wrong in terms of process and procedures.

[-]

SanityInAnarchy@reddit

Right, but these "root cause is a fallacy" comments talk about how it's never just one thing, as if there's a level of "RCA" religion where you insist there can only be one root cause and nothing else matters.

[-]

ThatDunMakeSense@reddit

Yeah its mostly because people don't understand how to determine a root cause. They go "oh this code had a bug" and say "that's the root cause" instead of actually looking at the things that allowed that to happen. I would say generally if someone does an RCA that doesn't come up with a number of action items then they've probably not done it right. It's not *impossible* I suppose but I've personally never seen a well done RCA with one action item

[-]

En-tro-py@reddit

Exactly and 5-Whys is just a warm-up...

"oh this code had a bug"

Why was it missed in review? etc.

Eventually you reach the level/scope of the 'why' that is outside your pay-grade/jurisdiction/control and that's when you know it's been done right.

We can't change the whole world, but we can change a lot about how this gets done to prevent that from happening again.

[-]

Murky-Relation481@reddit

It's more so that a ton of people don't actually know what RCA is and practice it wrong, which is why so many people are commenting on why there is multiple causes.

[-]

CurtainDog@reddit

Interesting choice of words. Religion is the OG RCA when you think about it.

[-]

LessonStudio@reddit

When I was training as pilot, there were old VHS cassettes with good non hyperbolic like those stupid crash "documentaries".

It was crash after crash after crash and their investigations.

Being 19 we had way too much fun watching these as we lived at the school, and between nights and poor weather we had lots of time on our hands.

The story was the same pretty much every crash: A series of factors, which if any were removed, it would go from huge accident, to something potentially not even worth reporting, or a minor maintenance sort of report.

The ones where it was bordering on a single factor often then took gross levels of incompetence. But again, it would be layers of different incompetence.

My favourite seemed simple, it was a crash in a Florida swamp where one of the gear down lights had burned out. So, they lowered the gear and the three pilots (including the engineer) were screwing with the bulb right down into the ground, as nobody was paying attention to flying the plane. I would argue that this wasn't only a bulb and bad pilots, but that the bulb should never have been a single point of failure. There should have been redundant bulbs, or something. That this was also a failure of the engineers and lack of a safety critical design. And a lack of training where they would tell them that if it doesn't light up, it could be the bulb, not the gear, and on and on. Many failures.

Other "simple" accidents like the Concorde getting taken out by a single piece of scrap still really had a long set of failures going back to the problem of that plane barely being able to fly, and that the loss of an engine could be so catastrophic.

The Gimli glider might have been one of the shorter ones, and still was a chain of mistakes, that removing any single one would have probably turned it into something not even mentioned in a log somewhere.

[-]

CurtainDog@reddit

I love this kind of stuff. The 737 MAX is a particularly interesting case. A whole raft of failures. But I think the lion's share should have gone to the FAA, for making the cost of compliance too high.

[-]

LessonStudio@reddit

for making the cost of compliance too high

I build systems which are often safety/mission critical, and I will impose some standards on projects where this is not a requirement.

Doing it in the classic engineering V Matrix style is begging for unsafe products.

My theory is that people less focus on making a good product, than making good paperwork, and a product which complies with the paperwork. Then, when they are holding the earliest versions in their hands the obvious iterations they need to do in order to make a great product are now prohibitively expensive to do. Not only because you have to put those changes all the way back to the start, but there is a good chance the team has been disbanded, or moved on to another project. So, you keep your mouth shut and point to the fact that everyone signed off and it passed all the tests.

My methodology is to focus on the goal, keep it within known and immutable constraints, and then just go to town. Make the product with many iterations, prototypes, and whatnot. Doing this while keeping various safety critical things in mind. Things like selecting lockstep MCUs, etc.

Then, when the product is nailed down. You blast through the process of papering the product properly. It is super easy to document the requirements in detail when you are holding it in your hand. All the way to the other end of testing it, when you are dead certain of the final product. Everything during design and development is now just copy paste. Even the testing people could have been validating their tests against the prototype in anticipation that the "real" product will be no different.

This papering process is fast and requires very few people.

You are 100% correct that regulators, lawyers, and insurance companies have enabled the worst kinds of engineers to rise to positions of power. The librarians of the engineering world instead of the builders. Those librarians should be more archivists. Let the builders do their thing and hand it over to them to document. Not the other way around.

[-]

En-tro-py@reddit

My theory is that people less focus on making a good product, than making good paperwork, and a product which complies with the paperwork.

Unfortunately, I've seen this too often... FMEA's done as a deliverable long after they would be useful...

It is what I'd say is the human equivalent to AI slop... As long as it looks right and no one notices, it's all good until it isn't...

[-]

En-tro-py@reddit

They should have not designed something where there was any option of installing it the wrong way

Good designers do this whenever possible, but the bean counters love to cut 'waste' to save pennies.

[-]

grauenwolf@reddit

Running out of memory directly crashed the database, but other aspects shouldn’t be overlooked.

This is your root cause. There is a design flaw in the database server that causes it to run out of memory and crash when a query is too complex. That shouldn't even be possible. There are known ways for the database to deal with this.

AND monitoring failed to alert developers

This is contributing factor for delaying the recovery, but not the root cause. While it needs to be fixed, it doesn't change the fact that the database shouldn't have failed in the first place.

AND the scaling policy didn’t work

Again, this is a contributing factor for delaying the recovery. The database should have, at worst, suffered a performance degradation. Or maybe killed the one query that exceeded it's memory limit. And then, after the issue occurred, the scaling policy should have kicked in to reduce the chance of a reoccurrence.

AND the culprit query wasn’t optimised.

Unoptimized queries are to be expected. Again, databases should not crash because a difficult query comes through.

There was no root cause analysis in this article.

Root cause analysis doesn't answer the question, "What happened in this specific occurance?". It answers the question, "How do we prevent this from happening again?".

What this article did was identify some proximate causes. It didn't take the next step of looking at the root causes.

Why did the database fail when it ran out of memory?
Why was the alert system ineffective?
Why did the scaling feature fail?
Why were there inefficient queries in production?

Not all of these questions will lead you to a root cause, but not asking them will guarantee that the problem will occur again.

[-]

moratnz@reddit

There was no root cause analysis in this article.

I'm glad I'm not the only one thinking this. The database running out of memory isn't the root cause of the crash; it's the proximal cause. The root cause is almost certain to be something organisational that explains why it was possible to have dangerously unoptimised queries in production, and why there was no monitoring and alerting to catch the issue before it broke stuff.

Similarly, the linked reddit comment says that when looking at the 737max, root cause analysis gets you to the malfunctioning angle of attack sensor; no it doesn't - again that's the start of the hunt. The next very obvious question is why the fuck is such a critical sensor not redundant, and on we go.

Ultimately, yeah, eventually we're going to end up tracing a lot of root causes back to the problem being something societal ('misaligned incentives in modern limited liability public companies' is a popular candidate here), but that doesn't mean root cause analysis is useless, just that in practical terms you're only going to be able to go so far down the causal chain before solving the problem moves out of your control.

[-]

Kache@reddit

IME, tracing deeply pretty much has to end at "societal" because past that, it can start to get finger-pointy and erode trust

[-]

En-tro-py@reddit

People aren't the problem, the problem is the problem.

It's often the lowest grunt who get's the blame without the actual fault being with the management and environment that allows it to ever get questioned.

People can cause problems, but the root cause is generally lack of support/training, accountability, etc.

[-]

Plank_With_A_Nail_In@reddit

I have worked on systems that were so poorly designed that crazy SQL was the best anyone could do.

[-]

grauenwolf@reddit

That's been my life for most of the year. Their data model was flat out garbage, but I couldn't change it.

[-]

Murky-Relation481@reddit

I swear people think root cause analysis is figuring out what went wrong and not why it went wrong.

Almost every single root cause analysis system starts with already knowing what went wrong. You can't figure out the why if not for the what.

[-]

Pinewold@reddit

I worked on five nines systems. A good root cause analysis desires to find all contributing factors, all monitoring weaknesses, all of the alert opportunities, all of the recovery impediments. The goal is to eliminate all contributing factors including the class of problem. So a memory failure would prompt a review of memory usage, garbage collection, memory monitoring, memory alerts, memory exceptions, memory exception handling and recovery from memory exceptions.

This would be done for all of the code, not just the offending module. When there are millions of dollars on the line every day, you learn to make everything robust, reliable, redundant, restartable, replayable and recordable. You work to find patterns that work well and reuse them over and over again to the point of boredom.

At first it is hard but overtime it becomes second nature.

You learn the limits of your environment, put guard posts around them, work to find the weakest links and rearchitect for strength.

Do you know how much disk space your programs require to install and consume on a daily basis? Do you know your program memory requirements? What processes are memory intensive, storage intensive, network intensive? How many network connections do you use? How many network connections do you allow? How many logging streams do you have? How many queues do you subscribe to, how many do you publish to?

[-]

With my experience not working on these systems, here's something I wish I knew earlier: You're not working on these systems. Stop spending time and money trying to make your systems more reliable than you absolutely need to.

If your site being down at 3am means nobody is up to buy anything anyway, fixing it is worth exactly zero dollars to you.

[-]

jl2l@reddit

What a garbage take, problem is it might be 3am in your time zone but the Internet is 24/7. Its 5pm in Australia and they buy shit last time I checked.

[-]

grauenwolf@reddit

My pizza shop only has a 5 mile delivery area. Are you really going to order a pizza at 5 pm your time and then fly for 14+ hours to pick it up?

[-]

jl2l@reddit

If you are running the pizza shop infrastructure on your local ISP sure. But I bet your pizza shop is running on Shopify or Wix and it also might run hundreds of other pizza shops in different timezones.

[-]

grauenwolf@reddit

If your site being down at 3am means nobody is up to buy anything anyway

That scenario was already precluded.

[-]

EscapeTomMayflower@reddit

It's so annoying how many companies want to burn through money trying to do "something" even if there's nothing that can be done.

When there was that big AWS outage last month we had teams of people working 16 hours straight to try and "fix" things. None of the fixes worked and we had to spend the rest of the week undoing all the things the fixes broke.

Agreed, our five nines system was servers for a life alert system so literally peoples life’s depended on our servers being up. Well managed, it makes choices easier to do the right thing.

The most toxic root cause analysis is a witch hunt. I have seen good people thrown under the bus because a senior manager did not want a stain on their reputation. It is a major red flag when anyone says they have never had a server crash!

[-]

omgFWTbear@reddit

a failure could cost lives,

I’m old, but when I was in school, they taught (what the textbook called “the” but appears to have been “a”) Therac “disaster.”

Code that was technically correct (and before anyone replies, hop in a time machine to 30+ years ago) due to a race condition cooking a human alive.

I’ve certainly worked with a lot of developers since who clearly either did not have that experience or did not internalize it.

[-]

TheDevilsAdvokaat@reddit

I think your title itself is a fallacy.

Sometimes systems DO fail for one reason. I afree that many times they do not, but sometimes it really is one thing.

[-]

CurtainDog@reddit

Eh, let historians decide on "truth", engineers should be more concerned with the utility of their models. In this case the Root Cause model is unhelpful, so it should be discarded.

[-]

TheDevilsAdvokaat@reddit

Eh in return. You commenetd on a fallacy, so I pointed out your own fallacy.

[-]

LookingRadishing@reddit

This is very true -- especially as systems get more complex. Root cause analysis is often scapegoating dressed-up so as to look professional.

Sometimes I feel like we aren't all the much more advanced than the Aztecs and their practice of human sacrifice. In our own ways we do similar things, but we've convinced ourselves that it's rational. The irony is that in the process, we do very little to address the underlying problem(s). We treat the symptoms while the disease lingers in dormancy.

[-]

crashorbit@reddit

There is always a key factor for each failure.

From the article

The database ran out of memory
AND monitoring failed to alert developers
AND the scaling policy didn’t work
AND the culprit query wasn’t optimized.

Cascades like the above are a red flag. They are a sign of immature capability. The root problem is a fear of making changes. It's distrust in the automation. It's a lack of knowledge in the team of how to operate the platform.

You develop confidence in your operational procedures including your monitoring by running them.

Amateurs practice till they get it right. Professionals practice till they can't get it wrong.

[-]

Ok-Substance-2170@reddit

And then stuff still fails in unexpected ways anyway.

[-]

grauenwolf@reddit

While that's certainly a possibility, the "database ran out of memory" is something I expect to happen frequently. There's no reason to worry about about the unexpected when you already know the expected is going to cause problems.

[-]

Ok-Substance-2170@reddit

Frequently? You might need bigger boxes.

[-]

grauenwolf@reddit

Look at your actual execution plan.

In SQL Server the warning you are looking for is "Operator used tempdb to spill data during the execution". This means it unexpectedly ran out of memory.

I forget what the message is when it planned to use TempDB because it knew there wouldn't be enough memory. And of course each database handles this differently, but none should just crash.

[-]

Ok-Substance-2170@reddit

That's interesting, thanks.

I'm kinda just poking fun at the idea that maturity models and endless practice can defeat Murphy's laws.

[-]

grauenwolf@reddit

There's a lot to be learned from the original Murphy's law. To paraphrase, "If there is two ways of doing something, and one will result in catastrophe, someone will do it that way."

The answer isn't to shrug. It is to eliminate the wrong way as a possibility. In the original story, some sensors could be installed forward or backward. By changing the mounts so that they could only be installed one way, installing them backwards would no longer be possible.

We see this all the time with electrical and electronic connectors. If they aren't keyed, people will install them upside-down. (Or in my case, off by one pin. Man that was embarrassing.)

There's always going to be things you can't anticipate. But so much of what we do can be if we just stop to ask, "What happens if X fails?".

[-]

Ok-Substance-2170@reddit

Well yeah I don't think anyone can work with technology and stay employed if they don't think about what might fail and what can we do about it.

[-]

grauenwolf@reddit

Oh I wish that was true.

[-]

Ok-Substance-2170@reddit

It is where I work.

[-]

crashorbit@reddit

Stuff will always fail in novel ways. It's when it keeps failing in well known ways that exposes the maturity level of the deployed capability.

> Cascades like the above are a red flag. They are a sign of immature capability.

I disagree (although the example in the article isn't great and is confusing). If you have a complex, mature system and maturity/experience in operating it, then any incident usually has multiple contributing factors / multiple failures. Any of these could have / should prevented the incident or significantly reduced the impact of it.

Sure, if the unoptimized query got shipped to production without any tests, without any monitoring, no scaling, etc. then it's a sign of an immaturity. But often, these things were in place, but they had gaps or edge cases.

[-]

Sweet_Television2685@reddit

and then management lays off the professionals and keeps the amateurs and shuffles the team, true story!

[-]

crashorbit@reddit

It reminds me of the aphorism: "Why plan for failure? We can't afford it anyway."

[-]

Linguistic-mystic@reddit

Yes. We’ve just discovered a huge failure in our team’s code and it’s indeed lots of causes:

one piece of code ont taking a lock on the reads (only the writes) for performance reasons
another piece if code taking the lock correctly but still in a race with the previous piece
even then, there was no race because they ran at different times. But then we split databases and now there were foreign tables involved, slowing down the transactions - that’s when the races started
turns out, that maybe second piece of code is not needed anymore at all since the first was optimized (so it could’ve been scrapped months ago)

There is no single method or class to blame here. Each had their reasons to be that way. We just didn’t see how it would all behave together, and had no way to monitor the problem, and also the clients didn’t notice for months (we learned of tge problem from a single big client). It’s a terrible result but it showcases the real complexity of software development.

[-]

LessonStudio@reddit

I would argue that less than 5% of programmers really grok threading at any level. I also argue that threading is way more than threads in a single application, but inter process communications, networked communications, and that even the user's GUI and potentially crap in their head is kind of a thread.

Your DB was just one more "thread"

What I often see as a result is wildly over aggressive locking to the point where the code really isn't multi threaded as nothing is parallel, just waiting for locks to free up. And that one write, multi read is a great place for people to get things wrong.

A solid sign that people don't understand threading is when code says:

sleepFunction(50ms); // Don't remove this or the data will go all weird.

Now they are just sledgehammering the threads into working at all.

[-]

bwainfweeze@reddit

You have neither a single source of truth nor a single system of record.

That’s your root cause. The concurrency bugs are incidental to that.

[-]

grauenwolf@reddit

one piece of code not taking a lock on the reads (only the writes) for performance reasons

That sounds like the root cause to me. It should have used an reader-writer lock instead of just hoping that no writes would overlap a read.

By root cause I don't mean "this line of code was wrong". I mean "this attitude towards locking was wrong" and the code needs to be reviewed for other places where reads aren't protected from concurrent writes.

For the counterfactual analysis, lets consider the other possibilities.

Item #2 was correctly written code. You can't search the code for other examples of correctly written code to 'fix' as a pre-emptive measure. Therefore it wasn't a root cause.

phillipcarter2@reddit

Always love an opportunity to plug: https://how.complexsystems.fail/

[-]

swni@reddit

A fine essay but I think it goes a little too far to absolve humans of human error. It is true that there is a bias in retrospective analysis to believe that the pending failure should have appeared obvious to operators, but conversely there is also a bias for failures to occur to operators that are error-prone or oblivious.

Humans are not interchangeable and not constant in their performance. Complex systems require multiple errors to fail (as the essay points out) and as one increases this threshold, the failure rate of skilled operators declines faster than the failure rate of less-skilled operators, and so the more often system failure only occurs in the presence of egregious human error.

[-]

BinaryIgor@reddit

I don't know it feels like a semantics exercise. From the article's example:

3 AM database crash example breakdown:
1. Database ran out of memory → high (breaking point)
2. Missing monitoring → medium (would have caught it early)
3. Broken scaling policy → high (could have prevented overflow)
4. Suboptimal query → medium (accelerated memory consumption)

To me, the root cause was that the database ran out of memory. Sure, then you can ask why did the database run out of memory, but that's a different thing.

[-]

El_Wij@reddit

What failed first?

[-]

Haplo12345@reddit

The phrase for that is "due to a confluence of events".

[-]

RobotIcHead@reddit

People love to think it is just one thing problem that it is causing systems to fail and fixing that will fix everything. Usually it is multiple underlying issue that were never addressed, combined with some larger ongoing problems and then one or two huge issues that happen at once.

People are good at adapting to problems, sometimes too good at working around them, putting in temporary fixes that becomes permanent and building on unstable structures. It is the same in nearly everything people create. It takes disasters to force people to learn and to force those in charge to stop actions like that from happening in the first place.

[-]

jogz699@reddit

I’d highly recommend checking out John Allspaw who has coined the “infinite hows” in the incident management space.

I’d highly recommend reading up on Allspaw’s work, then supplementing it with some systems thinking stuff (see: Drifting into Failure by Sidney Dekker).