The Root Cause Fallacy: Systems fail for multiple reasons, not one
Posted by dmp0x7c5@reddit | programming | View on Reddit | 84 comments
Posted by dmp0x7c5@reddit | programming | View on Reddit | 84 comments
SanityInAnarchy@reddit
I'm curious if there are places that do RCA so religiously that they don't consider these other contributing factors. I've worked in multiple places where the postmortem template would have a clear "root cause" field to fill in, but in a major outage, you'd end up with 5-10 action items to address other contributing factors.
Every postmortem I've ever written for a major incident had a dozen action items.
Murky-Relation481@reddit
If they're doing RCA religiously then they would be getting multiple action items. Root cause analysis is analyzing the root cause, not just identifying the thing that went wrong. It's how you got to the thing that went wrong in terms of process and procedures.
SanityInAnarchy@reddit
Right, but these "root cause is a fallacy" comments talk about how it's never just one thing, as if there's a level of "RCA" religion where you insist there can only be one root cause and nothing else matters.
ThatDunMakeSense@reddit
Yeah its mostly because people don't understand how to determine a root cause. They go "oh this code had a bug" and say "that's the root cause" instead of actually looking at the things that allowed that to happen. I would say generally if someone does an RCA that doesn't come up with a number of action items then they've probably not done it right. It's not *impossible* I suppose but I've personally never seen a well done RCA with one action item
En-tro-py@reddit
Exactly and 5-Whys is just a warm-up...
Why was it missed in review? etc.
Eventually you reach the level/scope of the 'why' that is outside your pay-grade/jurisdiction/control and that's when you know it's been done right.
We can't change the whole world, but we can change a lot about how this gets done to prevent that from happening again.
Murky-Relation481@reddit
It's more so that a ton of people don't actually know what RCA is and practice it wrong, which is why so many people are commenting on why there is multiple causes.
CurtainDog@reddit
Interesting choice of words. Religion is the OG RCA when you think about it.
LessonStudio@reddit
When I was training as pilot, there were old VHS cassettes with good non hyperbolic like those stupid crash "documentaries".
It was crash after crash after crash and their investigations.
Being 19 we had way too much fun watching these as we lived at the school, and between nights and poor weather we had lots of time on our hands.
The story was the same pretty much every crash: A series of factors, which if any were removed, it would go from huge accident, to something potentially not even worth reporting, or a minor maintenance sort of report.
The ones where it was bordering on a single factor often then took gross levels of incompetence. But again, it would be layers of different incompetence.
My favourite seemed simple, it was a crash in a Florida swamp where one of the gear down lights had burned out. So, they lowered the gear and the three pilots (including the engineer) were screwing with the bulb right down into the ground, as nobody was paying attention to flying the plane. I would argue that this wasn't only a bulb and bad pilots, but that the bulb should never have been a single point of failure. There should have been redundant bulbs, or something. That this was also a failure of the engineers and lack of a safety critical design. And a lack of training where they would tell them that if it doesn't light up, it could be the bulb, not the gear, and on and on. Many failures.
Other "simple" accidents like the Concorde getting taken out by a single piece of scrap still really had a long set of failures going back to the problem of that plane barely being able to fly, and that the loss of an engine could be so catastrophic.
The Gimli glider might have been one of the shorter ones, and still was a chain of mistakes, that removing any single one would have probably turned it into something not even mentioned in a log somewhere.
CurtainDog@reddit
I love this kind of stuff. The 737 MAX is a particularly interesting case. A whole raft of failures. But I think the lion's share should have gone to the FAA, for making the cost of compliance too high.
LessonStudio@reddit
I build systems which are often safety/mission critical, and I will impose some standards on projects where this is not a requirement.
Doing it in the classic engineering V Matrix style is begging for unsafe products.
My theory is that people less focus on making a good product, than making good paperwork, and a product which complies with the paperwork. Then, when they are holding the earliest versions in their hands the obvious iterations they need to do in order to make a great product are now prohibitively expensive to do. Not only because you have to put those changes all the way back to the start, but there is a good chance the team has been disbanded, or moved on to another project. So, you keep your mouth shut and point to the fact that everyone signed off and it passed all the tests.
My methodology is to focus on the goal, keep it within known and immutable constraints, and then just go to town. Make the product with many iterations, prototypes, and whatnot. Doing this while keeping various safety critical things in mind. Things like selecting lockstep MCUs, etc.
Then, when the product is nailed down. You blast through the process of papering the product properly. It is super easy to document the requirements in detail when you are holding it in your hand. All the way to the other end of testing it, when you are dead certain of the final product. Everything during design and development is now just copy paste. Even the testing people could have been validating their tests against the prototype in anticipation that the "real" product will be no different.
This papering process is fast and requires very few people.
You are 100% correct that regulators, lawyers, and insurance companies have enabled the worst kinds of engineers to rise to positions of power. The librarians of the engineering world instead of the builders. Those librarians should be more archivists. Let the builders do their thing and hand it over to them to document. Not the other way around.
En-tro-py@reddit
Unfortunately, I've seen this too often... FMEA's done as a deliverable long after they would be useful...
It is what I'd say is the human equivalent to AI slop... As long as it looks right and no one notices, it's all good until it isn't...
En-tro-py@reddit
Good designers do this whenever possible, but the bean counters love to cut 'waste' to save pennies.
grauenwolf@reddit
This is your root cause. There is a design flaw in the database server that causes it to run out of memory and crash when a query is too complex. That shouldn't even be possible. There are known ways for the database to deal with this.
This is contributing factor for delaying the recovery, but not the root cause. While it needs to be fixed, it doesn't change the fact that the database shouldn't have failed in the first place.
Again, this is a contributing factor for delaying the recovery. The database should have, at worst, suffered a performance degradation. Or maybe killed the one query that exceeded it's memory limit. And then, after the issue occurred, the scaling policy should have kicked in to reduce the chance of a reoccurrence.
Unoptimized queries are to be expected. Again, databases should not crash because a difficult query comes through.
There was no root cause analysis in this article.
Root cause analysis doesn't answer the question, "What happened in this specific occurance?". It answers the question, "How do we prevent this from happening again?".
What this article did was identify some proximate causes. It didn't take the next step of looking at the root causes.
Not all of these questions will lead you to a root cause, but not asking them will guarantee that the problem will occur again.
moratnz@reddit
I'm glad I'm not the only one thinking this. The database running out of memory isn't the root cause of the crash; it's the proximal cause. The root cause is almost certain to be something organisational that explains why it was possible to have dangerously unoptimised queries in production, and why there was no monitoring and alerting to catch the issue before it broke stuff.
Similarly, the linked reddit comment says that when looking at the 737max, root cause analysis gets you to the malfunctioning angle of attack sensor; no it doesn't - again that's the start of the hunt. The next very obvious question is why the fuck is such a critical sensor not redundant, and on we go.
Ultimately, yeah, eventually we're going to end up tracing a lot of root causes back to the problem being something societal ('misaligned incentives in modern limited liability public companies' is a popular candidate here), but that doesn't mean root cause analysis is useless, just that in practical terms you're only going to be able to go so far down the causal chain before solving the problem moves out of your control.
Kache@reddit
IME, tracing deeply pretty much has to end at "societal" because past that, it can start to get finger-pointy and erode trust
En-tro-py@reddit
People aren't the problem, the problem is the problem.
It's often the lowest grunt who get's the blame without the actual fault being with the management and environment that allows it to ever get questioned.
People can cause problems, but the root cause is generally lack of support/training, accountability, etc.
Plank_With_A_Nail_In@reddit
I have worked on systems that were so poorly designed that crazy SQL was the best anyone could do.
grauenwolf@reddit
That's been my life for most of the year. Their data model was flat out garbage, but I couldn't change it.
Murky-Relation481@reddit
I swear people think root cause analysis is figuring out what went wrong and not why it went wrong.
Almost every single root cause analysis system starts with already knowing what went wrong. You can't figure out the why if not for the what.
Pinewold@reddit
I worked on five nines systems. A good root cause analysis desires to find all contributing factors, all monitoring weaknesses, all of the alert opportunities, all of the recovery impediments. The goal is to eliminate all contributing factors including the class of problem. So a memory failure would prompt a review of memory usage, garbage collection, memory monitoring, memory alerts, memory exceptions, memory exception handling and recovery from memory exceptions.
This would be done for all of the code, not just the offending module. When there are millions of dollars on the line every day, you learn to make everything robust, reliable, redundant, restartable, replayable and recordable. You work to find patterns that work well and reuse them over and over again to the point of boredom.
At first it is hard but overtime it becomes second nature.
You learn the limits of your environment, put guard posts around them, work to find the weakest links and rearchitect for strength.
Do you know how much disk space your programs require to install and consume on a daily basis? Do you know your program memory requirements? What processes are memory intensive, storage intensive, network intensive? How many network connections do you use? How many network connections do you allow? How many logging streams do you have? How many queues do you subscribe to, how many do you publish to?
19andNuttin@reddit
Given your experience working on these systems, is there anything you’ve learned over time that you wish you’d known earlier?
Maybe something that people often underestimate or only realize after a few years of doing this kind of work?
Pinewold@reddit
Find reliable patterns and use them relentlessly! To make it reliable, load test everything to the breaking point and understand why it fails.
The heavy load exposes memory leaks, excessive disk utilization, network constraints and hardware constraints. Even multithreaded issues happen more frequently under load.
Turns out even state of the art routers can fail when you get over a million simultaneous connections. By limiting connections to 700k we spared the router and avoided swamping the server with dropped connections. It is cool to half a million connections breeze through with no issues.
Last-Independence554@reddit
Not the commentator, but IMHO the key mindset is: Defense in depth. There will be failures, there will be bugs.
Obviously you should try to catch these bugs and failures early, but things will slip through the cracks. So think of how you can prevent or mitigate from it exploding. Think about what-ifs. What if this part/system fails. How will the rest of my system react? Can I make it more resilient? Is there a reasonable fall-back? Can I shed load? Will the system degrade gracefully of will it fall apart? If this API call fails, can I still continue? Do humans needs to do repetitive tasks / manual checks for deployment? Can these be automated? Can I test / simulate failures? Can I simplify my system? What would happen if the number of requests suddenly multiples? What are my failure zones? What's the blast radius of each of these? Are they acceptable?
Also, the Google SRE is always a good resource: https://sre.google/sre-book/table-of-contents/
Poromenos@reddit
With my experience not working on these systems, here's something I wish I knew earlier: You're not working on these systems. Stop spending time and money trying to make your systems more reliable than you absolutely need to.
If your site being down at 3am means nobody is up to buy anything anyway, fixing it is worth exactly zero dollars to you.
jl2l@reddit
What a garbage take, problem is it might be 3am in your time zone but the Internet is 24/7. Its 5pm in Australia and they buy shit last time I checked.
grauenwolf@reddit
My pizza shop only has a 5 mile delivery area. Are you really going to order a pizza at 5 pm your time and then fly for 14+ hours to pick it up?
jl2l@reddit
If you are running the pizza shop infrastructure on your local ISP sure. But I bet your pizza shop is running on Shopify or Wix and it also might run hundreds of other pizza shops in different timezones.
grauenwolf@reddit
That scenario was already precluded.
EscapeTomMayflower@reddit
It's so annoying how many companies want to burn through money trying to do "something" even if there's nothing that can be done.
When there was that big AWS outage last month we had teams of people working 16 hours straight to try and "fix" things. None of the fixes worked and we had to spend the rest of the week undoing all the things the fixes broke.
Companies want to become completely dependent on cloud providers and then act like they're not completely dependent on cloud providers.
Norphesius@reddit
The CEO of that company that had all their IOT beds lock up because of the outage made some statement like "our developers are working around the clock until the issue is resolved." Its stupid on two levels, first being what you said, that there is nothing they could do because the issue was entirely beyond their control, at the time.
The really annoying thing is that it was in their control, up until the moment they decided to make their product completely reliant on a network connection. All the "do something now" money they're wasting now would've gone a long way if they actually spent it on designing a product that doesn't brick if someone unplugs a router. If the company wanted devs working around the clock during that incident, it should've been to develop an offline mode update for the beds to deploy ASAP when AWS came back up.
cronofdoom@reddit
Unless you work for an organization that operates globally…
CpnStumpy@reddit
Reminds me of the fishbone diagram exercise I participated in once at a previous job where the whole product engineering department was sat down in groups of ~3-5 with a poster board to collect failures and causes in an incident, then they were compiled into a huge fishbone diagram. Was interestingly insightful result to see how many different points of detail went into the result, any of which somehow contributing to the incident, many invisible to folks from various teams.
We walked out of the exercise with pretty much every team having some action item or 3 to either change code or process wise
Pinewold@reddit
Bet you also walked out with a sense of hope for a more reliable product too!
Murky-Relation481@reddit
Yah, I used to build root cause analysis software long long ago for use in industrial settings and other multi disciplinary engineering systems (think oil and gas, nuclear, aerospace, etc.).
At least the group we worked with was about covering all contributing causes to the same degree the root cause was. This meant you often has trees of causes with each branch having as much investigation and evidence gathering as the final cause would. I mean I don't even know if you could call it root cause analysis without that since root cause almost is always an exercise in pattern recognition and pattern extraction and then adjusting your practices and procedures to prevent it in the future.
Pinewold@reddit
Good insight! Patterns are key. When we hit a memory error the first time, (as opposed to memory exceptions which we felt we had well covered), we worked really hard to understand the pattern of when memory errors occurred. We had to refactor for these new patterns and come up with new patterns for handling memory intensive situations. Automated tests helped us simulate memory errors and how to handle them.
mycall@reddit
I like to call it minimal maintenance.
Pinewold@reddit
Agreed, once you realize that reliable code frees you up to write new code, you work harder to get it right the first time!
InTheASCII@reddit
Programmers and IT techs would benefit from reading final FAA reports, especially ones involving complex mechanical failures. It really would open some eyes I think. If we were to assume a failure could cost lives, we would definitely go beyond, "root cause."
Just because something broke, doesn't mean it broke in total isolation, or that it was unavoidable, or that there's a single party responsible for fixing the problem.
Pinewold@reddit
Agreed, our five nines system was servers for a life alert system so literally peoples life’s depended on our servers being up. Well managed, it makes choices easier to do the right thing.
The most toxic root cause analysis is a witch hunt. I have seen good people thrown under the bus because a senior manager did not want a stain on their reputation. It is a major red flag when anyone says they have never had a server crash!
omgFWTbear@reddit
I’m old, but when I was in school, they taught (what the textbook called “the” but appears to have been “a”) Therac “disaster.”
Code that was technically correct (and before anyone replies, hop in a time machine to 30+ years ago) due to a race condition cooking a human alive.
I’ve certainly worked with a lot of developers since who clearly either did not have that experience or did not internalize it.
TheDevilsAdvokaat@reddit
I think your title itself is a fallacy.
Sometimes systems DO fail for one reason. I afree that many times they do not, but sometimes it really is one thing.
CurtainDog@reddit
Eh, let historians decide on "truth", engineers should be more concerned with the utility of their models. In this case the Root Cause model is unhelpful, so it should be discarded.
TheDevilsAdvokaat@reddit
Eh in return. You commenetd on a fallacy, so I pointed out your own fallacy.
LookingRadishing@reddit
This is very true -- especially as systems get more complex. Root cause analysis is often scapegoating dressed-up so as to look professional.
Sometimes I feel like we aren't all the much more advanced than the Aztecs and their practice of human sacrifice. In our own ways we do similar things, but we've convinced ourselves that it's rational. The irony is that in the process, we do very little to address the underlying problem(s). We treat the symptoms while the disease lingers in dormancy.
crashorbit@reddit
There is always a key factor for each failure.
From the article
Cascades like the above are a red flag. They are a sign of immature capability. The root problem is a fear of making changes. It's distrust in the automation. It's a lack of knowledge in the team of how to operate the platform.
You develop confidence in your operational procedures including your monitoring by running them.
Amateurs practice till they get it right. Professionals practice till they can't get it wrong.
Ok-Substance-2170@reddit
And then stuff still fails in unexpected ways anyway.
grauenwolf@reddit
While that's certainly a possibility, the "database ran out of memory" is something I expect to happen frequently. There's no reason to worry about about the unexpected when you already know the expected is going to cause problems.
Ok-Substance-2170@reddit
Frequently? You might need bigger boxes.
grauenwolf@reddit
Look at your actual execution plan.
In SQL Server the warning you are looking for is "Operator used tempdb to spill data during the execution". This means it unexpectedly ran out of memory.
I forget what the message is when it planned to use TempDB because it knew there wouldn't be enough memory. And of course each database handles this differently, but none should just crash.
Ok-Substance-2170@reddit
That's interesting, thanks.
I'm kinda just poking fun at the idea that maturity models and endless practice can defeat Murphy's laws.
grauenwolf@reddit
There's a lot to be learned from the original Murphy's law. To paraphrase, "If there is two ways of doing something, and one will result in catastrophe, someone will do it that way."
The answer isn't to shrug. It is to eliminate the wrong way as a possibility. In the original story, some sensors could be installed forward or backward. By changing the mounts so that they could only be installed one way, installing them backwards would no longer be possible.
We see this all the time with electrical and electronic connectors. If they aren't keyed, people will install them upside-down. (Or in my case, off by one pin. Man that was embarrassing.)
There's always going to be things you can't anticipate. But so much of what we do can be if we just stop to ask, "What happens if X fails?".
Ok-Substance-2170@reddit
Well yeah I don't think anyone can work with technology and stay employed if they don't think about what might fail and what can we do about it.
grauenwolf@reddit
Oh I wish that was true.
Ok-Substance-2170@reddit
It is where I work.
crashorbit@reddit
Stuff will always fail in novel ways. It's when it keeps failing in well known ways that exposes the maturity level of the deployed capability.
https://en.wikipedia.org/wiki/Capability_Maturity_Model
Ok-Substance-2170@reddit
You should tell AWS and Azure about that I guess.
omgFWTbear@reddit
But can i introduce you to my friend, profits?
br0ck@reddit
During the Azure front door outage two weeks ago, they linked from the alert on their status page to their doc telling you that you should have your own backup outside of Microsoft in case Front Door fails with specifics on how to do that, and.. that page was down due to the outage.
BigHandLittleSlap@reddit
Someone should tell them about circular dependencies like using a high-level feature for low-level control plane access.
jug6ernaut@reddit
Or in ways you completely expect but can’t realistically account for until it’s a “priority” to put man hours into addressing. Be that architectural issues, dependencies, vendors etc/w/e.
syklemil@reddit
And in those cases you hopefully have an error budget, so you're able to make some decisions about how to prioritise, and not least reason around various states of degradation and their impact.
In the case of a known wonky subsystem, the right course of action might be to introduce the ability to run without it, rather than debug it.
Cheeze_It@reddit
You do know who capitalists will hire right?
Last-Independence554@reddit
> Cascades like the above are a red flag. They are a sign of immature capability.
I disagree (although the example in the article isn't great and is confusing). If you have a complex, mature system and maturity/experience in operating it, then any incident usually has multiple contributing factors / multiple failures. Any of these could have / should prevented the incident or significantly reduced the impact of it.
Sure, if the unoptimized query got shipped to production without any tests, without any monitoring, no scaling, etc. then it's a sign of an immaturity. But often, these things were in place, but they had gaps or edge cases.
Sweet_Television2685@reddit
and then management lays off the professionals and keeps the amateurs and shuffles the team, true story!
crashorbit@reddit
It reminds me of the aphorism: "Why plan for failure? We can't afford it anyway."
Linguistic-mystic@reddit
Yes. We’ve just discovered a huge failure in our team’s code and it’s indeed lots of causes:
one piece of code ont taking a lock on the reads (only the writes) for performance reasons
another piece if code taking the lock correctly but still in a race with the previous piece
even then, there was no race because they ran at different times. But then we split databases and now there were foreign tables involved, slowing down the transactions - that’s when the races started
turns out, that maybe second piece of code is not needed anymore at all since the first was optimized (so it could’ve been scrapped months ago)
There is no single method or class to blame here. Each had their reasons to be that way. We just didn’t see how it would all behave together, and had no way to monitor the problem, and also the clients didn’t notice for months (we learned of tge problem from a single big client). It’s a terrible result but it showcases the real complexity of software development.
LessonStudio@reddit
I would argue that less than 5% of programmers really grok threading at any level. I also argue that threading is way more than threads in a single application, but inter process communications, networked communications, and that even the user's GUI and potentially crap in their head is kind of a thread.
Your DB was just one more "thread"
What I often see as a result is wildly over aggressive locking to the point where the code really isn't multi threaded as nothing is parallel, just waiting for locks to free up. And that one write, multi read is a great place for people to get things wrong.
A solid sign that people don't understand threading is when code says:
Now they are just sledgehammering the threads into working at all.
bwainfweeze@reddit
You have neither a single source of truth nor a single system of record.
That’s your root cause. The concurrency bugs are incidental to that.
grauenwolf@reddit
That sounds like the root cause to me. It should have used an reader-writer lock instead of just hoping that no writes would overlap a read.
By root cause I don't mean "this line of code was wrong". I mean "this attitude towards locking was wrong" and the code needs to be reviewed for other places where reads aren't protected from concurrent writes.
For the counterfactual analysis, lets consider the other possibilities.
Item #2 was correctly written code. You can't search the code for other examples of correctly written code to 'fix' as a pre-emptive measure. Therefore it wasn't a root cause.
Item #3 was not incorrectly written code either. Moreover, even if it wasn't in place the race condition could still be triggered, just less frequently. So like item 2, it doesn't lead to any actionable recommendations.
Item #4 is purely speculative. You could, and probably should, ask "maybe this isn't needed" about any feature, but that doesn't help you solve the problem beyond a generic message of "features that don't exist don't have bugs".
vytah@reddit
I grew up watching the [Mayday](https://en.wikipedia.org/wiki/Mayday_(Canadian_TV_series)) documentary series. It taught me that virtually any failure has multiple preventable causes.
MjolnirMark4@reddit
Makes me think about what someone said about accidents involving SCUBA tanks: the actual error happened 30 minutes before the person went under water.
Example: the person filling the tank messed up settings, and the tanks only had oxygen in them. When the guys were underwater, oxygen toxicity set in, and it was pretty much too late to do anything to save them.
LessonStudio@reddit
When I trained for some work related diving, they also used re-breathers. One of the things they did was give us all a "taste" of pure oxygen, argon, and pure nitrogen, vs air.
They told us, remember this, and if you tasted these from your air tank, don't go down.
The nitrogen was interesting because it had no taste at all. Air tasted dry and cool, but oxygen had a "crisp" feeling as it passed through my mouth. Argon was different. I'm not sure I might pick up on that one.
This was to avoid this very thing as all these were on deck and some asshat might mix them up when refilling our tanks. I suspect there were other measures than beaten to hell labels to keep us safe, but it was not only nice to know this, but that we should understand that if we even had a hint something wasn't right, that we should investigate, that this was a possibility.
Had they not done this and my air tasted weird, I would have assumed it was a dirty regulator or something.
wslagoon@reddit
The defect that crippled Apollo 13 happened months before the mission.
Jim_84@reddit
Garbage blog post based on playing loose with definitions. The term "root cause" implies the existence of a chain of causes, of which some are more important than others in causing the failure.
Jim_84@reddit
Sometimes there's one cause, sometimes it's more complex. I guess a realistic stance doesn't generate much engagement though.
SuperfluidBosonGas@reddit
This is my favorite model of explaining why catastrophic failures are rarely the result of a single factor: https://en.wikipedia.org/wiki/Swiss_cheese_model
grauenwolf@reddit
In the case of this article, it was a single factor. A database that crashes in response to a low memory situation is all hole and no cheese.
And that's often the situation. The temptation is to fix "the one bad query". And then next week, you fix "the other one bad query". And the week after that you fix the "two bad queries". They never do the extra step and ask why the database is failing when it runs out of memory. They just keep slapping on patches.
phillipcarter2@reddit
Always love an opportunity to plug: https://how.complexsystems.fail/
swni@reddit
A fine essay but I think it goes a little too far to absolve humans of human error. It is true that there is a bias in retrospective analysis to believe that the pending failure should have appeared obvious to operators, but conversely there is also a bias for failures to occur to operators that are error-prone or oblivious.
Humans are not interchangeable and not constant in their performance. Complex systems require multiple errors to fail (as the essay points out) and as one increases this threshold, the failure rate of skilled operators declines faster than the failure rate of less-skilled operators, and so the more often system failure only occurs in the presence of egregious human error.
BinaryIgor@reddit
I don't know it feels like a semantics exercise. From the article's example:
To me, the root cause was that the database ran out of memory. Sure, then you can ask why did the database run out of memory, but that's a different thing.
El_Wij@reddit
What failed first?
Haplo12345@reddit
The phrase for that is "due to a confluence of events".
RobotIcHead@reddit
People love to think it is just one thing problem that it is causing systems to fail and fixing that will fix everything. Usually it is multiple underlying issue that were never addressed, combined with some larger ongoing problems and then one or two huge issues that happen at once.
People are good at adapting to problems, sometimes too good at working around them, putting in temporary fixes that becomes permanent and building on unstable structures. It is the same in nearly everything people create. It takes disasters to force people to learn and to force those in charge to stop actions like that from happening in the first place.
jogz699@reddit
I’d highly recommend checking out John Allspaw who has coined the “infinite hows” in the incident management space.
I’d highly recommend reading up on Allspaw’s work, then supplementing it with some systems thinking stuff (see: Drifting into Failure by Sidney Dekker).