Why is manual root cause analysis still a thing in 2026?

Posted by Heavy_Banana_1360@reddit | sysadmin | View on Reddit | 20 comments

Every outage I am digging through logs, metrics, traces like some kind of caveman. Alerts fire, phone blows up, but actually pinpointing the cause? Hours of toil every time.

Ai promises automatic RCA with pattern detection and anomaly flagging but half the tools I have tried either spit out noise or need constant tuning to stay useful. Proactive detection sounds great until it is paging you at 3am for a CPU blip that resolved itself.

Does anyone actually cut their MTTR meaningfully with this stuff?Or are we all just hoping the next tool is finally the one? What are you running and does it actually deliver? Tired of senior engineers getting pulled in for things that should be detectable automatically.

[-]

SGG@reddit

The best way to view AI is a combo of the following because it's basically a "this word follow this word most of the time" machine.

A new hire that started 5 minutes ago that has a lot of theoretical knowledge but is so green #00FF00 doesn't come close.
Someone who thinks they are right all the time
Someone who always has to give an answer

You always need to double check all of their work. Sure they can be useful as an advanced form of rubber duck debugging, but if you take their response at face value you'll need to prepare the 3 envelopes much sooner.

At most I would be using AI to help ingest some logs about an incident.

[-]

InfraScaler@reddit

Because we are still in diapers when it comes to understanding how to steer AI to help us. We're used to deterministic outcomes but the AI is probabilistic. Most people say they understand this but they don't. Those who understand it haven't found a surefire solution.

We're working on it anyway. The last 4 months I've been head down on this for one of my customers and this is currently my special interest :) if there are any communities or folks in general working on this I'm looking forward to collaborate and build together.

My customer is quite particular so even though I can bring in a lot of context engineering expertise I haven't got the resources to work on an RCA assistant for stuff like k8s and similar, which I think would be where we can add a lot of value.

[-]

Ssakaa@reddit

Most people say they understand this but they don't. Those who understand it haven't found a surefire solution.

Those who really understand it definitely aren't going to trust it. They'll take it into consideration, just as they'll take the musings of an over-eager intern into consideration. Different perspectives can be valuable. They can also send you down rabbit holes that're completely the wrong path.

[-]

Envelope_Torture@reddit

"Why do I still have a job in 2026?"

[-]

Ssakaa@reddit

They have a job because they're out here shilling for "AI" to prop up the hype.

[-]

AngleBackground157@reddit

What’s your biggest pain point when trying to automate RCA? We use Anomalog to monitor our app and catch errors in real time, and also rely on Sentry and Datadog for broader observability.

[-]

Zackey_TNT@reddit

Because ai won't get the job done and is just another buzzword. It's wrong most of the time. We're systems engineers. We get it wrong and companies collapse. Devs get it wrong and they just move it to the next fake spirnt.

[-]

Sea-Aardvark-756@reddit

This 100%. Even if it correctly identifies the error, it's not going to contextualize it correctly. We recently had some machines with errors that led technicians to believe we had an epidemic of RAM failures, but rolling back the most recent Windows 11 update solved everything. AI is built on a pyramid of knowledge but it doesn't know how to come to new conclusions and add new bricks.

[-]

manilove__frogs@reddit

The OP is also leaving out a lot of context of their specific situation. In my team, RCA also means identifying risks and improvements, which may end up going to C suites.

[-]

zatset@reddit

Because understanding the root cause requires analysis. And this requires actual understanding and critical thinking. And answers…problems are hardly ever “template cases”.

[-]

ExtensionNervous7500@reddit

Worth checking out hud, they stream runtime data in real time which actually helps flag anomalies before they blow up rather than just detecting them after the fact. what makes it different is it ties anomalies to specific code or config changes rather than just surfacing "something looks weird." reduced toil noticeably for us on the RCA side. for your specific question about MTTR: yes it's actually moved the number, not just in theory.

[-]

fdeyso@reddit

How often do you need to attend RCA meetings? If it’s daily/weekly there are 2 options: A; ppl don’t know what they’re doing causing constant major outages constantly and probably there’s no proper change control B; the org classes outages/incidents as major requiring RCA without any benefit (like lesson’s learned).

What you describe can be managed even by Nagios Core but XI would def cover it. You can setup alert age thresholds to log a ticket and even a schedule when to log it when to not, a simple CPU blip would never require an RCA. RCA is needed for events that fall outside of an “easily detectable oopsie” and usually legitimately require some subject matter experts to piece together what happened.

[-]

Loki-L@reddit

How do you know if AI is correct?

AI doesn't understand things like cause and effect, it just matches patterns.

It might be okay to write a reasonably sounding report based on your inputs and lots of other reasonably sounding reports it was trained on, but there is no reason it would hsve to actually be true.

If the goal is just to produce paperwork that is okay, but if you actually want to fix things and prevent them from happening again you need facts not hallucinations.

It would be okay yo use AI and other tools to give you a starting point on what to check. Pattern recognition is a thing we use with natural intelligence too after all. It is how we can read about a major outage on the news and confidently predict it was DNS without knowing anything about their system or what happened.

But for confirmation you need yo actually check yourself.

[-]

GoldTap9957@reddit

Are you getting enough function-level data from your current tools to actually pinpoint causes, or is it mostly infrastructure-level metrics? In my experience that's the gap, you know the system is unhealthy but not which specific thing made it unhealthy.

[-]

barrulus@reddit

I have found it to be invaluable in tracing really obscure performance issues. This is NOT automatic RCA. This is a manual investigation augmented by AI.

I can collect logs from 10 independent systems, crash dumps and count side har files to perform a massive cross functional analysis looking for odd patterns or chasing a hunch.

If I were doing this myself, I'd get there, but it would take weeks. Now I can get it done in an hour or so.

I have created log embeddings servers and numerous log parsing scripts to accelerate this process.

One rag has detailed information about the network and server architecture, applications in use and more. All of that just to provide meaningful insight as I investigate.

[-]

shikkonin@reddit

Because there is still no other way to do it.

[-]

HTDutchy_NL@reddit

Because AI is still a black box that decides what words would go together nicely with a "be nice modifier" as a response to your input. There is 0 capacity of critical thinking.

No ability to take various disconnected subjects and see how they disconnect unless some dude on serverfault or reddit ever saw the same.

[-]

SlickAstley_@reddit

Ive thought this myself.

My guesses would be:

Security, imagine an AI seeing ldap names of every user in your business.
AI folk are tied to the industry, and RCA/looking at logs is the only thing keeping them and their kind in a job (so they're doing us a solid really).

[-]

AmazingHand9603@reddit

I wish I could say that AI made a real impact on outage response times, but most of the time the improvements are just incremental. You save a few minutes from obvious problems, but when it’s something new or a chain of failures, you will be stuck trying to find the root cause of the problem. The best automation is using scripts to collect logs and metrics faster, but the actual thinking still takes up most of the time.

[-]

l0g0ut@reddit

Every time I got pulled into an RCA meeting, I thanked god that I still have a job.