Brendan Gregg's 55-minute outage story has a missing piece — the sos command

Posted by jlrueda@reddit | sysadmin | View on Reddit | 13 comments

Brendan Gregg wrote a post in 2024 that every sysadmin should read: "Linux Crisis Tools" — https://www.brendangregg.com/blog/2024-03-24/linux-crisis-tools.html

He walks through a fictional-but-real 55-minute outage where a team couldn't even install iostat because the firewall blocked apt, the filesystem was immutable, and nobody knew the package management policy. The site came back at 4:55pm — not because they found the root cause, but because they reverted a VM snapshot.

Then at 12:50am it happened again. Because nothing was actually fixed.

Brendan's takeaway: pre-install your crisis tools. He's absolutely right.

But there's a second problem his story illustrates that nobody talks about: the snapshot revert destroyed all forensic evidence. No logs. No command outputs. No idea what actually happened.

I think that this is where the sos command belongs on that crisis tools list. Run it once during the incident — even on a crawling system — and it captures logs, configs, and the output of dozens of diagnostic commands into a single archive. After the snapshot restore, your team still has everything they need to find the real root cause.

sos is open source, ships with every major Linux distro, and takes one command to run. Add it to your crisis toolkit.

What do you think? Is there any other tool like this (preferably open-source)?