Anyone knew about Linux crisis tools? I think that sos command is missing from this list

Posted by jlrueda@reddit | linuxadmin | View on Reddit | 15 comments

Brendan Gregg published a Linux Crisis Tools list in 2024 — https://www.brendangregg.com/blog/2024-03-24/linux-crisis-tools.html — covering everything from procps to bpftrace. It's an excellent reference and if you manage Linux systems it's worth bookmarking.

But reading through his outage scenario something stood out: at 4:55pm the team reverted a VM snapshot to restore the site. Problem "solved." Except all the logs, all the command outputs, every piece of forensic evidence — gone. The outage returned at 12:50am because the root cause was never found.

I think that there's one tool missing from his list: the sos command.

I would have run it during the incident, before anyone touch anything else. It would have capture a complete picture of system state — logs, configs, running processes, network stats, storage info into a single archive (possibly encrypted but given that the server was faulty maybe not). After the snapshot restore the team would still have everything needed to find the actual root cause, without racing the clock on a live production system.

sos is open source, pre-installed on most enterprise Linux distros, and takes literally one command. It should be standard practice alongside every other crisis tool on Brendan's list.

What do you guys think? Are there any other tools available to solve this?