Anyone knew about Linux crisis tools? I think that sos command is missing from this list
Posted by jlrueda@reddit | linuxadmin | View on Reddit | 15 comments
Brendan Gregg published a Linux Crisis Tools list in 2024 — https://www.brendangregg.com/blog/2024-03-24/linux-crisis-tools.html — covering everything from procps to bpftrace. It's an excellent reference and if you manage Linux systems it's worth bookmarking.
But reading through his outage scenario something stood out: at 4:55pm the team reverted a VM snapshot to restore the site. Problem "solved." Except all the logs, all the command outputs, every piece of forensic evidence — gone. The outage returned at 12:50am because the root cause was never found.
I think that there's one tool missing from his list: the sos command.
I would have run it during the incident, before anyone touch anything else. It would have capture a complete picture of system state — logs, configs, running processes, network stats, storage info into a single archive (possibly encrypted but given that the server was faulty maybe not). After the snapshot restore the team would still have everything needed to find the actual root cause, without racing the clock on a live production system.
sos is open source, pre-installed on most enterprise Linux distros, and takes literally one command. It should be standard practice alongside every other crisis tool on Brendan's list.
What do you guys think? Are there any other tools available to solve this?
Single-Virus4935@reddit
Most of the stuff like logs etc should be send to remote and ingested. If a VM is compromised snapshot of compromised state. If host is compromised poweroff, disk image and reprovision.
nethack47@reddit
Do you learn about sos during certification?
I did not know it, and most people I interview don’t mention it in a recovery scenario.
Will have a look at it later today now.
I tend to view a lot of data since the terminal often is the only thing I can trust. Rolling back to a snapshot really isn’t my go to resolution.
fatmanwithabeard@reddit
I run stateless mostly. Logs are captured elsewhere.
Any restart is effectively a rollback recovery.
We try to run a sos report as part of our non standard restart procedures, but due to node configurations, we're successful less than 75% of the time. (to be fair nearly 15% of non standard node restarts are NMIs or remote power cycles, so, yeah).
(I learned about sos while looking for any easy way to get reports from a TLA we supported. They still insisted on sending us logs by fax.)
Hotshot55@reddit
I wouldn't call
sosa recovery scenario tool, it's more of a general troubleshooting tool.If I'm in recovery mode, I'm restoring access, not trying to find a root cause.
nethack47@reddit
Now that I know about the tool, and have had a chance to test it, it is a useful tool for when something is wonky and I am not in a hurry.
When I am in a recovery situation, this is slow and not very useful. Just checking top, logs and other indicators are much more efficient and informative.
When I ask a candidate about tools they might use when investigating a server with a vague "it's not working right", this seems to fit for data gathering. That said, so many people don't think about checking logs and dmesg when they get something vague.
Now that I am quite curious. What scenario would you use the tool?
Hotshot55@reddit
99.9999999% of the time, I'm only ever going to run
sosto generate a report for a vendor. If you can log into the system, it'll pretty much always be more useful to read through the current logs.An sosreport is a collection of logs and stats from a certain point in time, it's more useful for reviewing after the fact or when you don't have access to the system.
jlrueda@reddit (OP)
What certification? I honestly can't remember where or when I learned about it it was many years ago. But I use it very often ever since. When you use it as part of your operation, then you have another kind of problem archiving, classifying and tracking old sosreports down and comparing them but that is another completely different issue. It Has been very useful to me.
nethack47@reddit
It looks like one of the functions that would be a part of something like RedHat certification.
It seems to collect a snapshot of what my basic checks cover. Interesting thing to add as an option, just concerned about dropping potentially large packets into /var/tmp that need cleanup. I can see it could be very helpful with a bash wrapper that pulls the package off server, cleans up, and renames it.
wossack@reddit
When logging a call with Red Hat, you are quite often asked for an sosreport to be uploaded (depending on the issue etc etc)
serverhorror@reddit
What's it part of or where is the Homepage?
Never heard of that tool.
jlrueda@reddit (OP)
https://github.com/sosreport/sos years old. Is what RedHat uses for customer support. You can type: man sos report
gforke@reddit
Debian and openSuse don't seem to have it included from the start, only RHEL stuff seems to have it preinstalled
ChrisTX4@reddit
It’s more commonly used on RHEL, but you can also use it on other distros.
There’s also a tool called xsos to distill information from sos even quicker, but that throws a lot of errors when not on RH I’ve found.
Kurgan_IT@reddit
TIL about bpfcc-tools. This looks insanely cool. https://www.super-man.dev/package/bpfcc-tools
MaxRK@reddit
I don't think sosreport is at the low level observability layers that he's focused on, he does mention it's a minimal list. Still up to experienced sysadmins to work out what's good for their environment, including bulk collection scripts like sosreport.
sosreport isn't quite ubiquitous yet if you're aiming for largest audience, example SLES prefers supportconfig. I've also had a couple of incidents where sosreport resource load and OS probing commands caused production issue, the low level knowledge can be more appealing once you know them.