Anyone knew about Linux crisis tools? I think that sos command is missing from this list

[-]

Single-Virus4935@reddit

Most of the stuff like logs etc should be send to remote and ingested. If a VM is compromised snapshot of compromised state. If host is compromised poweroff, disk image and reprovision.

[-]

nethack47@reddit

Do you learn about sos during certification?

I did not know it, and most people I interview don’t mention it in a recovery scenario.

Will have a look at it later today now.

I tend to view a lot of data since the terminal often is the only thing I can trust. Rolling back to a snapshot really isn’t my go to resolution.

[-]

fatmanwithabeard@reddit

I run stateless mostly. Logs are captured elsewhere.

Any restart is effectively a rollback recovery.

We try to run a sos report as part of our non standard restart procedures, but due to node configurations, we're successful less than 75% of the time. (to be fair nearly 15% of non standard node restarts are NMIs or remote power cycles, so, yeah).

(I learned about sos while looking for any easy way to get reports from a TLA we supported. They still insisted on sending us logs by fax.)

[-]

Hotshot55@reddit

I did not know it, and most people I interview don’t mention it in a recovery scenario.

I wouldn't call sos a recovery scenario tool, it's more of a general troubleshooting tool.

If I'm in recovery mode, I'm restoring access, not trying to find a root cause.

[-]

nethack47@reddit

Now that I know about the tool, and have had a chance to test it, it is a useful tool for when something is wonky and I am not in a hurry.
When I am in a recovery situation, this is slow and not very useful. Just checking top, logs and other indicators are much more efficient and informative.

When I ask a candidate about tools they might use when investigating a server with a vague "it's not working right", this seems to fit for data gathering. That said, so many people don't think about checking logs and dmesg when they get something vague.

Now that I am quite curious. What scenario would you use the tool?

[-]

Hotshot55@reddit

What scenario would you use the tool?

99.9999999% of the time, I'm only ever going to run sos to generate a report for a vendor. If you can log into the system, it'll pretty much always be more useful to read through the current logs.

An sosreport is a collection of logs and stats from a certain point in time, it's more useful for reviewing after the fact or when you don't have access to the system.

[-]

jlrueda@reddit (OP)

What certification? I honestly can't remember where or when I learned about it it was many years ago. But I use it very often ever since. When you use it as part of your operation, then you have another kind of problem archiving, classifying and tracking old sosreports down and comparing them but that is another completely different issue. It Has been very useful to me.

[-]

nethack47@reddit

It looks like one of the functions that would be a part of something like RedHat certification.

It seems to collect a snapshot of what my basic checks cover. Interesting thing to add as an option, just concerned about dropping potentially large packets into /var/tmp that need cleanup. I can see it could be very helpful with a bash wrapper that pulls the package off server, cleans up, and renames it.

[-]

wossack@reddit

When logging a call with Red Hat, you are quite often asked for an sosreport to be uploaded (depending on the issue etc etc)

[-]

serverhorror@reddit

What's it part of or where is the Homepage?

Never heard of that tool.

[-]

jlrueda@reddit (OP)

https://github.com/sosreport/sos years old. Is what RedHat uses for customer support. You can type: man sos report

[-]

gforke@reddit

Debian and openSuse don't seem to have it included from the start, only RHEL stuff seems to have it preinstalled

[-]

ChrisTX4@reddit

It’s more commonly used on RHEL, but you can also use it on other distros.

There’s also a tool called xsos to distill information from sos even quicker, but that throws a lot of errors when not on RH I’ve found.

[-]

Kurgan_IT@reddit

TIL about bpfcc-tools. This looks insanely cool. https://www.super-man.dev/package/bpfcc-tools

[-]

MaxRK@reddit

I don't think sosreport is at the low level observability layers that he's focused on, he does mention it's a minimal list. Still up to experienced sysadmins to work out what's good for their environment, including bulk collection scripts like sosreport.

sosreport isn't quite ubiquitous yet if you're aiming for largest audience, example SLES prefers supportconfig. I've also had a couple of incidents where sosreport resource load and OS probing commands caused production issue, the low level knowledge can be more appealing once you know them.