terminusd release - Shutdown control and systemd offline-updates without dual reboots.

Posted by jonnywhatshisface@reddit | linuxadmin | View on Reddit | 19 comments

Hi, folks. I come from pretty large infrastructures, as in \~300k+ servers. I wrote https://jonnywhatshisface.github.io/systemd-shutdown-inhibitor/ to solve problems I've hit in some of those infrastructures, and figured I'd share with everyone in case you may potentially have a use-case for it as well.

We had serious challenges around patch maintenance and management when we switched from SystemV to SystemD on (RHEL 6 -> RHEL 7) quite a while back.

Given the size of our plant and the count of unique hosts in the infrastructure (thousands of departments and super orgs, 97k employees - all with their own server infra, and just 15 operations members and 7 engineers globally) - the entire plant was setup to do rolling reboots with dynamically controlled scheduling that the users set their maint. windows. They handled things such as their own shutdown scripts for handling scenarios like HA failover, service stops prior to package upgrades and etc.

With the switch to systemd, we had to leverage offline-reboots (system-update state) to align with those strategies, and that introduced dual-reboots to every system because the updates would happen on the way UP while in system-update state, instead of on the way DOWN when the shutdown/reboot was executed. Why that's a big issue in that plant is because POST on some of these servers can take more than 30 minutes (think boxes with more than 1TB RAM, 12 NIC's, RAID cards, JBOD's attached, etc). This was turning simple reboots and patching into an hour long adventure in some cases, particularly when a host was being rebooted specifically for the purpose of rolling back a set of patches.

So, I had addressed this using a similar methodology to terminusd (though, not as feature-rich), and it resolved that after many years of just dealing with the ridiculous dual reboots.

Now that I've left the company, I had rewritten it into a daemon with far more flexibility because I was bored and wanted to leverage it on my own systems.

Then, a colleague I got pinged by an old colleague inquiring about ways to handle dyamically disabling reboot/shut entirely on boxes so that normal systemctl and /sbin/shutdown commands wouldn't work - so I decided to extend that functionality into it as well. Apparently, an HA pair that looked as though the other side was up was shutdown by someone on the operations team, and it had serious financial impact because the other node was not in a seeded state and couldn't take the handover.

I decided to take that scenario and cover for it in terminusd as well.

What came out of it is terminusd - a lightweight daemon that gives full control and flexibility over shutdowns and reboots by leveraging a systemd delay inibitor, and a shutdown guard that can dynamically enable/disable shutdown, halt, reboot and kexec based on environmental factors determined by administrator scripts.

To handle shutdown actions before the system goes down - and before systemd is even in a shutdown state - it registers a delay inhibitor. During this time, all systemctl commands work as normal and systemd is still in a 100% fully running state, but has a pending shutdown. That pending state is controlled by the InhibitDelayMaxSec parameter in logind.conf - which terminusd can optionally configure for you. The delay is only held as long as the inibitor holds it, or until this timeout is reached - at which point the shutdown/reboot/halt proceeds regardless of whether the inihibitor has finished (to prevent a total dead-lock/hang).

Commands for shutdown actions are dynamically configured as drop-ins or in the config file. It allows setting a full command to run (with args), optionally setting the user/group to run as, in addition to optional env for it, and can be marked as critical. The actions are executed in ascending order "priority groups," meaning commands you set with equal priority will run in parallel. Any task marked "critical" failing will result in not running any further priority groups and the inibitor will be released.

This is currently being used on large storage clusters and HA kits where shutdowns require things such as trigger failovers, migrating services and VIP's and etc, as well as stopping various services before applying patches/upgrades.

The shutdown guard can disable system-wide reboots, shutdowns, halts and kexecs, even if the command is issued as root. It can either run your guard command/script/binary in timed intervals with a configured threshold for failure - oneshot mode - which simply requires a zero exit of the command to re-enable reboots, where a non-zero exit will disable them, or it can run in persist mode where it attaches a pipe to the stdio of your script/command/binary and monitors it, logging all stdio/stderr to syslog. With the persist mode, your app only needs to write the command out to enable or disable the shutdowns on the system.

Currently, the persist mode is being used on HA clusters that the script is monitoring the readiness of the servers to take the handoff if one of them is rebooted. If at any point one is not able to take the handoff for whatever reason (reboots, service failures, etc) - then the reboots are disabled on the other side to prevent accidental reboots.

terminusctl allows you to actually visualize the action order, see the status of shutdown enable/disable state, stop/start the shutdown guard and reload the configuration live without restarting the daemon. This is useful for working on developing your shutdown guard scripts, configuring your shutdown actions and being able to visualize the result without having to restart the daemon. It can also be used to enable/disable the system-wide shutdowns from the cli on the spot, including to override shutdown guard.

If you find it useful, I'd love to hear about it. It may not be for everyone, but I'm sure someone else out there has some kind of need for it given we did.

[-]

faxattack@reddit

TLDR; lol?

Lots of words and words…what problem are you solving?

[-]

hursofid@reddit

Title says that, bud. They have to reboot servers twice and POST takes up to 30 minutes due to hardware onboard.

[-]

faxattack@reddit

No, they talk about solutions that solves problems such as dual reboot, but not about the premise why you would reboot twice in the first place.

[-]

The_Real_Grand_Nagus@reddit

Yeah you have to do some digging, it's not explained clearly. Basically they were using a feature of systemd where it installs updates on boot, and therefore has to reboot a second time.

I've actually never used it--in my world package are either ok to upgrade continuously, or you have to hold back the upgrade until you've tested that everything will be OK when you do.

Part of what his utility does seems to "install updates right before rebooting" instead. That part makes some sense. The part I still don't get is he seems to imply that he needs to keep other processes or teams from rebooting the system, which is really a separate problem.

[-]

jonnywhatshisface@reddit (OP)

Yeah, I solved both cases in a single hit. Admittedly, it breaks the Unix philosophy to some degree of "do one thing and do it well," but it's solving both problems and I didn't want to separate the two.

Thank you for pointing out that the explanation isn't really clear. I've spent 26 years working in massive - and I do mean massive - infrastructures. So, it's often an automatic assumption on my end that most people will have experienced the challenges those infrastructures do. I have not worked in smaller (less than 100k servers) in a very long time, so it didn't dawn on me that most people are still handling patching somewhat manually or just pushing out Ansible playbooks, which doesn't scale in these kind of plants and isn't the means of patching in publicly traded organizations with global footprint and endless red tape.

[-]

dodexahedron@reddit

Admittedly, it breaks the Unix philosophy to some degree of "do one thing and do it well,"

Which systemd clearly adheres closely to, of course.

...For extremely high values of "one."

[-]

jonnywhatshisface@reddit (OP)

It's rare I physically smile. This actually pulled one out of me.

[-]

dodexahedron@reddit

Glad to be of daemon. 😉

[-]

The_Real_Grand_Nagus@reddit

I was skeptical at first, but I looked at your code, and it looks good to me--relatively simple and straightforward. I suppose SELinux to prevent people from rebooting can be a pain to manage. I'm not sure I'll use the systemd upgrade feature or not. I suppose we don't really need it in our environment.

[-]

jonnywhatshisface@reddit (OP)

The whole concept really is more useful than some people are realizing.

The shutdown guard came because too many times people have rebooted a node in an HA pair that the other side simply wasn't ready to take the hand off. Mistakes happen way more than one would think, especially if people had a rough week and were horsewhipped into running the sewing machines all week long and through the weekend.

The shutdown actions are also very useful, because you can do things with them you cannot do with something started via a unit file - such as upgrading packages or executing other systemctl commands. Things like Spectrum Scale, InfoScaler - these require various systemctl commands to be executed before upgraded versions are installed - i.e. stopping their services.

Imagine this scenario: your entire infrastructure has pre-defined windows for hosts that are "maintenance windows," and the reboot controls have a few different options:

1) Do not reboot/skip

2) Rebuild at reboot (a total reinstall through PXE)

3) Threshold Reboot (ie it's a large HPC compute grid or cluster, so X % must remain up at all times)

4) HA Pair - do not reboot if the other side is not available/ready

5) No Update - do not install updates/patches at reboot

So, the entire infra in this case, the reboot maintenance windows and the above information/actions are all controlled by the BU's / owners of the host (there are more than 100k human beings in the company total, across 3 three regions in the world). It's almost hands-off for a majority of the operations that need to be performed, because a lot of the actions are automated / handled simply by ensuring every host reboots at least once a month, and the users can themselves mark or control certain results of what happens during those reboots.

Java developer in india did a dumb dumb thing and filled up one of the hard drives on his server by not QA testing his code, and flooding out data to the local disk? No problem, they have several other hosts, so instead of flooding operations with requests to login and fix it - they mark it as rebuild via tooling and either reboot it, or wait for the maintenance window to reboot it automatically and it gets rebuilt.

The rolling reboot schedules also ensure that particular sides are patched first. The hosts in DC1 and DC2 - only one side will be patched/updated in a 2 week span by the automation. This means that if there are issues with the patches that impact the users stack, you catch them and either roll the patch back entirely via repo snapshot, or exclude that particular set of hosts from the patch until it's resolved.

This does not only apply package updates in the infrastructure it's being used in. The concept of "update" is also rolling out custom code changes and in-house software deployments, even configuration changes on the host(s) through the configuration management. So on reboot/shutdown, ALL updates are automatically triggered. No unit files, not limitations to what can be executed during the process - so long as it happens within the timeline you've set the max delay inhibit value to.

[-]

faxattack@reddit

Sounds like you have organisational challenges. I have experience with large environments and never needed such a stack of precautions. Anything over OS layer is handled by systemd services. Anything other scenario must be handled by the relevant application team. Idk…

[-]

jonnywhatshisface@reddit (OP)

No - far from it. The life was quite easy. Doesn’t get much easier than scheduled reboots installing patches.

Perhaps not the best thing to make such an assumption. 26 years in tech working in government and financial systems, the smallest infra I’ve touched in the past eleven years is 300k servers. Largest 940k.

When you get to that scale - you either find find a good way to manage the infrastructure or your hire an army of people and burn your opex to hell.

Trust me - that infra operates just fine and has passed every audit.

If you don’t need it? That’s fine, then - it isn’t for you. But don’t say others don’t or wouldn’t have had to write it.

[-]

faxattack@reddit

Whatever, this still doesnt make any sense, just like most ai stuff cobbled together with incrompehensible marketing lingo.

[-]

jonnywhatshisface@reddit (OP)

Sounds like you’re one of the countless people I’ve hired and fired.

[-]

jonnywhatshisface@reddit (OP)

Yah, it’s a bit long.

Check out the site - less words, more to the point.

[-]

faxattack@reddit

Too much distractions on the site, still dont understand your unique problem behind all the solutions.

[-]

AdrianTeri@reddit

Confused. Tool is to inhibit shutdowns & reboots but several things do not add up on this post.

On owners level of physical infra level you surely must have maintenance/repair windows(SLAs) that you send out to clients or customers. You then perform your hardware changes/repairs, updates to hypervisor, firmware for cpu, nics, storage controllers etc

On clients/customers end they perform their migrations or ensure HA is properly functioning in anticipation of this scheduled downtime window.

So who needs this tool to inhibit shutdowns/reboots. Are you(physical infra level) performing operations that could impair performance or outright corrupt data of your customers/clients without warning or giving them notice?

All this sounds like a comms & coordination problem not systems.

[-]

jonnywhatshisface@reddit (OP)

It's quite clear you have never been responsible for a plant of thousands, let alone hundreds of thousands, of machines.

[-]

The_Real_Grand_Nagus@reddit

I don't understand your problem. How does systemd cause more reboots than sysV? None of my RHEL systems reboot themselves unless I schedule them to do so. Why would you allow other entities to schedule reboots, only to have to block them with this other tool?