Struggling with forcing systemd to keep restarting a service.
Posted by mamelukturbo@reddit | linuxadmin | View on Reddit | 21 comments
I have a service I need to keep alive. The command it runs sometimes fails (on purpose) and instead of keeping trying to restart until the command works, systemd just gives up.
Regardless of what parameters I use, systemd just decides after some arbitrary time "no I tried enough times to call it Always I ain't gonna bother anymore" and I get "Failed with result 'exit-code'."
I googled and googled and rtfm'd and I don't really care what systemd is trying to achieve. I want it to try to restart the service every 10 seconds until the thermal death of the universe no matter what error the underlying command spits out.
For the love of god, how do I do this apart from calling "systemctl restart" from cron each minute?
The service file itself is irrelevant, I tried every possible combination of StartLimitIntervalSec, Restart, RestartSec, StartLimitInterval, StartLimitBurst you can think of.
Emergency-Scene3044@reddit
Dude, I feel your pain. Systemd is like that one stubborn friend who insists they know best and just won’t listen.
Nementon@reddit
To configure a systemd service to restart indefinitely with a 10-second delay after failure, create or modify the service file (e.g., /etc/systemd/system/myservice.service) with the following configuration:
[Unit] Description=My Indestructible Service After=network.target
[Service] ExecStart=/path/to/your/command Restart=always RestartSec=10s
[Install] WantedBy=multi-user.target
Explanation:
Restart=always: Ensures the service restarts regardless of the failure reason.
RestartSec=10s: Sets a 10-second delay before restarting the service.
WantedBy=multi-user.target: Ensures the service starts at boot.
Apply Changes:
After editing the file, reload systemd and start the service:
sudo systemctl daemon-reload sudo systemctl enable myservice sudo systemctl start myservice
This setup ensures that your service keeps restarting every 10 seconds upon failure—until the universe collapses (or until you manually disable it).
ChatGTP
2FalseSteps@reddit
Instead of applying bandaids that overcomplicate everything, find out WHY the service fails and resolve that PROPERLY.
punkwalrus@reddit
Because some programmer won't fix his damn Java app, that's why. I see that as a "solution" more often than not. I have spent too many days of my sysadmin life, in hours of meetings, where "this is the interim solution" and politics play over function. "Just disable SELinux" or "remove the firewall" or "remove that OOM thing that keeps filling our logs, what is that anyway?" But oh no, it's not the shitty app the programmer won't fix. "Works on my system," yeah I doubt that, too.
/rant
2FalseSteps@reddit
Nothing irritates me more than lazy, incompetent devs not wanting to lift a finger to do their own fucking job.
I completely understand and agree with your rant. I rant about it daily, but at least I'm at a level where I can give the devs AND their managers shit and point out their lazyness/incompetence.
HR has my manager's # on speed-dial, and he wholly supports me when I push back on that shit.
Do your own fucking job and don't demand I put bandaids on my servers just so your shitty code stays up, because I won't.
mamelukturbo@reddit (OP)
No.
The service fails because it creates a worker inside another service, which itself might not be up due to being updated/offline w/e. The service is supposed to keep trying to restart until the other service comes back up. The WHY is unimportant. The setup is for the purpose of this post immutable and I need to achieve my goal within its constraints.
2FalseSteps@reddit
So you identify and resolve the issue PROPERLY.
Bandaids are for amateurs that don't understand what they're doing.
mamelukturbo@reddit (OP)
You tell that to a corporation, I'm just trying to work within the constraints set by their system and would appreciate an answer related to my question.
Creating the service as their manual suggests doesn't keep the worker alive if nextcloud server restarts/goes offline.
https://docs.nextcloud.com/server/latest/admin_manual/ai/overview.html#systemd-service
I'm just trying to modify the service so it works in my environment properly.
What do you want me to do? Refactor and reprogram the whole underlaying application?
meditonsin@reddit
Considering those docs tell you to write your own shell script for the systemd unit, you could just modify it to wait until Nextcloud is available before trying to start the worker.
mamelukturbo@reddit (OP)
I understand that, but I'm asking whether the same effect is achievable with modifying the service instead.
meditonsin@reddit
The problem with that approach is that it's harder to notice when there's an actual problem with the service if systemd will just try to restart it forever. Making it so it can still fail if there's an actual problem seems way cleaner.
mamelukturbo@reddit (OP)
That makes sense. I'll look into modifying the shell script instead.
Failure of the service is not much of an issue in the end. The worker by default runs by the container's cron every 5 minutes (not configurable to be lower). The systemd service is there to keep spawning workers that would pick up a task quicker than in 5 minutes. So even in the case of the external shell script spawned worker failing, the internal worker will pick it up, it will just take between 0-5 minutes depending on the time.
schorsch3000@reddit
thats that dependencies are for you didn't have set Requires=
mamelukturbo@reddit (OP)
The other service isn't a systemd unit.
schorsch3000@reddit
That's the problem :-)
mamelukturbo@reddit (OP)
Which I'm trying to solve lol. This whole convo feels like early 2000 4chan, you ask a question and after 3 pages of ppl telling you to kill yourself, someone finally answers.
2FalseSteps@reddit
Bingo!
A proper resolution that doesn't rely on bandaids.
jambry@reddit
From the man page for
StartLimitInterval=, StartLimitBurst=
:set to 0 to disable any kind of rate limiting
mamelukturbo@reddit (OP)
Looking at my attempts I'm realizing it's not the numbers I'm trying that ar ethe issue, but the placement of the statements. Half the examples put it into [Unit], half into [Service]. It confuses me mightily.
I changed the umbers to 0 so now I have (the task is supposed to run for 60 sec and then restart, but also the whole service should restart every 5sec if the command fails to run (the .sh file calls docker exec container_name and the container ain't always up) until it doesn't fail at which point it should be back to restarting each 60sec. Hope that makes sense.
This seems to work, but from your reply I should move the statements to [Service]?
yrro@reddit
If the directive is documented in systemd.unit then it goes into the Unit section. If it's documented in systemd.service or systemd.exec or others then it goes into Service.
The man pages themselves probably mention this but if you always skip to the description of the options then you might miss that.
I believe systemd will complain if it notices unexpected stuff within a unit file, maybe you're missing those messages... try journalctl -e after running daemon-reload, jump to the end with G and check.
You can also run systemctl show whatever.service and confirm that the values you're trying to set are actually effective. I think -p OptionName can be used to print out a single value, not sure off the top of my head.
_mick_s@reddit
They should be in Unit, docs list all the options and where they should be
https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html#
https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html#