Best way to limit total memory used by all users on a shared multi-user system
Posted by pi_epsilon_rho@reddit | linuxadmin | View on Reddit | 16 comments
Our site has many CentOS7, Rocky8/9 linux systems that are shared by many users concurrently via ssh login for random interactive uses. Many of these are large 128GB+ desktops at one person in a a groups desk where that person logins in person but many other users in the group SSH in to that desktop to run various analysis programs and development.
Anyway, one thing that happens a lot is one user will run a MATLAB or other program that consuses all the RAM in the box slowing it down to a crawl for all others. Eventually the kernel implements its OOM procedure. However, many system processes, though not killed by the OOM procedure get in a stuck non-operating state.
One of these is SSSD the main account services daemon which does not recover and then prevents any new logins and hangs other processes on things like user name/id lookups. One can restart sssd to fix it but one cannot ssh to the box or even login locally to do this. So most of the time we have to hard powercycle the box.
One attempt I made at "fixing" this was to create the following rsyslog configuration in /etc/rsyslog.d/oom-sssd-restart.conf
:msg, contains, "was terminated by own WATCHDOG" ^/usr/etc/sssd-restart.sh
as one usually sees that message in /var/log/messages when sssd gets in its hung state but this has only worked about 50% of the time
Ultimately, I want to make sure that 4GB or so of the RAM of each system is reserved only for system processes (UID < 1000) or just limit RAM to 96% of the systems ram to users with UID > 1000. Is there any simple and accepted way to do this? I am NOT looking for a per user memory limit via the /etc/security/limits.d/ system. That does not work for what I want.
One thing I am looking at is using cgroup slices and running
systemctl set-property user.slice MemoryHigh=120G
for example on a 128G system. It is unclear to me if this requires cgroups v2 meaning changing GRUB on all boxes to have kernel paramater systemd.unified_cgroup_hierarchy=1 and rebooting them.
BTW, I do use SLURM on a HPC cluster and consider that a too heavy handed and difficult solution for an interactive user desktop shared by users where local GUI login is used.
venquessa@reddit
You could try terminating some users.
Often it's a social problem not a technical one.
I jest!
Well... sometimes.
michaelpaoli@reddit
Uhm, actually often that really works! If the host is tight on resources, just be sure all the users can well and easily see how much of that resource is being consumed by which users, top on down.
Many a user disk space issues I've gotten resolved very quickly by just making sure all users were well aware of that.
venquessa@reddit
Servers in the last place I worked were a bit like that. They would try education first.
Then they would move to naming and shaming.
Only then would they "move" your files out of the way to scare you into the thinking they deleted them, forcing you to come looking for them to be told.
"Don't leave multi-gigabyte files in /tmp, it's a RAM disk!"
seidler2547@reddit
I did something like this in the times well before systems, when cgroups was just about supported, so I don't know of this will still work.
I wrote a script that went through the process list and put the processes into cgroups according to the user they belonged to. This way, I could gather all processes belonging to each user in a group and limit that cgroups total memory allowance.
Additionally, I changed oom_score_adj for system critical processes (including my watchdog) so that they won't be looked at so much by the OOM killer. I think there's also a way to nudge it towards larger processes. This might even be easier nowadays with systemd if you create a service override file.
Good luck.
michaelpaoli@reddit
So ... if then already up to that limit, then what, all additional users attempting to login, gonna just throw 'em some sort of relevant error/status message and not let 'em log in?
arkham1010@reddit
Ulimit to the rescue!
pi_epsilon_rho@reddit (OP)
ulimit (aka /etc/security/limits.conf) is only PER USER as far as I can tell. So that does cover what I need. I need to limit the total memory combined of all users of UID > 1000 to less than 96% of the system's RAM. So far it appears only cgroups seems to provide a method for this and it does appear to require cgroups v2 as v1 does not properly put ssh user processes in the correct slice.
arkham1010@reddit
You can set a group syntax in the limits.conf file. Make sure your users are in the same group and you should be fine. Secondary groups are good for this sort of thing, so you can restrict the matlab secondary group to X amount of memory, but other users don't need to be part of that group and can have other restrictions.
pi_epsilon_rho@reddit (OP)
That applies the ulimit to each user in the group, but to each separately. And as AmusingVegetable says above, it applies to each process group (session) for that user. So a user may have a mem ulimit of 10G but they could have 100 process groups (on 100 different ssh's to a box) running using 10G each. The cgroup v2 method seems the only workable way so far but I need to test.
AmusingVegetable@reddit
ulimit is per process. It does nothing to control the memory usage of a group of processes.
IllllIIlIllIllllIIIl@reddit
Your solution is basically what I did recently for a shared RStudio box and it's been working well. You should also look into setting oom_score_adj to some negative value on important system processes like SSSD.
pi_epsilon_rho@reddit (OP)
oom never kills sssd. But something that happens during the situation, probably some failed malloc or other allocation (maybe network) in sssd, puts sssd in to some kind of non-working condition
pi_epsilon_rho@reddit (OP)
Hmm. Sadly it seems a lot of the user processes for users who ssh into the box end up under system.slice/sshd.slice rather than user.slice (at least with the default cgroups v1)
ITaggie@reddit
Perhaps this will offer some insight: https://serverfault.com/questions/968717/how-does-systemd-put-sshd-processes-in-slices
SuperQue@reddit
Yup, using cgroups is the best way. I think your
user.slice
method is the easiest way to deal with it. I don't know about the cgroups v1 vs v2 issues tho.jaymef@reddit
look into cGroups