Best way to limit total memory used by all users on a shared multi-user system

Posted by pi_epsilon_rho@reddit | linuxadmin | View on Reddit | 19 comments

Our site has many CentOS7, Rocky8/9 linux systems that are shared by many users concurrently via ssh login for random interactive uses. Many of these are large 128GB+ desktops at one person in a a groups desk where that person logins in person but many other users in the group SSH in to that desktop to run various analysis programs and development.

Anyway, one thing that happens a lot is one user will run a MATLAB or other program that consuses all the RAM in the box slowing it down to a crawl for all others. Eventually the kernel implements its OOM procedure. However, many system processes, though not killed by the OOM procedure get in a stuck non-operating state.

One of these is SSSD the main account services daemon which does not recover and then prevents any new logins and hangs other processes on things like user name/id lookups. One can restart sssd to fix it but one cannot ssh to the box or even login locally to do this. So most of the time we have to hard powercycle the box.

One attempt I made at "fixing" this was to create the following rsyslog configuration in /etc/rsyslog.d/oom-sssd-restart.conf

:msg, contains, "was terminated by own WATCHDOG" ^/usr/etc/sssd-restart.sh

as one usually sees that message in /var/log/messages when sssd gets in its hung state but this has only worked about 50% of the time

Ultimately, I want to make sure that 4GB or so of the RAM of each system is reserved only for system processes (UID < 1000) or just limit RAM to 96% of the systems ram to users with UID > 1000. Is there any simple and accepted way to do this? I am NOT looking for a per user memory limit via the /etc/security/limits.d/ system. That does not work for what I want.

One thing I am looking at is using cgroup slices and running

systemctl set-property user.slice MemoryHigh=120G

for example on a 128G system. It is unclear to me if this requires cgroups v2 meaning changing GRUB on all boxes to have kernel paramater systemd.unified_cgroup_hierarchy=1 and rebooting them.

BTW, I do use SLURM on a HPC cluster and consider that a too heavy handed and difficult solution for an interactive user desktop shared by users where local GUI login is used.

[-]

Iciciliser@reddit

bit late but: against the other recommendations here: set MemoryLow on the sssd service to a small but sensible value. MemoryLow. This makes it so that when memory pressure occurs the kernel will leave SSSD well alone.

Also, would recommend installing systemd-oomd for better OOM handling so you don't get the freezes associated with kernel OOM killer by killing offending processes earlier.

[-]

pi_epsilon_rho@reddit (OP)

Hmm. Sadly it seems a lot of the user processes for users who ssh into the box end up under system.slice/sshd.slice rather than user.slice (at least with the default cgroups v1)

[-]

ITaggie@reddit

Perhaps this will offer some insight: https://serverfault.com/questions/968717/how-does-systemd-put-sshd-processes-in-slices

[-]

pi_epsilon_rho@reddit (OP)

I really don't understand what is going on. Looking at sshd.service on a bunch of other workstations we have running Rocky 8 with cgroups v1 and lots of users ssh'ed in, I don't see those user processing ending up in the sshd.service slice. But I definitely saw it on this problem box before I rebooted it to cgroups v2. Nothing should be different among my boxes with the way sssh/pam/... are all configured.

With cgroups v2 the main sshd process does not even show up in the systemd-cgls output.

# find /sys/fs/cgroup/ -name cgroup.procs -exec grep -l '\<5564\>' {} \;

/sys/fs/cgroup/system.slice/sshd.service/cgroup.procs

# systemd-cgls | grep sshd.service | grep -v grep

#

which is very odd. Hmm. Actually it does now up on another Rocky 8 box I have running with cgroups v2. No idea.

[-]

pi_epsilon_rho@reddit (OP)

Update on this

First, as to social engineering aspect, we are an academic research institution. So there is a lot of experimentation going on with data analysis with users not always having a good idea how much memory a program is going to use. Being research we have a fairly loose mentality about these things with somewhat a motto of "if you will learn something, do it". Also, we have high turnover of students and research assistants, so training is a very hard thing for each group to manage and very unique to each group.

I have rebooted the box with systemd.unified_cgroup_hierarchy=1 in kernel parameters to turn on cgroups v2. I then ran

# systemctl set-property user.slice MemoryMax=498G

# systemctl set-property user.slice MemoryHigh=494G

as free showed essentially 502G of RAM (539907670016).

So far the box shows on user.slice

[user.slice]# cat memory.peak

526904508416

[user.slice]# cat memory.max

534723428352

[user.slice]# cat memory.high

530428461056

[user.slice]# grep . user-*.slice/memory.{peak,current}

user-0.slice/memory.peak:45514752

user-5023442.slice/memory.peak:383312285696

user-5587260.slice/memory.peak:51254865920

user-5923529.slice/memory.peak:167326973952

user-0.slice/memory.current:40755200

user-5023442.slice/memory.current:61406142464

user-5587260.slice/memory.current:32752504832

user-5923529.slice/memory.current:112188551168

so this method appears to be working well

[-]

venquessa@reddit

You could try terminating some users.

Often it's a social problem not a technical one.

I jest!

Well... sometimes.

[-]

michaelpaoli@reddit

Uhm, actually often that really works! If the host is tight on resources, just be sure all the users can well and easily see how much of that resource is being consumed by which users, top on down.

Many a user disk space issues I've gotten resolved very quickly by just making sure all users were well aware of that.

[-]

venquessa@reddit

Servers in the last place I worked were a bit like that. They would try education first.
Then they would move to naming and shaming.
Only then would they "move" your files out of the way to scare you into the thinking they deleted them, forcing you to come looking for them to be told.

"Don't leave multi-gigabyte files in /tmp, it's a RAM disk!"

[-]

seidler2547@reddit

I did something like this in the times well before systems, when cgroups was just about supported, so I don't know of this will still work.

I wrote a script that went through the process list and put the processes into cgroups according to the user they belonged to. This way, I could gather all processes belonging to each user in a group and limit that cgroups total memory allowance.

Additionally, I changed oom_score_adj for system critical processes (including my watchdog) so that they won't be looked at so much by the OOM killer. I think there's also a way to nudge it towards larger processes. This might even be easier nowadays with systemd if you create a service override file.

Good luck.

[-]

michaelpaoli@reddit

So ... if then already up to that limit, then what, all additional users attempting to login, gonna just throw 'em some sort of relevant error/status message and not let 'em log in?

[-]

arkham1010@reddit

Ulimit to the rescue!

[-]

pi_epsilon_rho@reddit (OP)

ulimit (aka /etc/security/limits.conf) is only PER USER as far as I can tell. So that does cover what I need. I need to limit the total memory combined of all users of UID > 1000 to less than 96% of the system's RAM. So far it appears only cgroups seems to provide a method for this and it does appear to require cgroups v2 as v1 does not properly put ssh user processes in the correct slice.

[-]

arkham1010@reddit

You can set a group syntax in the limits.conf file. Make sure your users are in the same group and you should be fine. Secondary groups are good for this sort of thing, so you can restrict the matlab secondary group to X amount of memory, but other users don't need to be part of that group and can have other restrictions.

[-]

pi_epsilon_rho@reddit (OP)

That applies the ulimit to each user in the group, but to each separately. And as AmusingVegetable says above, it applies to each process group (session) for that user. So a user may have a mem ulimit of 10G but they could have 100 process groups (on 100 different ssh's to a box) running using 10G each. The cgroup v2 method seems the only workable way so far but I need to test.

[-]

AmusingVegetable@reddit

ulimit is per process. It does nothing to control the memory usage of a group of processes.

[-]

IllllIIlIllIllllIIIl@reddit

Your solution is basically what I did recently for a shared RStudio box and it's been working well. You should also look into setting oom_score_adj to some negative value on important system processes like SSSD.

[-]

pi_epsilon_rho@reddit (OP)

oom never kills sssd. But something that happens during the situation, probably some failed malloc or other allocation (maybe network) in sssd, puts sssd in to some kind of non-working condition

[-]

SuperQue@reddit

Yup, using cgroups is the best way. I think your user.slice method is the easiest way to deal with it. I don't know about the cgroups v1 vs v2 issues tho.

[-]

jaymef@reddit

look into cGroups