Everyone kept crashing the lab server, so I wrote a tool to limit cpu/memory

Posted by TheDevilKnownAsTaz@reddit | linuxadmin | View on Reddit | 47 comments

Hey everyone,

I’m not a real sysadmin or anything. I’ve just always been the “computer guy” in my grad lab and at a couple jobs. We’ve got a few shared machines that everyone uses, and it’s a constant problem where someone runs a big job, eats all the RAM or CPU, and the whole thing crashes for everyone else.

I tried using systemdspawner with JupyterHub for a while, and it actually worked really well. Users had to sign out a set amount of resources and were limited by systemd. The problem was that people figured out they could just SSH into the server and bypass all the limits.

I looked into schedulers like SLURM, but that felt like overkill for what I needed. What I really wanted was basically systemdspawner, but for everything a user does on the system, not just Jupyter sessions.

So I ended up building something called fairshare. The idea was simple: the admin sets a default (like 1 CPU and 2 GB RAM per user), and users can check how many resources are available and request more. Systemd enforces the limits automatically so people can’t hog everything.

Not sure if this is something others would find useful, but it’s been great for me so far. Just figured I’d share in case anyone else is dealing with the same shared server headaches.

https://github.com/WilliamJudge94/fairshare/tree/main

[-]

CelDaemon@reddit

Aaaand it has a CLAUDE.md... :/

[-]

casper_trade@reddit

Caught me off guard, too. It seemed like an excellent project. I do wish we would move away from using the phrase "I wrote" when describing a vibe-coded codebase.

[-]

TheDevilKnownAsTaz@reddit (OP)

Haha very true. The tool still works and is useful to me. Just wanted to share it in case others also have a need for something similar.

[-]

SaladOrPizza@reddit

Like the idea but CPU and memory are ment to be used.

[-]

TheDevilKnownAsTaz@reddit (OP)

True! This tool was built mainly because the system was being overused. Daily crashed from memory overload and daily stalls because someone used every core and stopped the rest of the group from being able to work.

[-]

kryptkpr@reddit

This is a 12 core, 16 GB machine? I hate to tell you this but its crashing because those are terrible specs for even a single user, nevermind multiple.

[-]

resonantfate@reddit

True, but they're students and this is education. Not a lot for money to go around. Also, the resource limitations could help train users to be more frugal with their requests.

[-]

kryptkpr@reddit

Resource limitations in constrained, single user embedded environments are both fun and educational.

Resource limitations in shared multiuser environments are frustrating and nothing else.

[-]

stufforstuff@reddit

A server that only has 12G - why?

[-]

hdkaoskd@reddit

Student use.

[-]

kryptkpr@reddit

there are $600 laptops at bestbuy with far better specs

making anyone use 15ye old ewaste in a shared environment is a crime

[-]

kryptkpr@reddit

when ewaste misses the garbage and somehow and ends up a server that crashes daily because the specs can't even support one user

[-]

Amidatelion@reddit

grad lab

[-]

420GB@reddit

Test machine

[-]

Z3t4@reddit

Integrated gpu

[-]

circularjourney@reddit

Did you try systemd-nspawn?

Add some resource limits to that and you're good to go.

[-]

crazyjungle@reddit

Interesting, can come handy when different "me" are trying to overload the server at different time ;p

[-]

rwu_rwu@reddit

Nice.

[-]

SnooChocolates7812@reddit

Nice one 👍

[-]

whenwillthisphdend@reddit

for interactiv and perpetual run jobs which is what i gathered from your comments, our lab treats them as shared workstation. I simply retrict concurrent users to two logins at any one time. And if they still manage to crash each other, then they can duke it out amongst themselves/have a conversation. Or move on to one of the other 8 workstations we have available. What ends up happening is regular users will tend to keep using the same workstation, and people start to remember who is on what station and organise themselves accordingly. Never had any issues with this method, and we have almost 20 people in our group! (we also have a cluster but that's another story)

[-]

TheDevilKnownAsTaz@reddit (OP)

I wish we had 8 computers! Usually it is a single large computer (512gb ram, 32 cores, 4gpus) for 10 people. Users would constantly go over their allocation budget and crash the computer.

[-]

not-your-typical-cs@reddit

This is incredibly solid!!! I built something similar but for GPU partitioning I'll take a look at your repo, star it so I can follow your progress Here's mine in case you're curious: https://github.com/Oabraham1/chronos

[-]

TheDevilKnownAsTaz@reddit (OP)

This is so cool!! From the docs it is unclear but does this allow you to do MIG on any GPU? So I can set up two different experiments at the same time each using half the vram?

[-]

archontwo@reddit

Kudos..

Good to scratch your itch.

You could improve it significantly with cgroups as they have been in Linux for a long time now.

You might want to flex those budding sysadmin muscles.

Good luck.

[-]

TheDevilKnownAsTaz@reddit (OP)

I think what I have build relies on the cgroups, but I am actually not sure. Fairshare allows users to create and modify their own systemd user.slice. Which is then may be controlled by cgroup? I am not totally sure though, so if this is wrong, pointing me in the correct direction would be much appreciated!

[-]

fishmapper@reddit

Is that not what they are already doing with adding limits in user-uid.slice?

[-]

skillzz_24@reddit

This is pretty cool I must say, but is it really fair to say you wrote it if the whole thing is vibe coded? Don't mean to slam on you, but it's a little misleading. Either way, dope project.

[-]

TheDevilKnownAsTaz@reddit (OP)

That is a really good point. And I don’t actually know. Maybe if an AI system was able to one shot this I would say Claude did this? But it took about two full days and more than few manual debug sessions to get version 0.3.0. Either way I will edit the post to be more clear that Claude did a lot of heavy lifting.

[-]

TheDevilKnownAsTaz@reddit (OP)

It looks like I am unable to edit because it is an image post :( hopefully others see this comment and the additional one where I mention Claude did a lot of heavy lifting on this project.

[-]

Exzellius2@reddit

The CLAUDE.md file makes me think AI.

[-]

TheDevilKnownAsTaz@reddit (OP)

Edit: Claude was use a lot during this project’s development.

[-]

Julian-Delphiki@reddit

You may want to check out /etc/security/limits.conf :)

[-]

keesbeemsterkaas@reddit

He wrote a nice wrapper around systemd limits, which will also work.

[-]

kobumaister@reddit

Nice job!

[-]

aieidotch@reddit

you might want to look at zram, and nohang.

[-]

crackerjam@reddit

Personally I have no use for this, but it is a very neat project. Good job OP!

[-]

xtigermaskx@reddit

This is neat. I manage clusters and use slurm if you ever want to try it's not too big an undertaking if you were able to build this.

Some folks over at /r/hpc there might be some folks that like it.

[-]

TheDevilKnownAsTaz@reddit (OP)

Thanks for the input! I have tried slurm a few times and never really liked its integration for persistent tasks. Unless it has gotten easier?

[-]

xtigermaskx@reddit

Ohh you're running things just full time? Yeah I don't use it for that just jobs that will dump outputs.

[-]

TheDevilKnownAsTaz@reddit (OP)

A lot of devs like their Jupyter notebooks haha but others like the command line. I needed a way to reign in both types of users

[-]

HeavyNuclei@reddit

Just use open on demand? Jupyter notebooks running in a slurm allocation. Piece of cake to setup. Tried and tested.

[-]

xtigermaskx@reddit

So we have a similar issue we solved a completely different way. A faculty member asked us to stand up a server for students to all be able to run docker containers and notebooks.

We worried that the students could possibly mess up each other's containers on a single server so we took some old big iron and used terra form to build them all their own personal vms. Then they have their own little environment to work in and we don't have to worry about someone doing anything that could mess up other folks.

We could use this for something that got brought up in a call today. Spin up a similar environment but for group projects.

[-]

TheDevilKnownAsTaz@reddit (OP)

As an additional point, if you are looking to have a single docker image for a group and then limit resources within that docker image fairshare should be able to do that.

Within the repo I have a .devcontainer directory that you can use as a docker template since it requires a little bit of setup to allow systemd to be run from within the docker image.

[-]

TheDevilKnownAsTaz@reddit (OP)

I thought about docker too. Main reason I didn’t is I wanted to keep the onboarding process as simple as possible. i .e. Here is how to ssh into the machine and use ‘fairshare request’ to sign out resources.

[-]

i_am_buzz_lightyear@reddit

This is what is most used from what I know -- GitHub - chpc-uofu/arbiter: Monitor and manage user behavior on login nodes https://share.google/zMO7nplW6osB1gqvT

[-]

H3rbert_K0rnfeld@reddit

Don't sell yourself short. Look up the history of Linux. It was just a thing a guy made for class. His post to newsgroups was just like yours.

Make your thing fun to use. Support it. Don't be jerky if some says Hey about this? You never know where the project will take you.

[-]

robvas@reddit

Couldn't you do the same with cgroups?