Everyone kept crashing the lab server, so I wrote a tool to limit cpu/memory
Posted by TheDevilKnownAsTaz@reddit | linuxadmin | View on Reddit | 47 comments
Hey everyone,
I’m not a real sysadmin or anything. I’ve just always been the “computer guy” in my grad lab and at a couple jobs. We’ve got a few shared machines that everyone uses, and it’s a constant problem where someone runs a big job, eats all the RAM or CPU, and the whole thing crashes for everyone else.
I tried using systemdspawner with JupyterHub for a while, and it actually worked really well. Users had to sign out a set amount of resources and were limited by systemd. The problem was that people figured out they could just SSH into the server and bypass all the limits.
I looked into schedulers like SLURM, but that felt like overkill for what I needed. What I really wanted was basically systemdspawner, but for everything a user does on the system, not just Jupyter sessions.
So I ended up building something called fairshare. The idea was simple: the admin sets a default (like 1 CPU and 2 GB RAM per user), and users can check how many resources are available and request more. Systemd enforces the limits automatically so people can’t hog everything.
Not sure if this is something others would find useful, but it’s been great for me so far. Just figured I’d share in case anyone else is dealing with the same shared server headaches.
https://github.com/WilliamJudge94/fairshare/tree/main
CelDaemon@reddit
Aaaand it has a
CLAUDE.md... :/casper_trade@reddit
Caught me off guard, too. It seemed like an excellent project. I do wish we would move away from using the phrase "I wrote" when describing a vibe-coded codebase.
TheDevilKnownAsTaz@reddit (OP)
Haha very true. The tool still works and is useful to me. Just wanted to share it in case others also have a need for something similar.
SaladOrPizza@reddit
Like the idea but CPU and memory are ment to be used.
TheDevilKnownAsTaz@reddit (OP)
True! This tool was built mainly because the system was being overused. Daily crashed from memory overload and daily stalls because someone used every core and stopped the rest of the group from being able to work.
kryptkpr@reddit
This is a 12 core, 16 GB machine? I hate to tell you this but its crashing because those are terrible specs for even a single user, nevermind multiple.
resonantfate@reddit
True, but they're students and this is education. Not a lot for money to go around. Also, the resource limitations could help train users to be more frugal with their requests.
kryptkpr@reddit
Resource limitations in constrained, single user embedded environments are both fun and educational.
Resource limitations in shared multiuser environments are frustrating and nothing else.
stufforstuff@reddit
A server that only has 12G - why?
hdkaoskd@reddit
Student use.
kryptkpr@reddit
there are $600 laptops at bestbuy with far better specs
making anyone use 15ye old ewaste in a shared environment is a crime
kryptkpr@reddit
when ewaste misses the garbage and somehow and ends up a server that crashes daily because the specs can't even support one user
Amidatelion@reddit
420GB@reddit
Test machine
Z3t4@reddit
Integrated gpu
circularjourney@reddit
Did you try systemd-nspawn?
Add some resource limits to that and you're good to go.
crazyjungle@reddit
Interesting, can come handy when different "me" are trying to overload the server at different time ;p
rwu_rwu@reddit
Nice.
SnooChocolates7812@reddit
Nice one 👍
whenwillthisphdend@reddit
for interactiv and perpetual run jobs which is what i gathered from your comments, our lab treats them as shared workstation. I simply retrict concurrent users to two logins at any one time. And if they still manage to crash each other, then they can duke it out amongst themselves/have a conversation. Or move on to one of the other 8 workstations we have available. What ends up happening is regular users will tend to keep using the same workstation, and people start to remember who is on what station and organise themselves accordingly. Never had any issues with this method, and we have almost 20 people in our group! (we also have a cluster but that's another story)
TheDevilKnownAsTaz@reddit (OP)
I wish we had 8 computers! Usually it is a single large computer (512gb ram, 32 cores, 4gpus) for 10 people. Users would constantly go over their allocation budget and crash the computer.
not-your-typical-cs@reddit
This is incredibly solid!!! I built something similar but for GPU partitioning I'll take a look at your repo, star it so I can follow your progress Here's mine in case you're curious: https://github.com/Oabraham1/chronos
TheDevilKnownAsTaz@reddit (OP)
This is so cool!! From the docs it is unclear but does this allow you to do MIG on any GPU? So I can set up two different experiments at the same time each using half the vram?
archontwo@reddit
Kudos..
Good to scratch your itch.
You could improve it significantly with cgroups as they have been in Linux for a long time now.
You might want to flex those budding sysadmin muscles.
Good luck.
TheDevilKnownAsTaz@reddit (OP)
I think what I have build relies on the cgroups, but I am actually not sure. Fairshare allows users to create and modify their own systemd user.slice. Which is then may be controlled by cgroup? I am not totally sure though, so if this is wrong, pointing me in the correct direction would be much appreciated!
fishmapper@reddit
Is that not what they are already doing with adding limits in user-uid.slice?
skillzz_24@reddit
This is pretty cool I must say, but is it really fair to say you wrote it if the whole thing is vibe coded? Don't mean to slam on you, but it's a little misleading. Either way, dope project.
TheDevilKnownAsTaz@reddit (OP)
That is a really good point. And I don’t actually know. Maybe if an AI system was able to one shot this I would say Claude did this? But it took about two full days and more than few manual debug sessions to get version 0.3.0. Either way I will edit the post to be more clear that Claude did a lot of heavy lifting.
TheDevilKnownAsTaz@reddit (OP)
It looks like I am unable to edit because it is an image post :( hopefully others see this comment and the additional one where I mention Claude did a lot of heavy lifting on this project.
Exzellius2@reddit
The CLAUDE.md file makes me think AI.
TheDevilKnownAsTaz@reddit (OP)
Edit: Claude was use a lot during this project’s development.
Julian-Delphiki@reddit
You may want to check out /etc/security/limits.conf :)
keesbeemsterkaas@reddit
He wrote a nice wrapper around systemd limits, which will also work.
kobumaister@reddit
Nice job!
aieidotch@reddit
you might want to look at zram, and nohang.
crackerjam@reddit
Personally I have no use for this, but it is a very neat project. Good job OP!
xtigermaskx@reddit
This is neat. I manage clusters and use slurm if you ever want to try it's not too big an undertaking if you were able to build this.
Some folks over at /r/hpc there might be some folks that like it.
TheDevilKnownAsTaz@reddit (OP)
Thanks for the input! I have tried slurm a few times and never really liked its integration for persistent tasks. Unless it has gotten easier?
xtigermaskx@reddit
Ohh you're running things just full time? Yeah I don't use it for that just jobs that will dump outputs.
TheDevilKnownAsTaz@reddit (OP)
A lot of devs like their Jupyter notebooks haha but others like the command line. I needed a way to reign in both types of users
HeavyNuclei@reddit
Just use open on demand? Jupyter notebooks running in a slurm allocation. Piece of cake to setup. Tried and tested.
xtigermaskx@reddit
So we have a similar issue we solved a completely different way. A faculty member asked us to stand up a server for students to all be able to run docker containers and notebooks.
We worried that the students could possibly mess up each other's containers on a single server so we took some old big iron and used terra form to build them all their own personal vms. Then they have their own little environment to work in and we don't have to worry about someone doing anything that could mess up other folks.
We could use this for something that got brought up in a call today. Spin up a similar environment but for group projects.
TheDevilKnownAsTaz@reddit (OP)
As an additional point, if you are looking to have a single docker image for a group and then limit resources within that docker image fairshare should be able to do that.
Within the repo I have a .devcontainer directory that you can use as a docker template since it requires a little bit of setup to allow systemd to be run from within the docker image.
TheDevilKnownAsTaz@reddit (OP)
I thought about docker too. Main reason I didn’t is I wanted to keep the onboarding process as simple as possible. i .e. Here is how to ssh into the machine and use ‘fairshare request’ to sign out resources.
i_am_buzz_lightyear@reddit
This is what is most used from what I know -- GitHub - chpc-uofu/arbiter: Monitor and manage user behavior on login nodes https://share.google/zMO7nplW6osB1gqvT
H3rbert_K0rnfeld@reddit
Don't sell yourself short. Look up the history of Linux. It was just a thing a guy made for class. His post to newsgroups was just like yours.
Make your thing fun to use. Support it. Don't be jerky if some says Hey about this? You never know where the project will take you.
robvas@reddit
Couldn't you do the same with cgroups?