Rack server for local LLM
Posted by Typhoon-UK@reddit | LocalLLaMA | View on Reddit | 10 comments
Hi, has anyone tried running local LLM on dell/hp rack server with older xenon processors and 100+ GB RAM and no GPU?
Dell PowerEdge R720 2 x Xeon-2650v2 - 128gb RAM
I currently run qwen3.5-2b 8_0 on a dell xps 7590 with 16gb RAM and 4gb nvidia gpu. Its alright in chat mode but struggles when integrating with opencode.
MelodicRecognition7@reddit
https://old.reddit.com/r/LocalLLaMA/comments/1rqo2s0/can_i_run_this_model_on_my_hardware/?
SM8085@reddit
I have a 2x E5-2697 v2 with 256GB RAM and I'm pretty happy with it.
It's definitely slow as fuck, but if you have a "Set it and forget it" mentality then it's not so bad.
Like last night and today it was just grinding through thousands of images in a script.
4573 frames it needs to check in total for a 38 minute video at 2 FPS. According to Qwen3.5-35B-A3B I can edit out 80% of this live video I recorded because it doesn't have what I want in frame.
MoE models are king, because a 32B model can be a pain to run, but 122B-A10B isn't so bad. 35B-A3B is alright.
Qwen3.5-122B-A10B can take an hour in OpenCode just to think up a proper session title. So, there's that. It feels faster when it's done with the title and processes the 10k opencode system instructions. Then at least you can see it ingesting files instead of thinking in circles about the title.
One-Replacement-37@reddit
I would not bother without GPUs. It’ll be 100x slower than your Dell XPS 7590 which has a GPU.
Typhoon-UK@reddit (OP)
Thank you. If I add a Rtx3060 or a GTX 1070/1080 will that be a better setup?
MutantEggroll@reddit
Trying to use those cards in that server is likely more trouble than it's worth. Notably, powering them will be difficult.
I had an R710, so not sure if the R720 has the same limitation, but there were no PCIe power cables available inside the chassis, and the PCIe slot only supplied 25W instead of the standard 75W. So if you want to power a 3060/1080, you're gonna have to do some real ugly stuff with an external PSU, and even then the 25W slot power limitation may cause issues anyways.
AutonomousHangOver@reddit
I gave up the idea of old servers. Just bought custom case (here 14U), put GENOA mobo in it and add some (ever more) serious GPUs. First, there were 2 3090s, then apetite came and I got my hands on 5090s. Then my favourite beasts came.
It was long journey. I've started with 'normal' PC with not nearly enough PCIe lanes to handle 2 GPUs on x16. Then bought some more and more hardware. Thing is, if I could get back in time, I would go straight to this solution here.
Typhoon-UK@reddit (OP)
That’s some serious hardware. My budget’s limited so i was keen on seeing if I could leverage any older hardware
AutonomousHangOver@reddit
It wasn't serious at the beginning. Invest in some mobo with multiple PCIe (it might be PCIe4 too). Grab as much RAM as possible within budget (yeah, tricky) and then some 3090s. These are really fine, even in 2026.
Thing is to be elastic and get frame/case that could be expanded. Mine is just for rack, therefore case not frame.
Typhoon-UK@reddit (OP)
And if I bump up my xps ram to 32GB will that help with having a bigger context window or a larger model like 7b? Is there any linux distribution optimised towards running local llm?
plopperzzz@reddit
I'm currently using an old Dell 7820 with dual E5-2697A V4, 192GB RAM, and an M40 24GB.
Even the best CPUs that you can get for these computers are quite slow. Also, nearly everything in the computer will be proprietary and not easily upgradable (like PSU, and such). But, if that hasn't changed your mind, you I can get around 25 tok/s on Qwen3.5-35B with --n-cpu-moe 20, and maybe 18 with just --cpu-moe.
here is what I can get on some of the other models:
Qwen3.5-122B: 8 tok/s
gpt-oss-120b: 15-20 tok/s
Gemma-4-26B: 28 tok/s
Gemma-4-31B: 6 tok/s (very little context on Q4 fully on GPU)