Software stack for local LLM server: 2x RTX 5090 + Xeon (willing to wipe Ubuntu, consider Proxmox)
Posted by maxwarp79@reddit | LocalLLaMA | View on Reddit | 15 comments
Hello,
setting up a dedicated machine for local LLM inference/serving. With this hardware, Ollama isn’t fully utilizing the multi-GPU potential—especially tensor parallelism for huge models (e.g., 70B+ with high context or concurrent requests). Currently on Ubuntu Server 24.04 with latest NVIDIA drivers/CUDA, running Ollama via OpenAI-compatible API, but it’s single-GPU heavy without advanced batching.
Hardware specs:
- CPU: Intel(R) Xeon(R) w3-2435 (8 cores/16 threads)
- RAM: 128 GB DDR5 4400 MT/s (4x 32 GB)
- GPUs: 2x NVIDIA GeForce RTX 5090 32 GB GDDR7 (full PCIe 5.0)
- Storage: 2x Samsung 990 PRO 2TB NVMe SSD
- Other: Enterprise mobo w/ dual PCIe 5.0 x16, 1200W+ PSU
Goals:
- Max throughput: Large models (Llama3.1 405B quantized, Qwen2.5 72B) split across both GPUs, continuous batching for multi-user API.
- OpenAI-compatible API (faster/more efficient than Ollama).
- Easy model mgmt (HuggingFace GGUF/GPTQ/EXL2), VRAM monitoring, Docker/VM support.
- Bonus: RAG, long contexts (128k+ tokens), LoRA serving.
We’re open to completely wiping the current Ubuntu install for a clean start—or even switching to Proxmox for optimal VM/container management (GPU passthrough, LXC isolation).
Alternatives like vLLM, ExLlamav2/text-gen-webui, TGI look great for RTX 50-series multi-GPU on Ubuntu 24.04 + 5090 (e.g., vLLM build w/ CUDA 12.8). Need step-by-step setup advice. Any Blackwell/sm_120 gotchas? Benchmarks on similar dual-5090 rigs?
Thanks—aiming to turn this into a local AI beast!
huzbum@reddit
Sounds like the next gen version of my AI Rig after I get the next round of upgrades in. Ryzen 5900XT 16 core, 128GB DDR4, and dual RTX 3090s.
I use it as a workstation for software development and full time AI server. Because I use it as a workstation, I put the server inside docker, but that is not necessary, especially for a dedicated server for a single model.
If you are serving a single model, just install vllm or llama.cpp and you’re all set. As for which one, the model you want to use plays a role in that decision. Also, how many simultaneous users?
Plus_Regret4490@reddit
If I build a dedicated full-time AI server, can I remote into it from my Windows PC or MacBook Pro to use it as both a Linux workstation for software development and a server at the same time? ...and can it support multiple users running LLMs simultaneously?
huzbum@reddit
Yeah, that's how I use mine. I've got it setup with Llama.cpp, Hermes Agent, and Open Web UI.
I remote in using ssh and Tailscale from my Macbook and iPhone.
I'm using Llama.cpp for a single model. It have it setup to serve up to 8 parallel requests at a time. It does 4 requests at 1/2 speed and 8 requests at 1/4 speed, so there are diminishing returns to batching on a 3090 and Llama.cpp.
As for how many user's that would support, I'd guess like 20 active users having conversations where they read between responses. You would definitely want to setup parallel contexts and context caching with multiple users.
Even with a single user, some harnesses are aggressively parallel and make like 8 requests all at once *cough* IntelliJ! *cough* which will blow out the input caching if you don't have parallel contexts and caching setup. Then it has to re-process the entire context for every message every time.
Since it's got 128GB of system ram, It also runs all of my development docker containers, and I use IntelliJ Idea Gateway to run all my heavy IDE features, language services, etc., and my old Macbook only has to render the UI.
Plus_Regret4490@reddit
What’s the annual electricity cost for this server running 24/7?
maxwarp79@reddit (OP)
I don't know, sorry.
see_spot_ruminate@reddit
I hope I don't come off as mean...
You got a beast of a machine but don't want to "tinker" with it to get the most performance?? Honestly, stay on ubuntu since at least all the beginner guides are written for it. For you, I would not even touch proxmox and gpu passthrough or any other headaches.
Start off small, (again not to be mean) but go on the vllm github and figure out how to install it. Read the github issues and discussions.
maxwarp79@reddit (OP)
Fair point, starting simple with Ubuntu/vLLM. Any pinned guide for TP=2 on 5090?
see_spot_ruminate@reddit
Not that I know of. Probably the best way to find out is to check out their github or just fuck around with it.
Also, vllm is good but you bought all that system ram and I don't think it is going to use it. Try llamacpp or ik_llamacpp (the new split mode graph is good for dense models and each can be faster depending on the model). This way you can leverage that system ram.
Narrow-Belt-5030@reddit
Switched from ollama to vllm .. difference is night and day, even more so if you have multiple calls to the same llm.
maxwarp79@reddit (OP)
Where can I find tutorials/guides to install? No problems with RTX 5090?
Narrow-Belt-5030@reddit
I know it sounds stupid, but I just asked Claude to do it for me.
maxwarp79@reddit (OP)
Thanks! Huge jump from Ollama—do you have a good Claude prompt for 5090 vLLM setup? Specific issues you hit?
lemondrops9@reddit
I like Linux Mint and LM Studio personally because it makes playing with models easy.
But you want multi-user so it should an easy choice of vLLM. As LM Studio just got multiple user, but it just cuts down the speed by how many people are generating.
LA_rent_Aficionado@reddit
2x 5090 will not be enough VRAM for TP especially with batching on the models you mention if you want full GPU offload. I’d at least move to llama-server over ollama , vllm will further increase your vram requirements. Exl3/2 is good but not as mature of an ecosystem for batching and multiuser, I’d suggest using tabbyapi as the front end for any ex backend. You’ll likely need to update your PSU if you want TP and batching since the 5090s will have more load.
Infinite100p@reddit
Don't 5090's have some passthrough bug where they idle at high wattage or something like that?