Software stack for local LLM server: 2x RTX 5090 + Xeon (willing to wipe Ubuntu, consider Proxmox)

Posted by maxwarp79@reddit | LocalLLaMA | View on Reddit | 15 comments

Hello,

setting up a dedicated machine for local LLM inference/serving. With this hardware, Ollama isn’t fully utilizing the multi-GPU potential—especially tensor parallelism for huge models (e.g., 70B+ with high context or concurrent requests). Currently on Ubuntu Server 24.04 with latest NVIDIA drivers/CUDA, running Ollama via OpenAI-compatible API, but it’s single-GPU heavy without advanced batching.

Hardware specs:

Goals:

We’re open to completely wiping the current Ubuntu install for a clean start—or even switching to Proxmox for optimal VM/container management (GPU passthrough, LXC isolation).

Alternatives like vLLM, ExLlamav2/text-gen-webui, TGI look great for RTX 50-series multi-GPU on Ubuntu 24.04 + 5090 (e.g., vLLM build w/ CUDA 12.8). Need step-by-step setup advice. Any Blackwell/sm_120 gotchas? Benchmarks on similar dual-5090 rigs?

Thanks—aiming to turn this into a local AI beast!