Software stack for local LLM server: 2x RTX 5090 + Xeon (willing to wipe Ubuntu, consider Proxmox)

[-]

huzbum@reddit

Sounds like the next gen version of my AI Rig after I get the next round of upgrades in. Ryzen 5900XT 16 core, 128GB DDR4, and dual RTX 3090s.

I use it as a workstation for software development and full time AI server. Because I use it as a workstation, I put the server inside docker, but that is not necessary, especially for a dedicated server for a single model.

If you are serving a single model, just install vllm or llama.cpp and you’re all set. As for which one, the model you want to use plays a role in that decision. Also, how many simultaneous users?

[-]

Plus_Regret4490@reddit

If I build a dedicated full-time AI server, can I remote into it from my Windows PC or MacBook Pro to use it as both a Linux workstation for software development and a server at the same time? ...and can it support multiple users running LLMs simultaneously?

[-]

huzbum@reddit

Yeah, that's how I use mine. I've got it setup with Llama.cpp, Hermes Agent, and Open Web UI.

I remote in using ssh and Tailscale from my Macbook and iPhone.

I'm using Llama.cpp for a single model. It have it setup to serve up to 8 parallel requests at a time. It does 4 requests at 1/2 speed and 8 requests at 1/4 speed, so there are diminishing returns to batching on a 3090 and Llama.cpp.

As for how many user's that would support, I'd guess like 20 active users having conversations where they read between responses. You would definitely want to setup parallel contexts and context caching with multiple users.

Even with a single user, some harnesses are aggressively parallel and make like 8 requests all at once *cough* IntelliJ! *cough* which will blow out the input caching if you don't have parallel contexts and caching setup. Then it has to re-process the entire context for every message every time.

Since it's got 128GB of system ram, It also runs all of my development docker containers, and I use IntelliJ Idea Gateway to run all my heavy IDE features, language services, etc., and my old Macbook only has to render the UI.

[-]

Plus_Regret4490@reddit

What’s the annual electricity cost for this server running 24/7?

[-]

maxwarp79@reddit (OP)

I don't know, sorry.

[-]

see_spot_ruminate@reddit

I hope I don't come off as mean...

You got a beast of a machine but don't want to "tinker" with it to get the most performance?? Honestly, stay on ubuntu since at least all the beginner guides are written for it. For you, I would not even touch proxmox and gpu passthrough or any other headaches.

Start off small, (again not to be mean) but go on the vllm github and figure out how to install it. Read the github issues and discussions.

[-]

maxwarp79@reddit (OP)

Fair point, starting simple with Ubuntu/vLLM. Any pinned guide for TP=2 on 5090?

[-]

see_spot_ruminate@reddit

Not that I know of. Probably the best way to find out is to check out their github or just fuck around with it.

Also, vllm is good but you bought all that system ram and I don't think it is going to use it. Try llamacpp or ik_llamacpp (the new split mode graph is good for dense models and each can be faster depending on the model). This way you can leverage that system ram.

[-]

Narrow-Belt-5030@reddit

Switched from ollama to vllm .. difference is night and day, even more so if you have multiple calls to the same llm.

[-]

maxwarp79@reddit (OP)

Where can I find tutorials/guides to install? No problems with RTX 5090?

[-]

Narrow-Belt-5030@reddit

I know it sounds stupid, but I just asked Claude to do it for me.

[-]

maxwarp79@reddit (OP)

Thanks! Huge jump from Ollama—do you have a good Claude prompt for 5090 vLLM setup? Specific issues you hit?

[-]

lemondrops9@reddit

I like Linux Mint and LM Studio personally because it makes playing with models easy.

But you want multi-user so it should an easy choice of vLLM. As LM Studio just got multiple user, but it just cuts down the speed by how many people are generating.

[-]

LA_rent_Aficionado@reddit

2x 5090 will not be enough VRAM for TP especially with batching on the models you mention if you want full GPU offload. I’d at least move to llama-server over ollama , vllm will further increase your vram requirements. Exl3/2 is good but not as mature of an ecosystem for batching and multiuser, I’d suggest using tabbyapi as the front end for any ex backend. You’ll likely need to update your PSU if you want TP and batching since the 5090s will have more load.

[-]

Infinite100p@reddit

Don't 5090's have some passthrough bug where they idle at high wattage or something like that?