What Size of LLM Can 4x RTX 5090 Handle? (96GB VRAM)

Posted by Affectionate_Arm725@reddit | LocalLLaMA | View on Reddit | 14 comments

I currently have access to a server equipped with 4x RTX 5090 GPUs. This setup provides a total of 96GB of VRAM (assuming 24GB per card).

I'm planning to use this machine specifically for running and fine-tuning open-source Large Language Models (LLMs).

Has anyone in the community tested a similar setup? I'm curious to know:

What is the maximum size (in parameters) of a model I can reliably run/inference with this 96GB configuration? (e.g., Qwen-72B, Llama 3-70B, etc.)
What size of model could I feasibly fine-tune, and what training techniques would be recommended (e.g., QLoRA, full fine-tuning)?

Any real-world benchmarks or advice would be greatly appreciated! Thanks in advance!

[-]

Aggressive-Bother470@reddit

Nothing that will justify the cost, unfortunately.

[-]

TechnicalGeologist99@reddit

Half the 96 will tell you the biggest dense model you could fit in INT4 precision. So 48B parameters.

However you need to allow overhead for the KV cache. I'd recommend around 20-30% as overhead for stability with large context windows.

[-]

Qwen3 235b Instruct EXL3 4.0bpw using TabbyAPI, fits 8k context and 49k cache. This model is significantly better than Llama3 70b or equivalents. Token generation around 40 tokens per second, if all cards are on Pcie5.0 x16. I recommend using either MikeRoz 4.0bpw quant or DoctorShotgun 3.07bpw optimized quant if you need to fit more context or cache. You can find them on HuggingFace.

[-]

dinerburgeryum@reddit

This person knows. If you go Tabby, remember you can quantize your KV cache super effectively given their use of Hadamard transforms to create a sort of dynamic SmoothQuant implementation for cache. 6-bits holds up really well, and can more than double your context window.

[-]

danny_094@reddit

4x RTX 5090 won't do you any good. You can't share the vram. This only works with the server cards. You can only use one graphics card per model.

[-]

wild_abra_kadabra@reddit

that's not true...

[-]

StardockEngineer@reddit

Open LM Studio. Or go to Ollama’s website. Any model and quant that has a file size below 128GB will fit (well minus some overhead and how much context you want).

[-]

AMOVCS@reddit

4x32GB = 128GB VRAM is what you should have, with this alone you could run GPT OSS 120B and GLM 4.5 Air very very fast. You can also try Minimax M2 at lower quants and Qwen 3 Next. If you pair this system with an additional of 128GB DDR5 RAM you could run Mimimax M2 at higher quant, GLM 4.6 at descent speed and Qwen 3 235B

[-]

Affectionate_Arm725@reddit (OP)

Thanks, dude

[-]

swagonflyyyy@reddit

GPT-OSS-120b

[-]

Affectionate_Arm725@reddit (OP)

Thanks

[-]

tomz17@reddit

5090's have 32gb VRAM

[-]

Affectionate_Arm725@reddit (OP)

My mistake, sorry

[-]

valdev@reddit

Essentially the same models 4x 3090s can for the price of 1x 5090. (Albeit quite a bit faster, think 270tk/s on gpt-oss-120 vs 140tk/s.).

If you choose to do this, make sure you have the PCIE lanes. My life has been an nightmare figuring it out without throwing insane levels of cash at it.