What Size of LLM Can 4x RTX 5090 Handle? (96GB VRAM)

Posted by Affectionate_Arm725@reddit | LocalLLaMA | View on Reddit | 14 comments

I currently have access to a server equipped with 4x RTX 5090 GPUs. This setup provides a total of 96GB of VRAM (assuming 24GB per card).

I'm planning to use this machine specifically for running and fine-tuning open-source Large Language Models (LLMs).

Has anyone in the community tested a similar setup? I'm curious to know:

  1. What is the maximum size (in parameters) of a model I can reliably run/inference with this 96GB configuration? (e.g., Qwen-72B, Llama 3-70B, etc.)
  2. What size of model could I feasibly fine-tune, and what training techniques would be recommended (e.g., QLoRA, full fine-tuning)?

Any real-world benchmarks or advice would be greatly appreciated! Thanks in advance!