Built myself a bit of a local llm workhorse. What's a good model to try out with llamacpp that will put my 56G of VRAM to good use? Any other fun suggestions?

[-]

javiers@reddit

It looks nice (silently cried over his RTX 4070 with whopping 12GB of VRAM).

[-]

Aggravating_Pinch@reddit

Is it a good idea to put unlike gpus?

[-]

SBoots@reddit (OP)

Some software won't play nice but so far it's been working pretty good

[-]

Aggravating_Pinch@reddit

thanks, I got this setup today...installations in progress
Specs:

CPU: AMD Ryzen Threadripper 9960X (24C/48T, Zen 5)

Motherboard: ASUS Pro WS TRX50-SAGE WIFI A (4-channel DDR5, PCIe 5.0, ECC RDIMM)

RAM: 128GB Kingston DDR5-6400 ECC Registered (2×64GB)

GPU: RTX 5090 Founders Edition 32GB GDDR7

NVMe: 2TB Samsung 9100 PRO Gen5

HDD: 16TB Seagate IronWolf Pro (cold model storage)

OS: Kubuntu 26.04 LTS

Any gotchas? thanks in advance

[-]

BitGreen1270@reddit

I get 20t/s on Gemma4-26B on my potato 780m igpu. It goes up to 110 t/s on a 3090. How much is it on your monster?

[-]

SBoots@reddit (OP)

With just the 5090 I can get 52 T/s on gemma-4-31B-it-Q6_K.gguf

On two cards I'm seeing 28 T/s meta-llama-3-70b-instruct.Q4_K_M.gguf

5090 hits 149.3t T/s on gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf

[-]

BitGreen1270@reddit

Brilliant 28 t/s on 70B dense model is astounding, thanks for testing!

[-]

c4talystza@reddit

Does it matter that the second GPU is only in a x4 slot? MSI MPG X870E Carbon W?

I'm about to put in a second 3090 and I'm pulling hair that my mobo (z790 asrock steel legend, Intel) can't do x8/x8

[-]

SBoots@reddit (OP)

MSI X870E Carbon Max Wifi. Both PCIe running x8/x8. 4090 isn't running at peak performance but still fast enough for me

[-]

c4talystza@reddit

ChatGPT totally got your specs wrong, sorry :/ Your motherboard is decent! I'll have to see how I end up because x4 via chipset (not CPU) will likely be a bit crap for single model? We shall see.

[-]

SBoots@reddit (OP)

the LLM's always get confuse the MAX with the non-MAX variant. One of the changes they made to the max version was the ability to do x8/x8

[-]

Plastic-Stress-6468@reddit

nvidia-smi dmon -s pt

Check to see if pci bandwidth is ever saturated at all.

I assume it's probably no big deal. RPC over network sends traffic in the sub GB/s range, hundred MB/s range if I recall correctly, and that's with even more overhead due to TCP protocol chatter.

[-]

SBoots@reddit (OP)

I saw one burst of 24xxx MB/s on my 5090 and one burst of 13xxx MB/s on my 4090 when loading a model but after that the numbers are insignificant.

[-]

c4talystza@reddit

Yeh seems very lean? This is single 3090 running qwen3.5-9b on vLLM - not very optimized - FP16... (Newbie!)

[-]

SBoots@reddit (OP)

Oh very cool. I'm going to test that out tonight and see where mine lands.

[-]

c4talystza@reddit

Thanks! Will dig more tomorrow. Learning so much

[-]

klenen@reddit

Welcome to tensor parallelism and reduce all!

[-]

b0tbuilder@reddit

Probably fine. If you are using pcie5 it’s the same as 4 x 8

[-]

Maleficent-Ad5999@reddit

Not blaming him.. I want at least 4x rtx 6000gpus but I have no idea why I need them

[-]

I would blame, the most memory you need is to fit \~50b active parameters of the large MOE + context size. I have no idea why would anyone need more memory except for speed which is an irrelevant metric for long tasks.

[-]

HornyGooner4402@reddit

Must've taken a while to save up to buy it

[-]

SBoots@reddit (OP)

😂

Just seeing what people are doing with comparable setups

[-]

No_Night679@reddit

🤣😂🥹

[-]

IngwiePhoenix@reddit

RIP power bill tho... x)

[-]

Adrenolin01@reddit

Solar! 🖕🏻the power company. 😆

[-]

SBoots@reddit (OP)

I have a power monitor on my house and it's funny seeing the usage spike at the command line lol

[-]

IngwiePhoenix@reddit

That bad? xD Damn. I have only a 4090 and went to buy a 1200W PSU because nobody could properly tell me what to go for (paired with a Ryzen 9 3900X). What are you using to power both of those?

[-]

SBoots@reddit (OP)

Corsair HX1500i

[-]

m31317015@reddit

I would say Qwen3.6 27B Q4 with full 256k context but the tool calling is kinda bad on my side so I'd recommend Gemma4 31B, also full 256k context at Q4_K_M.

Maybe also an embedding model to use LanceDB with, currently playing with it and it's quite good for RAG alongside with the context window.

[-]