Built myself a bit of a local llm workhorse. What's a good model to try out with llamacpp that will put my 56G of VRAM to good use? Any other fun suggestions?
Posted by SBoots@reddit | LocalLLaMA | View on Reddit | 40 comments
javiers@reddit
It looks nice (silently cried over his RTX 4070 with whopping 12GB of VRAM).
Aggravating_Pinch@reddit
Is it a good idea to put unlike gpus?
SBoots@reddit (OP)
Some software won't play nice but so far it's been working pretty good
Aggravating_Pinch@reddit
thanks, I got this setup today...installations in progress
Specs:
CPU: AMD Ryzen Threadripper 9960X (24C/48T, Zen 5)
Motherboard: ASUS Pro WS TRX50-SAGE WIFI A (4-channel DDR5, PCIe 5.0, ECC RDIMM)
RAM: 128GB Kingston DDR5-6400 ECC Registered (2×64GB)
GPU: RTX 5090 Founders Edition 32GB GDDR7
NVMe: 2TB Samsung 9100 PRO Gen5
HDD: 16TB Seagate IronWolf Pro (cold model storage)
OS: Kubuntu 26.04 LTS
Any gotchas? thanks in advance
BitGreen1270@reddit
I get 20t/s on Gemma4-26B on my potato 780m igpu. It goes up to 110 t/s on a 3090. How much is it on your monster?
SBoots@reddit (OP)
With just the 5090 I can get 52 T/s on gemma-4-31B-it-Q6_K.gguf
On two cards I'm seeing 28 T/s meta-llama-3-70b-instruct.Q4_K_M.gguf
5090 hits 149.3t T/s on gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf
BitGreen1270@reddit
Brilliant 28 t/s on 70B dense model is astounding, thanks for testing!
c4talystza@reddit
Does it matter that the second GPU is only in a x4 slot? MSI MPG X870E Carbon W?
I'm about to put in a second 3090 and I'm pulling hair that my mobo (z790 asrock steel legend, Intel) can't do x8/x8
SBoots@reddit (OP)
MSI X870E Carbon Max Wifi. Both PCIe running x8/x8. 4090 isn't running at peak performance but still fast enough for me
c4talystza@reddit
ChatGPT totally got your specs wrong, sorry :/ Your motherboard is decent! I'll have to see how I end up because x4 via chipset (not CPU) will likely be a bit crap for single model? We shall see.
SBoots@reddit (OP)
the LLM's always get confuse the MAX with the non-MAX variant. One of the changes they made to the max version was the ability to do x8/x8
Plastic-Stress-6468@reddit
nvidia-smi dmon -s pt
Check to see if pci bandwidth is ever saturated at all.
I assume it's probably no big deal. RPC over network sends traffic in the sub GB/s range, hundred MB/s range if I recall correctly, and that's with even more overhead due to TCP protocol chatter.
SBoots@reddit (OP)
I saw one burst of 24xxx MB/s on my 5090 and one burst of 13xxx MB/s on my 4090 when loading a model but after that the numbers are insignificant.
c4talystza@reddit
Yeh seems very lean? This is single 3090 running qwen3.5-9b on vLLM - not very optimized - FP16... (Newbie!)
SBoots@reddit (OP)
Oh very cool. I'm going to test that out tonight and see where mine lands.
c4talystza@reddit
Thanks! Will dig more tomorrow. Learning so much
klenen@reddit
Welcome to tensor parallelism and reduce all!
b0tbuilder@reddit
Probably fine. If you are using pcie5 it’s the same as 4 x 8
Long_comment_san@reddit
Did you forget why you built it?
Maleficent-Ad5999@reddit
Not blaming him.. I want at least 4x rtx 6000gpus but I have no idea why I need them
Long_comment_san@reddit
I would blame, the most memory you need is to fit \~50b active parameters of the large MOE + context size. I have no idea why would anyone need more memory except for speed which is an irrelevant metric for long tasks.
HornyGooner4402@reddit
Must've taken a while to save up to buy it
SBoots@reddit (OP)
😂
Just seeing what people are doing with comparable setups
No_Night679@reddit
🤣😂🥹
IngwiePhoenix@reddit
RIP power bill tho... x)
Adrenolin01@reddit
Solar! 🖕🏻the power company. 😆
SBoots@reddit (OP)
I have a power monitor on my house and it's funny seeing the usage spike at the command line lol
IngwiePhoenix@reddit
That bad? xD Damn. I have only a 4090 and went to buy a 1200W PSU because nobody could properly tell me what to go for (paired with a Ryzen 9 3900X). What are you using to power both of those?
SBoots@reddit (OP)
Corsair HX1500i
m31317015@reddit
I would say Qwen3.6 27B Q4 with full 256k context but the tool calling is kinda bad on my side so I'd recommend Gemma4 31B, also full 256k context at Q4_K_M.
Maybe also an embedding model to use LanceDB with, currently playing with it and it's quite good for RAG alongside with the context window.
SBoots@reddit (OP)
I've used Gemma4 a bit and I do like it
ambient_temp_xeno@reddit
Put ubuntu 24 on it I reckon.
SBoots@reddit (OP)
Running 26.04
Califorskin@reddit
I think what he means is that kernel 7 is still pretty new. I tried to get ROCm on 26.04 and it just didn’t work. Switched to 24.04 and no issues
specify_@reddit
Qwen 3.6 27B cyankiwi AWQ-INT4, running in vLLM with tensor parallelism and speculative decoding, using opencode with oh-my-openagent. Clone a github repo like llama.cpp and ask it to do a full Rust port.
hec_ovi@reddit
yes, i made it work with amd was a pain... a lot of patch scripts
oxygen_addiction@reddit
Q8 Qwen 3.6 27B, ideally via VLLM so you can use MTP or Dflash to get anywhere from 1.2-2x the speed for token generation.
SBoots@reddit (OP)
Going to checkout 3.6 and vLLM. Thanks!
deenspaces@reddit
second this, try qwen3.6-35b-a3b as well
b0tbuilder@reddit
Yup, both good calls.