Ask me to run models
Posted by monoidconcat@reddit | LocalLLaMA | View on Reddit | 136 comments
Hi guys, I am currently in the process of upgrading my 4×3090 setup to 2×5090 + 1×RTX Pro 6000. As a result, I have all three kinds of cards in the rig temporarily, and I thought it would be a good idea to take some requests for models to run on my machine.
Here is my current setup: - 1× RTX Pro 6000 Blackwell, power limited to 525 W - 2× RTX 5090, power limited to 500 W - 2× RTX 3090, power limited to 280 W - WRX80E (PCIe 4.0 x16) with 3975WX - 512 GB DDR4 RAM
If you have any model that you want me to run with a specific setup (certain cards, parallelism methods, etc.), let me know in the comments. I’ll run them this weekend and reply with the tok/s!
Suspicious-Elk-4638@reddit
Can you run the deepseek 685B math v2 im really curious
Real-Valuable-5303@reddit
MiniMax M2 FP8 as big of a context as you can.
SkyFeistyLlama8@reddit
I'm gonna ask you about your power consumption. How many kW are you seeing at full tilt and did you have to do any changes to your wiring?
deathtoallparasites@reddit
I ask you to run models
monoidconcat@reddit (OP)
I reply with tok/s
Ok-Internal9317@reddit
qwen3-coder vllm
IAmBackForMore@reddit
Kimi K2 thinking instruct Max DOLPHIN LIBERATED GIGAMAX FREEDED 4028B
qwer1627@reddit
LLaMa4 405B 😇
MoffKalast@reddit
SmolLM2-135M, let's see those tokens.
monoidconcat@reddit (OP)
Single stream generation
- RTX 6000: 715 tok/s
- RTX 5090(x1): 683 tok/s
- RTX 3090(x1): 523 tok/s
Concurrency=64
- RTX 6000: 11639 tok/s
- RTX 5090(x1): 11056 tok/s
- RTX 3090(x1): 6171 tok/s
MoffKalast@reddit
Hmm so not quite up to the "I'm doing 1000 calculations per second and they're all wrong" meme. But it does get close :P
Agreeable-Market-692@reddit
at that speed though best-of-n is very feasible for multiple people even
sourpatchgrownadults@reddit
Asrock or ASUS mobo?
I have an ASUS WRX80 that I'm trying to build and it's so unstable, even after sending to manuf :(
mytreya96@reddit
The more you buy the more you save!! Nice setup 👍🏻
PraxisOG@reddit
An interesting one might be comparing 2x3090 under tensor parallelism to 1x5090 with a model under 24gb. In theory the 3090s have more combined bandwidth, but in practice probably don't scale linearly.
monoidconcat@reddit (OP)
Qwen3 30b a3b at AWQ 4bit would be a good candidate for this
OctaviaZamora@reddit
I'm curious as well! By the way, here I am, being so very happy with my single RTX 5090, and then I saw your picture. Damn. Good for you.
monoidconcat@reddit (OP)
cpatonn/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit
Input=1024, Output=2048, Concurrency=1
- RTX 5090: 218 tok/s
- RTX 3090(x1): 157 tok/s
- RTX 3090(x2, TP): 155 tok/s
Input=1024, Output=2048, Concurrency=16
- RTX 5090: 1793 tok/s
- RTX 3090(x1): 783 tok/s
- RTX 3090(x2, TP): 1032 tok/s
Nepherpitu@reddit
What is yours pcie configuration? Are 3090 on full speed 4.0 x16? Did you compiled patched driver with p2p access?
monoidconcat@reddit (OP)
The patched p2p driver unfortunately broke my machine once(abrupt power off upon high load) so I had to rollback. Fullspeed 4.0 x16 w/o nvlink.
panchovix@reddit
Not him but most WRX80 boards do have either PCIe 4.0 X8 or PCIe 4.0 X16. PCIe X4 is mostly seen on TR40 boards.
s2k4ever@reddit
How to run in order to get such performance ? I have a 6000 and a 5090 and I cant seem to get this perf.
Also, the image
monoidconcat@reddit (OP)
If you are using llama.cpp, I highly recommend to switch over to vllm(and awq quant). I can say that this is the definitive reason, but in my experience vllm had better optimization for inference. That little port is for additional output that supports much more displays than DP
qwer1627@reddit
Gods work ty
Distinct-Target7503@reddit
interesting... are you using the two 3090 in pcie x4 or x16? also how are those connected? did you use a switch?
if you are using those card 2 cards at x16, would it be possible for you to test those in x4? that comparison would really helpful for me.
panchovix@reddit
Not him but most WRX80 boards do have either PCIe 4.0 X8 or PCIe 4.0 X16. PCIe X4 is mostly seen on TR40 boards.
Distinct-Target7503@reddit
Ok thanks!
PraxisOG@reddit
Thanks! Its interesting that it scales better with higher concurrency. I briefly considered throwing together like 16 RX580's to run in groups of 4 and I'm really glad I went a different route instead.
MatlowAI@reddit
Can you do the same for 2x 5090s vs rtx 6000 pro for this model and then Qwen3-Next-80B-A3B-Instruct-AWQ-4bit
s2k4ever@reddit
what is this rig called as ?
commonsasquatch@reddit
What rig is that?
Puzzleheaded_Tip7043@reddit
so you're the one taking all the ram!
NeverLookBothWays@reddit
Nice space heater!
monoidconcat@reddit (OP)
Entropy to intelligence machine
0point01@reddit
i dont get it. how I understand it, you only create more entropy, no? everything else is just a side effect
monoidconcat@reddit (OP)
I should have said negative entropy to intelligence machine haha
Asleep-Ingenuity-481@reddit
Qwen3 0.6b. Might be a bit too big but idk.
koflerdavid@reddit
Unsloth made some Q1S quants, they might fit.
monoidconcat@reddit (OP)
Still too large for my machine!
Tiny-Criticism-86@reddit
That's bro's got GDP of a small African country on his server rack
Interesting-Tip-2712@reddit
PrimeIntellect/INTELLECT-3
moistiest_dangles@reddit
Would you be interested in helping to open source some medical models?
Wubbywub@reddit
minecraft
xzpyth@reddit
https://i.redd.it/dx1sdxj8z04g1.gif
random-tomato@reddit
Woah first time I've seen this in GIF/Video form haha
Sudden-Lingonberry-8@reddit
it was ai generated lmao
xzpyth@reddit
It's because I've made it with grok imagine 😅
joelW777@reddit
GLM4.6 exl3 4 bpw would be interesting with TabbyAPI
ClearApartment2627@reddit
Great idea, very much appreciated.
I would be interested in Qwen 235b 2507-thinking as an AWQ quant:
https://huggingface.co/AIDXteam/Qwen3-235B-A22B-Thinking-2507-AWQ
danny_094@reddit
Was genau ist dein Ziel mit dem Setup?
Wichtig zu verstehen: Man kann Consumer-GPUs nicht zu einem großen VRAM-Pool bündeln.
Grund:
NVLink auf RTX bedeutet praktisch nur:
„Schnelles Asset/Texture-Sharing für Software, die es explizit unterstützt.“
– Spiele? nein
– LLM-Frameworks? definitiv nein
Also:
Du kannst nicht einfach VRAM addieren, um ein großes Modell zu laden.
Jede Karte kann 1 Modell gleichzeitig hosten (bzw. ihre eigenen Jobs).
Wenn VRAM wirklich addierbar wäre,
dann wäre jeder, der eine H100 mit 94 GB VRAM kauft, ja komplett verrückt
da könnte man sich viel günstiger einfach 4 Consumer-Karten zusammenschrauben. 😄
monoidconcat@reddit (OP)
First of all, neither rtx pro 6000 or rtx 5090 do not support nvlink anyway. I bought these two kinds of cards for specific purposes - 6000 for inference, and 5090s for training. When I need to utilize these at once, I can just use pipeline parallelism 🙃
kearm@reddit
As someone who trains on two RTX 6000 Pros why the 5090's for training?
tat_tvam_asshole@reddit
have you utilized the recent DMA unlocks for 5090s?
danny_094@reddit
Pipeline-Parallelismus? du teilst ein Modell in BLÖCKE und schickst Daten stückweise durch mehrere GPUs?du wartest bei jedem Token auf die nächste GPU Jede GPU muss auf die vorherige warten = Bottleneck = extrem langsam GIGANTISCHE Kommunikations Overheads Consumer-PCs haben kein NVSwitch PCIe ist viel zu langsam und Keine Mainstream-Frameworks unterstützen das stabil. Warum tust du dir das an? Da wartet jede GPU auf die nächste und es wird immer langsamer. Schaffst du dann überhaupt 1 Token die Sekunde? Ernsthaft. Das interessiert mich :D
egnegn1@reddit
Man kann Modells auch auf mehrere Karten aufteilen. Aber wo Du recht hast ist dass die Leistung nicht wesentlich steigt, weil die Kommunikation über PCI-e zu langsam ist.
Es gibt genügend Videos dazu auf Youtube zu finden.
danny_094@reddit
Natürlich geht es theoretisch.
Aber nur sehr begrenzt – und wir müssen unterscheiden zwischen funktioniert technisch = funktioniert sinnvoll. Du kannst rein technisch auch ein DeepSeek-600B auf einer RTX 2080 + 800 GB RAM starten. Ja, es startet. Ja, es funktioniert. Die Frage ist nur: wie weit funktioniert es? 0,0001 Tokens pro Sekunde? 1 Token pro Minute? 1 Token pro Stunde? „Geht“ heißt nicht automatisch „brauchbar“ – und genau das ist der Unterschied, den viele übersehen.
egnegn1@reddit
That's not true, as you can see from many examples. For example, there are many setups with 4x RTX3090 or 4x Apple Studio M4 Max which can run a large model with usable T/s.
Of course, it always depends on your personal requirements. The higher it is, the more you have to pay for it. Most of those discussing here are in the private sector. You can see it a little more loosely.
danny_094@reddit
Ich sehe es auch locker. Aber denkst du nicht, man sollte aufpassen, das man Geld nicht für Falsche Erwartungen ausgibt? Ich gönne es jedem der ein Grafikkartenrig mit von mir aus 10 RTX hat, aber Falsche Aussagen streuen Falsche Erwartungen.
Gehen wir mal von 4x RTX3090 aus:
Das sind: 96 GB VRAM Für 5600 Euro
eine nvidia h100 kostet mit 94 GB 34.000 Euro.
Wenn ich mit 4x RTX3090 fast Das gleiche erreiche wie mit einer Nvidia H100
für 6x weniger Anschaffungskosten, warum sollte ich mir als Firma dann eine H100 Kaufen?
egnegn1@reddit
Please take a look at this post: https://www.reddit.com/r/LocalLLaMA/s/B99oXseNNi
MitsotakiShogun@reddit
Does Reddit auto-translate for you?
It's the second or third time I see someone replying in German to a post in English, and I simply don't get what it is with the German replies on an English sub/post. Original posts in German (or other native languages) I get, replying to someone in another language I do not. And I also haven't seen any stray Spaniards / Namibians / Mongolians / ... land here, just Germans D:
NemesisCrow@reddit
For me reddit does this in the web browser. So annoying. If I open any reddit link a "?lang=de" or something like this gets added as an url suffix and I constantly have to manually remove it. It's unusable and reddit does seem to override my personal settings. Give me the option, but don't force something like this on me. Additionally it just confuses people, as seen in this thread.
danny_094@reddit
Reddit automatically translates yes.
MitsotakiShogun@reddit
Cool, are you using an app or web? I don't see the setting on web.
danny_094@reddit
On the smartphone this is right at the top of the screen. In the Reddit app. And the preferred language is automatically selected in the browser on the MacBook
MitsotakiShogun@reddit
I see it now, thanks!
chub0ka@reddit
I have 8x3090 with pairs at nvlink ep8tp2 works good for me >4x speedup
kearm@reddit
Wait as a previous owner of a 8x3090 rig and now a 4x3090 Ti plus 2xRTX 6000 Pro rig why the dual 5090's?
jacek2023@reddit
don't know if you have the diskspace for it, but I would like to see Grok 2 benchmark on that
monoidconcat@reddit (OP)
Okay this one is quite massive, I can download this but running it will require intense CPU offloading
jacek2023@reddit
why? how much VRAM can you have max now?
monoidconcat@reddit (OP)
At max, 96 + 32 + 32 + 24 + 24 so 208gb, but vllm distributed inference requires AWQ quant - which I cannot find from huggingface. grok 2 at GGUF 1bit quant is like 89gb, which may fit in the pro 6000. Maybe I can try this one
zynbobguey@reddit
couldve got two dgx sparks instead of this mess
monoidconcat@reddit (OP)
I mean yeah, they look much cleaner on my desk, but 2x 5090 had roughly equivalent tok/sec with the 2x dgx spark on inference. Also I am trying to use these for fine tuning mainly, so I would choose GPUs over dgx spark.
twnznz@reddit
this is like being the F18 and asking tower for a ground speed check while the SR-71 is above you
byteleaf@reddit
This is just insane.
natufian@reddit
😮
...this fucking guy 😅!
jacek2023@reddit
Ah I was hoping you could use llama.cpp
monoidconcat@reddit (OP)
Grok 2 IQ1_S quant, GGUF, with llama-bench, ran on RTX Pro 6000 only
Single stream: 52 tok/s
Concurrency=32: 303 tok/s
jacek2023@reddit
I tried Q3 on my 3x3090 and I think I got 2 or 3t/s
Opteron67@reddit
currently only 2x3090 and 2x5090. for me. How do you share the load evenly ? what vllm command line ? what psu setup ?
aeroumbria@reddit
The stack of 12VHPWR connectors sending shiver down my spine... How do you make sure such a system is safe to run even unattended?
koflerdavid@reddit
He underclocks them a bit.
iMrParker@reddit
It's mostly a non-issue if they are attached properly
Lissanro@reddit
I wonder what performance would you get with Q4_X quant of K2 Thinking if you can fit it https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF (given 512 GB RAM + 32×2+24×2+96=208GB VRAM, 720 GB in total, so it might fit if layers correctly allocated to GPUs). Or in case of issues, lightweight IQ3_K quant may work. Especially would be interesting to see performance using ik_llama.cpp (I shared details here how to build and set it up) - it has greatly improved prompt processing speed, and I am curios how much difference RTX PRO 6000 would make.
I have a rig with 4x3090 and 1 TB DDR4 3200MHz RAM with EPYC 7763, and this makes me interested even more if upgrading GPUs would make a noticeable difference. Currently with Q4_X quant of K2 Thinking I get around 8 tokens/s, 100-150 tokens/s prompt processing, and in 96 GB VRAM I can fit either 128K context cache at Q8 + 4 full layers, or 256K context at Q8 without full layers (reduction in performance is very small, around 5% maybe), in both cases with common expert tensors in VRAM. My guess RTX PRO 6000 may give massive boost to prompt processing speed, likely over 300 tokens/s.
realpm_net@reddit
Speaking as a guy with one card in a box, can you explain this hardware rig? How do the cards connect to the computer? How are the cards powered?
monoidconcat@reddit (OP)
I use two 1600w power supply units, and also power limit the cards to just make sure everything works safely. I use a mining rig to host threadripper motherboard and all the GPUs. The cards are connected via thermaltake riser cables.
jdchmiel@reddit
can you detail the electrical pcie connection to each card? I'm learning the hard way. a lot of motherboards have x16 slots, but are not electrically x16.
pmttyji@reddit
Could you please check CPU-only performance since you have bulk RAM? Please proceed with these medium size models. Ignore MOE list if you don't have time.
Dense models
MOE models
Add Qwen3-Next too. Thanks.
theblackcat99@reddit
Reminder
monoidconcat@reddit (OP)
Perfect, will do that. So entierly in CPU without using GPU right? Not just offloading?
pmttyji@reddit
Right. I use CPU version of llama.cpp setup for this.
One additional request on your benchmarks. This is to see how much t/s differs between those contexts.
Thanks again.
monoidconcat@reddit (OP)
Do you have any specific quant in mind? BF16? FP8?
pmttyji@reddit
Q8 please as I use llama.cpp(GGUF). Anything equivalent to Q8 is fine.
thisoilguy@reddit
Nice heater
paul_tu@reddit
Good luck with Kimi k2
burntheheretic@reddit
Which open frame is that?
Im trying to do something similar but the cheap frame I purchased on aliexpress, the GPU support bar interferes with the end of the PCIe connector so I'm not super comfortable plugging my GPUs in
InterstellarReddit@reddit
Can you run the model from weird science in the 80s? That’s the one I’m trying to get to
Agreeable-Market-692@reddit
This might be too big without clever tricks. If you can get any backend to load this on your hardware with at least 30,000tok context window that will satisfy my curioisity.
https://huggingface.co/meituan-longcat/LongCat-Flash-Thinking
Agreeable-Market-692@reddit
And if you need to use a quant that's OK!
egnegn1@reddit
Great idea!
Please take a look at this comparison test for inspiration:
https://www.reddit.com/r/LocalLLaMA/s/B99oXseNNi
https://www.cloudrift.ai/blog/benchmarking-rtx6000-vs-datacenter-gpus
alex_bit_@reddit
How much faster is GLM 4.5 Air on the RTX Pro 6000 Blackwell compared to your previous setup with four RTX 3090s?
Is there any noticeable difference in real world use?
Successful-Willow-72@reddit
Man sick setup. i wish for dual 3090 but cant afford it atm. What a beast you got there
Former-Tangerine-723@reddit
Glm 4.5 air Q4_K_M 🥳
segmond@reddit
glm4.6-Q4
kimi-k2-Q3/Q4
deepseekv3Terminus - Q4
CarelessOrdinary5480@reddit
Qwen 7b to enrich paragraphs.
TomatilloPutrid3939@reddit
GLM-4.6
sleepingsysadmin@reddit
How do you get enough pcie lanes on this setup? Are they all running at 4x?
Qwen3 80b next
Magistral 2509
gpt 120b
Seed-OSS-36B-Instruct
monoidconcat@reddit (OP)
All 16x. Threadripper pro provides 128 pcie lanes.
favicocool@reddit
All via PCIe 4.0 risers? No MCIO/SlimSAS/etc?
Humble-Pick7172@reddit
Is comfyUI allowed? I wanna see the speed difference between 4x 3090, 5090 and 6000 in Wan 2.2 T2V + Native upscale
Own-Junket6393@reddit
Is that possible to achieve unified memory with out NVLink? Correct me if I'm wrong, I started researching to build custom AI station to run large LLM models on desktop. I read only older cards such as rtx 3900 or titan cards supports NVLink and all other cards afterwards to latest doesn’t support and is bottleneck to accrue unified memory. Read DGX Sparks solves this problem where you can connect two sparks and get a unified memory of 256gb.
luwuke@reddit
Curious about Ultravox (both the llama 70b and GLM ones!)
CV514@reddit
https://huggingface.co/SicariusSicariiStuff/Sweet_Dreams_12B
If you don't mind, do a proper high-context coherency test. Your hardware should be pretty snappy with it. And reply with the tok/s as well, I've got 6.
EnthusiasmPurple85@reddit
run models!
amitbahree@reddit
I am curious to know more on the motherboard and which risers do you use.
chub0ka@reddit
Kimi deepseek minimax gptoss glm4.6
DuckyBlender@reddit
Insane amount of computer I shall list some models: gpt-oss-120b Qwen 3 Next 80b GLM-4.6 or GLM-4.5-Air
tat_tvam_asshole@reddit
?s for you
what's your choice of thermal interface material?
what's your power set up like?
any concerns running vertical for extended periods like that?
AmbitiousOnion7327@reddit
run openai-vpt
madsheepPL@reddit
please run models
Pixer---@reddit
Minimax m2 4/8bit
Educational-Sun-1447@reddit
Hi,
If you can you run Qwen3-vl 32b on 4 bit quant on 1 RTX 5090. I want to see inference speed on this. I plan to buy 5090 to run this model but not sure if the tok is good or not. If you can max the context lenght that be helpful also.
kev_11_1@reddit
Gpt oss 120b
Willing_Landscape_61@reddit
Not sure how to bench that but I would be interested to see if/how much Blackwell improves Q4 with native type compared to the flops difference with 3090. I guess pp speed of the right kind of Q4 for 3090 vs rtx pro Blackwell could tell us . Anybody knows what kind of Q4 gets a pp speed boost on Blackwell?
monoidconcat@reddit (OP)
NVFP4 inference is natively supported on blackwell consumer cards afaik. I can try running single model with nvfp4 quant on both cards.
fragment_me@reddit
What's your favorite model thus far and for what use case?
monoidconcat@reddit (OP)
I literally just plugged in the rtx 6000, but I enjoyed running glm 4.5 air on my previous 4x 3090 setup. The best model for personal use case.
Zerowind88@reddit
Qwen 3 next
Anyusername7294@reddit
Qwen 3 coder (the big one)
egnegn1@reddit
Nobody claims that you can achieve the same result. But by loading into VRAM you can at least achieve a significantly better result than if you only run the model in the CPU memory. Some people have already tried this with an EPYC 64 core processor and lots of memory. From memory, that was also worse than the rig with 4x RTX3090.
Ultimately it is a compromise between performance and costs. Everyone has to find their sweet spot. For many, a computer with AMD Strix Halo AI Max+ 395, nVidia DGX Sparc and an Apple Mac Studio M4 Max may be a good compromise with up to 96 GB of VRAM.
garlopf@reddit
One Claudia Schiffer plz
Orolol@reddit
Claude Opus 4.5
monoidconcat@reddit (OP)
It will take a few years as I need to buy additional 63 rtx 6000s
UniqueAttourney@reddit
Is this pewds doing it all over again ?
BenniB99@reddit
onetimeiateaburrito@reddit
Exactly what I was thinking after being upset about not being able to afford my pleb 2x 5070ti setup that I think is actually achievable for me financially. Lmao