Ask me to run models | TheaterFire

An interesting one might be comparing 2x3090 under tensor parallelism to 1x5090 with a model under 24gb. In theory the 3090s have more combined bandwidth, but in practice probably don't scale linearly.

[-]

monoidconcat@reddit (OP)

Qwen3 30b a3b at AWQ 4bit would be a good candidate for this

[-]

OctaviaZamora@reddit

I'm curious as well! By the way, here I am, being so very happy with my single RTX 5090, and then I saw your picture. Damn. Good for you.

[-]

monoidconcat@reddit (OP)

cpatonn/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit

Input=1024, Output=2048, Concurrency=1
- RTX 5090: 218 tok/s
- RTX 3090(x1): 157 tok/s
- RTX 3090(x2, TP): 155 tok/s

Input=1024, Output=2048, Concurrency=16
- RTX 5090: 1793 tok/s
- RTX 3090(x1): 783 tok/s
- RTX 3090(x2, TP): 1032 tok/s

[-]

Nepherpitu@reddit

What is yours pcie configuration? Are 3090 on full speed 4.0 x16? Did you compiled patched driver with p2p access?

[-]

monoidconcat@reddit (OP)

The patched p2p driver unfortunately broke my machine once(abrupt power off upon high load) so I had to rollback. Fullspeed 4.0 x16 w/o nvlink.

[-]

panchovix@reddit

Not him but most WRX80 boards do have either PCIe 4.0 X8 or PCIe 4.0 X16. PCIe X4 is mostly seen on TR40 boards.

[-]

s2k4ever@reddit

How to run in order to get such performance ? I have a 6000 and a 5090 and I cant seem to get this perf.

Also, the image

[-]

monoidconcat@reddit (OP)

If you are using llama.cpp, I highly recommend to switch over to vllm(and awq quant). I can say that this is the definitive reason, but in my experience vllm had better optimization for inference. That little port is for additional output that supports much more displays than DP

[-]

qwer1627@reddit

Gods work ty

[-]

Distinct-Target7503@reddit

interesting... are you using the two 3090 in pcie x4 or x16? also how are those connected? did you use a switch?

if you are using those card 2 cards at x16, would it be possible for you to test those in x4? that comparison would really helpful for me.

[-]

panchovix@reddit

Not him but most WRX80 boards do have either PCIe 4.0 X8 or PCIe 4.0 X16. PCIe X4 is mostly seen on TR40 boards.

[-]

Distinct-Target7503@reddit

Ok thanks!

[-]

PraxisOG@reddit

Thanks! Its interesting that it scales better with higher concurrency. I briefly considered throwing together like 16 RX580's to run in groups of 4 and I'm really glad I went a different route instead.

[-]

MatlowAI@reddit

Can you do the same for 2x 5090s vs rtx 6000 pro for this model and then Qwen3-Next-80B-A3B-Instruct-AWQ-4bit

[-]

s2k4ever@reddit

what is this rig called as ?

[-]

commonsasquatch@reddit

What rig is that?

[-]

Puzzleheaded_Tip7043@reddit

so you're the one taking all the ram!

[-]

NeverLookBothWays@reddit

Nice space heater!

[-]

monoidconcat@reddit (OP)

Entropy to intelligence machine

[-]

0point01@reddit

i dont get it. how I understand it, you only create more entropy, no? everything else is just a side effect

[-]

monoidconcat@reddit (OP)

I should have said negative entropy to intelligence machine haha

[-]

Asleep-Ingenuity-481@reddit

Qwen3 0.6b. Might be a bit too big but idk.

[-]

koflerdavid@reddit

Unsloth made some Q1S quants, they might fit.

[-]

monoidconcat@reddit (OP)

Still too large for my machine!

[-]

Tiny-Criticism-86@reddit

That's bro's got GDP of a small African country on his server rack

[-]

Interesting-Tip-2712@reddit

PrimeIntellect/INTELLECT-3

[-]

moistiest_dangles@reddit

Would you be interested in helping to open source some medical models?

[-]

Wubbywub@reddit

minecraft

[-]

xzpyth@reddit

https://i.redd.it/dx1sdxj8z04g1.gif

[-]

random-tomato@reddit

Woah first time I've seen this in GIF/Video form haha

[-]

Sudden-Lingonberry-8@reddit

it was ai generated lmao

[-]

xzpyth@reddit

It's because I've made it with grok imagine 😅

[-]

joelW777@reddit

GLM4.6 exl3 4 bpw would be interesting with TabbyAPI

[-]

ClearApartment2627@reddit

Great idea, very much appreciated.
I would be interested in Qwen 235b 2507-thinking as an AWQ quant:

https://huggingface.co/AIDXteam/Qwen3-235B-A22B-Thinking-2507-AWQ

[-]

danny_094@reddit

Was genau ist dein Ziel mit dem Setup?
Wichtig zu verstehen: Man kann Consumer-GPUs nicht zu einem großen VRAM-Pool bündeln.

Grund:

NVSwitch existiert bei Consumer-Karten nicht (das ist der komplette Unterschied zu H100/A100-Serverkarten)
NVLink bringt NICHT:
VRAM-Fusion
VRAM-Pooling
Memory-Mirroring
Low-latency Cross-GPU Memory Zugriff
Tensor-Parallelism wäre theoretisch möglich, ist aber: extrem instabil, selten unterstützt und meistens langsamer als 1 GPU.

NVLink auf RTX bedeutet praktisch nur:

„Schnelles Asset/Texture-Sharing für Software, die es explizit unterstützt.“

– Spiele? nein
– LLM-Frameworks? definitiv nein

Also:
Du kannst nicht einfach VRAM addieren, um ein großes Modell zu laden.
Jede Karte kann 1 Modell gleichzeitig hosten (bzw. ihre eigenen Jobs).

Wenn VRAM wirklich addierbar wäre,
dann wäre jeder, der eine H100 mit 94 GB VRAM kauft, ja komplett verrückt
da könnte man sich viel günstiger einfach 4 Consumer-Karten zusammenschrauben. 😄

[-]

monoidconcat@reddit (OP)

First of all, neither rtx pro 6000 or rtx 5090 do not support nvlink anyway. I bought these two kinds of cards for specific purposes - 6000 for inference, and 5090s for training. When I need to utilize these at once, I can just use pipeline parallelism 🙃

[-]

kearm@reddit

As someone who trains on two RTX 6000 Pros why the 5090's for training?

[-]

tat_tvam_asshole@reddit

have you utilized the recent DMA unlocks for 5090s?

[-]

danny_094@reddit

Pipeline-Parallelismus? du teilst ein Modell in BLÖCKE und schickst Daten stückweise durch mehrere GPUs?du wartest bei jedem Token auf die nächste GPU Jede GPU muss auf die vorherige warten = Bottleneck = extrem langsam GIGANTISCHE Kommunikations Overheads Consumer-PCs haben kein NVSwitch PCIe ist viel zu langsam und Keine Mainstream-Frameworks unterstützen das stabil. Warum tust du dir das an? Da wartet jede GPU auf die nächste und es wird immer langsamer. Schaffst du dann überhaupt 1 Token die Sekunde? Ernsthaft. Das interessiert mich :D

[-]

egnegn1@reddit

Man kann Modells auch auf mehrere Karten aufteilen. Aber wo Du recht hast ist dass die Leistung nicht wesentlich steigt, weil die Kommunikation über PCI-e zu langsam ist.

Es gibt genügend Videos dazu auf Youtube zu finden.

[-]

danny_094@reddit

Natürlich geht es theoretisch.
Aber nur sehr begrenzt – und wir müssen unterscheiden zwischen funktioniert technisch = funktioniert sinnvoll. Du kannst rein technisch auch ein DeepSeek-600B auf einer RTX 2080 + 800 GB RAM starten. Ja, es startet. Ja, es funktioniert. Die Frage ist nur: wie weit funktioniert es? 0,0001 Tokens pro Sekunde? 1 Token pro Minute? 1 Token pro Stunde? „Geht“ heißt nicht automatisch „brauchbar“ – und genau das ist der Unterschied, den viele übersehen.

[-]

egnegn1@reddit

That's not true, as you can see from many examples. For example, there are many setups with 4x RTX3090 or 4x Apple Studio M4 Max which can run a large model with usable T/s.

Of course, it always depends on your personal requirements. The higher it is, the more you have to pay for it. Most of those discussing here are in the private sector. You can see it a little more loosely.

[-]

danny_094@reddit

Ich sehe es auch locker. Aber denkst du nicht, man sollte aufpassen, das man Geld nicht für Falsche Erwartungen ausgibt? Ich gönne es jedem der ein Grafikkartenrig mit von mir aus 10 RTX hat, aber Falsche Aussagen streuen Falsche Erwartungen.

Gehen wir mal von 4x RTX3090 aus:

Das sind: 96 GB VRAM Für 5600 Euro

eine nvidia h100 kostet mit 94 GB 34.000 Euro.

Wenn ich mit 4x RTX3090 fast Das gleiche erreiche wie mit einer Nvidia H100

für 6x weniger Anschaffungskosten, warum sollte ich mir als Firma dann eine H100 Kaufen?

[-]

egnegn1@reddit

Please take a look at this post: https://www.reddit.com/r/LocalLLaMA/s/B99oXseNNi

[-]

MitsotakiShogun@reddit

Does Reddit auto-translate for you?

It's the second or third time I see someone replying in German to a post in English, and I simply don't get what it is with the German replies on an English sub/post. Original posts in German (or other native languages) I get, replying to someone in another language I do not. And I also haven't seen any stray Spaniards / Namibians / Mongolians / ... land here, just Germans D:

[-]

NemesisCrow@reddit

For me reddit does this in the web browser. So annoying. If I open any reddit link a "?lang=de" or something like this gets added as an url suffix and I constantly have to manually remove it. It's unusable and reddit does seem to override my personal settings. Give me the option, but don't force something like this on me. Additionally it just confuses people, as seen in this thread.

[-]

danny_094@reddit

Reddit automatically translates yes.

[-]

MitsotakiShogun@reddit

Cool, are you using an app or web? I don't see the setting on web.

[-]

danny_094@reddit

On the smartphone this is right at the top of the screen. In the Reddit app. And the preferred language is automatically selected in the browser on the MacBook

[-]

MitsotakiShogun@reddit

I see it now, thanks!

[-]

chub0ka@reddit

I have 8x3090 with pairs at nvlink ep8tp2 works good for me >4x speedup

[-]

kearm@reddit

Wait as a previous owner of a 8x3090 rig and now a 4x3090 Ti plus 2xRTX 6000 Pro rig why the dual 5090's?

[-]

jacek2023@reddit

don't know if you have the diskspace for it, but I would like to see Grok 2 benchmark on that

[-]

monoidconcat@reddit (OP)

Okay this one is quite massive, I can download this but running it will require intense CPU offloading

[-]

jacek2023@reddit

why? how much VRAM can you have max now?

[-]

monoidconcat@reddit (OP)

At max, 96 + 32 + 32 + 24 + 24 so 208gb, but vllm distributed inference requires AWQ quant - which I cannot find from huggingface. grok 2 at GGUF 1bit quant is like 89gb, which may fit in the pro 6000. Maybe I can try this one

[-]

zynbobguey@reddit

couldve got two dgx sparks instead of this mess

[-]

monoidconcat@reddit (OP)

I mean yeah, they look much cleaner on my desk, but 2x 5090 had roughly equivalent tok/sec with the 2x dgx spark on inference. Also I am trying to use these for fine tuning mainly, so I would choose GPUs over dgx spark.

[-]

twnznz@reddit

this is like being the F18 and asking tower for a ground speed check while the SR-71 is above you

[-]

byteleaf@reddit

This is just insane.

[-]

natufian@reddit

😮

...this fucking guy 😅!

[-]

jacek2023@reddit

Ah I was hoping you could use llama.cpp

[-]

monoidconcat@reddit (OP)

Grok 2 IQ1_S quant, GGUF, with llama-bench, ran on RTX Pro 6000 only

Single stream: 52 tok/s

Concurrency=32: 303 tok/s

[-]

jacek2023@reddit

I tried Q3 on my 3x3090 and I think I got 2 or 3t/s

[-]

Opteron67@reddit

currently only 2x3090 and 2x5090. for me. How do you share the load evenly ? what vllm command line ? what psu setup ?

[-]

aeroumbria@reddit

The stack of 12VHPWR connectors sending shiver down my spine... How do you make sure such a system is safe to run even unattended?

[-]

koflerdavid@reddit

He underclocks them a bit.

[-]

iMrParker@reddit

It's mostly a non-issue if they are attached properly

[-]

Lissanro@reddit

I wonder what performance would you get with Q4_X quant of K2 Thinking if you can fit it https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF (given 512 GB RAM + 32×2+24×2+96=208GB VRAM, 720 GB in total, so it might fit if layers correctly allocated to GPUs). Or in case of issues, lightweight IQ3_K quant may work. Especially would be interesting to see performance using ik_llama.cpp (I shared details here how to build and set it up) - it has greatly improved prompt processing speed, and I am curios how much difference RTX PRO 6000 would make.

I have a rig with 4x3090 and 1 TB DDR4 3200MHz RAM with EPYC 7763, and this makes me interested even more if upgrading GPUs would make a noticeable difference. Currently with Q4_X quant of K2 Thinking I get around 8 tokens/s, 100-150 tokens/s prompt processing, and in 96 GB VRAM I can fit either 128K context cache at Q8 + 4 full layers, or 256K context at Q8 without full layers (reduction in performance is very small, around 5% maybe), in both cases with common expert tensors in VRAM. My guess RTX PRO 6000 may give massive boost to prompt processing speed, likely over 300 tokens/s.

[-]

realpm_net@reddit

Speaking as a guy with one card in a box, can you explain this hardware rig? How do the cards connect to the computer? How are the cards powered?

[-]

monoidconcat@reddit (OP)

I use two 1600w power supply units, and also power limit the cards to just make sure everything works safely. I use a mining rig to host threadripper motherboard and all the GPUs. The cards are connected via thermaltake riser cables.

[-]

jdchmiel@reddit

can you detail the electrical pcie connection to each card? I'm learning the hard way. a lot of motherboards have x16 slots, but are not electrically x16.

[-]

pmttyji@reddit

Could you please check CPU-only performance since you have bulk RAM? Please proceed with these medium size models. Ignore MOE list if you don't have time.

Dense models
MOE models

Add Qwen3-Next too. Thanks.

[-]

theblackcat99@reddit

Reminder

[-]

monoidconcat@reddit (OP)

Perfect, will do that. So entierly in CPU without using GPU right? Not just offloading?

[-]

pmttyji@reddit

Right. I use CPU version of llama.cpp setup for this.

One additional request on your benchmarks. This is to see how much t/s differs between those contexts.

Run models with default context
Run models with 32K context

Thanks again.

[-]

monoidconcat@reddit (OP)

Do you have any specific quant in mind? BF16? FP8?

[-]

pmttyji@reddit

Q8 please as I use llama.cpp(GGUF). Anything equivalent to Q8 is fine.

[-]

thisoilguy@reddit

Nice heater

[-]

paul_tu@reddit

Good luck with Kimi k2

[-]

burntheheretic@reddit

Which open frame is that?

Im trying to do something similar but the cheap frame I purchased on aliexpress, the GPU support bar interferes with the end of the PCIe connector so I'm not super comfortable plugging my GPUs in

[-]

InterstellarReddit@reddit

Can you run the model from weird science in the 80s? That’s the one I’m trying to get to

[-]

Agreeable-Market-692@reddit

This might be too big without clever tricks. If you can get any backend to load this on your hardware with at least 30,000tok context window that will satisfy my curioisity.

https://huggingface.co/meituan-longcat/LongCat-Flash-Thinking

[-]

Agreeable-Market-692@reddit

And if you need to use a quant that's OK!

[-]

egnegn1@reddit

Great idea!

Please take a look at this comparison test for inspiration:

https://www.reddit.com/r/LocalLLaMA/s/B99oXseNNi

https://www.cloudrift.ai/blog/benchmarking-rtx6000-vs-datacenter-gpus

[-]

alex_bit_@reddit

How much faster is GLM 4.5 Air on the RTX Pro 6000 Blackwell compared to your previous setup with four RTX 3090s?

Is there any noticeable difference in real world use?

[-]

Successful-Willow-72@reddit

Man sick setup. i wish for dual 3090 but cant afford it atm. What a beast you got there

[-]

Former-Tangerine-723@reddit

Glm 4.5 air Q4_K_M 🥳

[-]

segmond@reddit

glm4.6-Q4

kimi-k2-Q3/Q4

deepseekv3Terminus - Q4

[-]

CarelessOrdinary5480@reddit

Qwen 7b to enrich paragraphs.

[-]

TomatilloPutrid3939@reddit

GLM-4.6

[-]

sleepingsysadmin@reddit

How do you get enough pcie lanes on this setup? Are they all running at 4x?

Qwen3 80b next

Magistral 2509

gpt 120b

Seed-OSS-36B-Instruct

[-]

monoidconcat@reddit (OP)

All 16x. Threadripper pro provides 128 pcie lanes.

[-]

favicocool@reddit

All via PCIe 4.0 risers? No MCIO/SlimSAS/etc?

[-]

Humble-Pick7172@reddit

Is comfyUI allowed? I wanna see the speed difference between 4x 3090, 5090 and 6000 in Wan 2.2 T2V + Native upscale

[-]

Own-Junket6393@reddit

Is that possible to achieve unified memory with out NVLink? Correct me if I'm wrong, I started researching to build custom AI station to run large LLM models on desktop. I read only older cards such as rtx 3900 or titan cards supports NVLink and all other cards afterwards to latest doesn’t support and is bottleneck to accrue unified memory. Read DGX Sparks solves this problem where you can connect two sparks and get a unified memory of 256gb.

[-]

luwuke@reddit

Curious about Ultravox (both the llama 70b and GLM ones!)

[-]

CV514@reddit

https://huggingface.co/SicariusSicariiStuff/Sweet_Dreams_12B

If you don't mind, do a proper high-context coherency test. Your hardware should be pretty snappy with it. And reply with the tok/s as well, I've got 6.

[-]

EnthusiasmPurple85@reddit

run models!

[-]

amitbahree@reddit

I am curious to know more on the motherboard and which risers do you use.

[-]

chub0ka@reddit

Kimi deepseek minimax gptoss glm4.6

[-]

DuckyBlender@reddit

Insane amount of computer I shall list some models: gpt-oss-120b Qwen 3 Next 80b GLM-4.6 or GLM-4.5-Air

[-]

tat_tvam_asshole@reddit

?s for you

what's your choice of thermal interface material?

what's your power set up like?

any concerns running vertical for extended periods like that?

[-]

AmbitiousOnion7327@reddit

run openai-vpt

[-]

madsheepPL@reddit

please run models

[-]

Pixer---@reddit

Minimax m2 4/8bit

[-]

Educational-Sun-1447@reddit

Hi,

If you can you run Qwen3-vl 32b on 4 bit quant on 1 RTX 5090. I want to see inference speed on this. I plan to buy 5090 to run this model but not sure if the tok is good or not. If you can max the context lenght that be helpful also.

[-]

kev_11_1@reddit

Gpt oss 120b

[-]

Willing_Landscape_61@reddit

Not sure how to bench that but I would be interested to see if/how much Blackwell improves Q4 with native type compared to the flops difference with 3090. I guess pp speed of the right kind of Q4 for 3090 vs rtx pro Blackwell could tell us . Anybody knows what kind of Q4 gets a pp speed boost on Blackwell?

[-]

monoidconcat@reddit (OP)

NVFP4 inference is natively supported on blackwell consumer cards afaik. I can try running single model with nvfp4 quant on both cards.

[-]

fragment_me@reddit

What's your favorite model thus far and for what use case?

[-]

monoidconcat@reddit (OP)

I literally just plugged in the rtx 6000, but I enjoyed running glm 4.5 air on my previous 4x 3090 setup. The best model for personal use case.

[-]

Zerowind88@reddit

Qwen 3 next

[-]

Anyusername7294@reddit

Qwen 3 coder (the big one)

[-]

egnegn1@reddit

Nobody claims that you can achieve the same result. But by loading into VRAM you can at least achieve a significantly better result than if you only run the model in the CPU memory. Some people have already tried this with an EPYC 64 core processor and lots of memory. From memory, that was also worse than the rig with 4x RTX3090.

Ultimately it is a compromise between performance and costs. Everyone has to find their sweet spot. For many, a computer with AMD Strix Halo AI Max+ 395, nVidia DGX Sparc and an Apple Mac Studio M4 Max may be a good compromise with up to 96 GB of VRAM.

[-]

garlopf@reddit

One Claudia Schiffer plz

[-]

Orolol@reddit

Claude Opus 4.5

[-]

monoidconcat@reddit (OP)

It will take a few years as I need to buy additional 63 rtx 6000s

[-]