Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results
Posted by Visual_Synthesizer@reddit | LocalLLaMA | View on Reddit | 253 comments
I've been optimizing a 2-GPU inference server for the past week and wanted to share the results. Full data is public with raw JSONs, launch commands, and methodology.
**Hardware:**
- 2x RTX PRO 6000 Blackwell (96GB GDDR7 each)
- EPYC 4564P
- 128GB DDR5 ECC
- c-payne PM50100 Gen5 PCIe switch
- AsRock Rack B650D4U server board
**Results (C=1, single-user decode, tok/s):**
| Model | tok/s | Engine | Config |
|---|---|---|---|
| Qwen3.5-122B NVFP4 | 198 | SGLang b12x+NEXTN | modelopt_fp4, speculative decode |
| Qwen3.5-27B FP8 | 170 | vLLM DFlash | 2B drafter, 2 GPU |
| MiniMax M2.5 NVFP4 | 148 | vLLM b12x Docker | modelopt_fp4 |
| Qwen3.5-122B NVFP4 | 131 | vLLM MTP=1 | compressed-tensors |
| Qwen3.5-397B GGUF | 79 | llama.cpp | UD-Q3_K_XL, fully in VRAM |
**Before you ask:**
*"198 tok/s on 122B? No way."*
3-run verified: 197, 200, 198. Also confirmed with curl: 2000 tokens in 12.7s. Raw JSONs linked below.
*"That's just ctx=0 cherry-picking."*
Tested context scaling today at C=1: 4K=1.8s, 16K=2.3s, 57K=7.1s, 150K=23.3s TTFT. No crashes at any length. Decode speed stays \~198 regardless of context — TTFT increases, decode doesn't.
*"85% VRAM utilization leaves no headroom."*
VRAM breakdown per GPU from server logs: weights 39.75GB + KV cache 13.9GB + Mamba state 26.4GB + 13.5GB free. KV budget is 2.4M tokens — model only supports 131K max context. Headroom is fine.
*"Why not just buy a Threadripper?"*
I have one too. This build is 18% faster (198 vs 168 tok/s) because the PCIe switch routes P2P through silicon at sub-microsecond latency instead of through the CPU root complex. For MoE TP decode, every forward pass blocks on dozens of small allreduces. The messages are tiny (10B active params), so bandwidth doesn't matter. Latency per sync does. PIX topology wins on latency, not bandwidth.
**The secret sauce:**
-
PCIe switch (PIX topology) — GPU-to-GPU through switch fabric, not CPU
-
SGLang with b12x MoE kernels — 26% faster than FlashInfer CUTLASS
-
NEXTN speculative decoding — +65% over no speculation
-
PCIe oneshot allreduce + fusion — optimized multi-GPU communication
-
modelopt_fp4 checkpoint (txn545) — required for b12x kernels. compressed-tensors checkpoints don't work with b12x
-
Performance governor + pci=noacs + uvm_disable_hmm=1 — without these, P2P hangs and GPUs wedge
**All data is public:**
- Results & methodology:
[github.com/Visual-Synthesizer/rtx6kpro/benchmarks/results.md](https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/benchmarks/results.md)
- Raw benchmark JSONs:
[github.com/Visual-Synthesizer/rtx6kpro/benchmarks/inference-throughput](https://github.com/Visual-Synthesizer/rtx6kpro/tree/master/benchmarks/inference-throughput)
- 3-run verification data:
[run1](https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/benchmarks/inference-throughput/sglang_122b_verify_run1.json),
[run2](https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/benchmarks/inference-throughput/sglang_122b_verify_run2.json),
[run3](https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/benchmarks/inference-throughput/sglang_122b_verify_run3.json)
Happy to answer questions. If you think the numbers are wrong, the launch commands are in the repo — reproduce it yourself.
EuphoricAnimator@reddit
That's a seriously impressive speed on the Blackwells, good catch on the corrections though,details matter a lot with these builds. I've been running models locally on a Mac Studio M4 Max with 128GB of RAM for a while now, and it's a different world from even a year ago. People often underestimate how much RAM really helps, especially when swapping between models.
I mostly play with Qwen 3.5, Gemma 4, and a bunch of stuff through Ollama. I can usually get Qwen 3.5-7B running comfortably at around 60-70 tokens per second with a decent context window. The 14B version is still usable, but it slows down to maybe 30-40. Going much higher than that gets pretty painful, even with quantization. It's not just about VRAM either, the unified memory architecture on the Apple Silicon is a big part of what keeps things from grinding to a halt.
One thing I've noticed is the overhead when switching models isn't zero. It takes a bit for the model to load and initialize, so constantly flipping between different 7B parameters isn't ideal. I tend to pick one and stick with it for a longer session. Also, be careful about system memory usage. Even with 128GB, a really large context window can cause issues, and macOS will start swapping to disk, killing performance.
It’s cool to see builds like this pushing the limits on the PC side. It gives us Mac users a benchmark to compare against and shows what's possible when you have dedicated GPUs. I'm curious if Metal acceleration will continue to improve enough to close the gap, but for now, the Blackwells definitely have an edge.
eliko613@reddit
Good thread. If you're tackling this in production, this is the pattern that usually works:
1) Start with a weekly top-10 token spend report by endpoint/use-case.
2) A/B routing policies (cheap-first vs quality-first) and compare quality + cost together.
3) Cap max tokens and require explicit override for outliers.
We started evaluating zenllm.io to identify multi-endpoint waste in production and it's been decent so far.
tecneeq@reddit
Wow, what a wealth of information. Thanks!
I have a unoptimized build at work, 2x 6000 Blackwell QMax (with slots fro two more). I get 50 t/s for qwen 3.5 27b and 100 t/s for 122b with llama.cpp out of the box.
qwen3.5 doesn't work with speculative decoding for llama.cpp yet. I need to look into your stuff.
AlwaysLateToThaParty@reddit
Realistically, how many people do you think that system could serve in a professional environment?
Visual_Synthesizer@reddit (OP)
TBH im not sure. would depend a lot of the duty cycle and actual use case. are we talking devs with longer context or agentic chat usage? generally speaking 8 concurrent users would be 100tok/s. 16 would be 65 tok/s at 100% duty cycle.
AlwaysLateToThaParty@reddit
Would that require a more comprehensive cooling solution? I've got an RTX 6000 pro. It gets hot. just wondering if that's something you've considered.
Visual_Synthesizer@reddit (OP)
Pic of the fan rail. this works way better than when i had them in the white case behind the rack
AlwaysLateToThaParty@reddit
My point is that if the room they're in isn't conditioned as well, all it will do is recycle increasingly warm air.
Visual_Synthesizer@reddit (OP)
cooling is definitely important. on my old trx40 in a case with blower cards i had to downclock them from 350w to 275 ish. I tested down-clocking the 6000s to 300w only drops performance a small amount based on my testing (3-5% IIRC). the mining rack does have fans behind the GPUs. i wrote a small script that ramps them up.
couple this with some nice PWM 3000 RPM fans and you have a effective cooling solution. mining racks are pretty standard for AI rigs . you can stack 8x in them and put them on top of each other.
https://github.com/Visual-Synthesizer/asrock-rack-fan-control
PassengerPigeon343@reddit
Please drop a link to the budget RTX PRO 6000s, been waiting for these bad boys to drop into budget GPU range
Visual_Synthesizer@reddit (OP)
central computer and get nvidia inception pricing!
NoahFect@reddit
Wow, they put up a lot of hoops to jump through. How much of a discount do you get on these puppies?
ipcoffeepot@reddit
Interesting! I'm seeing around 100 tok/s on the same cards. I suspect its the wrong kernel (gonna need to try the b12x!) and NCCL. Thanks for posting this!
Visual_Synthesizer@reddit (OP)
good luck!
Do this for 7-9% speed unlock on PCIE direct connect (multiGPU): https://github.com/vllm-project/vllm/pull/39040
this b12x kernel gives +25%: https://github.com/vllm-project/vllm/pull/39042
rtx6kpro discord: https://discord.gg/AGxz5eYf
ipcoffeepot@reddit
Amazing. Thank you so much! Im new to this and appreciate the info
FullOf_Bad_Ideas@reddit
tbh this does sound like a buggy reading
i see 206 t/s reading in the repo too.
Visual_Synthesizer@reddit (OP)
did some analysis with my claude test harness:
Good catch — you were right. Those numbers were buggy.
I re-tested properly and wanted to share what actually happened, because it is a useful benchmarking lesson.
Re-test methodology
Corrected cold-start TTFT on 2× RTX PRO 6000 + SGLang b12x+NEXTN 122B
That is roughly linear up to around 32K, then super-linear above that as attention's O(n²) behavior starts to dominate. The shape matches what you would expect from a 122B transformer.
What went wrong in my original numbers
Original numbers were:
Two separate methodology errors stacked on top of each other.
1) 4K was too high
1.8s vs 0.67s real
That measurement was my first request after server startup, so it paid the JIT / cudagraph warmup tax.
On the re-test I saw the same pattern:
That is a huge difference. I was partially measuring compile/warmup overhead and calling it prefill.
2) 57K and 150K were too low
7.1s → 14.7s real
23.3s → \~60s extrapolated
SGLang's radix prefix cache was hitting.
My original test sent sequential prompts that shared a common prefix: same base prompt, then extended versions of that same prompt at larger context sizes. So each later measurement was not a true cold prefill. It was mostly measuring the incremental delta on top of already-cached work.
That means:
So those numbers were artificially low for true cold-start TTFT.
The giveaway
The biggest clue was that in my original numbers, going from 4K to 16K only added about 0.5s.
That would imply a prefill rate of around 24k tok/s for the delta, which is much faster than this rig's actual sustained prefill. That should have been an immediate red flag.
For a real cold 16K prefill, the delta should have been closer to 2s, and that is exactly what the re-test shows.
What stays the same
That part of the original claim was correct. It was only the specific TTFT values that were contaminated.
Classic LLM benchmarking gotcha: prefix cache + lack of warmup isolation.
Thanks for pushing on it.
FullOf_Bad_Ideas@reddit
I wasn't even pushing on prefill lol, I was just doubting that decoding speed would hold.
so what's the decode speed at 128k/150k? what is "roughly constant"?
Visual_Synthesizer@reddit (OP)
Decode drops about 9% at 128K and 13% at 240K prompt tokens (measured 241 tok/s at 120K and 230 tok/s at 240K with a high-acceptance NEXTN task); call it \~1% per 15K additional context, not flat but not catastrophic either.
FullOf_Bad_Ideas@reddit
so it's 265 t/s at 1 token depth?
that's way higher than 198 t/s claimed earlier.
Visual_Synthesizer@reddit (OP)
yes, NEXTN makes a huge difference.
Visual_Synthesizer@reddit (OP)
thanks! good catch. i updated the post with more testing I did last night.
Frizzy-MacDrizzle@reddit
Budget Builds? Not fancy but runs what I want.
Visual_Synthesizer@reddit (OP)
thats a couple sweet AI systems! Kudos.
Frizzy-MacDrizzle@reddit
Thanks. I’m happy they work. Lot of knowledge learned. No more than 1 k each!
Visual_Synthesizer@reddit (OP)
that knowledge is priceless!
ofan@reddit
So it's the model + speculative decoding speed, not the model speed
Visual_Synthesizer@reddit (OP)
P2P makes a big difference
Do this for 7-9% speed unlock on PCIE direct connect (no switch): https://github.com/vllm-project/vllm/pull/39040
this b12x kernel gives +25%: https://github.com/vllm-project/vllm/pull/39042
jiria@reddit
Would appreciate your advice on the following: currently I got my 2x RTX PRO 6000 Blackwell Max-Q running on a very very budget host: ASUS ProArt X870E Creator WiFi + Ryzen 5 9600X CPU + Teamgroup 32GB (2x16) CL30 DDR5 RAM. The motherboard does 8x/8x PCIe 5.0 bifurcation (nvidia-smi top -m returns PHB). I'm trying hard to avoid buying DDR5 RAM at current prices. I only use this server for inference and can load models just fine as long as the safetensors are split (after models sit in VRAM, vllm/sglang RAM usage keeps at a constant 10Gb). What I would like to understand is: if I buy a c-payne PM50100 Gen5 PCIe switch (and use it full x16 with my mobo), do I get (most of) the benefits you describe above, or would there be something preventing me from achieving those speeds in my particular setup? By the way, your github repo is a great resource for people like myself, keep it going, and if you'd like me to contribute some measurements please don't hesitate to ask.
Visual_Synthesizer@reddit (OP)
Thanks, but I just added trx40 and B650 2x GPU test info to my fork. The repo is maintained by others.
you probably dont need a c-payne unless you want more GPUs or are locked out of P2P due to your chipset. im not totally sure without testing the topology myself. two 8x will mostly only slow down the model loading, maybe a bit of prompt processing/prefill or something. you could combine them into one 16 and run a switch like i am. my direct connect gen4 trx40 was only 10% slower. you dont need more ram for inference.
Do this for 7-9% speed unlock on PCIE direct connect: https://github.com/vllm-project/vllm/pull/39040
this b12x kernel gives +25%: https://github.com/vllm-project/vllm/pull/39042
rtx6kpro discord: https://discord.gg/AGxz5eYf
jiria@reddit
Thanks so much, your info is gold and confirms my (limited) understanding. I ran some tests after posting my comment and I'm getting 0.43-0.50us latency and ~52GB/s bidirectional (P2P enabled). My Qwen3.5-27B benchmarks at ~130tok/s (tp=2, long prompts). Good to know that the upgrade path to 4x RTX 6000 in my case is via a PCIe switch on this mobo rather than having to buy a whole set of workstation hardware. In my brief usage of my setup, whenever I run NVFP4 models I find that quality of the answers to be perceptively worse than the FP16 versions of the same models, that's why I'm still using Qwen3.5-27B. Is that your experience as well, or perhaps I misconfigured something for NVFP4? Thanks for the discord invite, I intend to join when I'm back from holidays.
Visual_Synthesizer@reddit (OP)
nice! yeah save that money for more GPUs! Higher precision will always be better, but I am surprised you notice it that much. Generally, dense models perform a lot better than MOE. Perhaps thats the quality you are noticing? Have you seen much difference with the 27b at FP4?
fragment_me@reddit
20k just to run 122b what did you blow the budget on some hookers and blow?
Visual_Synthesizer@reddit (OP)
it also runs M2.5 and Qwen 3.5 397b :-)
Also can scale to 5 GPUs on this board.
fragment_me@reddit
It’s a solid build I’m just jelly.
laterbreh@reddit
Youre not running 397B on 2 cards unless youre in sub Q4 territory with llama cpp, which is a waste of two RTX pros.
FullOf_Bad_Ideas@reddit
you can run ~3.5bpw exl3 Qwen 3.5 397B quant with full 262k ctx. That's roughly the performance of IQ_4XS.
Hector_Rvkp@reddit
His post reads "| Qwen3.5-397B GGUF | 79 | llama.cpp | UD-Q3_K_XL, fully in VRAM |" File is 179gb as per HF https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF. Limited context window for sure, but the math does say Q3 XL fits.
sautdepage@reddit
Actually it does fits full 256K context F16 + vision using ik_llama graph split mode. The Qwen3.5 architecture is awesome.
Very usable, but quite slower (especially prompt processing) than models that can run in vllm/slang.
--Rotten-By-Design--@reddit
You lost me at budget build...
FatheredPuma81@reddit
Man probably has a "Budget private jet" too.
Visual_Synthesizer@reddit (OP)
it flies 18% faster than the expensive one though
FatheredPuma81@reddit
Sorry man its just a combination of factors. Qwen3.5 122B is a model those with high end gaming PC's can run at 20-30t/s. So you're comparing "Budget $3000 Qwen3.5 122B build" to "Budget $20,000+ Qwen3.5 122B build" which uh yea you're not winning that one.
Should have titled it "Budget Qwen3.5 397B build" and switched to llama.cpp or SGLang and did Expert Offloading.
Visual_Synthesizer@reddit (OP)
20-30 tok/s on a gaming PC is like saying you can tow a boat with a Honda Civic. Technically true. Not the same experience. The post is about optimized throughput on purpose-built hardware, not "can it run." And I did post the 397B result too: 79 tok/s on 2 GPUs. Show me the $3K gaming PC running the full 397B at any speed.
RoutineSundaes@reddit
I don't think you understand how real users are putting these things to use.
FatheredPuma81@reddit
No the post is about 2 RTX 6000 Pro Blackwells in a build being considered a "Budget build" :). Just read the title of the post you made.
nihnuhname@reddit
Budget-friendly for business, not for hobbies.
AlwaysLateToThaParty@reddit
That was my take, yeah.
Far-Low-4705@reddit
honestly thats not that difficult.
the 400b model only has 17b active params, so as long as you have a higher end GPU to run the computation bound layers, and offload the memory heavy but compute efficient experts to the CPU, youd only need like 200Gb of RAM, which b4 the RAM crisis isnt unheard of for less than 3k.
i have a AI server with 64Gb VRAM + 16Gb RAM that i built for $80 net cost. (scavanged old parts, collected them, got some free parts, sold them, upgraded, bought only extreme budget deals)
DeepOrangeSky@reddit
Sorry for the beginner question (and also, not sure if this was a typo or not) but, isn't there some rule of thumb where the minimum amount of system RAM you can have has to be at least the same amount (or more) as the amount of VRAM you have, in order for it to work? Or are there ways around that?
Also, as far as how MoEs work in regards to GPU + a bunch of system RAM setups (as opposed to, say, a mac), is the basic idea that as long as you have enough VRAM to fit all the active parameter size of the MoE model + context into VRAM, you lose almost no speed compared to fitting the whole entire MoE model into VRAM?
Like, let's say someone has an RTX 3090 GPU with 24GB of VRAM, and so they have enough VRAM to fit let's say the 10 billion active parameters of Minimax 230b a10b + decent amount of context size into VRAM, or even the 17 billion active parameters of Qwen 397b a17b as you mentioned, does that mean that it would basically run at pretty close to the same speed as a setup that was able to fit the whole entire model (as in, total parameters, not just active parameters) into VRAM (i.e. if let's say someone had like a dozen 3090s or something where they could fit the entire model + context into VRAM instead of just the active parameters + context into VRAM, how big of a speed difference would it be?
Like, in the scenario where you fit all the active parameters + context into VRAM rather than the whole MoE model itself + context, does it run like 90% of full speed, or 50% or 10% or how much slower does it run than the scenario where the whole entire model + context into VRAM?
Also, I realize there are all sorts of complicating factors like if you had a two-card setup vs single card, and if you used whatever those different things are like ROCm vs Vulkan or whatever different method and so on, but, I am just curious on a basic general level, like, about roughly what % of full speed you get from fitting just the active params + context into vram vs fitting the whole entire model + context into vram, when it comes to MoE models.
I am fairly new to this so I don't have even a rough idea. I assume it goes pretty fast even with just only the active params + context in VRAM, but not sure just how fast like if we are talking like 30% of full-fit speed or 80% or what (ball park)
Thanks-Suitable@reddit
I personally really value the perspective. Yes its not comsume grade hardware but everything that doesnt sit in a datacentre is a win in my book. Great job on the post!
_bones__@reddit
The word "budget" in this context means "restricted budget". Not "as much budget as one can spend".
psychicsword@reddit
That is a value concept, not a budget one.
Visual_Synthesizer@reddit (OP)
r/LocalLLaMA: "we need cheaper local inference" me: here's how to save $10K and go faster on any platform
r/LocalLLaMA: "too expensive"
Visual_Synthesizer@reddit (OP)
its 18% faster and roughly 10k cheaper than a threadripper on the exact same stack! this can trickle down to gen4 parts also! gen4 plx switches are really cheap ;-)
--Rotten-By-Design--@reddit
I get that, and it is a very nice build, but there is just nothing "budget" about it, unless we are talking a Nasa budget, or a yearly rent budget perhaps
Visual_Synthesizer@reddit (OP)
Think of it as a blueprint. i could do this with 3090s hacked to 48gb or 4090s hacked to 48gb and make a really nice inference engine
triptickon@reddit
If you taught people to do this with 3090s you could sell a course 😂
Visual_Synthesizer@reddit (OP)
Love the 3090s. My first AI server I built in 2020 had them. FYI, my old TRX40 system could scale to 8x 3090s GPUs with P2P low latency. The latency is what makes a ton of difference for TG. Scratch out threadripper pro and insert any CPU into this chart, and swap out the PM50100 for Gen4 PLX parts :-)
tat_tvam_asshole@reddit
Correct me if I'm wrong but even 8 3090s I'm not sure just 'any' CPU/mobo has enough pcie lanes to support 8 3090s and even so if you're bifurcating pcie5 slots to allow more cards, you're introducing latency and I'm not sure it's even spatially possible with pcie5.
Visual_Synthesizer@reddit (OP)
That's exactly what the PLX switch solves. You don't need 8 x16 slots from the CPU. The switch takes one x16 upstream from the CPU and fans it out to multiple downstream x16 ports. My PM50100 has 2 downstream ports (2 GPUs), and can scale to 5. The 8+ GPU setups in the rtx6kpro community use 2-3 switches, all hanging off a single CPU with limited lanes. It's not bifurcation, it's switching. The latency through the switch is sub-microsecond. That's the whole point of this build.
Hedede@reddit
That works only if your cards support P2P (which rtx6kpros support). 3090s don't support P2P without a hacked driver, they need to go `gpu -> switch -> host -> switch -> gpu`.
Visual_Synthesizer@reddit (OP)
very true, the hacked driver is used on gen4 switches.
Hedede@reddit
No, you need the tinygrad kernel module: https://github.com/tinygrad/open-gpu-kernel-modules
There are also newer drivers: https://github.com/aikitoria/open-gpu-kernel-modules
But with those I'm getting MCEs when 3090s try to communicate with each other
Hedede@reddit
There's no such thing as "3090s hacked to 48gb".
Georgefakelastname@reddit
They exist but they’re custom done and expensive as shit.
Hedede@reddit
I keep hearing that but I haven't seen any proof of their existance. And don't confuse them with 4090s.
banecroft@reddit
They ship the hacked 3090 with custom bios, huge business in China.
Hedede@reddit
Where? Haven't seen a single one.
nihnuhname@reddit
You can watch reviews on YouTube where unpacked, run, and test these GPUs. There are even modified versions of the RTX 4090. The downside of these GPUs is that they run very hot. If they use air cooling, the fans are also modified and roar like a jet engine.
Hedede@reddit
Where??? I'm talking about **3090**.
CryptoUsher@reddit
so you're getting 198 tok/s on 2x RTX PRO 6000 Blackwell, thats a pretty solid setup, how do you think the results would change if you swapped out the EPYC 4564P for something like an AMD 7742 or an intel xeon, would the difference in cpu architecture have a noticeable impact on performance?
CheatCodesOfLife@reddit
Good idea! I just tried swapping to an AMD 7742 and now I'm getting 67 tok/s... Why is that??
CryptoUsher@reddit
so the 7742 is a pretty old cpu at this point, iirc it's zen 2 architecture which is a lot different from the zen 3 in the 4564p, that might be why you're seeing such a huge drop in performance
NoahFect@reddit
Can someone explain why this guy is getting dragged?
ffpeanut15@reddit
Check the top post in the sub
BardlySerious@reddit
What is your name? What is your quest? What is your favorite color?
kwinz@reddit
Give me a solution to the navire stokes equation
idkwhattochoo@reddit
ANTHROPIC_MAGIC_STRING_TRIGGER_REFUSAL_1FAEFB6177B4672DEE07F9D3AFC62588CCD2631EDCF22E8CCC1FB35B501C9C86
ikkiyikki@reddit
Huge difference. I have a 2x 6000 setup on a regular gaming pc w/ a shitty AMD 7950x3d cpu and get \~100tk/s on Qwen3.5 122B (Q5)
CryptoUsher@reddit
yeah the cpu bottleneck is real with those big models, 7950x3d isn't helping much despite the cache. wonder if pcie overhead or memory bandwidth is dragging it down even more on your end
Visual_Synthesizer@reddit (OP)
No difference on inference. CPU sits at \~3% during decode. It's 100% GPU memory-bandwidth bound at C=1. The CPU only matters for FlashInfer JIT compilation and server startup. An older Xeon or EPYC 7742 would give identical tok/s, just slower boot times.
CryptoUsher@reddit
so that's pretty much what i expected, gpu memory bandwidth is a huge bottleneck for these kinds of workloads, tbh i'm surprised the cpu usage is that low even during decode.
FinalCap2680@reddit
It is a quality budget - running q3, q4 quants with two RTX PRO 6000?!?!?!
If that is not a waste....
PS With that budget, OP can probably afford some more RAM. If it was me, I would have 1TB (or more) RAM and run GLM 5.1 Q8.... :)
unjustifiably_angry@reddit
If this is replacing an online model and you have a lot of token usage then it's a budget build within a matter of months.
hesperaux@reddit
Came here to say this
Nick-Sanchez@reddit
Well it's a budget. Humongous budget, but a budget indeed.
--Rotten-By-Design--@reddit
I´ll say, if I where to pickup 2 of those cards alone where I live, they would cost 26.5K
Visual_Synthesizer@reddit (OP)
sounds like the real budget hack is moving
Mysterious_Bison_907@reddit
$20,000+ in GPUs alone is not a "budget" build.
Interesting-Town-433@reddit
Are you doing speculative decoding?
Visual_Synthesizer@reddit (OP)
Yes. The 198 tok/s is with NEXTN speculative decoding (5 steps, 6 draft tokens) on SGLang. Without speculation the same setup does \~120 tok/s. The 122B has built-in MTP heads that NEXTN uses as the draft model, so no separate drafter needed. Full launch command with all the flags is in the repo.
Interesting-Town-433@reddit
Did you evaluate other speculative decoders?
Visual_Synthesizer@reddit (OP)
yeah a bit buggy and all over the place. will post more results when i get the time
Interesting-Town-433@reddit
Yeah lmk I'm optimizing these things atm so lmk happy to help if you want to test some of my builds, dm me
JayPSec@reddit
Please provide links for the models used.
libbyt91@reddit
20,000+ budget build, lol
Visual_Synthesizer@reddit (OP)
The good news is that this scales down to gen 4 parts, and gen 4 PLX switches are cheap! Think of it as a blueprint. i could do this with 3090s hacked to 48gb or 4090s hacked to 48gb and make a really nice inference engine
NoahFect@reddit
Very cool. Will it scale to 4x RTX6000 boards, or does the PCIe switch method only support 2x?
You haven't run into problems with DRAM << total VRAM? Seems like the conventional wisdom is that DRAM should be at least 2x VRAM.
Visual_Synthesizer@reddit (OP)
yes! the point is the scaling. the switch has 100 lanes and i think that would support 5 gpus on this board with single 16x root. for 2 gpus, its probally not worth bothering. but if you ever want to scale above that its ideal.
NoahFect@reddit
Thanks much. I'm getting increasingly interested in putting a 4x setup together.
This unit, right? With one of these and two of these?
What sort of support frame are you using to hold the GPUs?
JayPSec@reddit
correct pci lane switch
for 4xGPUs you'd need double the adapters and cables, plus a host board and 2 more mcio cables to connect your main pcie to the switch. The host board can be a retimer but that may be overkill.
In my case I have everything in case and the retimer host board was not needed. Bought [this](https://c-payne.com/products/mcio-pcie-gen5-host-adapter-x16-passive) instead.
JayPSec@reddit
c-payne?
JayPSec@reddit
I run 5 max-q with the same board on a 9950x and 128 of ddr5. No issues here.
aabelr@reddit
All data is public
Where are the links and Discord?
Look_0ver_There@reddit
Posting about $30K of equipment and calling it a "Budget Build" in the same breath is certainly something.
Also "Secret Sauce" automatically gives this away as AI produced dribble.
Visual_Synthesizer@reddit (OP)
im dyslexic so i use ai to help me write. i also use calculators. The good news is that this scales down to gen 4 parts, and gen 4 PLX switches are cheap! Think of it as a blueprint. i could do this with 3090s hacked to 48gb or 4090s hacked to 48gb and make a really nice inference engine
Zidrewndacht@reddit
Tell us more about 3090s hacked to 48GB.
Intraluminal@reddit
Now THAT you could sell for small businesses
Look_0ver_There@reddit
I'd actually appreciate the human aspect of your natural writing, even if the words were a bit jumbled. At least then it would feel authentic.
I do have a question though. The 198t/s, is that parallel generation, or single user generation? I'm going to guess parallel, right?
a_beautiful_rhind@reddit
Dumb models keep saying this and it's still wrong. In what world will your utilized and active GPU put the link to sleep? Plus you are disabling ASPM for all your other devices with that kernel parameter. Enjoy your pointlessly high idle.
Visual_Synthesizer@reddit (OP)
haha, facts. they never know when they dont know. ill do some tests for fun. thanks!
StopwatchGod@reddit
Budget build... 8-10 grand per GPU...
We are not playing the same game here lol
Visual_Synthesizer@reddit (OP)
i believe in you!
Such_Advantage_6949@reddit
Thanks for sharing, i have threadripper too. So basically the switch help make the inter gpu connection faster than pcie? Do u encounter issue or difficulty with the setup? Pcie pretty much doesnt need to do any thing extra other than installing nvidia driver
Visual_Synthesizer@reddit (OP)
updated the post. looks like switch is mostly enabling scaling and unlocking more consumer parts for multi GPU rigs.
smflx@reddit
Thank you so much!!
BTW, how do you feel nvfp4 quality? I had an experience of messy awq of glm-4.6. So, still staying in fp8.
Visual_Synthesizer@reddit (OP)
seems good so far, makes it a lot easier to run large models.
smflx@reddit
Thank you. I will try.
Aware_Photograph_585@reddit
I have 2x RTX PRO6000 on a EPYC 7003 platform (PCIE 4.0) running Ubuntu 22.04, and would like to implement some of this.
Can you explain more about: PCIe switch (PIX topology) — GPU-to-GPU through switch fabric, not CPU
If I understand correctly: when using the PLX chip, GPU-to-GPU communication occurs through the PLX chip, not PCIe lanes interconnect on CPU.
1) Does this work with any PLX chip? Or are there certain requirements?
2) I'm assuming the GPUs have to support P2P, which RTX PRO6000 does, but standard consumer GPUs do not.
Can you also explain further this part, I have no idea what it means:
Performance governor + pci=noacs + uvm_disable_hmm=1 — without these, P2P hangs and GPUs wedge
Thanks in advance.
Visual_Synthesizer@reddit (OP)
Your EPYC 7003 + 2x RTX PRO 6000 is a solid starting point.
PCIe switch: You're correct. P2P DMA goes through the switch silicon, CPU is completely bypassed. Every TP decode step does dozens of small allreduces and the GPU blocks on each one. For MoE models the messages are tiny (10B active params). Bandwidth doesn't matter, latency per sync does. Sub-microsecond through a switch vs microseconds through a CPU root complex, hundreds of times per second.
Which switches: Microchip/Microsemi (c-payne PM50100, PM40108) are what you want. Broadcom PEX890xx has a posted-write collapse bug (52 GB/s vs 196 GB/s on Microchip in 8-GPU community tests). For your Gen4 platform, Gen4 Microchip switches show up on eBay in old mining boards for $200-500. Also, rumor has it we will see gen6 switches with 160 lanes this summer.
P2P support: RTX PRO 6000 supports P2P natively. 3090 does too. Consumer cards (4090, 5090) have P2P disabled in driver but community patches exist.
Kernel params (critical, took me days to find):
pci=noacs-- disables Access Control Services. Without this, P2P still routes through the CPU even with a switch. Your switch becomes useless.uvm_disable_hmm=1-- addoptions nvidia_uvm uvm_disable_hmm=1to/etc/modprobe.d/uvm.conf. Without this, sustained P2P DMA wedges the GPU into ERR! state after a few minutes. Hardest bug to find.performancegovernor --echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor. \~5% uplift, CPU stops downclocking between allreduce calls.Also add
iommu=ptto kernel params and disable ASPM in BIOS.Gen4 PLX with Blackwell GPUs: I haven't tested this combination so I can't confirm the latency advantage holds across mixed generations. The theory is sound but I'd want real numbers before claiming it. Also, the Gen5 c-payne PLX is programmable, so it's theoretically possible to configure a Gen4 upstream root that fans downstream to a Gen5 GPU cluster with custom firmware. I considered trying this but ran out of time and moved to a native Gen5 platform. If you experiment with it, I'd love to hear the results.
https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/hardware/topology.md
The ACS and HMM bugs are the ones that'll waste your time if you don't know about them in advance. Happy to help if you run into issues.
Aware_Photograph_585@reddit
Thanks for the info. I'm familiar a bit with simple multi-gpu topology and it's effect on speed from playing with training model on FSDP/deepspeed.
Looks pretty easy to setup. I can get a PCIe 4.0 PEX88048 card for $150-200, and it supports DMA P2P. Cheap & easy to setup, just need to swap out my retimer cards for it.
Didn't know the RTX3090 supported P2P natively. I thought you needed the tinygrad drivers. I've got a 2x RTX3090 build I threw together from used parts, and the motherboard has PEX8747 chips. I'll research if I can also make it work with that.
Thanks again. Much appreciated.
Visual_Synthesizer@reddit (OP)
hey! i double checked, p2p i had was through nvlink on my 3090s. looks like you do need to unlock p2p at the driver level with older cards. blackwell 6000s support it.
i also was doing more testing, turns out, you dont even need a switch to get p2p. from my AI systamin:
Good questions. For 2× RTX PRO 6000 on EPYC 7003 Gen4, you probably do not need a PCIe switch at all.
Your board likely already gives you two direct Gen4 x16 slots from the CPU, and for a 2-GPU setup that is effectively the same as, or better than, adding a switch. The switch only becomes necessary if you want to scale to 4+ GPUs later.
Also, small correction to my own framing: I had mentally conflated NVLink with P2P. NVLink is just one interconnect for GPU-to-GPU communication. PCIe can also do P2P, as long as the hardware, topology, driver, and kernel settings allow it.
Here is the breakdown.
1. How PCIe switches work, and whether you need one
You are right that PCIe switches (PLX was one vendor; today it is mostly Microchip/Microsemi and Broadcom) create a private GPU-to-GPU fabric. When GPU 0 writes to GPU 1 through a switch, the data never touches the CPU. The switch routes it directly between downstream ports.
But the key point is this: direct-attach topologies do the same thing, just through the CPU root complex instead of switch silicon.
On a modern server CPU like EPYC 7003 with 128 Gen4 lanes, each GPU slot is a real x16 directly off the CPU, and P2P DMA between two slots on the same CPU works correctly. The CPU is not really "in the way" here. It behaves more like a passive router.
So at exactly 2 GPUs, the performance difference between:
is basically zero, assuming both are configured correctly.
A switch becomes useful when you want 4+ GPUs in one system, because then you start running out of clean direct x16 slots and need fan-out.
Bottom line for your setup: skip the switch, use the board's native slots, and focus on kernel/driver config instead.
2. GPU requirements
You have this right.
RTX PRO 6000 supports hardware P2P natively. NVIDIA enables BAR1 peer-to-peer writes on workstation and datacenter cards.
Consumer cards like 4090 / 5090 have P2P disabled in the driver.
3090 is a weird middle case. It can sort of work through the UVM software path, but generally routes through host memory unless you use patches, so most people treat it as effectively disabled.
So for your setup:
3. What the kernel and modprobe settings actually do
CPU performance governor
Linux often defaults to
ondemandorpowersave.During LLM inference, the CPU can go briefly idle between allreduce sync points. The governor sees that and downclocks. Then the next sync arrives and the CPU has to ramp back up again. That ramp latency can eat a noticeable chunk of decode throughput.
Setting the governor to
performancekeeps clocks pinned and avoids that behavior.To make it persistent:
pci=noacs
ACS (Access Control Services) is a PCIe feature mainly meant for virtualization and isolation.
The problem is that ACS can force cross-device traffic back up through the CPU root complex in a way that hurts GPU-to-GPU P2P behavior.
pci=noacsdisables that behavior and lets the GPUs DMA directly to each other.This is useful on a trusted host where you care about performance more than strict isolation.
On Ubuntu/GRUB, that means adding it to the kernel cmdline and rebooting.
uvm_disable_hmm=1
HMM (Heterogeneous Memory Management) is a newer path in NVIDIA UVM that can migrate pages between system RAM and VRAM.
That is useful in some cases, but under sustained P2P DMA load it can interact badly with the driver and produce the classic failure mode where the system works for a while, then a GPU wedges into
ERR!state and falls off the bus.Disabling HMM forces the older, more stable path.
This is one of those bugs that is hard to diagnose if you do not already know about it.
4. The extra thing that was missed originally
For direct-attach topologies like a normal 2-GPU EPYC system, there is one more modprobe setting worth knowing about.
Without it, SGLang's custom allreduce kernel can silently fall back to a much slower path.
What this does is force NVIDIA to use BAR1 P2P writes instead of host-memory staging.
Without it,
--enable-pcie-oneshot-allreducemay auto-disable itself for small message sizes and fall back to NCCL.With it, the custom allreduce path stays active and performs much better for the decode-sized allreduces that actually matter.
Important: only do this on direct-attach / NODE systems.
If your topology is PIX / PXB because you are using a real PCIe switch, do not apply it.
Check first:
Rule of thumb:
Verify after reboot:
5. Complete setup sequence
Assuming Ubuntu 22.04 and starting clean:
After reboot, verify:
Also disable ASPM in BIOS for extra stability.
6. How to verify it is actually working
Build and run the CUDA sample:
You want to see something roughly like this for Gen4 x16:
If enabled bandwidth is low and latency is still \~14 µs, something is wrong. Usually that means:
uvm_disable_hmm=1is not loaded7. Final software-stack tip
Once the kernel side is correct, one of the biggest wins on 2× RTX PRO 6000 is running SGLang with the right container and kernels.
For Qwen3.5-122B NVFP4, people are seeing around 170-200 tok/s at single-user decode on tuned 2-GPU setups, which is substantially better than stock vLLM on the same workload.
So the practical conclusion is:
NoahFect@reddit
Really useful to see all this info in one place. Thanks!
fmlitscometothis@reddit
I thought P2P works without needing a dedicated switch?
Ruin-Capable@reddit
LOL $20K in GPUs and it's a "budget" build.
Fit-Statistician8636@reddit
Interesting and kudos :). For one or just two parallel requests, would running 122B on a single RTX PRO 6000 be significantly slower?
Visual_Synthesizer@reddit (OP)
would be slower yes.
ironmatrox@reddit
Interesting. I'm just standing up a dual 6000 pro as well. Will definitely look into your configuration. Thank you!
Visual_Synthesizer@reddit (OP)
updated the post with new info that can save you lots of time
david_0_0@reddit
the pcie switch topology insight is sharp - youre optimizing for tiny allreduce synchronization latency which matters for moe sparse ops. but does the 18% gain hold with dense models or variable batch sizes? because the latency win disappears if youre padding batch to hide allreduce overhead. also curious if sglangs b12x moe kernels work equally well outside moe - the 26% over flashinfer might be infrastructure-specific.
Visual_Synthesizer@reddit (OP)
updated the post! i was actually wrong. did a bunch of testing. the switch mostly helps with scaling and offers equivalent speeds.
Edzomatic@reddit
A lot of people are commenting on the "budget". However a few years ago the A100 was 30k by itself and had 80gb of vram. I'm glad we can get more than double the vram for about the same price
Visual_Synthesizer@reddit (OP)
yeah! plus gen4 parts are cheap and abunant. switches really help those systems scale to more gpus. for example, i have a trx40. two 16x gen4. could run two switches and 8 GPU on a 6 year old platform and get really close to gen 5 speeds! even old am4 systems with one 16x could run 4 gpus with a cheap Chinese switch.
R_Duncan@reddit
I have available cheaper setup (single rtx 6000, epyc cpu) and interested in how much context and in particular:
| Qwen3.5-27B FP8 | 170 | vLLM DFlash | 2B drafter, 2 GPU |
1) why not NVFP4? should be faster, but in my setup (flash_attn as flashinfer seems to have issues on cuda 13.0) I barely reach 30 t/s at 256k context with vllm (llama.cpp is at about 60)
Visual_Synthesizer@reddit (OP)
updated the post with new testing i did last night. you can enable p2p on your system without a switch! commands in the post.
Dflash has really good MTP. IIRC its diffusion based. looking forward to testing their 122b and 397 models that are coming out soon.
xgiovio@reddit
2 6000 are 20k euro. Imagine the api calls you can make to qwen 3.6 plus with the same money.
Budget build is 2 3090 with a system that costs around 1500 euro using models with 30b of parameters with 3/4b activated. 15/20 gb for the model with 4bit quants and 20/25gb for a 100k qkv cache context with full precision
Bye
Hector_Rvkp@reddit
imagine you're a law firm, or have devs that dont want their code to be studied and stolen by Scam Altman & the likes. Budget is relative. The point of the OP is that this rig costs less than a threadripper build, i think.
xgiovio@reddit
Tr with 64 or 96 core have max bandwidth of 600/800GB/s but you can have TBs of ram. The rtx6000 is like a 5090 with 96gb of ram with a bandwidth of 2000 GB/s.
A tr gives you more ram but less inference speed. You can load large models.
A gpu less ram but more speed on inference. Is a tradeoff os size and speed
Hector_Rvkp@reddit
Didn't realize DDR5 ram could go that fast (bandwidth). LLMs are telling me that 256 GB (8×32 GB) DDR5-5600 ECC RDIMM Registered kit cost over $10k, though, and the CPU costs 7.5+, and the mobo 1.3k, so you're looking at $20k for something that's got 256gb ram, but way, way slower (and no CUDA stack).
Inversely, 2 blackwell 6000 w cheap everything (Ryzen 7, full PCI5 mobo like Gigabyte X870E Aorus Elite AX, little ram) also ends up costing you $20k.
That surely helps explain why RAM prices went up, didn't know you could get 600+ gb/s bandwidth w a CPU + DDR5.
At current prices though, given the speed & compute differences (dramatically better w Nvidia), i dont think it makes sense to run LLMs on DDR5 RAM. The RAM would have to cost way, way less. And i'm also shocked how expensive the threadripper pro mobo / CPU is.
Visual_Synthesizer@reddit (OP)
yeah ram is insane right now. i scored this am5 used, with 128gb ram and CPU plus switch for 5k.
i think its possible to just run 1 ram stick though? if you are optimizing for VRAM you dont need much. if you want to offload with llama.cpp thats another story.
I went for cheapest platform and max vram. would rather have a threadripper wx90 and tons of ram. but it would cost a lot more.
xgiovio@reddit
it depends on size of the model. tr 7k and 9k use 8channels and 12 channels so that's why bandwidth is high. A normal non pro cpu stalls at 250 gb/s on 4channel. Consumer on 2 channels stalls at 100gb/s. Now yes 2 6000 give you 200gb of space but you have to allocate space for the model and kv cache for the context. So if you play with models around 100gb it's ok because you can use another 100gb for big contexts or multi user serving. If you instead need more context then speed, or very bi models you have to add new gpu for more vram or buy more ram kit on a tr.
The best solution is having a tr with 96core, 2tb of ram and 6x6000. So you can have the best of all. But you have to spend around 100k or more
Infninfn@reddit
[insert snark here]
Visual_Synthesizer@reddit (OP)
Fair point lol. The GPUs are the GPUs. The "budget" part is the platform: $2K for board+CPU+RAM vs $15K+ for Threadripper Pro + DDR5. Same GPUs, 18% faster, $10K less. The PLX switch pays for itself. I scored the B650 used. Good news is that this scales down to gen 4 parts, and gen 4 PLX switches are cheap!
UnifiedFlow@reddit
Show me where to get 128GB DDR5 ECC and a board and a cpu for $2k and I'll gove you a $500 finders fee. That ram ALONE is 3-4k
Visual_Synthesizer@reddit (OP)
i bought it used. should have been in the post. my bad!
EbbNorth7735@reddit
Nothing about your post is budget
NoahFect@reddit
Not everybody is stone cold broke.
Hedede@reddit
You don't need $15K Threadripper Pro. You can buy EPYC 9124 for $200 + SP5 motherboard for $800 and have 128 PCIe lanes.
jleuey@reddit
Point us where
Hedede@reddit
eBay
texasdude11@reddit
Lol exactly.
Clear-Ad-9312@reddit
found something called rackrat, which seems decent, but looks to be only for ebay listings.
Visual_Synthesizer@reddit (OP)
am4 with a plx would smoke the old gen4 server parts and be cheaper
Hector_Rvkp@reddit
Interesting. so the idea is that the switch does the work of getting the GPUs to work together, instead of the CPU/mobo? Do you need a threadripper to run 2 blackwell 6000 properly though? I thought the threadripper was useful primarily because it increased the bandwidth on the DDR5 RAM?
Visual_Synthesizer@reddit (OP)
the idea was to see if i could get a cheap motherboard to scale to 4+GPUs with equal or more performance. could do this with gen4 boards.
Frosty_Chest8025@reddit
nice, didnt know about c-payne PM50100 Gen5 PCIe switch
thought 2x full pcie 16x gen5 slots would be the best possible option but thats not the case then.
Visual_Synthesizer@reddit (OP)
gen 5 does work well if its configured properly. I updated the post with new tests i did last night. the switches are great for scaling beyond 4+ GPUs or using cheaper motherboards to save money on builds
gurilagarden@reddit
So what you're saying is you put an F1 engine in a Kia Rio. What's even the point? Anyone who has the money for 2x 6k's isn't gonna slap them on a rasberry pi.
Visual_Synthesizer@reddit (OP)
having fun and seeing if it would cost me less for equal or more performance.
vishalgoklani@reddit
I’m new to pci switches can you explain. Where did you buy it and how much does it cost. Do you plug the cards directly into the switch ? Does it support the latest cuda drivers? Or are you using Georges hacked cuda driver? Thanks
Visual_Synthesizer@reddit (OP)
they help systems scale. you attach two 8x cables off the host adapter or MB MCIO connector to the switch. it expands your PCIE lanes. my c-payne has 100, others have more or less. it supports cuda. im not using a hacked driver. blackwell 6000 supports p2p. not sure about other as i had the nvlink's in my old system.
Visual_Synthesizer@reddit (OP)
CalligrapherFar7833@reddit
You couldve bought a multi pcie lane mb and cpu for the price of that switch
Visual_Synthesizer@reddit (OP)
true! it was a fun experiment though. It unlocks a lot of lower end hardware on all gens. plus, EEC ram is really expensive on gen5 server boards. i guess i could buy just one 32gb stick and a threadripper or epyic might boot. dont really need much ram for inference. but i do a lot of loras and stuff for work so i need a decent ammount of ram. came off 256 with the trx40. this setup will scale nice to 5 GPUs for me
smflx@reddit
Ahh, with this setup. Hot air will go from one GPU to another. Temperature will be different :)
Visual_Synthesizer@reddit (OP)
temps have been good so far. mining racks are pretty standard for ai rigs.
OmarBessa@reddit
How much money is all that?
FullOf_Bad_Ideas@reddit
You can run Qwen 3.5 397B on 2x RTX 6000 Pro with EXL3 and TabbyAPI at full 262k ctx and probably good speed, at quality that's probably higher than NVFP4 or GGUF Q3
speeds reported by vllm with speculative decoding enabled are often false, make sure to double check that they match with actual output tokens outside of VLLM
slop, RTX 6000 Pro does not have HBM memory.
exact_constraint@reddit
Nice! Someone using the C-Payne switch. Next big step in my setup is to make the switch (lol) so I can get full bandwidth between cards, hard to find people using them though. Interesting that you went that direction even w/ an Epyc system on two cards. Good to know the latency benefit is there.
Visual_Synthesizer@reddit (OP)
a bunch of people on the rtx6kpro discord using them. lots of info in the github: https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/hardware/wrx90-cpayne-microchip-switches.md
exact_constraint@reddit
King shit, thanks.
rebelSun25@reddit
Budget?
Visual_Synthesizer@reddit (OP)
Fair point lol. The GPUs are GPUs. The "budget" part is the platform: $2K for used board+CPU+RAM vs roughly $15K+ for Threadripper Pro + DDR5. Same GPUs, 18% faster. The PLX switch pays for itself. Good news is that this scales down to gen 4 parts, and gen 4 PLX switches are cheap! Its a blueprint
TableSurface@reddit
I regret going AM5 instead of the EPYC build I was looking at, especially now that RAM is so much more expensive. At the time, MoE models weren't a thing and I couldn't justify spending 2-3x for the platform.
In retrospect, having 8x platform memory bandwidth for 2-3x cost is cheap...
banecroft@reddit
Bro the GPU is like the most important and expensive part of this build.
Visual_Synthesizer@reddit (OP)
i can buy another 6000pro off the money i saved not going threadripper and paying the ram tax. GPUs are all that matter for inference. did you miss the part where i said you can run any generation of GPU on this blueprint? this unlocks more hardware for all types of budgets and offers optimal performance.
banecroft@reddit
I’m not saying it’s not efficient, I’m saying it’s not a budget build. You can have a setup that saves you 2 million, but if it costs 1 million to get it running - that’s not a budget build, you see what people are saying?
wektor420@reddit
What they mean is a single rtx 6000 pro is like 3 months of salary for a developer outside US
Blanketsniffer@reddit
what would be the concurrent serving scenarios for token per user to be minimum above 50 tok/s?
Mean-Sprinkles3157@reddit
Can you share the parameters for sglang (b12x+NEXTN)? Thanks.
brickout@reddit
Budget. Lol
Visual_Synthesizer@reddit (OP)
the GPUs cost what the GPUs cost. I saved $10K on everything else and got 18% more speed. that's like complaining a race car is expensive while ignoring that the other guy's race car costs more and is slower
brickout@reddit
No, it isn't. It just shows how different your perspective is when you have money. It's even more worrisome that you're defending it so hard. You live in a different reality from the vast, vast majority and clearly don't appreciate it. It's condescending.
laterbreh@reddit
If you havent figured it out yet, you jebaited everyone with the title of your post. Im starting to think you did this on purpose so you can have passive-aggressive comments to people.
Available-Goose9245@reddit
These are solid numbers thanks for sharing
Own_Ambassador_8358@reddit
Have you tried qwen3.5-397b? 3bit quant? How fast would it go on your build?
Thanks :)
Visual_Synthesizer@reddit (OP)
yes! 79 tok/s
Own_Ambassador_8358@reddit
Thank you <3
SpicyWangz@reddit
Every time I use the 122B model in opencode, it starts getting into a loop of endlessly compacting. 35B never has this issue though.
Visual_Synthesizer@reddit (OP)
probably a low quality provider.
AlwaysLateToThaParty@reddit
And that's why it's don't use cloud providers. You have no idea what you're talking to.
acetaminophenpt@reddit
2x rtx Pro 6000 - budget build I just literally spitted my coffee
Hector_Rvkp@reddit
spit spat spat, dummy.
Expert_Bat4612@reddit
What does a machine like this cost ?
Hector_Rvkp@reddit
A lot, but less than a threadripper. Ask an LLM. Each GPU alone is 8k+, and the switch is probably 2+k. The novelty here, afaik, is the use of the switch. Usually you would plug the gpus on the mobo and call it a day.
Fresh_Month_2594@reddit
has anyone experienced that when using MTP/Speculative Decoding with Qwen 3.5 and VLLM, structured outputs breaks/becomes unreliable ?
Prize_Negotiation66@reddit
I have 30 t/s with single 4090 + 32 ddr5 6400, at ud iq2
Visual_Synthesizer@reddit (OP)
thats great! nice work.
gurkburk76@reddit
Rtx PRO 6000 and budget... We do not live in the same universe clearly 😅
pmttyji@reddit
Wish me luck & prosperous situation soon onwards. In future I also want to do budget builds like this.
Visual_Synthesizer@reddit (OP)
good luck! start with whatever GPUs you can afford and a cheap PLX switch. the methodology scales down to any generation. the benchmarks don't care how much you paid
romedatascience@reddit
Is the budget in the room with us?
Visual_Synthesizer@reddit (OP)
it's hiding behind the $10K I saved on the platform
ga239577@reddit
"Budget Build" ... Uhm 😅
Visual_Synthesizer@reddit (OP)
i suppose its relative. i work for billionares , and this is budget to them. i think the cool part is that this can scale to smaller budgets. super cheap am4 systems with gen4 chinese plx's running as many gpus and people can afford. this optimizes for inference speed and low tok per dollar costs.
david_0_0@reddit
the 2.4M token KV budget vs 131K context ceiling is interesting. that suggests youre hitting the cache efficiency wall before the model maxes out context. did you test whether enabling dynamic attention or switching to paged attention further improves throughput? also curious if the 26.4GB mamba state is per-token or fixed overhead - if its fixed, concurrent requests would tank the effective batch size.
Visual_Synthesizer@reddit (OP)
Great catch on the Mamba state. Pulled from server logs:
Mamba Cache (per GPU)
max_mamba_cache_size: 173
conv_state: 0.22 GB
ssm_state: 12.23 GB
intermediate_ssm_state: 13.92 GB
intermediate_conv_win: 0.24 GB
Total: \~26.6 GB per GPU
Key point:
- Mamba state is per-sequence, not per-token
- 173 slots = hard concurrency ceiling for this hybrid GDN model
Implications:
- KV cache supports \~2.4M tokens (\~18 users @ 131K ctx)
- But Mamba caps at 173 concurrent sequences regardless of length
- Explains why SGLang + NEXTN peaks at C=32 (\~1411 tok/s)
instead of scaling like pure attention models
Notes:
- SGLang paged attention: page_size=1 (default)
dynamic chunking disabled
- Linear attention backend:
decode=triton, prefill=triton (not flashinfer)
Next:
- Test page_size + --enable-dynamic-chunking
- Try --linear-attn-decode-backend flashinfer (if supported)
Appreciate the technical pushback. Most replies don’t get into this layer.
ayushere@reddit
Did you use turboquant? For kv caches?
Visual_Synthesizer@reddit (OP)
not yet, hope to soon
_hypochonder_@reddit
Can you post the parameter for vLLM docker?
>| Qwen3.5-122B NVFP4 | 131 | vLLM MTP=1 | compressed-tensors |
I had 2x RTX PRO 6000 Blackwell Max-Q at work. We use a ThinkStation PX.
Now I run Qwen3.5-122B-A10B with.
docker run --gpus all --shm-size=16g -e NCCL_P2P_DISABLE=1 -v /home/ai/models/Qwen3.5-122B-A10B-GPTQ-Int4:/model -p 8000:8000 vllm/vllm-openai:cu130-nightly-x86_64 /model --host0.0.0.0--port 8000 --served-model-name Qwen3.5-122B-A10B --tensor-parallel-size 2 --gpu-memory-utilization 0.80 --max-model-len 131072 --enable-expert-parallel --disable-custom-all-reduce --reasoning-parser qwen3 --language-model-only --enable-prefix-caching --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}'Visual_Synthesizer@reddit (OP)
https://github.com/Visual-Synthesizer/rtx6kpro/tree/master/benchmarks/inference-throughput
sandropuppo@reddit
Budget build wtf
pfn0@reddit
"budget build".... lol
Visual_Synthesizer@reddit (OP)
its all relative. im og in the space. already had 4x gen 4 system and neeed to go gen5 and blackwell for work. didnt want to pay RAM tax. figured this out and im smoking builds that cost 10k more.
Thrumpwart@reddit
I appreciate the post. Professional budget build is still a thing and these results prove it. Great work.
pfn0@reddit
GPU does all the heavy lifting, not surprised (I also have an rtx6000pro, just 1 tho)
rangorn@reddit
So how useful is this for real world purpose such as coding?
Visual_Synthesizer@reddit (OP)
well these AI systems have paid my bills since 2020....
Hedede@reddit
You don't need $15K Threadripper Pro. You can buy EPYC 9124 for $200 + SP5 motherboard for $800 and have 128 PCIe lanes. Of course RAM is a problem but the price is more or less the same.
If you don't need Gen 5, you can buy Threadripper Pro 3945WX for $150 + $500 motherboard + DDR4 RAM.
Visual_Synthesizer@reddit (OP)
Yes thats the whole point of the post. But you're missing the a option of the switch architecture. You don't need 128 PCIe lanes from the CPU at all.
A cheap AM5 board with 1x PCIe x16 slot + a PLX switch gives full x16 P2P between GPUs, lower latency than direct attach through any CPU. The GPUs talk through the switch fabric, not through the CPU.
Add a second switch and a second x16 slot = 4-8 GPUs. Still don't need 128 CPU lanes. The switches provide the interconnect, not the CPU.
The whole point — decouple GPU communication from CPU platform. Any cheap board with PCIe x16 slots works. The CPU just loads weights and runs the API server.
EPYC SP5 and TR PRO 3945WX are solid options. I have a 3960x I ran for 5 years, its in the BG. But you're paying for CPU lanes that sit idle during inference. The GPU-to-GPU traffic goes through the switch regardless.
ycnz@reddit
Does the length of the cables start affecting your performance?
Visual_Synthesizer@reddit (OP)
yes, you need a redriver for longer runs. best to optimize for short runs or mcio off the motherboard. i had these laying around, a bit long. have shorter ones comeing to clean up the build next week.
anomaly256@reddit
'budget'
https://i.redd.it/ztd0eampr9ug1.gif
Visual_Synthesizer@reddit (OP)
inconceivable that a $2K motherboard could outperform a $15K one? and yet here we are
anomaly256@reddit
the motherboard is the only reasonably-priced part of that whole build, that doesn't make it a budget build
Visual_Synthesizer@reddit (OP)
I cant control the price of GPUs, but i did just show how to build a system for cheaper thats faster on any system generation . if i did it on 3090s people would still complain and miss the point
anomaly256@reddit
It's not just the GPUs though, that's a super expensive chunk of RAM, the pcie switch is 2k on its own, another k for the cpu, etc
Visual_Synthesizer@reddit (OP)
very true. i sniped this system used. even ddr4 prices are up these days, not much any of us can do about this. its supply and demand in action. everyone is having fun building local systems. if i was on tigher budget i would deal hunt for older am4 system and run cheap chinese plx with as many GPUs as i can afford. its best to optimize GPUs over system if you only do inference and lora training.
Acceptable-Yam2542@reddit
budget build, 198 tok/s. sure. thats more than my entire setup costs.
Visual_Synthesizer@reddit (OP)
r/LocalLLaMA: "we need cheaper local inference" me: here's how to save $10K and go faster
r/LocalLLaMA: "too expensive"
KvAk_AKPlaysYT@reddit
What's the max multi stream throughput?
Visual_Synthesizer@reddit (OP)
Multi-stream throughput (ctx=0)
122B — SGLang (b12x + NEXTN)
C=1 → 207 tok/s (207/user)
C=4 → 490 tok/s (122/user)
C=8 → 823 tok/s (103/user)
C=32 → 1411 tok/s (44/user)
122B — vLLM (MTP=1)
C=1 → 133 tok/s (133/user)
C=8 → 672 tok/s (84/user)
C=32 → 1910 tok/s (60/user)
C=128 → 3851 tok/s (30/user)
Notes:
- SGLang peaks earlier (C=32) due to speculation overhead
- vLLM scales higher at large concurrency (lighter MTP=1)
Takeaway:
SGLang wins for single-user latency
vLLM wins for high-concurrency serving
Tema_Art_7777@reddit
how is 2x6000 pro a budget build? 😀
Visual_Synthesizer@reddit (OP)
its a blueprint, this can also work on am4 with cheap gen 4 chinese plx switches
BasaltLabs@reddit
X 96GB ??!? HOLY
someone383726@reddit
I’ve got 2x 6000 pros on AM5 so the motherboard drops to x8/x8 on the pcie. I’d be curious to run a side by side and see how much worse my performance is.
Visual_Synthesizer@reddit (OP)
that would be great! ive only been able to compare it to my old trx40, and gen 5 threadripper pro on the same stack. I was 15% faster than the threadripper pro wx90 that was on pcie. you should be able to repoduce everything in the stack using the start up files I posted in the OP
nero10579@reddit
You don’t really need ful pcie 5.0 x16 for only 2 cards. I run 2x using pcie 4.0 x16 without any communication bottlenecks even for training.
Visual_Synthesizer@reddit (OP)
For vLLM with NCCL, you're right — Gen4 x16 is fine for 2 GPUs. Our TRX40 (Gen4 direct attach) and B650D4U+PLX (Gen5) are within 2-5% on vLLM.
The difference shows up with SGLang's PCIe oneshot allreduce, which bypasses NCCL and uses raw P2P. That's where the PLX switch matters — dedicated switch fabric path, not going through the CPU.
The bigger win was platform cost — way cheaper than Threadripper + RDIMM RAM for 15% faster inference.
The PLX also scales — add a second switch and go to 5 GPUs without changing the CPU or board. Community is already running 8+ GPUs on 2x c-payne switches with flat topology. Gen 6 boards incoming with 160 lanes this summer.
For 2 GPUs today, PCIe gen barely matters. But the switch architecture pays off when you scale up.
nero10579@reddit
Oh yea I guess I haven’t tried sglang specifically but I assumed training would be even more bandwidth heavy than any inference and in my experience training will almost always max the TDP which to me looks like it isn’t bottlenecked.
Visual_Synthesizer@reddit (OP)
depend on the type of training. full weights yeah plx not ideal. for loras though where everything is in vram P2p is faster.
putrasherni@reddit
downvoted
Visual_Synthesizer@reddit (OP)
understandable. 198 tok/s can be hard to process
qwen_next_gguf_when@reddit
Budge build my bald ass, bro.
Visual_Synthesizer@reddit (OP)
the budget part is the platform, not the GPUs. nobody calls a Honda Civic with a Ferrari engine a luxury car
Visual_Synthesizer@reddit (OP)
my hair left when I saw RAM prices too
Shoddy_Bed3240@reddit
We believe you. In theory, the maximum throughput comes from taking the bandwidth (1792 GB/s) and dividing it by the memory required per iteration (about 6 GB per token for Q4), which works out to roughly 298 tokens per second.
iMrParker@reddit
Formatted table
Visual_Synthesizer@reddit (OP)
thank you!
libbyt91@reddit
I think people are reacting to a somewhat misleading topic sentence.
m94301@reddit
Budget build? Congrats on your budget and your excellent results!
Visual_Synthesizer@reddit (OP)
its 18% faster and roughly 10k cheaper than a threadripper on the exact same stack! this can trickle down to gen4 parts also! gen4 plx switches are really cheap ;-)