Benchmarking Qwen3 30B and 235B on dual RTX PRO 6000 Blackwell Workstation Edition

Posted by blackwell_tart@reddit | LocalLLaMA | View on Reddit | 45 comments

As promised in the banana thread. OP delivers.

Benchmarks

The following benchmarks were taken using official Qwen3 models from Huggingface's Qwen repo for consistency:

Qwen3 235B A22B GPTQ Int4 quant in Tensor Parallel
Qwen3 30B A3B BF16 in Tensor Parallel
Qwen3 30B A3B BF16 on a single GPU
Qwen3 30B A3B GPTQ Int4 quant in Tensor Parallel
Qwen3 30B A3B GPTQ Int4 quant on a single GPU

All benchmarking was done with vllm benchmark throughput ... using full context space of 32k and incrementing the number of input tokens through the tests. The 235B benchmarks were performed with input lengths of 1024, 4096, 8192, and 16384 tokens. In the name of expediency the remaining tests were performed with input lengths of 1024 and 4096 due to the scaling factors seeming to approximate well with the 235B model.

Hardware

2x Blackwell PRO 6000 Workstation GPUs, 1x EPYC 9745, 512GB DDR5 5200 MT/s, PCIe 5.0 x16.

Software

Ubuntu 24.04.2
NVidia drivers 575.57.08
CUDA 12.9

This was the magic Torch incantation that got everything working:

pip install --pre torch==2.9.0.dev20250707+cu128 torchvision==0.24.0.dev20250707+cu128 torchaudio==2.8.0.dev20250707+cu128 --index-url https://download.pytorch.org/whl/nightly/cu128

Otherwise these instructions worked well despite being for WSL: https://github.com/fuutott/how-to-run-vllm-on-rtx-pro-6000-under-wsl2-ubuntu-24.04-mistral-24b-qwen3

Results

Qwen3 235B A22B GPTQ Int4 (Qwen official Int4) @ 1k input

$ vllm bench throughput --model Qwen/Qwen3-235B-A22B-GPTQ-Int4 --max-model-len 32768 --tensor-parallel 2 --input-len 1024
Throughput: 5.03 requests/s, 5781.20 total tokens/s, 643.67 output tokens/s
Total num prompt tokens:  1021646
Total num output tokens:  128000

Qwen3 235B A22B GPTQ Int4 (Qwen official Int4) @ 4k input

$ vllm bench throughput --model Qwen/Qwen3-235B-A22B-GPTQ-Int4 --max-model-len 32768 --tensor-parallel 2 --input-len 4096
Throughput: 1.34 requests/s, 5665.37 total tokens/s, 171.87 output tokens/s
Total num prompt tokens:  4091212
Total num output tokens:  128000

Qwen3 235B A22B GPTQ Int4 (Qwen official Int4) @ 8k input

$ vllm bench throughput --model Qwen/Qwen3-235B-A22B-GPTQ-Int4 --max-model-len 32768 --tensor-parallel 2 --input-len 8192
Throughput: 0.65 requests/s, 5392.17 total tokens/s, 82.98 output tokens/s
Total num prompt tokens:  8189599
Total num output tokens:  128000

Qwen3 235B A22B GPTQ Int4 (Qwen official Int4) @ 16k input

$ vllm bench throughput --model Qwen/Qwen3-235B-A22B-GPTQ-Int4 --max-model-len 32768 --tensor-parallel 2 --input-len 16384
Throughput: 0.30 requests/s, 4935.38 total tokens/s, 38.26 output tokens/s
Total num prompt tokens:  16383966
Total num output tokens:  128000

Qwen3 30B A3B (Qwen official FP16) @ 1k input | tensor parallel

$ vllm bench throughput --model Qwen/Qwen3-30B-A3B --max-model-len 32768 --tensor-parallel 2 --input-len 1024
Throughput: 11.27 requests/s, 12953.87 total tokens/s, 1442.27 output tokens/s
Total num prompt tokens:  1021646
Total num output tokens:  128000

Qwen3 30B A3B (Qwen official FP16) @ 4k input | tensor parallel

$ vllm bench throughput --model Qwen/Qwen3-30B-A3B --max-model-len 32768 --tensor-parallel 2 --input-len 4096
Throughput: 5.13 requests/s, 21651.80 total tokens/s, 656.86 output tokens/s
Total num prompt tokens:  4091212
Total num output tokens:  128000

Qwen3 30B A3B (Qwen official FP16) @ 1k input | single GPU

$ vllm bench throughput --model Qwen/Qwen3-30B-A3B --max-model-len 32768 --input-len 1024
Throughput: 13.32 requests/s, 15317.81 total tokens/s, 1705.46 output tokens/s
Total num prompt tokens:  1021646
Total num output tokens:  128000

Qwen3 30B A3B (Qwen official FP16) @ 4k input | single GPU

$ vllm bench throughput --model Qwen/Qwen3-30B-A3B --max-model-len 32768 --input-len 4096
Throughput: 3.89 requests/s, 16402.36 total tokens/s, 497.61 output tokens/s
Total num prompt tokens:  4091212
Total num output tokens:  128000

Qwen3 30B A3B (Qwen official GPTQ Int4) @ 1k input | tensor parallel

$ vllm bench throughput --model Qwen/Qwen3-30B-A3B-GPTQ-Int4 --max-model-len 32768 --tensor-parallel 2 --input-len 1024
Throughput: 23.17 requests/s, 26643.04 total tokens/s, 2966.40 output tokens/s
Total num prompt tokens:  1021646
Total num output tokens:  128000

Qwen3 30B A3B FP16 (Qwen official GPTQ Int4) @ 4k input | tensor parallel

$ vllm bench throughput --model Qwen/Qwen3-30B-A3B-GPTQ-Int4 --max-model-len 32768 --tensor-parallel 2 --input-len 4096
Throughput: 5.03 requests/s, 21229.35 total tokens/s, 644.04 output tokens/s
Total num prompt tokens:  4091212
Total num output tokens:  128000

Qwen3 30B A3B (Qwen official GPTQ Int4) @ 1k input | single GPU

$ vllm bench throughput --model Qwen/Qwen3-30B-A3B-GPTQ-Int4 --max-model-len 32768 --input-len 1024
Throughput: 17.44 requests/s, 20046.60 total tokens/s, 2231.96 output tokens/s
Total num prompt tokens:  1021646
Total num output tokens:  128000

Qwen3 30B A3B (Qwen official GPTQ Int4) @ 4k input | single GPU

$ vllm bench throughput --model Qwen/Qwen3-30B-A3B-GPTQ-Int4 --max-model-len 32768 --input-len 4096
Throughput: 4.21 requests/s, 17770.35 total tokens/s, 539.11 output tokens/s
Total num prompt tokens:  4091212
Total num output tokens:  128000

[-]

polawiaczperel@reddit

Great benchmarks, thanks for that. It shows that those models can be really fast on consumer hardware (Yes RTX 6000 is still semi consumer hardware).

I am curious how good it can be for making a web applications using recursive agentic flow. So there is an error in logs, or design is not accurate to figma design and it is trying to fix it to the moment that everything is working fine. It is still not bruteforce, but kind of.

Has anyone some experience with flow I described?

[-]

pointer_to_null@reddit

It shows that those models can be really fast on consumer hardware (Yes RTX 6000 is still semi consumer hardware).

"consumer" or "semi-consumer" is an odd word choice. What hypothetical consumer hardware are you imagining?

I'd imagine ~$20k just in GPU costs (2x RTX 6000 Blackwell cards) alone... on top of a server with 128 Zen5 cores and 768GB DDR5-5200 (12 channel) pushes hardware past your typical "prosumer" and well into "professional" category. Sure, each Blackwell card is running the same GB202 chip from the RTX 5090- albeit unlocked 10% more cores and with 3x the VRAM.

[-]

No_Afternoon_4260@reddit

Well yeah, I mean any of those vscode extension can do that if you auto approve read,write, commands..
I'm sure there's better way to do that but that's how I experiment, my main cobay is devstral these days.

In my experience they tend to achieve what you want as long as it's not too complicated and you don't try to be too explicit (constraints in another way than they'll do it by themself).

Also if it's too complicated they tend to diverge from you goal but that's because the agentic concept is still in its infancy imo

But still i prefer to keep the leash tight and go by small iterations (idk may be I just like yo use my brain and see what's happening)

[-]

blackwell_tart@reddit (OP)

A small update: after following the instructions provided by Daniel (https://old.reddit.com/r/LocalLLaMA/comments/1lzps3b/kimi_k2_18bit_unsloth_dynamic_ggufs/) we were able to hit 20 tokens/sec with the Kimi K2 UD_Q4_K_XL GGUF quite easily.

[-]

Direct_Turn_1484@reddit

Thanks for this.

I sure wish I could build a machine like this for somewhere closer to $5k. Happy for those that can build and play with such a nice rig.

[-]

DAlmighty@reddit

Congrats on even getting vLLM to run on the pro 6000. That’s a feat I haven’t been about to accomplish yet.

[-]

blackwell_tart@reddit (OP)

The instructions in the original post should work quite well, assuming you are running Ubuntu Linux.

[-]

DAlmighty@reddit

I’ve tried that method but it still isn’t working for me. I think a refreshed install of Ubuntu is in my future.

[-]

blackwell_tart@reddit (OP)

To quote the young ‘uns cultural references: this is the way.

[-]

Traclo@reddit

They added sm120 a couple of weeks ago, so the latest release should work with it out of the box.

[-]

Impossible_Art9151@reddit

thanks for your reports!
You have built a setup that I am considering myself.
Can I ask you a few details please.
1) Your server just has 512GB RAM. Why didn't you go with 1TB or 2TB?
CPU RAM isn't that expensive compared to VRAM.
My considerations go like: With 2TB I can load a deepseek, a qwen3:235 and a few more into memory preventing cold starts.
2 x rtx6000 pro is high-end prosumer and I would aim for running the big models under it without heavy degradation from low quants.
This is no critic! I am just curious about your thoughts and your use case.

2) You are using vllm. Does it have any advantage over ollama, that I am using right now? Especially anything regarding your specific setup?

3) How do the MOE models scale with 2 x GPU? I would expect a qwen3:235b in q4 should run completely from GPU since 2 x 96GB = 192GB VRAM << 144GB for qwen plus context. Does qwen run GPU only?
Since I am using a nvida A6000 /48GB ollama shows me for the 235b 70% CPU/30%GPU. That means I am loosing > 70% in speed due to VRAM limitations.

Can you specify the loss from 2 x GPU versus a single GPU with 192GB? There must be some losses due to overhead and latency.

4) How much did you pay for your 2 x rtx 6000 hardware overall?

5) Last but not least: Who happend to the banana? is she save?

thx in advance

[-]

DepthHour1669@reddit

VLLM is a lot better than ollama for actual production workloads

[-]

night0x63@reddit

Why? I've heard this before... But always without reasons.

[-]

DepthHour1669@reddit

VLLM batches parallel inference very well, plus features like chunked attention. So 1 user may get 60 token/sec, 2 users each get 55tok/sec each, and 3 users get 50tok/sec each, etc (up to a point). Whereas Ollama will serve 3 users at 20tok/sec.

[-]

Impossible_Art9151@reddit

Thanks for your input. Indeed the actual ollama is broken regarding multiuser usage. Two request are really killing the performance.
Next ollama release will get a bug fix, parallel setting = 1 as standard.
link here: https://github.com/ollama/ollama/releases/tag/v0.9.7-rc0

In my case a sequential processing is sufficient. Apart from heavy commercial systems I wonder about parallel processing anyway. I cannot see any big user advantage from this.

Overall user experience will suffer because any multiuser processing will be slower than sequential proceesing (one GPU and Amdahls Law given).

I will give ollama alternatives a try.

[-]

DepthHour1669@reddit

No, the idea is that below the token limit, it takes the same amount of time to process 1 user or 10 users.

Sequential would be slower.

[-]

Impossible_Art9151@reddit

Okay - thanks again.
This means, the overall processing of one user keeps some hardwareressources unused. These ressources can be used otherwise?
I am really deep into efficency, hardware usage, paralellism vs sequential workloads. But I am still learning - the whole GPU world has different aspects than its CPU counterpart.

[-]

DepthHour1669@reddit

Correct. It’s a consequence of the hardware design of GPUs. Inference is much more efficient when batch processing. If you only have one small user query in the batch, then most compute cores are idle and can’t take advantage of the required memory bandwidth going through all the params.

[-]

night0x63@reddit

(Side north about lots of system memory: About 3 months ago I specced out an AI server. And did Les CPU and memory. So like 400 GB memory.

Now with qwen 235b and and mixtral 176b and llama4 200b. And llama4 behemoth 2T... All are MOE... All split vram and CPU.

If you want to run Kimi/moonshot or llama 4 behemoth... You need 1-2 TB memory.)

[-]

Lissanro@reddit

Ollama is not efficient for GPU+CPU inference. For single user requests, ik_llama.cpp is the best, at least 2-3 times faster compared to llama.cpp (which Ollama is based on). For multiple users, vllm is probably better though.

Multiple GPUs, depending on how they are used, either maintain about the same speed as a single GPU would, or even bring huge speed up if used with tensor parallelism enabled.

[-]

Expensive-Apricot-25@reddit

Man… do u have any Benchmarks that aren’t throughput?

[-]

blackwell_tart@reddit (OP)

No. Do you have any questions that explain what it is you wish to know?

[-]

Expensive-Apricot-25@reddit

single request speeds.

[-]

blackwell_tart@reddit (OP)

Capitalization, punctuation and good manners cost nothing, unlike benchmarking models on your behalf, which costs time. Time is precious when one is old.

No, sir. You did not take the time to be polite and I have no mind to take the time to entertain your frippery.

Good day.

[-]

Expensive-Apricot-25@reddit

You're right, and I apologize.

I didn't expect you to read/reply, and I was just frustrated because most real benchmarks posted here are throughput, and not single request speeds, which aren't as relevant to most people here. But thats no excuse for the poor manners, which I again apologize again for.

I typed it very late at night on my phone, hence the poor grammar.

I didn't mean to ask you to run more benchmarks, just if you had done it already.

I don't mean to ask anything of you, I just don't want to leave things on a bad foot. Have fun with your new cards!

[-]

blackwell_tart@reddit (OP)

It would seem that manners are indeed alive and well in the world, for which I am thankful. I retract my ire and apologize for being a curmudgeon.

Qwen3 235B A22B GPTQ Int4 runs at 75 tokens/second for the first thousand tokens or so, but starts to drop off quickly after that.

Qwen3 30B A3B GPTQ Int4 runs at 151 tokens/second in tensor parallel across both GPUs.

Interestingly, 30B A3 Int4 also runs at 151 tokens/second on a single GPU.

[-]

notwhobutwhat@reddit

This one is interesting, and I wonder if it's due to how the experts are distributed with -tp enabled by default in VLLM. From what I gather, if you're activating only experts on a single GPU (no idea how likely this might be), this might explain it.

I'm running the 30B-A3B AWQ quant, and I did notice on boot that it disables MoE distribution across GPUs due to the quant I'm using, perhaps GPTQ might allow it?

[-]

sautdepage@reddit

Throughput measures the maximum number of tok/sec achievable when processing parallel requests, is that right? I might be wrong - let me know.

But if so, that doesn't reveal the experience a single user gets, ie in agentic coding. So sequential processing throughout would be useful too.

[-]

blackwell_tart@reddit (OP)

75 tokens/second with Qwen3 235B A22B GPTQ Int4.

https://i.imgur.com/5vGk4Qs.png

[-]

Expensive-Apricot-25@reddit

no you are correct, generally speaking throughput is only useful if you are serving for a large group of people, like in the hundereds where concurrent requests are actually common.

but for the vast majority of people here its pretty irrelevant, so single request speeds are a more useful metric.

[-]

jarec707@reddit

Ha ha good reply

[-]

Steuern_Runter@reddit

The drop off that comes with more context length is huge. Is this the effect of parallelism becoming less efficient or something?

1k input - 643.67 output tokens/s

4k input - 171.87 output tokens/s

8k input - 82.98 output tokens/s

[-]

GreenVirtual502@reddit

More input=more prefill time=less t/s

[-]

night0x63@reddit

I thought input and output were independent? I guess not?

Big input... Means slower output token rate?

[-]

SandboChang@reddit

It does but I agree the fall off seems larger than expected.

[-]

blackwell_tart@reddit (OP)

less

Fewer.

[-]

Substantial-Ebb-584@reddit

Thank you for the benchmarks! The one thing I noticed is how Moe models are bad at content size scaling, compared to dense models. I don't say it's bad per se. But like in model size equivalent. Those benchmarks just showed that in a very nice scale. From one side they're faster, but the bigger the content the more it kills the purpose of moe as local LLM.

[-]