Benchmarking Qwen3 30B and 235B on dual RTX PRO 6000 Blackwell Workstation Edition

Posted by blackwell_tart@reddit | LocalLLaMA | View on Reddit | 45 comments

As promised in the banana thread. OP delivers.

Benchmarks

The following benchmarks were taken using official Qwen3 models from Huggingface's Qwen repo for consistency:

All benchmarking was done with vllm benchmark throughput ... using full context space of 32k and incrementing the number of input tokens through the tests. The 235B benchmarks were performed with input lengths of 1024, 4096, 8192, and 16384 tokens. In the name of expediency the remaining tests were performed with input lengths of 1024 and 4096 due to the scaling factors seeming to approximate well with the 235B model.

Hardware

2x Blackwell PRO 6000 Workstation GPUs, 1x EPYC 9745, 512GB DDR5 5200 MT/s, PCIe 5.0 x16.

Software

This was the magic Torch incantation that got everything working:

pip install --pre torch==2.9.0.dev20250707+cu128 torchvision==0.24.0.dev20250707+cu128 torchaudio==2.8.0.dev20250707+cu128 --index-url https://download.pytorch.org/whl/nightly/cu128

Otherwise these instructions worked well despite being for WSL: https://github.com/fuutott/how-to-run-vllm-on-rtx-pro-6000-under-wsl2-ubuntu-24.04-mistral-24b-qwen3

Results

Qwen3 235B A22B GPTQ Int4 (Qwen official Int4) @ 1k input

$ vllm bench throughput --model Qwen/Qwen3-235B-A22B-GPTQ-Int4 --max-model-len 32768 --tensor-parallel 2 --input-len 1024
Throughput: 5.03 requests/s, 5781.20 total tokens/s, 643.67 output tokens/s
Total num prompt tokens:  1021646
Total num output tokens:  128000

Qwen3 235B A22B GPTQ Int4 (Qwen official Int4) @ 4k input

$ vllm bench throughput --model Qwen/Qwen3-235B-A22B-GPTQ-Int4 --max-model-len 32768 --tensor-parallel 2 --input-len 4096
Throughput: 1.34 requests/s, 5665.37 total tokens/s, 171.87 output tokens/s
Total num prompt tokens:  4091212
Total num output tokens:  128000

Qwen3 235B A22B GPTQ Int4 (Qwen official Int4) @ 8k input

$ vllm bench throughput --model Qwen/Qwen3-235B-A22B-GPTQ-Int4 --max-model-len 32768 --tensor-parallel 2 --input-len 8192
Throughput: 0.65 requests/s, 5392.17 total tokens/s, 82.98 output tokens/s
Total num prompt tokens:  8189599
Total num output tokens:  128000

Qwen3 235B A22B GPTQ Int4 (Qwen official Int4) @ 16k input

$ vllm bench throughput --model Qwen/Qwen3-235B-A22B-GPTQ-Int4 --max-model-len 32768 --tensor-parallel 2 --input-len 16384
Throughput: 0.30 requests/s, 4935.38 total tokens/s, 38.26 output tokens/s
Total num prompt tokens:  16383966
Total num output tokens:  128000

Qwen3 30B A3B (Qwen official FP16) @ 1k input | tensor parallel

$ vllm bench throughput --model Qwen/Qwen3-30B-A3B --max-model-len 32768 --tensor-parallel 2 --input-len 1024
Throughput: 11.27 requests/s, 12953.87 total tokens/s, 1442.27 output tokens/s
Total num prompt tokens:  1021646
Total num output tokens:  128000

Qwen3 30B A3B (Qwen official FP16) @ 4k input | tensor parallel

$ vllm bench throughput --model Qwen/Qwen3-30B-A3B --max-model-len 32768 --tensor-parallel 2 --input-len 4096
Throughput: 5.13 requests/s, 21651.80 total tokens/s, 656.86 output tokens/s
Total num prompt tokens:  4091212
Total num output tokens:  128000

Qwen3 30B A3B (Qwen official FP16) @ 1k input | single GPU

$ vllm bench throughput --model Qwen/Qwen3-30B-A3B --max-model-len 32768 --input-len 1024
Throughput: 13.32 requests/s, 15317.81 total tokens/s, 1705.46 output tokens/s
Total num prompt tokens:  1021646
Total num output tokens:  128000

Qwen3 30B A3B (Qwen official FP16) @ 4k input | single GPU

$ vllm bench throughput --model Qwen/Qwen3-30B-A3B --max-model-len 32768 --input-len 4096
Throughput: 3.89 requests/s, 16402.36 total tokens/s, 497.61 output tokens/s
Total num prompt tokens:  4091212
Total num output tokens:  128000

Qwen3 30B A3B (Qwen official GPTQ Int4) @ 1k input | tensor parallel

$ vllm bench throughput --model Qwen/Qwen3-30B-A3B-GPTQ-Int4 --max-model-len 32768 --tensor-parallel 2 --input-len 1024
Throughput: 23.17 requests/s, 26643.04 total tokens/s, 2966.40 output tokens/s
Total num prompt tokens:  1021646
Total num output tokens:  128000

Qwen3 30B A3B FP16 (Qwen official GPTQ Int4) @ 4k input | tensor parallel

$ vllm bench throughput --model Qwen/Qwen3-30B-A3B-GPTQ-Int4 --max-model-len 32768 --tensor-parallel 2 --input-len 4096
Throughput: 5.03 requests/s, 21229.35 total tokens/s, 644.04 output tokens/s
Total num prompt tokens:  4091212
Total num output tokens:  128000

Qwen3 30B A3B (Qwen official GPTQ Int4) @ 1k input | single GPU

$ vllm bench throughput --model Qwen/Qwen3-30B-A3B-GPTQ-Int4 --max-model-len 32768 --input-len 1024
Throughput: 17.44 requests/s, 20046.60 total tokens/s, 2231.96 output tokens/s
Total num prompt tokens:  1021646
Total num output tokens:  128000

Qwen3 30B A3B (Qwen official GPTQ Int4) @ 4k input | single GPU

$ vllm bench throughput --model Qwen/Qwen3-30B-A3B-GPTQ-Int4 --max-model-len 32768 --input-len 4096
Throughput: 4.21 requests/s, 17770.35 total tokens/s, 539.11 output tokens/s
Total num prompt tokens:  4091212
Total num output tokens:  128000