RTX PRO 6000 Blackwell Max-Q bad performance
Posted by YouBePortnt@reddit | LocalLLaMA | View on Reddit | 33 comments
I just got my RTX PRO 6000 and got problems with its performance.
llama-bench on Ubuntu:

llama-bench on Windows:

Even Geekbench (ONNX/DirectML):

I believe it should be 150% faster or something?
Two days of struggling with driver and toolkit versions, reinstalling, recompiling and I am out of ideas.
Is it possible that such bad performance is from misconfiguration? On Windows and on Ubuntu?
Or did I buy broken hardware?
FullOf_Bad_Ideas@reddit
Try mamf finder - https://github.com/mag-/gpu_benchmark
I ran the MAMF gpu benchmark that I linked earlier on 3 instances, from different hosts to account for cooling environment etc and I got 298.7 TFLOPS, 296.8 TFLOPS and 322.9 TFLOPS
I did the same with 600W Workstation GPUs and I got 374.7 TFLOPS, 398.5 TFLOPS and 403.9 TFLOPS.
So, average of peak MAMF values is 306.13 TFLOPS for Max-Q and 392.36 TFLOPS for WS.
let's see if you get similar numbers
YouBePortnt@reddit (OP)
After 47 trials max is 110.4 TFLOPS.
Unfortunately consistent with my other tests.
FullOf_Bad_Ideas@reddit
Yeah something is definitely wrong. Run it again and when it's running, use a different terminal to monitor clocks, temperature and power levels in nvtop/nvitop and share what kinds of power use, utilization rate and what clocks it reaches. It's either throttling or not getting enough power. Also please share Nvidia driver version.
YouBePortnt@reddit (OP)
CPU is on 100%
FullOf_Bad_Ideas@reddit
my CPU usage shows at 100% too but it's just one core - https://pixeldrain.com/u/SJdPLhmF
I get similar TFLOPS to yours but that's strictly accidental - 5080 just gets this kind of performance as it's half of the full chip.
Your performance is close to what I'd expect from RTX 4000 Pro Blackwell.
YouBePortnt@reddit (OP)
nvitop with driver and cuda versions
YouBePortnt@reddit (OP)
nvtop shows 100% on CPU
FullOf_Bad_Ideas@reddit
Cuda 13.2 is often consider to be borked and it's advised to avoid it - https://www.reddit.com/r/unsloth/comments/1sgl0wh/do_not_use_cuda_132_to_run_models/
Not in a way where this should impact performance on mamf finder, but I think you should downgrade anyway and then test again. Clocks and power look to be in the right ballpark.
YouBePortnt@reddit (OP)
Downgraded to driver version 590.48.01 and Cuda 13.1
No change: 110.5 TFLOPS, power 221W/300W, temperature about 70C
FullOf_Bad_Ideas@reddit
Regarding CPU usage during mamf test I'll test later how it looks like on my local 5080 and some rented 6000 Pro Max-Q.
Also, you can look at error messages from Linux kernel with dmesg, they could be silently piling up - I have unstable PCI-E on one rig and it produces gigabytes of errors silently.
Can you open GPU-Z on Windows, share a screenshot of the numbers there and correlate it with other screenshots from the same GPU model online? Some RTX 5000 gaming gpus had missing ROPs, maybe your chip has some hardware defect that would show up there. Did you buy it new from a reputable store? Do you have warranty? Do you have it in a normal PC case, connected straight to PCI-E slot in the motherboard without any kind of risers/oculink/Thunderbolt in between?
__JockY__@reddit
Don't run GGUFs and llama.cpp on that hardware, what a waste! It's optimized for FP8 kernels, which means you need sglang or vLLM.
Also... llama 70B? Seriously??? First: it's ancient. Second: it's dense! Of course a dense 70B is slow.
Go run nvidia/Qwen3.5-122B-A10B-NVFP4 in a recent version vLLM and watch it smoke.
YouBePortnt@reddit (OP)
Tried with RedHatAI/Qwen3-30B-A3B-NVFP4:
https://www.reddit.com/r/LocalLLaMA/comments/1srpjrs/comment/ohlk67t/
stormy1one@reddit
Have you experienced any strange output with RedHat's NVFP4 variant? Feels a bit off to me compared to Qwen's own FP8. No idea how RedHat configured in the nvidia optimizer. Their NVFP4 variant seems to make more loopy decisions, broken code, etc. Could just be NVFP4 though. Running OpenCode latest, vLLM 19.1 with 6000 Max-Q.
__JockY__@reddit
Oh run the Qwen one, 100%. They can calibrate on their own training data, it’s the best possible way.
Sticking_to_Decaf@reddit
Blackwell GPUs like the Pro 6000 are optimized for FP8 and NVFP4. Software support is better for FP8.
I don’t think the hardware is optimized for Q4_K quants and certainly not GGUF.
Try running an FP8 like the official Qwen3.6 or 3.5 FP8s. Dense 27B is a good test but the MoE Qwen3.6-35B will absolutely fly, especially with MTP.
I have a single Pro 6000 Max-Q 300w 96gb card and on a single request with Qwen3.6-35B in FP8 it outputs 225-250 tps (vLLM, speculative decoding using mtp and prediction of 3). Right now I have it running MMMLU (image benchmark) with 16 concurrent requests and it is putting out 1800-1900 tps combined across the 16 concurrent requests.
Now that is an MoE model with only like 3B params active. TheQwen 27B and Gemma 31B dense in FP8 are more like 45 tps unoptimized, 80tps in NVFP4 with some optimization. I haven’t tested them with mtp though.
I am spending this week optimizing and running benchmarks on the Qwen 3.5 and 3.6 models. Video analysis is a key part of my workflow so the Qwen models > Gemma for their ability to understand sequence of events and time in videos.
YouBePortnt@reddit (OP)
This is from "vllm bench serve" on RedHatAI/Qwen3-30B-A3B-NVFP4 on Ubuntu.
What do you think of the results?
Blanketsniffer@reddit
What!! Are you serious ? u can literally serve 16 users (I guess memory would limit KV cache for further ) with 100tok/sec with a single RTX 6000 Pro ? Man, that’s crazy for such a capable model
mr_zerolith@reddit
I have two consumer blackwell cards and they don't have support for nvfp4 still.
4 bit is of course faster than 8 bit on practically any card, including these.
Sticking_to_Decaf@reddit
Which cards?
mr_zerolith@reddit
5090 and RTX PRO 6000 ( regular PCIE version )
Sticking_to_Decaf@reddit
Those both support NVFP4. You need the latest drivers and CUDA 13 iirc (12.9 might work).
Support in vLLM / llama.cpp might be more spotty. That’s the software gap I mentioned.
Pytorch might need updating too.
CalligrapherFar7833@reddit
Sm120 vs sm100 no they dont support nvfp4 like real blackwells cuda optimized kernels
Sticking_to_Decaf@reddit
They have 5th gen tensor cores and Blackwell architecture with native acceleration for FP8 and NVFP4. The problems arise from drivers and inference software.
CalligrapherFar7833@reddit
Again sm100 vs sm120 read about it instead of spewing bs
mr_zerolith@reddit
I thought that it was the runtimes, for example llama.cpp, not having spport
Sticking_to_Decaf@reddit
That is correct. That’s why I mentioned the gap in software support and spotty support in vLLM / llama.cpp. vLLM is catching up though. It’s supposed to have fixed most of its NVFP4 issues. I am on vLLM 19.1 but haven’t had a chance to test it yet. I don’t use or keep up with llama.cpp bit last I heard they didn’t have full support yet.
Llama.cpp is great when facing vram limitations, but with a Pro 6000 card I think vLLM is a better platform.
stormy1one@reddit
That depends on what you are using to run the model. Both 5090 and RTX Pro 6000 have NVFP4 support natively in vLLM. There are trade-offs and caveats to using it though
stormy1one@reddit
Agreed on FP8 vs NVFP4. NVFP4 is needs to be quantized correctly otherwise it feels drunk most of the time. When running a random NVFP4 quant you never really know if they applied QAT/QAD, as it is not automatic.
Sticking_to_Decaf@reddit
Definitely! I feel like RedHat has been my most reliable source of NVFP4 quants. Nvidia a close second. But I haven’t benchmarked them so this is going by feel.
mr_zerolith@reddit
This is one of the slowest models you could run ( dense, not MoE, plus large ), and this is not surprising
Why are you using this very outdated model to test the performance of your card?
bigboyparpa@reddit
Did you reference against Max-Q or the full 600W? Because the full 600W of course has better perf.
DinoAmino@reddit
Try an AWQ on vLLM. Seriously, give it a shot.
https://huggingface.co/ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4
BobbyL2k@reddit
Yeah, seems slow. I’m getting 1800 tok/s PP and 28 tok/s TG running 70B Q6_K at @2000 context tokens on dual 5090s. You should be getting more since you don’t have dual GPU overhead.