RTX PRO 6000 Blackwell Max-Q bad performance

Posted by YouBePortnt@reddit | LocalLLaMA | View on Reddit | 33 comments

I just got my RTX PRO 6000 and got problems with its performance.

llama-bench on Ubuntu:

llama-bench on Windows:

Even Geekbench (ONNX/DirectML):

I believe it should be 150% faster or something?

Two days of struggling with driver and toolkit versions, reinstalling, recompiling and I am out of ideas.

Is it possible that such bad performance is from misconfiguration? On Windows and on Ubuntu?

Or did I buy broken hardware?

[-]

FullOf_Bad_Ideas@reddit

Try mamf finder - https://github.com/mag-/gpu_benchmark

I ran the MAMF gpu benchmark that I linked earlier on 3 instances, from different hosts to account for cooling environment etc and I got 298.7 TFLOPS, 296.8 TFLOPS and 322.9 TFLOPS

I did the same with 600W Workstation GPUs and I got 374.7 TFLOPS, 398.5 TFLOPS and 403.9 TFLOPS.

So, average of peak MAMF values is 306.13 TFLOPS for Max-Q and 392.36 TFLOPS for WS.

let's see if you get similar numbers

[-]

YouBePortnt@reddit (OP)

After 47 trials max is 110.4 TFLOPS.
Unfortunately consistent with my other tests.

[-]

Yeah something is definitely wrong. Run it again and when it's running, use a different terminal to monitor clocks, temperature and power levels in nvtop/nvitop and share what kinds of power use, utilization rate and what clocks it reaches. It's either throttling or not getting enough power. Also please share Nvidia driver version.

[-]

YouBePortnt@reddit (OP)

CPU is on 100%

[-]

FullOf_Bad_Ideas@reddit

my CPU usage shows at 100% too but it's just one core - https://pixeldrain.com/u/SJdPLhmF

I get similar TFLOPS to yours but that's strictly accidental - 5080 just gets this kind of performance as it's half of the full chip.

Your performance is close to what I'd expect from RTX 4000 Pro Blackwell.

[-]

YouBePortnt@reddit (OP)

nvitop with driver and cuda versions

[-]

YouBePortnt@reddit (OP)

nvtop shows 100% on CPU

[-]

FullOf_Bad_Ideas@reddit

Cuda 13.2 is often consider to be borked and it's advised to avoid it - https://www.reddit.com/r/unsloth/comments/1sgl0wh/do_not_use_cuda_132_to_run_models/

Not in a way where this should impact performance on mamf finder, but I think you should downgrade anyway and then test again. Clocks and power look to be in the right ballpark.

[-]

YouBePortnt@reddit (OP)

Downgraded to driver version 590.48.01 and Cuda 13.1
No change: 110.5 TFLOPS, power 221W/300W, temperature about 70C

[-]

FullOf_Bad_Ideas@reddit

Regarding CPU usage during mamf test I'll test later how it looks like on my local 5080 and some rented 6000 Pro Max-Q.

Also, you can look at error messages from Linux kernel with dmesg, they could be silently piling up - I have unstable PCI-E on one rig and it produces gigabytes of errors silently.

Can you open GPU-Z on Windows, share a screenshot of the numbers there and correlate it with other screenshots from the same GPU model online? Some RTX 5000 gaming gpus had missing ROPs, maybe your chip has some hardware defect that would show up there. Did you buy it new from a reputable store? Do you have warranty? Do you have it in a normal PC case, connected straight to PCI-E slot in the motherboard without any kind of risers/oculink/Thunderbolt in between?

[-]

JockY@reddit

Don't run GGUFs and llama.cpp on that hardware, what a waste! It's optimized for FP8 kernels, which means you need sglang or vLLM.

Also... llama 70B? Seriously??? First: it's ancient. Second: it's dense! Of course a dense 70B is slow.

Go run nvidia/Qwen3.5-122B-A10B-NVFP4 in a recent version vLLM and watch it smoke.

[-]

YouBePortnt@reddit (OP)

Tried with RedHatAI/Qwen3-30B-A3B-NVFP4:
https://www.reddit.com/r/LocalLLaMA/comments/1srpjrs/comment/ohlk67t/

[-]

stormy1one@reddit

Have you experienced any strange output with RedHat's NVFP4 variant? Feels a bit off to me compared to Qwen's own FP8. No idea how RedHat configured in the nvidia optimizer. Their NVFP4 variant seems to make more loopy decisions, broken code, etc. Could just be NVFP4 though. Running OpenCode latest, vLLM 19.1 with 6000 Max-Q.

[-]

JockY@reddit

Oh run the Qwen one, 100%. They can calibrate on their own training data, it’s the best possible way.

[-]

Sticking_to_Decaf@reddit

Blackwell GPUs like the Pro 6000 are optimized for FP8 and NVFP4. Software support is better for FP8.
I don’t think the hardware is optimized for Q4_K quants and certainly not GGUF.

Try running an FP8 like the official Qwen3.6 or 3.5 FP8s. Dense 27B is a good test but the MoE Qwen3.6-35B will absolutely fly, especially with MTP.

I have a single Pro 6000 Max-Q 300w 96gb card and on a single request with Qwen3.6-35B in FP8 it outputs 225-250 tps (vLLM, speculative decoding using mtp and prediction of 3). Right now I have it running MMMLU (image benchmark) with 16 concurrent requests and it is putting out 1800-1900 tps combined across the 16 concurrent requests.

Now that is an MoE model with only like 3B params active. TheQwen 27B and Gemma 31B dense in FP8 are more like 45 tps unoptimized, 80tps in NVFP4 with some optimization. I haven’t tested them with mtp though.

I am spending this week optimizing and running benchmarks on the Qwen 3.5 and 3.6 models. Video analysis is a key part of my workflow so the Qwen models > Gemma for their ability to understand sequence of events and time in videos.

[-]

YouBePortnt@reddit (OP)

This is from "vllm bench serve" on RedHatAI/Qwen3-30B-A3B-NVFP4 on Ubuntu.
What do you think of the results?

[-]

Blanketsniffer@reddit

What!! Are you serious ? u can literally serve 16 users (I guess memory would limit KV cache for further ) with 100tok/sec with a single RTX 6000 Pro ? Man, that’s crazy for such a capable model

[-]

mr_zerolith@reddit

I have two consumer blackwell cards and they don't have support for nvfp4 still.
4 bit is of course faster than 8 bit on practically any card, including these.

[-]

Sticking_to_Decaf@reddit

Which cards?

[-]

mr_zerolith@reddit

5090 and RTX PRO 6000 ( regular PCIE version )

[-]

Sticking_to_Decaf@reddit

Those both support NVFP4. You need the latest drivers and CUDA 13 iirc (12.9 might work).

Support in vLLM / llama.cpp might be more spotty. That’s the software gap I mentioned.

Pytorch might need updating too.

[-]

CalligrapherFar7833@reddit

Sm120 vs sm100 no they dont support nvfp4 like real blackwells cuda optimized kernels

[-]

Sticking_to_Decaf@reddit

They have 5th gen tensor cores and Blackwell architecture with native acceleration for FP8 and NVFP4. The problems arise from drivers and inference software.

[-]

CalligrapherFar7833@reddit

Again sm100 vs sm120 read about it instead of spewing bs

[-]

mr_zerolith@reddit

I thought that it was the runtimes, for example llama.cpp, not having spport

[-]

Sticking_to_Decaf@reddit

That is correct. That’s why I mentioned the gap in software support and spotty support in vLLM / llama.cpp. vLLM is catching up though. It’s supposed to have fixed most of its NVFP4 issues. I am on vLLM 19.1 but haven’t had a chance to test it yet. I don’t use or keep up with llama.cpp bit last I heard they didn’t have full support yet.

Llama.cpp is great when facing vram limitations, but with a Pro 6000 card I think vLLM is a better platform.

[-]

bigboyparpa@reddit

Did you reference against Max-Q or the full 600W? Because the full 600W of course has better perf.

[-]

DinoAmino@reddit

Try an AWQ on vLLM. Seriously, give it a shot.

https://huggingface.co/ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4

[-]

BobbyL2k@reddit

Yeah, seems slow. I’m getting 1800 tok/s PP and 28 tok/s TG running 70B Q6_K at @2000 context tokens on dual 5090s. You should be getting more since you don’t have dual GPU overhead.