Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings

Posted by AvocadoArray@reddit | LocalLLaMA | View on Reddit | 156 comments

Transparency: I used an LLM to help figure out a good title and write the TLDR, but the post content is otherwise 100% written by me with zero help from an LLM.

Background

I recently asked Reddit to talk me out of buying an RTX Pro 6000. Of course, it didn't work, and I finally broke down and bought a Max-Q. Task failed successfully, I guess?

Either way, I still had a ton of questions leading up to the purchase and went through a bit of trial and error getting things set up. I wanted to share some of my notes to hopefully help someone else out in the future.

This post has been 2+ weeks in the making. I didn't plan on it getting this long, so I ran it through an LLM to get a TLDR:

TLDR

Double check UPS rating (including non-battery backed ports)
No issues running in an "unsupported" PowerEdge r730xd
Use Nvidia's "open" drivers instead of proprietary
Idles around 10-12w once OS drivers are loaded, even when keeping a model loaded in VRAM
Coil whine is worse than expected. Wouldn't want to work in the same room as this thing
Max-Q fans are lazy and let the card get way too hot for my taste. Use a custom fan curve to keep it cool
VLLM docker container needs a workaround for now (see end of post)
Startup times in VLLM are much worse than previous gen cards, unless I'm doing something wrong.
Qwen3-Coder-Next fits entirely in VRAM and fucking slaps (FP8, full 262k context, 120+ tp/s).
Qwen3.5-122B-A10B-UD-Q4_K_XL is even better
Don't feel the need for a second card
Expensive, but worth it IMO

Be careful if connecting to a UPS, even on a non-battery backed port

I have a 900w UPS backing my other servers and networking hardware. It normally fluctuates between 300-400w depending on load. I thought I was fine plugging it into the UPS's surge protector port, but I didn't realize the 900w rating was for both battery and non-battery backed ports. This thing easily pulls 600w+ total under load, and I ended up tripping the UPS breaker while running multiple concurrent request. Luckily, it doesn't seem to have caused any damage, but it sure freaked me out.

Cons

Let's start with an answer to my previous post (i.e., why you shouldn't by an RTX 6000 Pro).

Long startup times (VLLM)

This card takes much longer to fully load a model and start responding to a request in VLLM. Of course, larger models = longer time to load the weights. But even after that, VLLM's CUDA graph capture phase alone takes several minutes compared to just a few seconds on my ADA L4 cards.

Setting --compilation-config '{"cudagraph_mode": "PIECEWISE"} in addition to my usual --max-cudagraph-capture-size 2 speeds up the graph capture, but at the cost of worse overall performance (\~30 tp/s vs 120 tp/s). I'm hoping this gets better in the future with more Blackwell optimizations.

Even worse, once the model is loaded and "ready" to serve, the first request takes an additional \~3 minutes before it starts responding. Not sure if I'm the only one experiencing that, but it's not ideal if you plan to do a lot of live model swapping.

For reference, I found a similar issue noted here #27649. Might be dependent on model type/architecture but not 100% sure.

All together, it takes almost 15 minutes after a fresh boot to start getting responses with VLLM. llama.cpp is slightly faster. I prefer to use FP8 quants in VLLM for better accuracy and speed, but I'm planning to test Unsloth's UD-IQ3_XXS quant soon, as they claim scores higher than Qwen's FP8 quant and would free up some VRAM to keep other models loaded without swapping.

Note that this is VLLM only. llama.cpp does not have the same issue.

Update: Right before I posted this, I realized this ONLY happens when running VLLM in a docker container for some reason. Running it on the host OS uses the cached graphs as expected. Not sure why.

Coil whine

The high-pitched coil whine on this card is very audible and quite annoying. I think I remember seeing that mentioned somewhere, but I had no idea it was this bad. Luckily, the server is 20 feet away in a different room, but it's crazy that I can still make it out from here. I'd lose my mind if I had to work next to it all day.

Pros

Works in older servers

It's perfectly happy running in an "unsupported" PowerEdge r730xd using a J30DG power cable. The xd edition doesn't even "officially" support a GPU, but the riser + cable are rated for 300w, so there's no technical limitation to running the card.

I wasn't 100% sure whether it was going to work in this server, but I got a great deal on the server from a local supplier and I didn't see any reason why it would pose a risk to the card. Space was a little tight, but it's been running for over a week seems to be rock solid.

Currently running ESXi 8.0 in a Debian 13 VM on and CUDA 13.1 drivers.

Some notes if you decide to go this route:

Use a high-quality J30DG power cable (8 Pin Male to Dual 6+2 Male). Do not cheap out here.
A safer option would probably be pulling one 8-pin cable from each riser card to distribute the load better. I ordered a second cable and will make this change once it comes in.
Double-triple-quadruple check the PCI and power connections are tight, firm, and cables tucked away neatly. A bad job here could result in melting the power connector.
Run dual 1100w PSUs non-redundant mode (i.e., able to draw power from each simultaneously).

Power consumption

Idles at 10-12w, and doesn't seem to go up at all by keeping a model loaded in VRAM.

The entire r730xd server "idles" around 193w, even while running a handful of six other VMs and a couple dozen docker containers, which is about 50-80w less than my old r720xd setup. Huge win here. Only shoots up to 600w under heavy load.

Funny enough, turning off the GPU VM actually increases power consumption by 25-30w. I guess it needs the OS drivers to put it into sleep state.

Models

So far, I've mostly been using two models:

Seed OSS 36b

AutoRound INT4 w/ 200k F16 context fits in \~76GB VRAM and gets 50-60tp/s depending on context size. About twice the speed and context that I was previously getting on 2x 24GB L4 cards.

This was the first agentic coding model that was viable for me in Roo Code, but only after fixing VLLM's tool call parser. I have an open PR with my fixes, but it's been stale for a few weeks. For now, I'm just bind mounting it to /usr/local/lib/python3.12/dist-packages/vllm/tool_parsers/seed_oss_tool_parser.py.

Does a great job following instructions over long multi-turn tasks and generates code that's nearly indistinguishible from what I would have written.

It still has a few quirks and occasionally fails the apply_diff tool call, and sometimes gets line indentation wrong. I previously thought it was a quantization issue, but the same issues are still showing up in AWQ-INT8 as well. Could actually be a deeper tool-parsing error, but not 100% sure. I plan to try making my own FP8 quant and see if that performs any better.

MagicQuant mxfp4_moe-EHQKOUD-IQ4NL performs great as well, but tool parsing in llama.cpp is more broken than VLLM and does not work with Roo Code.

Qwen3-Coder-Next (Q3CN from here on out)

FP8 w/ full 262k F16 context barely fits in VRAM and gets 120+ tp/s (!).

Man, this model was a pleasant surprise. It punches way above its weight and actually holds it together at max context unlike Qwen3 30b a3b.

Compared to Seed, Q3CN is:

Twice as fast at FP8 than Seed at INT4
Stronger debugging capability (when forced to do so)
More consistent with tool calls
Highly sycophantic. HATES to explain itself. I know it's non-thinking, but many of the responses in Roo are just tool calls with no explanation whatsoever. When asked to explain why it did something, it often just says "you're right, I'll take it out/do it differently".
More prone to "stupid" mistakes than Seed, like writing a function in one file and then importing/calling it by the wrong name in subsequent files, but it's able to fix its mistakes 95% of the time without help. Might improve by lowering the temp a bit.
Extremely lazy without guardrails. Strongly favors sloppy code as long as it works. Gladly disables linting rules instead of cleaning up its code, or "fixing" unit tests to pass instead of fixing the bug.

Side note: I couldn't get Unsloth's FP8-dynamic quant to work in VLLM, no matter which version I tried or what options I used. It loaded just fine, but always responded with infinitely repeating exclamation points "!!!!!!!!!!...". I finally gave up and used the official Qwen/Qwen3-Coder-Next-FP8 quant, which is working great.

I remember Devstral 2 small scoring quite well when I first tested it, but it was too slow on L4 cards. It's broken in Roo right now after they removed the legacy API/XML tool calling features, but will give it a proper shot once that's fixed.

Also tried a few different quants/reaps of GLM and Minimax series, but most felt too lobotomized or ran too slow for my taste after offloading experts to RAM.

UPDATE: I'm currently testing Qwen3.5-122B-A10B-UD-Q4_K_XL as I'm posting this, and it seems to be a huge improvement over Q3CN.

It's definitely "enough".

Lots of folks said I'd want 2 cards to do any kind of serious work, but that's absolutely not the case. Sure, it can't get the most out of Minimax m2(.5) or GLM models, but that was never the goal. Seed was already enough for most of my use-case, and Qwen3-Coder-Next was a welcome surprise. New models are also likely to continue getting faster, smarter, and smaller.

Coming from someone who only recently upgraded from a GTX 1080ti, I can see easily myself being happy with this for the next 5+ years.

Also, if Unsloth's UD-IQ3_XXS quant holds up, then I might have even considered just going with the 48GB RTX Pro 5000 48GB for \~$4k, or even a dual RTX PRO 4000 24GB for <$3k.

Neutral / Other Notes

Cost comparison

There's no sugar-coating it, this thing is stupidly expensive and out of most peoples' budget. However, I feel it's a pretty solid value for my use-case.

Just for the hell of it, I looked up openrouter/chutes pricing and plugged it into Roo Code while putting Qwen3-Coder-Next through its paces

Input: 0.12
Output: 0.75
Cache reads: 0.06
Cache writes: 0 (probably should have set this to the output price, not sure if it affected it)

I ran two simultaneous worktrees asking it to write a frontend for a half-finished personal project (one in react, one in HTMX).

After a few hours, both tasks combined came out to $13.31. This is actually the point where I tripped my UPS breaker and had to stop so I could re organize power and make sure everything else came up safely.

In this scenario it would take approximately 566 heavy coding sessions or 2,265 hours of full use to pay for itself, (electricity cost included). Of course, there's lots of caveats here, the most obvious one being that subscription models are more cost-effective for heavy use. But for me, it's all about the freedom to run the models I want, as much as I want, without ever having to worry about usage limits, overage costs, price hikes, routing to worse/inconsistent models, or silent model "updates" that break my workflow.

Tuning

At first, the card was only hitting 93% utilization during inference until I realized the host and VM were in BIOS mode. It hits 100% utilization now and slightly faster speeds after converting to (U)EFI boot mode and configuring the recommended MMIO settings on the VM.

The card's default fan curve is pretty lazy and waits until temps are close to thermal throttling before fans hit 100% (approaching 90c). I solved this by customizing this gpu_fan_daemon script with a custom fan curve that hits 100% at 70c. Now it stays under 80c during real-world prolonged usage.

The Dell server ramps the fans ramp up to \~80% once the card is installed, but it's not a huge issue since I've already been using a Python script to set custom fan speeds on my r720xd based on CPU temps. I adapted it to include a custom curve for the exhaust temp as well so it can assist with clearing the heat when under sustained load.

Use the "open" drivers (not proprietary)

I wasted a couple hours with the proprietary drivers and couldn't figure out why nvidia-smi refused to see the card. Turns out that only the "open" version is supported on current generation cards, whereas proprietary is only recommended for older generations.

VLLM Docker Bug

Even after fixing the driver issue above, the VLLM v0.15 docker image still failed to see any CUDA devices (empty nvidia-smi output), which was caused by this bug #32373.

It should be fixed in v17 or the most recent nightly build, but as a workaround you can bind-mount /dev/null to the broken config(s) like this: -v /dev/null:/etc/ld.so.conf.d/00-cuda-compat.conf -v /dev/null:/etc/ld.so.conf.d/cuda-compat.conf

Wrapping up

Anyway, I've been slowly writing this post over the last couple weeks in hopes that it helps someone else out. I cut a lot out, but it genuinely would have saved me a lot of time if I had this info before hand. Hopefully it can help someone else out in the future!

[-]

Fabix84@reddit

I’m one of the people who replied to you in your previous post. I’m glad you eventually decided to go with the RTX PRO 6000 Max-Q. I’ll soon be ordering my fourth, and hopefully last, card.

For your use case, I would actually recommend against using vLLM. It’s excellent software, but it’s mainly designed for professional environments where you need to serve dozens or hundreds of requests in parallel. The typical scenario is a workstation running 24/7 as an LLM server for an entire company.

For single-user access, the best combination I’ve personally tested is llama.cpp + OpenCode.

With the high-end hardware I built my workstation with, the noise only bothers me during training (never during inference). I currently run 3 RTX PRO 6000 Max-Q cards. During normal use, even when running LLM inference, the noise level is comparable to my gaming laptop. Video generation inference is a bit more noticeable in terms of noise.

I run a dual-boot Linux/Windows setup. I mostly use Linux for training. I’m using the official NVIDIA Studio drivers, and if you enable the channel that includes the latest improvements, SM120 is fully supported.

I’m glad that "for now" you feel like you don’t need another card. However, I still believe that anyone eventually ends up wanting more. Maybe not a few days after buying one, but with how fast AI is evolving and will continue to evolve, there’s really no true point of satisfaction. There’s only the limit of what we can afford or not afford, unfortunately.

If you manage to stay satisfied with your setup for the next 12 months, then honestly, good for you.

Many people think that having multiple GPUs is only useful for running larger non-quantized models or very lightly quantized ones. That’s partially true. But the real power of a multi-GPU setup is being able to keep multiple models loaded at the same time for different tasks and run them together.

For example:
an LLM generating responses, while simultaneously passing them to a TTS model that speaks them out loud. At the same time you might be generating images and videos, while an agent powered by a coding-focused LLM is implementing other tasks in parallel.

Each of these things individually could run on a single GPU, but having all of them running simultaneously is a completely different experience. In the AI space it almost makes you feel omnipotent.

That said, I absolutely don’t want to downplay the sacrifices required to afford even one of these cards. Owning one is already a huge milestone. I’m just saying that over time, sooner or later depending on your ambitions and experiences, it’s normal to want more hardware.

There’s nothing to be ashamed of in admitting that. And there’s nothing to be ashamed of if someone can’t afford even one of these cards.

I bought mine one at a time, always telling myself “okay, this will be the last one.”

The fourth will probably really be the last, but only because I’ve reached the limit of the electrical power I can dedicate to them, not the limit of my hunger for VRAM.

[-]

kl__@reddit

If you don’t mind sharing, what’s your use case? And what are you training the models on?

[-]

Laabc123@reddit

Naive question. What’re the advantages of using llama.cpp over vLLM for single user usage?

[-]

Fabix84@reddit

llama.cpp allows me to load even very large models in just a few seconds. That makes it easy to quickly switch between different models depending on what I need at the moment.

The GGUF ecosystem is extremely active, and it lets you find pretty much any model already quantized in many different ways, sometimes even experimental or “heretic” versions.

In the past, the performance differences between the two engines were more significant. Today things are closer.

Personally, I would only use vLLM for a production server that runs 24/7 and needs to serve many users simultaneously. Otherwise, for single-user usage, I strongly prefer the simplicity and flexibility of llama.cpp and the GGUF ecosystem.

[-]

Bit_Poet@reddit

I've had no success getting SOTA mixed media models to work with bare metal llama.cpp. As I understand it, they've got issues with the licenses for essential stuff for that and any pull requests for it get shot down at some point. VLLM is one step ahead because of that, and it's pretty much the only platform that fully supports A+VL models without jumping through a lot of hoops. That said, I experience the same spin up time issues with VLLM in docker+WSL2 with my Pro 6000, no matter if the models are stored inside the container or on a mapped drive.

[-]

Ok-Ad-8976@reddit

I've spent three or four nights dealing with VLM and it's been such a pain, so I gave up on it altogether. Because on top of everything else, single user performance was abysmal on AMD RDNA 3.5 and 4 with the new qwen3.5 models.
Especially it would take forever to load vision kernels or whatever the terminology is.

[-]