Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings
Posted by AvocadoArray@reddit | LocalLLaMA | View on Reddit | 156 comments
Transparency: I used an LLM to help figure out a good title and write the TLDR, but the post content is otherwise 100% written by me with zero help from an LLM.
Background
I recently asked Reddit to talk me out of buying an RTX Pro 6000. Of course, it didn't work, and I finally broke down and bought a Max-Q. Task failed successfully, I guess?
Either way, I still had a ton of questions leading up to the purchase and went through a bit of trial and error getting things set up. I wanted to share some of my notes to hopefully help someone else out in the future.
This post has been 2+ weeks in the making. I didn't plan on it getting this long, so I ran it through an LLM to get a TLDR:
TLDR
- Double check UPS rating (including non-battery backed ports)
- No issues running in an "unsupported" PowerEdge r730xd
- Use Nvidia's "open" drivers instead of proprietary
- Idles around 10-12w once OS drivers are loaded, even when keeping a model loaded in VRAM
- Coil whine is worse than expected. Wouldn't want to work in the same room as this thing
- Max-Q fans are lazy and let the card get way too hot for my taste. Use a custom fan curve to keep it cool
- VLLM docker container needs a workaround for now (see end of post)
- Startup times in VLLM are much worse than previous gen cards, unless I'm doing something wrong.
- Qwen3-Coder-Next fits entirely in VRAM and fucking slaps (FP8, full 262k context, 120+ tp/s).
- Qwen3.5-122B-A10B-UD-Q4_K_XL is even better
- Don't feel the need for a second card
- Expensive, but worth it IMO
Be careful if connecting to a UPS, even on a non-battery backed port
I have a 900w UPS backing my other servers and networking hardware. It normally fluctuates between 300-400w depending on load. I thought I was fine plugging it into the UPS's surge protector port, but I didn't realize the 900w rating was for both battery and non-battery backed ports. This thing easily pulls 600w+ total under load, and I ended up tripping the UPS breaker while running multiple concurrent request. Luckily, it doesn't seem to have caused any damage, but it sure freaked me out.
Cons
Let's start with an answer to my previous post (i.e., why you shouldn't by an RTX 6000 Pro).
Long startup times (VLLM)
This card takes much longer to fully load a model and start responding to a request in VLLM. Of course, larger models = longer time to load the weights. But even after that, VLLM's CUDA graph capture phase alone takes several minutes compared to just a few seconds on my ADA L4 cards.
Setting --compilation-config '{"cudagraph_mode": "PIECEWISE"} in addition to my usual --max-cudagraph-capture-size 2 speeds up the graph capture, but at the cost of worse overall performance (\~30 tp/s vs 120 tp/s). I'm hoping this gets better in the future with more Blackwell optimizations.
Even worse, once the model is loaded and "ready" to serve, the first request takes an additional \~3 minutes before it starts responding. Not sure if I'm the only one experiencing that, but it's not ideal if you plan to do a lot of live model swapping.
For reference, I found a similar issue noted here #27649. Might be dependent on model type/architecture but not 100% sure.
All together, it takes almost 15 minutes after a fresh boot to start getting responses with VLLM. llama.cpp is slightly faster. I prefer to use FP8 quants in VLLM for better accuracy and speed, but I'm planning to test Unsloth's UD-IQ3_XXS quant soon, as they claim scores higher than Qwen's FP8 quant and would free up some VRAM to keep other models loaded without swapping.
Note that this is VLLM only. llama.cpp does not have the same issue.
Update: Right before I posted this, I realized this ONLY happens when running VLLM in a docker container for some reason. Running it on the host OS uses the cached graphs as expected. Not sure why.
Coil whine
The high-pitched coil whine on this card is very audible and quite annoying. I think I remember seeing that mentioned somewhere, but I had no idea it was this bad. Luckily, the server is 20 feet away in a different room, but it's crazy that I can still make it out from here. I'd lose my mind if I had to work next to it all day.
Pros
Works in older servers
It's perfectly happy running in an "unsupported" PowerEdge r730xd using a J30DG power cable. The xd edition doesn't even "officially" support a GPU, but the riser + cable are rated for 300w, so there's no technical limitation to running the card.
I wasn't 100% sure whether it was going to work in this server, but I got a great deal on the server from a local supplier and I didn't see any reason why it would pose a risk to the card. Space was a little tight, but it's been running for over a week seems to be rock solid.
Currently running ESXi 8.0 in a Debian 13 VM on and CUDA 13.1 drivers.
Some notes if you decide to go this route:
- Use a high-quality J30DG power cable (8 Pin Male to Dual 6+2 Male). Do not cheap out here.
- A safer option would probably be pulling one 8-pin cable from each riser card to distribute the load better. I ordered a second cable and will make this change once it comes in.
- Double-triple-quadruple check the PCI and power connections are tight, firm, and cables tucked away neatly. A bad job here could result in melting the power connector.
- Run dual 1100w PSUs non-redundant mode (i.e., able to draw power from each simultaneously).
Power consumption
Idles at 10-12w, and doesn't seem to go up at all by keeping a model loaded in VRAM.
The entire r730xd server "idles" around 193w, even while running a handful of six other VMs and a couple dozen docker containers, which is about 50-80w less than my old r720xd setup. Huge win here. Only shoots up to 600w under heavy load.
Funny enough, turning off the GPU VM actually increases power consumption by 25-30w. I guess it needs the OS drivers to put it into sleep state.
Models
So far, I've mostly been using two models:
Seed OSS 36b
AutoRound INT4 w/ 200k F16 context fits in \~76GB VRAM and gets 50-60tp/s depending on context size. About twice the speed and context that I was previously getting on 2x 24GB L4 cards.
This was the first agentic coding model that was viable for me in Roo Code, but only after fixing VLLM's tool call parser. I have an open PR with my fixes, but it's been stale for a few weeks. For now, I'm just bind mounting it to /usr/local/lib/python3.12/dist-packages/vllm/tool_parsers/seed_oss_tool_parser.py.
Does a great job following instructions over long multi-turn tasks and generates code that's nearly indistinguishible from what I would have written.
It still has a few quirks and occasionally fails the apply_diff tool call, and sometimes gets line indentation wrong. I previously thought it was a quantization issue, but the same issues are still showing up in AWQ-INT8 as well. Could actually be a deeper tool-parsing error, but not 100% sure. I plan to try making my own FP8 quant and see if that performs any better.
MagicQuant mxfp4_moe-EHQKOUD-IQ4NL performs great as well, but tool parsing in llama.cpp is more broken than VLLM and does not work with Roo Code.
Qwen3-Coder-Next (Q3CN from here on out)
FP8 w/ full 262k F16 context barely fits in VRAM and gets 120+ tp/s (!).
Man, this model was a pleasant surprise. It punches way above its weight and actually holds it together at max context unlike Qwen3 30b a3b.
Compared to Seed, Q3CN is:
- Twice as fast at FP8 than Seed at INT4
- Stronger debugging capability (when forced to do so)
- More consistent with tool calls
- Highly sycophantic. HATES to explain itself. I know it's non-thinking, but many of the responses in Roo are just tool calls with no explanation whatsoever. When asked to explain why it did something, it often just says "you're right, I'll take it out/do it differently".
- More prone to "stupid" mistakes than Seed, like writing a function in one file and then importing/calling it by the wrong name in subsequent files, but it's able to fix its mistakes 95% of the time without help. Might improve by lowering the temp a bit.
- Extremely lazy without guardrails. Strongly favors sloppy code as long as it works. Gladly disables linting rules instead of cleaning up its code, or "fixing" unit tests to pass instead of fixing the bug.
Side note: I couldn't get Unsloth's FP8-dynamic quant to work in VLLM, no matter which version I tried or what options I used. It loaded just fine, but always responded with infinitely repeating exclamation points "!!!!!!!!!!...". I finally gave up and used the official Qwen/Qwen3-Coder-Next-FP8 quant, which is working great.
I remember Devstral 2 small scoring quite well when I first tested it, but it was too slow on L4 cards. It's broken in Roo right now after they removed the legacy API/XML tool calling features, but will give it a proper shot once that's fixed.
Also tried a few different quants/reaps of GLM and Minimax series, but most felt too lobotomized or ran too slow for my taste after offloading experts to RAM.
UPDATE: I'm currently testing Qwen3.5-122B-A10B-UD-Q4_K_XL as I'm posting this, and it seems to be a huge improvement over Q3CN.
It's definitely "enough".
Lots of folks said I'd want 2 cards to do any kind of serious work, but that's absolutely not the case. Sure, it can't get the most out of Minimax m2(.5) or GLM models, but that was never the goal. Seed was already enough for most of my use-case, and Qwen3-Coder-Next was a welcome surprise. New models are also likely to continue getting faster, smarter, and smaller.
Coming from someone who only recently upgraded from a GTX 1080ti, I can see easily myself being happy with this for the next 5+ years.
Also, if Unsloth's UD-IQ3_XXS quant holds up, then I might have even considered just going with the 48GB RTX Pro 5000 48GB for \~$4k, or even a dual RTX PRO 4000 24GB for <$3k.
Neutral / Other Notes
Cost comparison
There's no sugar-coating it, this thing is stupidly expensive and out of most peoples' budget. However, I feel it's a pretty solid value for my use-case.
Just for the hell of it, I looked up openrouter/chutes pricing and plugged it into Roo Code while putting Qwen3-Coder-Next through its paces
- Input: 0.12
- Output: 0.75
- Cache reads: 0.06
- Cache writes: 0 (probably should have set this to the output price, not sure if it affected it)
I ran two simultaneous worktrees asking it to write a frontend for a half-finished personal project (one in react, one in HTMX).
After a few hours, both tasks combined came out to $13.31. This is actually the point where I tripped my UPS breaker and had to stop so I could re organize power and make sure everything else came up safely.
In this scenario it would take approximately 566 heavy coding sessions or 2,265 hours of full use to pay for itself, (electricity cost included). Of course, there's lots of caveats here, the most obvious one being that subscription models are more cost-effective for heavy use. But for me, it's all about the freedom to run the models I want, as much as I want, without ever having to worry about usage limits, overage costs, price hikes, routing to worse/inconsistent models, or silent model "updates" that break my workflow.
Tuning
At first, the card was only hitting 93% utilization during inference until I realized the host and VM were in BIOS mode. It hits 100% utilization now and slightly faster speeds after converting to (U)EFI boot mode and configuring the recommended MMIO settings on the VM.
The card's default fan curve is pretty lazy and waits until temps are close to thermal throttling before fans hit 100% (approaching 90c). I solved this by customizing this gpu_fan_daemon script with a custom fan curve that hits 100% at 70c. Now it stays under 80c during real-world prolonged usage.
The Dell server ramps the fans ramp up to \~80% once the card is installed, but it's not a huge issue since I've already been using a Python script to set custom fan speeds on my r720xd based on CPU temps. I adapted it to include a custom curve for the exhaust temp as well so it can assist with clearing the heat when under sustained load.
Use the "open" drivers (not proprietary)
I wasted a couple hours with the proprietary drivers and couldn't figure out why nvidia-smi refused to see the card. Turns out that only the "open" version is supported on current generation cards, whereas proprietary is only recommended for older generations.
VLLM Docker Bug
Even after fixing the driver issue above, the VLLM v0.15 docker image still failed to see any CUDA devices (empty nvidia-smi output), which was caused by this bug #32373.
It should be fixed in v17 or the most recent nightly build, but as a workaround you can bind-mount /dev/null to the broken config(s) like this: -v /dev/null:/etc/ld.so.conf.d/00-cuda-compat.conf -v /dev/null:/etc/ld.so.conf.d/cuda-compat.conf
Wrapping up
Anyway, I've been slowly writing this post over the last couple weeks in hopes that it helps someone else out. I cut a lot out, but it genuinely would have saved me a lot of time if I had this info before hand. Hopefully it can help someone else out in the future!
Fabix84@reddit
I’m one of the people who replied to you in your previous post. I’m glad you eventually decided to go with the RTX PRO 6000 Max-Q. I’ll soon be ordering my fourth, and hopefully last, card.
For your use case, I would actually recommend against using vLLM. It’s excellent software, but it’s mainly designed for professional environments where you need to serve dozens or hundreds of requests in parallel. The typical scenario is a workstation running 24/7 as an LLM server for an entire company.
For single-user access, the best combination I’ve personally tested is llama.cpp + OpenCode.
With the high-end hardware I built my workstation with, the noise only bothers me during training (never during inference). I currently run 3 RTX PRO 6000 Max-Q cards. During normal use, even when running LLM inference, the noise level is comparable to my gaming laptop. Video generation inference is a bit more noticeable in terms of noise.
I run a dual-boot Linux/Windows setup. I mostly use Linux for training. I’m using the official NVIDIA Studio drivers, and if you enable the channel that includes the latest improvements, SM120 is fully supported.
I’m glad that "for now" you feel like you don’t need another card. However, I still believe that anyone eventually ends up wanting more. Maybe not a few days after buying one, but with how fast AI is evolving and will continue to evolve, there’s really no true point of satisfaction. There’s only the limit of what we can afford or not afford, unfortunately.
If you manage to stay satisfied with your setup for the next 12 months, then honestly, good for you.
Many people think that having multiple GPUs is only useful for running larger non-quantized models or very lightly quantized ones. That’s partially true. But the real power of a multi-GPU setup is being able to keep multiple models loaded at the same time for different tasks and run them together.
For example:
an LLM generating responses, while simultaneously passing them to a TTS model that speaks them out loud. At the same time you might be generating images and videos, while an agent powered by a coding-focused LLM is implementing other tasks in parallel.
Each of these things individually could run on a single GPU, but having all of them running simultaneously is a completely different experience. In the AI space it almost makes you feel omnipotent.
That said, I absolutely don’t want to downplay the sacrifices required to afford even one of these cards. Owning one is already a huge milestone. I’m just saying that over time, sooner or later depending on your ambitions and experiences, it’s normal to want more hardware.
There’s nothing to be ashamed of in admitting that. And there’s nothing to be ashamed of if someone can’t afford even one of these cards.
I bought mine one at a time, always telling myself “okay, this will be the last one.”
The fourth will probably really be the last, but only because I’ve reached the limit of the electrical power I can dedicate to them, not the limit of my hunger for VRAM.
kl__@reddit
If you don’t mind sharing, what’s your use case? And what are you training the models on?
Laabc123@reddit
Naive question. What’re the advantages of using llama.cpp over vLLM for single user usage?
Fabix84@reddit
llama.cpp allows me to load even very large models in just a few seconds. That makes it easy to quickly switch between different models depending on what I need at the moment.
The GGUF ecosystem is extremely active, and it lets you find pretty much any model already quantized in many different ways, sometimes even experimental or “heretic” versions.
In the past, the performance differences between the two engines were more significant. Today things are closer.
Personally, I would only use vLLM for a production server that runs 24/7 and needs to serve many users simultaneously. Otherwise, for single-user usage, I strongly prefer the simplicity and flexibility of llama.cpp and the GGUF ecosystem.
Bit_Poet@reddit
I've had no success getting SOTA mixed media models to work with bare metal llama.cpp. As I understand it, they've got issues with the licenses for essential stuff for that and any pull requests for it get shot down at some point. VLLM is one step ahead because of that, and it's pretty much the only platform that fully supports A+VL models without jumping through a lot of hoops. That said, I experience the same spin up time issues with VLLM in docker+WSL2 with my Pro 6000, no matter if the models are stored inside the container or on a mapped drive.
Ok-Ad-8976@reddit
I've spent three or four nights dealing with VLM and it's been such a pain, so I gave up on it altogether. Because on top of everything else, single user performance was abysmal on AMD RDNA 3.5 and 4 with the new qwen3.5 models.
Especially it would take forever to load vision kernels or whatever the terminology is.
AvocadoArray@reddit (OP)
I’m not necessarily saying I won’t want another one, but I don’t think I could justify it unless prices dropped significantly (like over 50%).
I still have a 1080 in my old server running the embedding/rerank models and CodeProject AI for blue iris, a 1080ti on the shelf, and a 5070ti in my gaming rig. So I do have a bit of wiggle room for additional models if needed.
Also, the quad channel DDR4 2400 RAM held up quite well when offloading 20-ish experts from Q3.5 122B. I think I saw around half the speed (40 tp/s) but it shaved off around 20GB of VRAM. Prompt processing took a bigger hit, but still usable if I ever really wanted felt the need to keep other models loaded.
I think my CPUs are a bit of a bottleneck with their low single thread performance, but I have a new set in the mail and would be interested to try it again once they come in.
buttonstraddle@reddit
do you regret going with the Max-Q model?
AvocadoArray@reddit (OP)
Not at all. My server wouldn’t be able to power the 600w server/workstation cards, and even if they did, I’d have to run the server fans much higher to keep them cool.
I’d only consider the workstation card if I planned to run it in my desktop.
MR_Weiner@reddit
Could you let me know what the footprint of your kv cache is with qwen3.5 122b a10b q4? I’m assuming you’re running full 264k context? Bf16 or quantized? Sorry if your post mentioned this and I just missed it. But I’m thinking about picking up one of these and running concurrent requests on it with this model, so trying to understand what kind of cache setup I should expect to be able to use with that model on the 6000.
Icy_Bid6597@reddit
u/AvocadoArray I had the same issues with long delay on first request to VLLM on RTX6000 in docker.
What I found so far:
- mounting a directory for Triton cache cut it down by \~50%
- adding a dir for cuda cache cut if further by 60%
I went down from 2 minutes for first request to \~11seconds, Still not perfect but better.
-v \~/nv_cache/:/root/nv_cache -v \~/triton_cache:/root/triton_cache --env TRITON_CACHE_DIR=/root/triton_cache --env CUDA_CACHE_PATH=/root/nv_cache/ComputeCache --env CUDA_CACHE_MAXSIZE=10737418240
I am not sure about last argument (CUDA_CACHE_MAXSIZE). It theoretically keeps the cuda cache size under control, but i don't think it is necessary.
Armym@reddit
Godlike
AvocadoArray@reddit (OP)
This is not correct. There are no
nv_cacheortriton_cachedirectories.VLLM caches torch graphs under
/root/.cache/vllm/torch_compile_cache, which is already mounted in my container (and being used, just very slowly for some reason). I do not believe CUDA graphs are written to disk, and are only cached in memory.Respectfully, was this LLM generated?
AvocadoArray@reddit (OP)
Holy crap, this fixed it. How did you figure this out /u/Icy_Bid6597? I don't see these directories or env variable mentioned in VLLM's docs at all.
I created a new cuda-cache volume and updated my llama-swap config:
Could also probably be solved by just mounting the vllm-cache directory to /root/ instead of /root/.cache/vllm.
I haven't compared the full logs, but this shaved off around four minutes from startup time. Torch graphs load in 7s instead of 70+, and CUDA graph capture only takes 4s instead of 145+.
Will update my post after more testing.
Glittering_Carrot_88@reddit
Does it run crysis?
AvocadoArray@reddit (OP)
Not trying to void the warranty just yet.
eliko613@reddit
Really thorough writeup! Your cost comparison methodology with OpenRouter pricing is clever - I've seen a lot of people struggle to get accurate ROI calculations for local LLM infrastructure.
One thing that might be interesting for your setup: since you're already tracking utilization and performance across different models/quants, you might want to look into more structured observability tooling. I've been using ZenLLM.io to track costs and performance across both local and API endpoints, and it's been helpful for getting better visibility into which model configurations actually perform best for different use cases.
The startup time issues you're seeing with VLLM are fascinating - 15 minutes is brutal for model swapping workflows. Have you tried any of the newer VLLM optimizations for Blackwell, or are you stuck waiting for better upstream support? The container vs host performance difference is particularly weird.
Orlandocollins@reddit
I couldn't help myself and bought a second one. Brought models like minimax m2 into play and have no regrets
AvocadoArray@reddit (OP)
Would you be willing to test out Qwen 3.5 122b and let me know how it compares? I haven't used minimax m2/m2.5 in any meaningful capacity, but 122b feels like it works as good as I've seen other describe Minimax and GLM models.
Orlandocollins@reddit
Yeah I can try. I've never had much luck with qwen models and tool calling using llana.cpp but I can see if it has improved
TokenRingAI@reddit
Prediction: 4 months from now you'll be buying a 2nd card
AvocadoArray@reddit (OP)
Counter-prediction: 4 months from now, I'll be able to run better models on a single card than two cards can run today.
Arguably happened after spending some time with Qwen3.5-122B-A10B, but not sure we'll see many more open weight models from them going forward.
TokenRingAI@reddit
Preidction: Deepseek V4 is coming, and you will get extreme FOMO
AvocadoArray@reddit (OP)
You think a 2nd card is going to make a dent in running a \~1T parameter model?
TokenRingAI@reddit
🤫🤐
AvocadoArray@reddit (OP)
😂
In all reality, models that large tend to still be quite impressive around the 1-2 bit bpw range. Could possibly play around with it while offloading a bunch of weights to RAM/NVME, but wouldn't expect any real-time usable speeds.
TokenRingAI@reddit
https://github.com/deepseek-ai/Engram
radomird@reddit
Why? Are they stopping open sourcing?
Orlandocollins@reddit
Is this what I sounded like before I folded and bought another?
big___bad___wolf@reddit
😅 I bought the second one two weeks after daily driving one.
parrot42@reddit
And I have a BIOS bug with an ASUS mainboard, where I have to disable REBAR, otherwise POST does not complete with a 2nd card. Which is very annoying, because every time testing a new bios version, I have to take out a card temporarily.
Jarlsvanoid@reddit
I have the workstation model although I use it limited to 450w. I am using qwen3.5 122b as the main model for everything. With the maximum context of 256k Vllm gives me a concurrency of more than 3x. I am using an nvfp4 version.
It happens to me like you, the model takes a long time to load, but once everything is in memory it is very fast, both in preprocessing and response. I don't need chatgpt anymore.
If I regret anything, it is perhaps not having bought the qmax model so I could fit another card.
AvocadoArray@reddit (OP)
122b is definitely a game-changer. What quant are you running? I tried running AWQ 4-bit in VLLM but it never returned any thinking tokens, even with `--reasoning-parser qwen3`.
Might have been a version issue. I'm going to try it with v0.17.0 now.
Jarlsvanoid@reddit
I'm using Sehyo/Qwen3.5-122B-A10B-NVFP4 with reasoning disabled:
--reasoning-parser qwen3
--default-chat-template-kwargs '{"enable_thinking": false}'
iamvikingcore@reddit
Meanwhile a used Macbook Pro with 64 or 128 gigs of RAM can run all of those same models just not as fast for about 1:15 of the cost
AvocadoArray@reddit (OP)
Technically, my PC can run the same models with 64GB RAM and 120GB SSD, just "not as fast".
Speed does matter for real-time usage, and everything I've seen suggests that prompt processing speed on Mac's unified memory is painfully slow, like 3-5 minutes or more for larger prompts.
jacek2023@reddit
"but tool parsing in llama.cpp is more broken than VLLM and does not work with Roo Code."
autoparser branch has been merged into llama.cpp after your post ;)
AvocadoArray@reddit (OP)
Nice, I'll have to take a look!
This is why I love this sub. It'd be almost impossible to keep up with everything without it.
mmazing@reddit
My UPS beeps under load but hasn’t caused any trouble so far…
AvocadoArray@reddit (OP)
Mine does too... except I turned it off years ago because it freaked the dog out every time the power went out 🤦
radomird@reddit
Great write up. I’ve tried qwen3.5-122B (UD-Q3_K_XL) and 35B (UD-Q8_K_XL ) on dual Rtx 5000 ada (2x32gb), using llama.cpp loads in few seconds. Performance wise 35B works better for me but 122B gives somewhat better results for the cost of speed.
Spezisasackofshit@reddit
Coil whine is an interesting issue that I hadn't considered, makes sense given the intended application would have these cards off site mostly, but would be a total deal breaker for my home setup. Thank you for sharing your experiences, there's a lot of info here that is great for someone like myself who is considering setting up a dedicated AI rig.
jeekp@reddit
this got my attention as well. My dream setup is a 2x Max Q but it would have to go in my office.
AvocadoArray@reddit (OP)
I honestly cannot understate how noisy it is. Some people talked about the fan noise, but the coil whine is 1000% more annoying than the fan, even when running at 100%.
From what other have said, the workstation card might not have the same issue, so take that into consideration as well
suicidaleggroll@reddit
I have the same issue with vLLM, startup takes an eternity. llama.cpp should be much faster though, on the order of 10-15 seconds plus however long it takes the pull the model weights off of disk.
AvocadoArray@reddit (OP)
Yes, that's exactly what I'm experiencing. I use llama-swap and absolutely dread making any config changes or accidentally sending a request to another model because the GPU pretty much takes a coffee break while it's waiting.
I searched for so long trying to find if anyone else had the same issue, I can barely find any mentions about it.
There are actually two points where the loading slows down.
The first is the `Directly load the compiled graphs...` part, which should only take 3-4 seconds, but takes 55-65 seconds for some reason.
The second is the `Capturing CUDA graphs` section should also only take about 3 seconds after the first run, but takes and additional 2-3 minutes.
I enabled debug logging, and it seems to hang when loading the "0-th" graph for some reason, which is strange because it's a relatively small file and is backed with SSD storage.
AvocadoArray@reddit (OP)
And for reference, this is what it SHOULD look like (running the same exact command from VLLM on the host outside the container).
jiria@reddit
I have two Max-Q, and this (a few seconds) is the time it takes on mine. So there is definitely something wrong with your setup. Feel free to PM me.
laterbreh@reddit
Stick with vllm llamacpp unless you need system ram is a total downgrade in sustained token speed under heavy context load
AvocadoArray@reddit (OP)
Apples-to-apples, I agree.
However, it can depend on what quants are available for the model you're trying to run.
For Qwen3.5-122b-a10b, I can't run it at full FP8 in a single card, but unsloth's UD-Q4_K_XL quant fits VRAM and runs plenty fast at 90+ tp/s.
VLLM's GGUF support is spotty at best, so I just always run those in llama.cpp.
Waiting for a proper NVFP4 quant to try out in VLLM.
big___bad___wolf@reddit
Im running 122b-a10b FP 8 at 155 tok/s on two max Qs.
AvocadoArray@reddit (OP)
Sounds solid 💪
Laabc123@reddit
FWIW, I’ve been driving an nvfp4 quant for 4 days now and it’s performing exceedingly well. >100 output tok/s with cuda graphs loaded.
AvocadoArray@reddit (OP)
Which nvfp4 quant are you using?
Laabc123@reddit
https://huggingface.co/Sehyo/Qwen3.5-122B-A10B-NVFP4
AvocadoArray@reddit (OP)
Huh, I saw that one but thought it needed this PR to work https://github.com/vllm-project/llm-compressor/pull/2383
Laabc123@reddit
I think Sehyo picked this PR in before quantizing. MTP is definitely working.
AvocadoArray@reddit (OP)
Neat! Been looking for an excuse to try nvfp4 somewhere. Will give it a shot!
DanielWe@reddit
How :(
Which vllm version with what settings. I can't get that to run. Tried for hours :(
kouteiheika@reddit
Note that there's a proper quant for vLLM available here:
https://huggingface.co/cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit
AvocadoArray@reddit (OP)
Thanks, will take a look!
wektor420@reddit
Llama.cpp worked way better on vulkan backend than cuda on my 5060 ti (also blackwell)
AD7GD@reddit
What's in the environment? There are a lot of CUDA env vars that can change behavior
AvocadoArray@reddit (OP)
Nothing special, really.
Tried a bunch of different args and env vars based on similar GH issue reports, but no change.
It could come down to a driver or kernel incompatibility, but even that seems strange as it's almost an identical setup and config as our server at work, just different hardware.
AvocadoArray@reddit (OP)
Also, full sample llama-swap config below:
zipperlein@reddit
I had such long loading only when I tried to load the models from a HDD. Never made that mistake again.
Yorn2@reddit
I have two and if you told me I had to get rid of one of them I'd say "from my cold dead hands".
You definitely want 2 cards if you want to run multiple models or types of models all together like chatterboxtts, comfyui, big models quantized like GLM, Qwen3, or Minimax, etc. and/or Omni models. I guess to each his own, but once you get used to having them, you can't not have them and they'll be worth every cent.
AvocadoArray@reddit (OP)
Totally fair.
I played around with StableDiffusion years ago on my 1080ti, but I was basically just screwing around and having fun. I don't have a real need for it, but maybe it would be fun to see how far it's come.
Right now, I'm happy running the biggest model that I can for general purpose/coding tasks, and swapping out as needed.
I can't lie that I wouldn't mind playing around with the bigger models, but IMO the law of diminishing returns kicks in quite heavily. For the vast majority of my use-cases, "good" is "good enough".
Also, open-weight models are still continuing to smarter and smaller. Qwen3-Coder-Next already beat my expectations, and Qwen3.5-122b at UD-Q4_K_XL is absolutely blowing my expectations away and crushed several personal benchmarks that I never expected to get with this card.
So if I feel the hedonic treadmill kicking in and I want more, odds are that I can just wait another month or two for the next open model to make a splash.
That being said, I'll be making a case for a proper 2x or 4x GPU server at work. Maybe I'll play around with the bigger models there, but its primary goal will be scaling out to handle more concurrent requests.
Yorn2@reddit
Yeah the bug will probably get to you eventually but enjoy the card you have for now, it's still a big step up. BTW, you can't highlight the power supply requirements enough, IMHO. I'm glad you highlighted them in your post.
AvocadoArray@reddit (OP)
Perhaps! Time will tell.
I really only included that section to push back on folks saying I'd be disappointed in it because I wouldn't be able to run anything useful on just one card.
But from my experience, there seems to be a pretty big quality difference about every \~24GB:
After that, it seems like returns diminish quite rapidly.
128GB-192GB gets you closer and closer to running the largest open-weight models, but still heavily quantized.
Like you you mentioned, the biggest benefit is being able to run multiple models simultaneously, which sounds awesome and all, but isn't anywhere near the difference in going from <=16GB to 96GB IMO.
As
Bit_Poet@reddit
It really gets interesting once you get into diffusion models as well. Imagine a workflow that takes a story, runs it through TTS, creates an SRT, then analyzes both and creates a script of one to 10 second scenes with prompts for images and video, and finally batches 70+ clips with image generation and first-frame+audio2video workflows including LLM prompt enhancement. (I want a second Pro 6000 now!) Or if you're training big LoRAs and want to run diffusion inference or agentic coding in parallel...
t4a8945@reddit
Awesome post. I'm in the poor gang, I bought an DGX Spark. (see my own write up here: https://www.reddit.com/r/LocalLLM/comments/1rmlclw/ )
Interested in your performances with Qwen3.5-122b at UD-Q4_K_XL.
What do you get out of it in terms of tokens per seconds, prefill, context size.
I'd be eager to compare $ to performance ratio with both our setups :D
big___bad___wolf@reddit
I tested the medium to large Qwen 3.5 models i could fit at FP8 or IQ4 on two RTX 6000 Max Qs.
laterbreh@reddit
If you want to fix the coil wine, stop caring about heat. From nvidia it targets 90c before it starts to ramp. I have several of these crammed into a machine and even under heavy load it genuinely doesn't ramp up that often and when it does it does a burst then winds back.
Second you can save your graph compilations to be reused. You just need to set up cache folders/volumes that are persistent for the docker to gain access to. Id pasta my config but im mobile at the moment.
AvocadoArray@reddit (OP)
I don't think the coil whine is from the fans, though. It's very noticeable even while idling.
Way ahead of you on the graph compilations. The graphs are saved and it does try to re-use them, but see my other comment that shows the bottleneck occurring while loading the cached torch graph, and forcing to re-capture the CUDA graphs (which should be cached in memory, not on disk).
This same delay is NOT present when running VLLM on the host outside a container
BillyBoberts@reddit
The coil whine must be a card by card thing, I have the workstation edition in a box next to me and I only notice it every now and then when it’s processing.
big___bad___wolf@reddit
Both of mine whines when vllm is grinding.
Especially when building cuda graphs 😂
AvocadoArray@reddit (OP)
I think I reading that it's much worse on the Max-Q, but I can't remember exactly. Thanks for the insight!
big___bad___wolf@reddit
I bought one, then I bought another, I’m thinking of getting two more.
tomByrer@reddit
I tend to add extra cooling on my GPUs, like a case fan on top or side to push extra air.
AvocadoArray@reddit (OP)
I've been keeping an eye on it and it's pretty stable for the time being. It's also squashed between two other servers and I noticed it runs cooler in the rack vs. running on my workbench. Probably because they're acting as giant heatsinks lol.
I did order low-profile heatsinks for the CPUs so that the server fans can help with the GPU heat a bit more. Cranking up the server fans doesn't make much of a difference until they're at 40% or more.
tomByrer@reddit
I'm thinking of re-pasting my 3080 & 3090. ..
AvocadoArray@reddit (OP)
If they're sharing a case together, consider a water cooling loop!
tomByrer@reddit
Eh... I don't want to drop too much money on something that isn't making me money yet.
Ok-Measurement-1575@reddit
It's nice to be able to do vm shortcut stuff but installing linux natively prolly solve most of your problems? Unless you've got a way of powering down that card when only the hypervisor and/or other vms are running, it's ultimately an abstraction layer you don't need.
Qwen models in vllm do seem to take ages for the first request. I've never timed it but it feels like over a minute on my 3090s.
I'm surprised you're seeing 15 minutes end to end.
AvocadoArray@reddit (OP)
Bare metal does remove some variables from the equation, but I have no reason to think virtualization is the cause of the VLLM loading time issue.
15 minutes was probably a bit drastic. That was from a cold boot before I set up might bcache device, so the model loading time was taking a decent chunk of it.
Still, I think it's around 5 minutes total once model weights are loaded.
YesI the extra \~60s TTFT on first request for the Qwen models sounds accurate. I don't see that same problem with Seed or others.
Johnwascn@reddit
Could you compare the capabilities of qwen3-coder-next and qwen3.5-122b on this device when you have time?
this-just_in@reddit
Mount a folder on the host the containers vllm cache path and you’ll solve this one
AvocadoArray@reddit (OP)
I wish it were that easy. It's storing and loading the cache... it just takes forever for some reason. See my other comment
nofdak@reddit
I'm glad to see you write this up, I was writing up my own experience with vLLM and it's extremely slow loading times.
The lowest time I've seen from vLLM loading a model to returning tokens is ~45s, and that's with small models. When using larger models like Qwen3.5-122B-A10B the time goes up even further. My llama.cpp built for my hardware can load Qwen3.5-9B in ~7s, but vLLM takes 45.
I've seen higher times when running in a container as well, so now I run directly on the host:
uvx --torch-backend auto --extra-index-url https://wheels.vllm.ai/nightly/cu130 vllm serve Qwen/Qwen3.5-35B-A3B-FP8 --host=:: --gpu-memory-utilization=0.90 --max_num_batched_tokens=16384 --enable-prefix-caching --max-num-seqs=4 --dtype=bfloat16 --reasoning-parser=qwen3 --tool-call-parser=qwen3_coder --enable-chunked-prefill --enable-auto-tool-choice --speculative-config {"method":"mtp","num_speculative_tokens":2} --mm-encoder-tp-mode data --mm-processor-cache-type shmI'm running a non-power-limited RTX Pro 6000 Workstation so it could pull 600W if needed.
I've tried various different vLLM flags but nothing seems to make a big difference. With ~1m minimum iteration times, it's pretty frustrating testing different quants or flags.
AvocadoArray@reddit (OP)
So glad to see others hitting the same issue and I'm not just going crazy.
I've been hesitant to open a GH issue because I thought it must just be something with my config (I mean, it probably still is, but I don't know *what* it is).
Maybe I'll take the combined notes from this thread and open an issue.
Do you mind posting the relevant startup logs that show the times for loading weights, cached graphs and CUDA graph capture times?
nofdak@reddit
I uploaded my startup logs here: https://pastebin.com/7Ra8Jwqf Note that I was loading the 2B model to limit the size that needs loading.
AvocadoArray@reddit (OP)
Hmm, I don't see the same issues in your log that I'm getting. Specifically, in this line:
I get times closer to 55-75s, which extremely long for loading cached graphs.
Your CUDA graph capture time is longer than expected as well, so we might have that part in common.
segmond@reddit
I'm just waiting for the mac m5 max/ultra studio to be released and hoping I won't regret my waiting.
cicoles@reddit
Regarding the coil whine, I am wondering if I am deaf but I get nothing from the one I had.
AvocadoArray@reddit (OP)
It sounds like one of those "mosquito" noises that kids played in school to annoy the rest of us, but the teachers almost never heard it.
Maybe I didn't go to enough concerts growing up.
sautdepage@reddit
Even without coil whine, the fan speeds runs at 30% all the time and is quite audible for anyone having spent a bit of effort on having a silent PC.
So we pay thousands over the almost-identical 5090 FE but loose the auto-fan off feature and the better liquid-metal TIM. Oh and also the HDMI is gone, targeted at AI video generation tasks but no 4K TV output for you!
But.. oh the shiny VRAM.
cicoles@reddit
😀 Sorry. I deserved that! Just feeling lucky now that I dodged the whine coil.
FullOf_Bad_Ideas@reddit
Can you run real-time video generation with Helios on it? claims to run real time on single H100, you might not be that far off.
https://huggingface.co/BestWishYsh/Helios-Distilled
Why not the 600W workstation version? I am glad you didn't go with MI210.
AvocadoArray@reddit (OP)
I haven't dabbled in that area yet, but I'll have to give it a shot one of these days!
Stuck with Max-Q because it fits the 300w power budget in my server and the blower fan exhausts air out the back (don't need to crank the server fans to keep it cool).
The server already runs 24/7 anyway, so it's more efficient to piggy back off its RAM, CPU, and storage in a Linux VM rather than keeping another box running full time.
running101@reddit
So was it worth the cost or was reddit right ?
AvocadoArray@reddit (OP)
100% worth the cost in my opinion.
I'm fairly prone to buyers remorse, but I haven't experienced an ounce of that for this purchase.
Economically speaking, I've easily thrown $100 worth of tokens at it overall, and market price for the card has already gone about 10% from when I bought it. It's hard to feel bad about a decision like that.
Mythril_Zombie@reddit
You should update the post with how much you actually spent. And what specs the card has. You know, the useful information.
AvocadoArray@reddit (OP)
I mean, the specs are public knowledge and easily found elsewhere. I wouldn't be adding anything to the conversation by reposting them here.
Prices change by the day, but I got the GPU for a hair under $7k after edu discount.
I scored pretty big on the r730xd server from a local shop. 2x Xeon E5-2660 v4's and 128GB RAM (4x 32GB) for $650.
The shop was kind enough to upgrade the existing DIMMS to match the 4x32GB DDR4-2400T DIMMS I had on hand for $10/DIMM. So 256GB total running at the full 2400MHZ.
GPU power cables off Amazon were about $30 for a pair.
Ordered a pair of E5-2690 v4s and low profile heat-sinks off eBay for about $65 total (in the mail).
I supplemented it with a few other things I had on hand to round out the build. Probably would have been an extra $500 at market price
Whiz_Markie@reddit
Haven’t had time to read it all but was on the verge of going either 6000 or 2x 5090 FE and 1x 4090 and making a system with separate pcs for inference in my use case. I’m thanking you ahead of time for sharing verbose notes and experiences from this endeavor, as I fight the urge to switch over to the 96gb. Cheers
Solid-Roll6500@reddit
Are you using the cu130 nightly vllm openai image? I was having issues with some of the qwen models until going with that.
Also curious, for your ESXi host are you using GPU pass thru or vGPU to the VM? And did you have to setup grid licensing to get it working?
AvocadoArray@reddit (OP)
Tried the default (cu12) and cu130 builds for v0.16.0 and a few of the nightlies (currently running
cu130-nightly-097eb544e9a22810c9b7a59e586b61627b308362).The long load times happen with all models, even seed OSS at INT4 which works perfectly fine using the same exact image on our VM at work.
I'm just passing through the entire GPU to the VM. No need for vGPUs or GRID licensing.
Solid-Roll6500@reddit
Appreciate the response. Cheers.
Glittering_Way_303@reddit
Thank you for the interesting write up! I was considering buying the Max-Q version for concurrent inference for transcription and summarisation for a huge group of people. Intending to use parakeet for STT and qwen3.5 35B-A3B for summarisation and as a chat model. Do you have any thoughts on this use case? In an Asus ESC4000A-E12 server with 96GB DDR5 RAM
AvocadoArray@reddit (OP)
I haven't done a lot of STT work, so I can't offer much in that regard.
However, I remember using OpenAI's whisper model to transcribe a couple of YouTube videos when I first started dabbling in AI. I'm pretty sure it ran much faster than real-time, so I'd guess that this card would handle that just fine if everything is set up correctly.
Captain21_aj@reddit
Hey great write up. Thanks for giving a reference just in case I want to build similar thing with my R730XD in the future. On the other post you mentioned you have 2x L4 GPUs (48GB VRAM total) at work. May I ask what makes your office self host GPU than using API key or claude code/cursor/copilot subscription?
AvocadoArray@reddit (OP)
Thank you! Good to see another r730xd user here.
Technically we have 3x L4s. The first one runs embedding and rerank models, as well as GPT-OSS 20b full time for general purpose chat/research/RAG/automated workflows etc. The other two are for the bigger models that we swap in and out as needed.
They've somehow gone up in price in the year since we bought them. I'm pretty sure we could sell all three and have enough for an RTX 6000 with money to spare.
The biggest reason is security/privacy/compliance. Some of our workflows deal with sensitive data we're limited on where that data can be stored or processed. Coding performance was more of a "cool if it can do it, but not the primary purpose".
I did dabble with JetBrains AI using Claude, but I think the last one I properly tested was 3.7 sonnet and still found that I had to babysit it and tweak minor details in order to produce production-worthy code. And once I spent the time setting up the guardrails, prompts, instructions and tool servers, I was getting nearly identical results with Seed 36b for what I actually wanted to use it for.
I don't use it for major architectural decisions or overhauls, just mainly for automating things like unit tests, boilerplate code, and refactoring old codebases (e.g., adding type hinting everywhere), Local models can handle that really well, and in that case, "good" is "good enough".
On top of that, I wanted the ability to learn and tinker with the backend. I've learned quite a bit along the road and still think it was a smart choice, even if we weren't dealing with sensitive data.
a_beautiful_rhind@reddit
I have SaS/Sata drives so a 10 minute model load is a given for the larger weights not on SSD. My slowest drive is like 120mb/s or something, fastest is only 500 (the SSDs).
May want to look into rebar, but that's a hell of a lot of ram to map. I don't know how much you have total but it might speed things up. 4x3090 can all do it so why not 1x96gb?
Once a model caches, load is almost instant. If you are taking 10 mins every load, something is fucky.
96gb of vram and hybrid for larger MoE is definitely "comfy".
AvocadoArray@reddit (OP)
Rebar enabled and 106GB RAM reserved/mapped to the VM.
Yes, model weight loading is very fast, it's something screwy with loading the torch compile graphs and forcing CUDA graphs to re-capture on every restart.
One other quick piece of advice if you're using spinning disks: I have around 1.2TB of models so far stored on a 3x 3TB RAID-5 array, but I set up bcache with a 500GB NVME 870 EVO drive in front of it.
Running it in writethrough mode, so downloading a model to the bcache array stores a copy on the backing device as well as the cache, so it's hot and ready to load right after downloading.
Reading a "cold" model automatically loads it into the NVME cache as well, and evicts the Least Recently Used (LRU) blocks as needed.
While I tend to keep lots of copies of models, I'm usually only swapping between 3-4 on any given day which are easily stored in the NVME cache, and at least one full model can fit in the Linux filesystem cache (RAM).
If you decide to go this route, make sure you disable sequential read bypass with
echo 0 > /sys/block/bcache0/bcache/sequential_cutoff, otherwise it just bypasses the cache for large reads/writes. This needs to be written after every OS restart.a_beautiful_rhind@reddit
I didn't even set up raid, but that makes sense. Instead I move heavily used models manually to SSD.. Just keep running out of space and having to get more drives. A couple of mistrals/70b/etc i can store in ram, but these larger ones need to split on NUMA properly and end up not after a few swaps.
The compiling adds up.. it takes a while on all my image models but then usually goes a ton faster. Chroma takes 2 runs of 130s before it settles out and I'm guessing your LLMs are doing similar.
AvocadoArray@reddit (OP)
I've only been using the native vector db built into Open Web UI, but have been meaning to set up Qdrant or Chroma soon.
What made you go with Chroma?
a_beautiful_rhind@reddit
Well it's the other chroma.. the image model. I did use chromaDB in the past for rag but since moved on to other vector solutions.
AvocadoArray@reddit (OP)
Ah, that makes sense. I hadn't heard of Chroma the image model.
Anyway, definitely recommend giving bcache a shot. It basically automates your manual SSD swapping for you. Other alternatives are dmcache or lvmcache, but I haven't tried them.
a_beautiful_rhind@reddit
I need NVME prices to drop a little first. First server didn't have it. Second one I was like meh, the ram training and boot takes a while anyway.. Now OS and a caching drive sounds nice, as would double the ram but my wallet says no.
AvocadoArray@reddit (OP)
I hear you there. Luckily I had an old 500GB 970 Evo lying around from an old computer, so it was a perfect fit. Just slotted it into a PCI adapter and it worked like a charm.
Also consider a RAID-0 of 4x small SATA SSDs if you can find those for cheaper. Won't lose any data if the RAID dies as long as it's in writethrough mode (not writeback).
PrysmX@reddit
Also, I would power limit the card to something like 450-480W. You only get literally a few percent gain past that point for over 100W more power usage. Extra heat, fans, and electric bill. Absolutely not worth it for pretty much any use case. You can do this via nvidia-smi without even installing additional software and set it run the command on startup.
MelodicRecognition7@reddit
on Workstation edition the gain for prompt processing is linear all way up to 600W, however the gain for token generation is few percents after 310W
thruston@reddit
Max Q is limited to 300W.
PrysmX@reddit
I'm aware of that. I was going off their 600W+ comment. I thought they had confused editions.
thruston@reddit
Oh my bad!
PrysmX@reddit
No worries! OP responded to me and said they were updating the post to clarify. :-)
AvocadoArray@reddit (OP)
This is the Max-Q edition. Already limited to 300w total and uses a blower fan to exhaust heat out the back :)
PrysmX@reddit
That's.what confused me. You kept saying Max-Q but then you said this thing pulls 600W+, which led me to believe you were actually using a workstation edition. You didn't make it clear that it was the entire server pulling that much.
AvocadoArray@reddit (OP)
Thank you, I updated the post to clarify that point.
starkruzr@reddit
very curious what those vLLM load times are about.
AvocadoArray@reddit (OP)
I promise to post an update if I figure it out.
AD7GD@reddit
I've used vLLM on a variety of cards, and slow startup time is common.
You could make sure you're at least on the highest PCIe gen your motherboard can handle. Every gen is 2x
AvocadoArray@reddit (OP)
Yes, I've had a lot of experience with VLLM on our work server using Nvidia L4 cards. You're right that it's always slower to startup, but not *this* slow. The model weights load pretty fast overall, but there's some kind of delay when loading cached torch graphs (which or very small and shouldn't be I/O bound).
See my other comment for more logs and info.
bubba-g@reddit
+1 r730 is gen 3 and the rtx 6000 is gen 5
fragment_me@reddit
Good to know I have a 730 and was worried something this big wouldn’t fit or work
AvocadoArray@reddit (OP)
You and me both. I figured it was time to just bite the bullet and see if it worked.
It was a bit tricky to get it installed, as the two "fins" on the bottom of the card's PCI bracket are pretty fat and barely fit into the slots on the server. I was about 10 seconds away from grabbing my Dremel to widen them out before I finally got it to slot in.
LKama07@reddit
Sorry for the newbie question but how does this type of setup compare to Mac hardware for similar use cases? For example the latest m5? It seems Mac has extremely low power consumptions, but maybe it's much slower?
AvocadoArray@reddit (OP)
I can only speak to what I've seen from other comments, but my understanding is that the Mac is not only slower overall, but the prompt processing speed is atrocious.
This isn't a big deal for light-weight chat, but if you're using for things like agentic coding, deep research, or long content summarization, it takes a lot longer to process the full prompt before it ever even starts responding. The same issue occurs when running models in RAM (even quad channel DDR5).
Armym@reddit
This is the reason why I follow this sub. Thank you.
LegacyRemaster@reddit
I have a 96GB 600W RTX 6000 running with two 48GB AMD w7800 (one is connected via M2 + external power supply). I took my MSI x570-Pro, added the cards (which were also mounted quite roughly), turned on the PC, installed the AMD+Nvidia drivers, and started using them without any problems. No UPS, but a good insurance policy in case of power failure due to spikes. Easy
Ok_Hope_4007@reddit
When using docker an vllm I think you can mount the cache folder for the cuda graph into the docker container just like the model folder (I can't remember the exact path) but at least it won't rebuild it whenever you create a new container.
AvocadoArray@reddit (OP)
Correct, and I'm doing that exactly the same way as my setup at work with
-v vllm-cache:/root/.cache/vllm/.See my other comment for screenshots and logs.
somethingdangerzone@reddit
Great write up, thanks for that. Happy coding!
PrysmX@reddit
vLLM startup times are worse because by default vLLM will fill up as much VRAM as possible with caching. Their point of view is that free VRAM is wasted VRAM which, depending on the use case, is a valid statement. There are startup parameters you can pass to limit how much VRAM is used by vLLM if you want quicker startup at the expense of available memory in vLLM. This can actually be important if you do use the same machine for multiple tasks and it isn't a standalone vLLM server.
AvocadoArray@reddit (OP)
I understand that, but that's not quite what's going on.
See my other comment that shows the bottleneck occurring while loading the cached torch graph, and forcing to re-capture the CUDA graph.
This same delay is NOT present when running VLLM on the host outside a container
swagonflyyyy@reddit
I can attest to a lot of the things you mentioned in this post. Haven't tried vllm tho because I'm on windows, but I was in the process of running Claude Code locally with gpt-oss-120b via vLLM. Any tips?
AvocadoArray@reddit (OP)
IMO, trying to get max performance on Windows is a losing battle. llama.cpp works just fine in windows, but if you want to run VLLM you need to use wsl2 or Docker Desktop (which also uses wsl2), and that creates a bunch of headaches like reserving system RAM and poor bind mount volume performance.
If you want to run VLLM, I'd suggest running the card in a separate headless Linux server/desktop and set it up with llama-swap.
Despite the weird long loading times with VLLM in a docker container, the ability to run native FP8 models is awesome in terms of speed and accuracy.
NoahFect@reddit
15 minute startup time? Now try it with CUDA enabled.
AvocadoArray@reddit (OP)
CUDA is enabled, though and runs perfectly once started and warmed up. There's just an issue with loading cached graphs in docker for some reason.
Tested with the official
vllm/vllm-openaidocker image across multiple tagsv0.15.1,v0.16.0,v0.16.0-cu130, andcu130-nightly-097eb544e9a22810c9b7a59e586b61627b308362pandar1um@reddit
Fantastic post, thank you for sharing. Well, in any case my broke ass can’t afford it, as well, as used 3090, but nobody can’t get me from reading about it :)
Writer_IT@reddit
As a person that your previous discussion actually convinced to buy this monster.
For the long startup time, have you stored the model into a linux-formatted image? This dropped my loading time from 20-30 minutes to 2-3 for 100+b models.
AvocadoArray@reddit (OP)
The model weights themselves load very fast, especially once they're cached in Linux RAM (>1GB/s). The long loading times are due to it taking forever to load the cached torch graphs and needing to re-capture the CUDA graphs after every restart for some reason.
Writer_IT@reddit
And the reason was that if they were stored on NTFS, the dockerized Linux vllm image would have to essentially decode them on the fly
jonahbenton@reddit
This is incredibly helpful but one thing I hoped you could clarify, regarding power draw- the machine in which you installed the card was pulling 600w with the card at full throttle, not the card itself (as measured via nvtop or nvidia-smi)- is that right?
AvocadoArray@reddit (OP)
Correct. The Max-Q only pulls 300w max. The entire machine idles around 200w, so when the card is running at max power it pulls 500w minimum + \~100w for the additional CPU, RAM, and fan overhead. I think I only saw 600w peak when offloading some MOE experts to the CPU with minimax m2.5.
kaliku@reddit
Fantastic write-up, thank you for taking the time!