Qwen 3.6 27B in RTX PRO 6000 - Why high RAM usage?
Posted by ubnew@reddit | LocalLLaMA | View on Reddit | 27 comments

Hey guys so I am running unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL in RTX PRO 6000 Blackwell Max-Q and I am not sure what is the cause of using this high ammount of RAM memory (cache'd)
I am using this llama-server script:
MODEL="unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL"
TEMPLATE="./qwen3.6-27b-chat.jinja"
llama-server -hf "$MODEL" \
--jinja \
--chat-template-file "$TEMPLATE" \
--chat-template-kwargs '{"preserve_thinking": true}' \
--ctx-size 262144 \
-fa on \
-ngl 99 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00 \
--repeat-penalty 1.0 \
--presence-penalty 0.0 \
--host 0.0.0.0 \
--port 8080
with CUDA Version: 13.1

It's practically the same script I was using for other models without any issue, but with qwen 3.6 35B A3B and the new 27B the prompt processing is getting slow and I guess it's because it's offloading cache to ram? I've tried setting the KV to Q8 without success.
Any ideas?
CockBrother@reddit
I just did a comparison between llama.cpp and vllm yesterday because tool calling on vllm is kind of... suspect. But the overall performance of llama.cpp was terrible compared to vllm. Using RTX 6000 Pro so should be very similar to your experience.
Try vllm with the FP8 quantized model directly from Qwen.
Legal-Ad-3901@reddit
what you doing about context then? turboquant ever get solved in vllm for hybrid?
CockBrother@reddit
Context isn't a problem with 96GB GPU RAM on the 35B model - with just 0.8 on GPU memory utilization:
Let me try on the new 27B:
Okay - RAM is tight on the 27B model. Upping vllm memory utilization to 0.95:
Okay - let's try FP8 cache on the 27B ("kv-cache-dtype: fp8"):
Yup, nearly exactly double. That should be enough?
DeltaSqueezer@reddit
Don't forget that is just the KV cache size in tokens. Since Qwen3.5 is a hybrid model, the linear layers don't take up this KV cache space. So due to 3:1 ratio, the max context length is 4x the KV cache size. So the 168,912 token KV cache gives you 675,648 of context.
TokenRingAI@reddit
Your actual context length is much higher than that, the reported KV cache size in VLLM does not account for the model being a hybrid.
Look at the lines under those and you will see you probably have 4x or higher concurrency at full context length.
CockBrother@reddit
Thanks for that! Very counterintuitive way to report it. The concurrency number was in the range you're saying.
ubnew@reddit (OP)
Will do right now, thx Cock Brother
CockBrother@reddit
I know getting going with vllm can be a bit different... here's a config.yaml for you to start with I'm using with the 3.6 35B FP8:
ubnew@reddit (OP)
It's crazy fast now with vLLM, huge difference. Game changer 🏆
Tomr750@reddit
what token/s?
ubnew@reddit (OP)
Finally got it running, this helped a lot, thanks! :))
ubnew@reddit (OP)
Now stucked with OOM issues with vllm:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.06 GiB. GPU 0 has a total capacity of 94.97 GiB of which 993.06 MiB is free. Including non-PyTorch memory, this process has 93.98 GiB memory in use. Of the allocated memory 88.10 GiB is allocated by PyTorch, and 1.12 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W423 00:56:11.035991834 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
TokenRingAI@reddit
You need to either use VLLM (recommended, with mtp set to 3), or switch llama.cpp to use Vulkan
Qwen Next, 3.5, and I assume 3.6, all have bad CUDA problems on llama.cpp with SM120.
For some reason they have been ignoring the problem for half a year.
parrot42@reddit
For me, using llama.cpp b8575 works great, even with cuda 13.2.1, and qwen 3.5 397b. Version after b8575 (up to b8893) do not work anymore.
lacerating_aura@reddit
What's the issue? Im using b8864 with 3.6 A3B and 3.5 A10B with cpu moe, same cuda 13.2, no problems.
TokenRingAI@reddit
The technical issue is this: https://github.com/ggml-org/llama.cpp/issues/19345
The bigger issue is that llama.cpp has a dysfunctional bug reporting process, which uses a 14 day auto-close bot, and doesn't seem to maintain a long term bug tracking system.
Issues don't magically just go away after 14 days of no solutions, and unfortunately the tickets for many serious bugs are just auto closed and lost and don't seem to be seriously tracked long term, which is how the popular Qwen hybrid models can stay broken on Blackwell for half a year.
Easy_Kitchen7819@reddit
Did you tried llama cpp ik?
lacerating_aura@reddit
Ah seems to be Blackwell specific, so makes sense why I dont get it.
parrot42@reddit
My issue seems to be related to this https://github.com/ggml-org/llama.cpp/issues/21289 . ggml_cuda_compute_forward: MUL_MAT_ID failed, but used to work.
ubnew@reddit (OP)
Thx for the info, any related documentation?
anzzax@reddit
use nightly vllm docker image, few optimisations recently landed for sm120 and sm121
sn2006gy@reddit
Seems absurd to me how much the Spark and RTX6000 is suffering from these issues and how much they both cost.
anzzax@reddit
with great community effort it is getting much better and I'm quite happy spark user, for background multi-agent flows it works very well and super power efficient, but shame to nvidia for they empty promises and lack of proper support on software side
sn2006gy@reddit
I don't see that as worth 5k though, that kind of work happens on my 100 dollar Pi's with a $2.00/week api
libregrape@reddit
Looks like the ram prompt cache.
You can test this by adding --cache-ram 0 and seeing if the ram usage decreases: if it does, then it's prompt cache. If ram usage stays the same, then it isn't.
ubnew@reddit (OP)
That made the RAM usage lower, but still stupidly slow prompt processing... anyways thx for the tip
car_lower_x@reddit
There are issues with CUDA 13 not sure ram is one of them.