Qwen 3.6 27B in RTX PRO 6000 - Why high RAM usage?

Posted by ubnew@reddit | LocalLLaMA | View on Reddit | 27 comments

Hey guys so I am running unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL in RTX PRO 6000 Blackwell Max-Q and I am not sure what is the cause of using this high ammount of RAM memory (cache'd)

I am using this llama-server script:

MODEL="unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL"
TEMPLATE="./qwen3.6-27b-chat.jinja"

llama-server -hf "$MODEL" \
  --jinja \
  --chat-template-file "$TEMPLATE" \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --ctx-size 262144 \
  -fa on \
  -ngl 99 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.00 \
  --repeat-penalty 1.0 \
  --presence-penalty 0.0 \
  --host 0.0.0.0 \
  --port 8080

with CUDA Version: 13.1

It's practically the same script I was using for other models without any issue, but with qwen 3.6 35B A3B and the new 27B the prompt processing is getting slow and I guess it's because it's offloading cache to ram? I've tried setting the KV to Q8 without success.

Any ideas?

[-]

CockBrother@reddit

I just did a comparison between llama.cpp and vllm yesterday because tool calling on vllm is kind of... suspect. But the overall performance of llama.cpp was terrible compared to vllm. Using RTX 6000 Pro so should be very similar to your experience.

Try vllm with the FP8 quantized model directly from Qwen.

[-]

Legal-Ad-3901@reddit

what you doing about context then? turboquant ever get solved in vllm for hybrid?

[-]

CockBrother@reddit

Context isn't a problem with 96GB GPU RAM on the 35B model - with just 0.8 on GPU memory utilization:

(EngineCore pid=1099320) INFO 04-22 17:08:06 [kv_cache_utils.py:1337] GPU KV cache size: 451,520 tokens

Let me try on the new 27B:

(EngineCore pid=1100660) INFO 04-22 17:13:38 [kv_cache_utils.py:1337] GPU KV cache size: 168,912 tokens

Okay - RAM is tight on the 27B model. Upping vllm memory utilization to 0.95:

(EngineCore pid=1101877) INFO 04-22 17:16:21 [kv_cache_utils.py:1337] GPU KV cache size: 223,584 tokens

Okay - let's try FP8 cache on the 27B ("kv-cache-dtype: fp8"):

(EngineCore pid=1102775) INFO 04-22 17:21:42 [kv_cache_utils.py:1337] GPU KV cache size: 447,632 tokens

Yup, nearly exactly double. That should be enough?

[-]

DeltaSqueezer@reddit

Don't forget that is just the KV cache size in tokens. Since Qwen3.5 is a hybrid model, the linear layers don't take up this KV cache space. So due to 3:1 ratio, the max context length is 4x the KV cache size. So the 168,912 token KV cache gives you 675,648 of context.

[-]

TokenRingAI@reddit

Your actual context length is much higher than that, the reported KV cache size in VLLM does not account for the model being a hybrid.

Look at the lines under those and you will see you probably have 4x or higher concurrency at full context length.

[-]

CockBrother@reddit

Thanks for that! Very counterintuitive way to report it. The concurrency number was in the range you're saying.

[-]

ubnew@reddit (OP)

Will do right now, thx Cock Brother

[-]

CockBrother@reddit

I know getting going with vllm can be a bit different... here's a config.yaml for you to start with I'm using with the 3.6 35B FP8:

host: 0.0.0.0        # Bind to all interfaces (required for remote access / containers)
port: 8080           # HTTP API port
model: /home/user/models/Qwen3.6-35B-A3B-FP8   # Absolute path to model directory (no trailing slash)
served-model-name: qwen3.6-35b-a3b          # Name returned by /v1/models and used by clients
safetensors-load-strategy: eager  # Load all weights at startup (reduces first-token latency)
attention-config.backend: FLASHINFER
cudagraph-capture-sizes: [1, 2, 3, 4, 5, 6, 7, 8]  # Pre-captured batch sizes for CUDA graph replay
max-cudagraph-capture-size: 8          # Must match largest capture size above
max-num-batched-tokens: 16384           # Max total tokens processed per scheduler step
max-num-seqs: 8                        # Max concurrent active sequences
stream-interval: 8                     # Tokens generated before sending a streaming update
gpu-memory-utilization: 0.8            # Use up to 80% of available GPU memory
enable-prefix-caching: true         # Cache shared prompt prefixes across requests
enable-auto-tool-choice: true          # Model can automatically decide to call tools
tool-call-parser: qwen3_coder               # Use OpenAI-style tool call schema
reasoning-parser: qwen3        # Parser for GPT-OSS structured reasoning output
default-chat-template-kwargs: '{"enable_thinking": false}'
speculative-config: '{"method":"qwen3_next_mtp","num_speculative_tokens":5}'
override-generation-config:
  temperature: 0.6
  top_p: 0.95
  top_k: 20
  min_p: 0.0

[-]

ubnew@reddit (OP)

It's crazy fast now with vLLM, huge difference. Game changer 🏆

[-]

Tomr750@reddit

what token/s?

[-]

ubnew@reddit (OP)

Finally got it running, this helped a lot, thanks! :))

[-]

ubnew@reddit (OP)

Now stucked with OOM issues with vllm:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.06 GiB. GPU 0 has a total capacity of 94.97 GiB of which 993.06 MiB is free. Including non-PyTorch memory, this process has 93.98 GiB memory in use. Of the allocated memory 88.10 GiB is allocated by PyTorch, and 1.12 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

[rank0]:[W423 00:56:11.035991834 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

[-]

TokenRingAI@reddit

You need to either use VLLM (recommended, with mtp set to 3), or switch llama.cpp to use Vulkan

Qwen Next, 3.5, and I assume 3.6, all have bad CUDA problems on llama.cpp with SM120.

For some reason they have been ignoring the problem for half a year.

[-]

parrot42@reddit

For me, using llama.cpp b8575 works great, even with cuda 13.2.1, and qwen 3.5 397b. Version after b8575 (up to b8893) do not work anymore.

[-]

lacerating_aura@reddit

What's the issue? Im using b8864 with 3.6 A3B and 3.5 A10B with cpu moe, same cuda 13.2, no problems.

[-]

TokenRingAI@reddit

The technical issue is this: https://github.com/ggml-org/llama.cpp/issues/19345

The bigger issue is that llama.cpp has a dysfunctional bug reporting process, which uses a 14 day auto-close bot, and doesn't seem to maintain a long term bug tracking system.

Issues don't magically just go away after 14 days of no solutions, and unfortunately the tickets for many serious bugs are just auto closed and lost and don't seem to be seriously tracked long term, which is how the popular Qwen hybrid models can stay broken on Blackwell for half a year.

[-]

Easy_Kitchen7819@reddit

Did you tried llama cpp ik?

[-]

lacerating_aura@reddit

Ah seems to be Blackwell specific, so makes sense why I dont get it.

[-]

parrot42@reddit

My issue seems to be related to this https://github.com/ggml-org/llama.cpp/issues/21289 . ggml_cuda_compute_forward: MUL_MAT_ID failed, but used to work.

[-]

ubnew@reddit (OP)

Thx for the info, any related documentation?

[-]

anzzax@reddit

use nightly vllm docker image, few optimisations recently landed for sm120 and sm121

[-]

sn2006gy@reddit

Seems absurd to me how much the Spark and RTX6000 is suffering from these issues and how much they both cost.

[-]

anzzax@reddit

with great community effort it is getting much better and I'm quite happy spark user, for background multi-agent flows it works very well and super power efficient, but shame to nvidia for they empty promises and lack of proper support on software side

[-]

sn2006gy@reddit

I don't see that as worth 5k though, that kind of work happens on my 100 dollar Pi's with a $2.00/week api

[-]

libregrape@reddit

Looks like the ram prompt cache.

You can test this by adding --cache-ram 0 and seeing if the ram usage decreases: if it does, then it's prompt cache. If ram usage stays the same, then it isn't.

[-]

ubnew@reddit (OP)

That made the RAM usage lower, but still stupidly slow prompt processing... anyways thx for the tip

[-]

car_lower_x@reddit

There are issues with CUDA 13 not sure ram is one of them.