Qwen 3.6 27B in RTX PRO 6000 - Why high RAM usage?

Posted by ubnew@reddit | LocalLLaMA | View on Reddit | 27 comments

Hey guys so I am running unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL in RTX PRO 6000 Blackwell Max-Q and I am not sure what is the cause of using this high ammount of RAM memory (cache'd)

I am using this llama-server script:

MODEL="unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL"
TEMPLATE="./qwen3.6-27b-chat.jinja"

llama-server -hf "$MODEL" \
  --jinja \
  --chat-template-file "$TEMPLATE" \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --ctx-size 262144 \
  -fa on \
  -ngl 99 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.00 \
  --repeat-penalty 1.0 \
  --presence-penalty 0.0 \
  --host 0.0.0.0 \
  --port 8080

with CUDA Version: 13.1

It's practically the same script I was using for other models without any issue, but with qwen 3.6 35B A3B and the new 27B the prompt processing is getting slow and I guess it's because it's offloading cache to ram? I've tried setting the KV to Q8 without success.

Any ideas?