Qwen3.6:27B VRAM 16GB 5080: MTP Quant, Speeds, and Configs

Posted by InternationalNebula7@reddit | LocalLLaMA | View on Reddit | 13 comments

For those of you running Qwen3.6:27B on 16GB VRAM, what quantization did you settle on?

For my primary purpose as a HA voice assistant, I've found my ideal target to be >50 tg and >800 pp. Qwen3.5:9B works really fast, but I'm experimenting with higher intelligence. Offloaded the vision model to CPU because it is infrequently used.

Currently running Qwen3.6-27B-Q3_K_S.gguf with 64 layers on GPU at the following speeds:

prompt eval time =     462.66 ms /   507 tokens (    0.91 ms per token,  1095.83 tokens per second)
       eval time =   18710.17 ms /   884 tokens (   21.17 ms per token,    47.25 tokens per second)
      total time =   19172.84 ms /  1391 tokens
draft acceptance rate = 0.59677 (  481 accepted /   806 generated)

prompt eval time =    6001.34 ms /  8561 tokens (    0.70 ms per token,  1426.51 tokens per second)
       eval time =    2404.46 ms /   147 tokens (   16.36 ms per token,    61.14 tokens per second)
      total time =    8405.80 ms /  8708 tokens
draft acceptance rate = 0.80357 (   90 accepted /   112 generated)

Config:

      -m /models/Qwen3.6-27B/Qwen3.6-27B-Q3_K_S.gguf
      --mmproj /models/Qwen3.6-27B/mmproj-BF16.gguf
      --no-mmproj-offload
      --host 0.0.0.0
      --port 8080
      --jinja
      -fa on
      --temp 0.6
      --top-p 0.95
      --top-k 20
      --min_p 0.0
      --presence-penalty 1.5
      --repeat-penalty 1.0
      --cache-ram 0
      --fit on
      -np 2
      --fit-ctx 32000
      --cache-type-k q8_0
      --cache-type-v q8_0
      --cache-type-k-draft q8_0
      --cache-type-v-draft q8_0
      --log-verbosity 4
      --chat-template-kwargs '{"preserve_thinking": true}'
      --spec-type draft-mtp
      --spec-draft-n-max 2

bin_vulkan/llama-server \ -m /home/eaman/lm/models/localweights/Qwen3.6-27B-MTP-IQ4_XS-Q8nextn-GGUF/Qwen3.6-27B-MTP-IQ4_XS-Q8nextn.gguf \ --host 0.0.0.0 -np 1 \ --fit-target 50 \ -ctk q4_0 -ctv q4_0 \ -fa on \ --temp 0.6 --top-k 30 --top-p 0.95 --min-p 0.0 \ --repeat-penalty 1.0 --presence_penalty 0.0 \ -b 128 \ --jinja \ --no-mmap \ --spec-type draft-mtp --spec-draft-p-min 0.75 --spec-draft-n-max 3 \ -ctkd q8_0 -ctvd q8_0 \ --ctx-size 32000 \ -ngl 99 \ --reasoning-budget 1 --chat-template-kwargs '{"enable_thinking":false}' \ --cache-ram 6000 -ngl 99 -lv 3 --no-warmup \

# optimized for coding # max context:81K headless bin_vulkan/llama-server \ -m /home/eaman/lm/models/mradermacher/Qwen3.6-27B/Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf \ --host 0.0.0.0 \ -np 1 \ --fit-target 50 \ -ctk q8_0 \ -ctv q5_0 \ -fa on \ --temp 0.5 \ --min-p 0.0 \ --repeat-penalty 1.0 \ --presence_penalty 0.0 \ -b 512 \ --jinja \ --no-mmap \ --reasoning-budget 1 \ --chat-template-kwargs '{"enable_thinking":false}' \ --spec-type ngram-mod \ --spec-ngram-mod-n-match 8 \ --spec-ngram-mod-n-min 3 \ --spec-ngram-mod-n-max 24 \ -lv 4 \

[-]

human_bean_@reddit

If you're using the latest llama.cpp which now has MTP, I got a huge improvement changing spec-deaft-n-max from 3 to 6.

ea_man@reddit

Me running:

* Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf (no MTP)

* Qwen3.6-27B-MTP-IQ4_XS-Q8nextn.gguf

If ya wanna see the launch of the MTD:

But I think I may find more useful the non-MTD one that gives me some \~80K context:

ZealousidealBadger47@reddit

If there is Qwen 3.6 14B, how nice,.

DrBearJ3w@reddit

https://huggingface.co/AesSedai/Qwen3.6-35B-A3B-GGUF/tree/main/IQ3_S

If you want to stay PP >800 consistently

ShadyShroomz@reddit

That's the moe which is much dumber than the dense one. not sure why people recommend it as a replacement tbh. Might as well recommend 3.5 0.8b, that'll run even faster

InternationalNebula7@reddit (OP)

At what quant does the intelligence cross over for 35B-A3B’s advantage? I would think identical quants would favor 27B? For example, would q4 35B be better than q3 27B?

4bit. Speed/and decent enough intelligence.27B would probably be too slow at 32k full context. 35B is smart enough for simple/intermediate task

grumd@reddit

You can run 35B at Q8 on a 5080 with RAM offloading and hit 1500+ pp consistently with 2048 ub

Yeah,sure, if one needs precision.

IgnisIason@reddit

I just want to say I'm jealous. 😅

KeepyUpper@reddit

I was able to run 27B IQ4 with 80k context with turbo4 kv cache. But it was kind of dumb and couldn't fix certain bugs even when I was telling it exactly what the problem was.

It's cool getting these things to run on 16GB cards but you have to lobotomise them so heavily to make them fit it's not worth the effort. I think a higher quant 35B A3 produced better output than a gimped 27B if you're trying to use them for anything productive. It also runs faster.

LostDrengr@reddit

Same card, same model. kv's set to q4 61 t/s

mixman68@reddit

What is the more smart ? Qwen3.6-35B Q5_L_XL or Qwen3.6-27B Q3_K_S ?