Qwen3.6:27B VRAM 16GB 5080: MTP Quant, Speeds, and Configs
Posted by InternationalNebula7@reddit | LocalLLaMA | View on Reddit | 13 comments
For those of you running Qwen3.6:27B on 16GB VRAM, what quantization did you settle on?
For my primary purpose as a HA voice assistant, I've found my ideal target to be >50 tg and >800 pp. Qwen3.5:9B works really fast, but I'm experimenting with higher intelligence. Offloaded the vision model to CPU because it is infrequently used.
Currently running Qwen3.6-27B-Q3_K_S.gguf with 64 layers on GPU at the following speeds:
prompt eval time = 462.66 ms / 507 tokens ( 0.91 ms per token, 1095.83 tokens per second)
eval time = 18710.17 ms / 884 tokens ( 21.17 ms per token, 47.25 tokens per second)
total time = 19172.84 ms / 1391 tokens
draft acceptance rate = 0.59677 ( 481 accepted / 806 generated)
prompt eval time = 6001.34 ms / 8561 tokens ( 0.70 ms per token, 1426.51 tokens per second)
eval time = 2404.46 ms / 147 tokens ( 16.36 ms per token, 61.14 tokens per second)
total time = 8405.80 ms / 8708 tokens
draft acceptance rate = 0.80357 ( 90 accepted / 112 generated)
Config:
-m /models/Qwen3.6-27B/Qwen3.6-27B-Q3_K_S.gguf
--mmproj /models/Qwen3.6-27B/mmproj-BF16.gguf
--no-mmproj-offload
--host 0.0.0.0
--port 8080
--jinja
-fa on
--temp 0.6
--top-p 0.95
--top-k 20
--min_p 0.0
--presence-penalty 1.5
--repeat-penalty 1.0
--cache-ram 0
--fit on
-np 2
--fit-ctx 32000
--cache-type-k q8_0
--cache-type-v q8_0
--cache-type-k-draft q8_0
--cache-type-v-draft q8_0
--log-verbosity 4
--chat-template-kwargs '{"preserve_thinking": true}'
--spec-type draft-mtp
--spec-draft-n-max 2
human_bean_@reddit
If you're using the latest llama.cpp which now has MTP, I got a huge improvement changing spec-deaft-n-max from 3 to 6.
ea_man@reddit
Me running:
* Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf (no MTP)
* Qwen3.6-27B-MTP-IQ4_XS-Q8nextn.gguf
If ya wanna see the launch of the MTD:
But I think I may find more useful the non-MTD one that gives me some \~80K context:
ZealousidealBadger47@reddit
If there is Qwen 3.6 14B, how nice,.
DrBearJ3w@reddit
https://huggingface.co/AesSedai/Qwen3.6-35B-A3B-GGUF/tree/main/IQ3_S
If you want to stay PP >800 consistently
ShadyShroomz@reddit
That's the moe which is much dumber than the dense one. not sure why people recommend it as a replacement tbh. Might as well recommend 3.5 0.8b, that'll run even faster
InternationalNebula7@reddit (OP)
At what quant does the intelligence cross over for 35B-A3B’s advantage? I would think identical quants would favor 27B? For example, would q4 35B be better than q3 27B?
DrBearJ3w@reddit
4bit. Speed/and decent enough intelligence.27B would probably be too slow at 32k full context. 35B is smart enough for simple/intermediate task
grumd@reddit
You can run 35B at Q8 on a 5080 with RAM offloading and hit 1500+ pp consistently with 2048 ub
DrBearJ3w@reddit
Yeah,sure, if one needs precision.
IgnisIason@reddit
I just want to say I'm jealous. 😅
KeepyUpper@reddit
I was able to run 27B IQ4 with 80k context with turbo4 kv cache. But it was kind of dumb and couldn't fix certain bugs even when I was telling it exactly what the problem was.
It's cool getting these things to run on 16GB cards but you have to lobotomise them so heavily to make them fit it's not worth the effort. I think a higher quant 35B A3 produced better output than a gimped 27B if you're trying to use them for anything productive. It also runs faster.
LostDrengr@reddit
Same card, same model. kv's set to q4 61 t/s
mixman68@reddit
What is the more smart ? Qwen3.6-35B Q5_L_XL or Qwen3.6-27B Q3_K_S ?