Qwen3.6-27B IQ4_XS FULL VRAM with 110k context
Posted by Pablo_the_brave@reddit | LocalLLaMA | View on Reddit | 51 comments
Qwen3.6-27B IQ4_XS Bloat: Reverting llama.cpp commit saves 16GB VRAM (14.7GB vs 15.1GB) + KVCache Tests
With the release of Qwen3.6-27B, I noticed that compared to the excellent IQ4_XS quantization (14.7GB) by mradermacher for the 3.5 version (Qwen3.5-27B-i1-GGUF), the current images have bloated. The Qwen3.6 equivalent (Qwen3.6-27B-i1-GGUF) now weighs 15.1GB.
The IQ4_XS is a true "unicorn" – in all benchmarks, it offers an incredible ratio of size to model quality. In practice, it is the only viable option for running a 27B model on 16GB VRAM with a decent context. Anything lower than this is unsuitable for coding tasks. Unfortunately, the increase from 14.7GB to 15.1GB breaks the experience for 16GB cards.
The Cause & The Fix The culprit is a specific llama.cpp commit (1dab5f5a44): GitHub link. Its effect is hardcoding attn_qkv layer quantizations to a minimum of Q5_K.
To fix this, I modified the source code and replicated the original IQ4_XS layer quantization 1:1. I used the imatrix from mradermacher (Qwen3.6-27B-i1-GGUF) and performed comparative benchmarks. I observed no significant drop in model quality. In my opinion, the mentioned commit is a pure regression for the IQ4_XS format.
My custom 14.7GB model with reverted layers is available here: 👉 cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF
Perplexity Benchmarks: 65k Context (-c 65536)
Testing parameters: pg19.txt (downloaded from Project Gutenberg here), --chunks 32, -ngl 99 (unless noted), -fa 1, -b 512, -ub 128
| ID | Model Size | Model File / Version | -ctk |
-ctv |
Final PPL |
|---|---|---|---|---|---|
| 1 | 15.1GB | Qwen3.6-27B.i1-IQ4_XS.gguf (Standard) |
q8_0 |
q8_0 |
7.3765 ± 0.0276 |
| 2 | 14.7GB | ...-IQ4_XS-attn_qkv-IQ4_XS.gguf (Custom) |
q8_0 |
q8_0 |
7.3804 ± 0.0276 |
| 3 | 14.7GB | ...-IQ4_XS-attn_qkv-IQ4_XS.gguf (Custom) |
q8_0 |
turbo2 |
7.4260 ± 0.0277 |
| 4 | 15.1GB | Qwen3.6-27B.i1-IQ4_XS.gguf (Standard) |
q8_0 |
turbo3 |
7.4069 ± 0.0277 |
| 5 | 14.7GB | ...-IQ4_XS-attn_qkv-IQ4_XS.gguf (Custom) |
q4_0 |
q4_0 |
7.3964 ± 0.0277 |
| 6 | 14.7GB | ...-IQ4_XS-attn_qkv-IQ4_XS.gguf (Custom) |
turbo3 |
turbo3 |
7.4317 ± 0.0279 |
Command lines for 65k context:
./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv turbo2 -fa 1./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -ctk q8_0 -ctv turbo3 -fa 1 -b 512 -ub 128./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -ctk q4_0 -ctv q4_0 -fa 1 -b 512 -ub 128./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 128
KV Cache Observations: These tests indicate that for Qwen3.6-27B, the conclusions in turboquant_plus do not apply. There is no significant benefit to increasing K-cache at the expense of V-cache. In fact, for this model, the V-cache appears equally critical.
Perplexity Benchmarks: 110k Context (-c 110000)
Based on the above, I decided to use symmetric Turbo3 quantization. Combined with my custom 14.7GB model, this optimization allowed me to achieve 110k context fully within 16GB VRAM. (This took quite a while to test, so I hope you appreciate the data!)
| ID | Model Size | Model File / Version | -ctk |
-ctv |
Final PPL |
|---|---|---|---|---|---|
| 7 | 14.7GB | ...-IQ4_XS-attn_qkv-IQ4_XS.gguf (Custom) |
q8_0 |
q8_0 |
7.5205 ± 0.0285 |
| 8 | 14.7GB | Selected Final Configuration | turbo3 | turbo3 | 7.5758 ± 0.0287 |
| 9 | 15.1GB | Qwen3.6-27B.i1-IQ4_XS.gguf (Standard) |
turbo3 |
turbo3 |
7.5727 ± 0.0287 |
Command lines for 110k context:
7. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 110000 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 64
8. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 110000 --chunks 32 -ngl 99 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 256
9. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS.gguf -f pg19.txt -c 110000 --chunks 32 -ngl -1 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 256
The Q3 Debate
There are theories floating around that the Q3 model is fine. Judge for yourselves:
| ID | Model Size | Model File / Version | -ctk |
-ctv |
Final PPL |
|---|---|---|---|---|---|
| 10 | Q3_K_L | Qwen3.6-27B.i1-Q3_K_L.gguf |
q8_0 |
q8_0 |
7.6538 ± 0.0292 |
| 11 | Q3_K_L | Qwen3.6-27B.i1-Q3_K_L.gguf |
turbo3 |
turbo3 |
7.7085 ± 0.0295 |
Command lines for Q3 tests:
10. ./llama-perplexity -m Qwen3.6-27B.i1-Q3_K_L.gguf -f pg19.txt -c 110000 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128
11. ./llama-perplexity -m Qwen3.6-27B.i1-Q3_K_L.gguf -f pg19.txt -c 110000 --chunks 32 -ngl 99 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 256
xeeff@reddit
just open a PR bro
Pablo_the_brave@reddit (OP)
To be honest, I can't figure out why ddh0 froze those tensors for Q5_K, but he did it himself. It could be that the model will explode at a larger context; I have only checked up to 110k. Sure, this could just be a side effect of the changes in the main loop. Maybe someone knows more about what is going on. For people with 16GB of VRAM, this is a big impact. That's why I created this topic.
xeeff@reddit
... did you even open a PR? you could just let people know about the issue and it'd get fixed
Pablo_the_brave@reddit (OP)
I have thinked about it an create this PR. https://github.com/ggml-org/llama.cpp/issues/22544
xeeff@reddit
now imagine reporting it before you even posted - how much people's time you would have saved if you just opened a PR for a fix instead of making everyone follow the same steps... before opening a PR anyway
Pablo_the_brave@reddit (OP)
My goal was to show that the March commit for IQ4_XS inflated model size with negligible PPL gain. While maybe not a strict bug, the general consensus is that
--tensor-typeshould always take precedence.Dismissing my work as a "waste of time" is unfair. I thoroughly investigated and defined the problem before opening the PR. Even if no one re-quantizes existing models, documenting this logic is valuable for the community.
Digger412@reddit
1) ddh0 is a she, and I'll point her to this thread if she wants to reply
2) you can use --tensor-type when quantizing a model to specify what level it should be quanted to, you don't need to recompile llama.cpp entirely for that
Pablo_the_brave@reddit (OP)
Thats the case, --tensor-type after the commit 1dab5f5 do nothing in below scenario:
The condition if (qtype != new_type) on line \~682 in src/llama-quant.cpp is blocking the override when the type from --tensor-type is the same as the default (iq4_xs). For attn_qkv the logic looks like:
The user wants iq4_xs, the default is iq4_xs → qtype == new_type → the override is skipped → manual = false and --tensor-type do nothing... Then: if (!manual) → enters llama_tensor_get_type_impl() → overwrites to Q5_K
Do you think this is something for PR?
Master-Meal-77@reddit
ddh0 never made quants for Qwen3.6, but you can open a discussion for their Qwen3.5 quants here, i'm sure they will be happy to talk and explain more there (i have talked to them before)
FW-Connection68@reddit
I had a look at this and it is definitely just the default that is set to Q5_K.
Setting a custom override works on latest lLama.cpp. As and example, Bartowskis IQ4_KS uses the correct type for attn_qkv. It is larger (15.3Gb) due to other design choices, such as the first 24 ssm_out being Q8_0.
Pablo_the_brave@reddit (OP)
Thanks for your input. I have checked and still the --tensor-type parameter is ignored in this scenario. In my opinion, if the user is set parameters manually it should take precedence over the default settings.
llama_model_quantize_impl: have importance matrix data with 496 entriesllama_tensor_get_type: output.weight - applying manual override: iq4_xs -> q6_Kllama_tensor_get_type: blk.3.attn_v.weight - applying manual override: iq4_xs -> q5_Kllama_tensor_get_type: blk.7.attn_v.weight - applying manual override: iq4_xs -> q5_Kllama_tensor_get_type: blk.11.attn_v.weight - applying manual override: iq4_xs -> q5_Kllama_tensor_get_type: blk.15.attn_v.weight - applying manual override: iq4_xs -> q5_Kllama_tensor_get_type: blk.19.attn_v.weight - applying manual override: iq4_xs -> q5_Kllama_tensor_get_type: blk.23.attn_v.weight - applying manual override: iq4_xs -> q5_Kllama_tensor_get_type: blk.27.attn_v.weight - applying manual override: iq4_xs -> q5_Kllama_tensor_get_type: blk.31.attn_v.weight - applying manual override: iq4_xs -> q5_Kllama_tensor_get_type: blk.35.attn_v.weight - applying manual override: iq4_xs -> q5_Kllama_tensor_get_type: blk.39.attn_v.weight - applying manual override: iq4_xs -> q5_Kllama_tensor_get_type: blk.43.attn_v.weight - applying manual override: iq4_xs -> q5_Kllama_tensor_get_type: blk.47.attn_v.weight - applying manual override: iq4_xs -> q5_Kllama_tensor_get_type: blk.51.attn_v.weight - applying manual override: iq4_xs -> q5_Kllama_tensor_get_type: blk.55.attn_v.weight - applying manual override: iq4_xs -> q5_Kllama_tensor_get_type: blk.59.attn_v.weight - applying manual override: iq4_xs -> q5_Kllama_tensor_get_type: blk.63.attn_v.weight - applying manual override: iq4_xs -> q5_K[ 1/ 851] output.weight - [ 5120, 248320, 1, 1], type = bf16,====== llama_model_quantize_impl: did not find weights for output.weightDanmoreng@reddit
Hm…I am currently using unsloths Qwen3.6-27B-UD-IQ3_XXS.gguf which is just 12Gb. Gets me around 90k ctx with K/V at q8_0. Would be nice if Q4 works, but at 14.7Gb there is no room for context without turbo3 and llama.cpp doesn’t support that yet, right?
btw for single user use the better speculative decoding option is ngram-map-k over ngram-mod.
ComfyUser48@reddit
You gotta use llama.cpp that supports it, like https://github.com/TheTom/llama-cpp-turboquant
Danmoreng@reddit
I simply prefer to not use a fork.
tomByrer@reddit
Many of the forks used to be llama.cpp team members.
Pablo_the_brave@reddit (OP)
Thanks for the tip! I will check it out. This is a new thing for me**,** and I'm not even sure if these new ngrams are worth the effort. As you can see in my tests, even the strongest Q3 is worse than IQ4_XS with turbo3. How will it be in real life? Worth a try! 😉
Danmoreng@reddit
Well it's for the specific use case of repeated text from the context. So if you let it regenerate long sections of code with minor changes, a lot of the time ngram will hit. I get ~27-30 t/s on initial generation for an HTML website on my 5080 (mobile). Then I tell it to make some edit and the subsequent rewrite goes up to 50 t/s because large parts are identical to before.
Arkenstonish@reddit
I'm theoretical here, but with -np >2 and —kv-unified will speculation trigger for different concurrent request of their output is streamed through grammatic?
So it's always json and same mostly - will spec operate on unified cache? Or is it request/slot bounded?
Danmoreng@reddit
There are different implementations for this, if you want to have it shared between server slots ngram-mod is the correct one. Documented here: https://github.com/ggml-org/llama.cpp/blob/master/docs/speculative.md#n-gram-mod-ngram-mod
Pablo_the_brave@reddit (OP)
THX, I have tested it but --pure has a much bigger impact.
Qwen 3.6 27B (IQ4_XS) - Perplexity (PPL) Test Results
Context: 110K | Version: turbo3/turbo3
ComfyUser48@reddit
I can't get this working. I'm OOM with 110k ctx. What am I missing? I am running llama.cpp with turbo quant support
tomByrer@reddit
Webbrowsers can use GPU.
Pablo_the_brave@reddit (OP)
110k is possible with Turbo3 and only if the GPU is dedicated for LLM only (no any DM at it). This is my setup for 110k but I'm using rather advenced setup with script and router so you have to change it for your needs. batch-size and ubatch-size are critical:
--models-preset model.ini \
--models-max 1 \
--host 0.0.0.0 \
--port 8081 \
-t 8 \
--parallel 1 \
--cont-batching \
--keep -1 \
--chat-template-kwargs '{"preserve_thinking": true}' \
--defrag-thold 0.3 \
--cache-reuse 1024 \
--jinja \
--temp 0.15 \
--top-k 1 \
--min-p 0.1 \
--spec-type ngram-mod \
--spec-ngram-size-n 24 \
--draft-min 4 \
--draft-max 64 \
--repeat-last-n 512 \
--repeat-penalty 1.05
model.ini:
model = models/Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf
ctx-size = 110000
chat-template-file = qwen36/chat_template.jinja
n-gpu-layers = 99
cache-type-k = turbo3
cache-type-v = turbo3
batch-size = 512
ubatch-size = 256
flash-attn = true
no-mmap = true
ComfyUser48@reddit
I have successfully loaded it with 100k context with turbo3. And yes it's a server running Ubuntu, I'm only connecting it to remotely, no gui. Thank you so much !
Hodler-mane@reddit
what about tool calling, reasoning etc, is it all working? in opencode?
ComfyUser48@reddit
Yes it's all working. I'm shocked tbh. Getting 24 tokens per sec on my 5060 ti 16gb. Could swap for 5070 ti and I'd probably get double
sylverCode@reddit
There was another post not long time ago of IQ4_XS at 14.3GB, might be of interest to you: https://www.reddit.com/r/LocalLLaMA/comments/1svnmgo/quant_qwen3627b_on_16gb_vram_with_100k_context/
Pablo_the_brave@reddit (OP)
THX, I have tested it but --pure has a much bigger impact.
Qwen 3.6 27B (IQ4_XS) - Perplexity (PPL) Test Results
Context: 110K | Version: turbo3/turbo3
It's interesting to see how the PPL shifts as the file size decreases. The 14.3GB version shows a noticeable jump in PPL compared to the slightly larger ones.
-Ellary-@reddit
I'm using Bartowski's Qs - Qwen_Qwen3.6-27B-IQ4_XS.gguf 14.2GB,
mmproj for vision I just upload to CPU ram with --no-mmproj-offload.
Pablo_the_brave@reddit (OP)
There is no any IQ4_XS 14.2GB.
Sensitive_Ganache571@reddit
12gb vram...(
Glittering-Call8746@reddit
Is it worth to buy 5060ti 16gb (elevated prices and closer to 5070) atm ?
ea_man@reddit
If you care I just bought an used AMD 6800 yesterday for 250$:
Glittering-Call8746@reddit
512GB/s not too bad
ea_man@reddit
Also it's a decently power efficient, default is 200w you can run that far down, I hope I can run it at \~120w so I can run a couple without changing the PSU.
Mine makes no coil whine compared to a 6700xt.
Tempest_nano@reddit
If it is for this model, it would be memory bandwidth bound rather than compute. Compare on that metric.
Glittering-Call8746@reddit
What's a good memory bandwith as min ? 9070xt speeds ?
Tempest_nano@reddit
On my 5080 Laptop, I have 896 GB/s bandwidth (not sure how real this is), and I settled at 25.7 tok/s with 100k context in Windows. The 9070xt gets 640 GB/s or so, and the 5060Ti is 448 GB/s. The internet seems to think the 9070xt is the best of the bunch in that respect. I can't speak to how the different interface (9070xt would use HIP/ROCm, adn the 5060Ti would use CUDA) would affect things.
hybrid_aries@reddit
I was wondering if it was possible to get a better 27B quant than the IQ3_XXS in 16GB! I figured it was impossible to get one at a decent context since I run IQ3_XXS at around 100k context via mainline llama.cpp w/ Vulkan. I have an older 16GB RDNA2 card, and now I'm able to run your custom IQ4_XS model with a similar size context! I had to install ROCm & compile that custom llama.cpp-turboquant branch, but wow is it worth it! Like magic I went up an entire quant. Thank you so much for your work on this!!
Tempest_nano@reddit
Tinkering last night with the unsloth version of IQ4-XS and buun-llama-cpp. I found that I got good results with a ctv/ctq of turbo4. It doesn't compress the cache as much as turbo3, but its perplexity and KLD were much better. It allowed me to hit 64k context vs 32k with q8_0. I will find the numbers and post them here. Thanks for your work, I will try this image. It was driving me up the wall that I couldn't hit 128k context to allow full thinking (per the model card).
NickCanCode@reddit
Are you using a single card? I am using dual cards and it will crash after thinking for a few second.
Tempest_nano@reddit
I am using a single card for this model. I have absolutely used multiple cards for the MoE models (Qwen3.6 35b A3b), putting the experts on my AMD iGPU, but there wasn't much benefit over cpu. This 27b model is a dense model, so it all needs to be on the same device. At least I thought so, but I have tried so many pertubations that it all gets fuzzy.
Borkato@reddit
Is buun-llama cpp worth trying? Does it actually speed anything up or is it just context?
Tempest_nano@reddit
From my understanding it is just context compression. It is one of the two llama.cpp implementations of turboquant, with the other being https://github.com/TheTom/llama-cpp-turboquant . I believe that buun's fork is more bleeding-edge (he seems to be playing with turboquant and speculative decoding), but building is dyi. I am getting 25 t/s on my laptop, AMD AI HX 375, 32GB Ram, 16GB 5080 at 64k context on the IQ4 model.
My build script optimized for Nvidia + Strix Point (powershell):
cmake --build buun-llama-cpp/build --config Release --parallel 2>&1 | Tee-Object -FilePath out.txt -Append
Skyne98@reddit
It allows you to use dflash and yes depending on a gpu, it's a giant speed boost to short ctx and special cases
ea_man@reddit
Ain't that because 3.6 has larger Hidden Dimension?
3.5:
Language Model
3.6
Language Model
Pablo_the_brave@reddit (OP)
No, this impacting kvcache size, not the model size 😉
ea_man@reddit
Investigating now but all AI tell me otherwise:
Hidden size ↑ → model size ↑ AND KV cache ↑
DefNattyBoii@reddit
Noob question but are there any ways to push for better quality Q3 quants? 12 gb vram here + my old gpu. Hadamard-Lloyd quant is interesting from caiovicentino1 on huggignface but it mostly focuses on Q4-Q5
moahmo88@reddit
Good job!Thanks for sharing!
ComfyUser48@reddit
I have a second PC with RTX 5070 ti 16gb laying around. Gonna try this and will report !
Thanks !