Qwen3.6-27B IQ4_XS FULL VRAM with 110k context

Posted by Pablo_the_brave@reddit | LocalLLaMA | View on Reddit | 51 comments

Qwen3.6-27B IQ4_XS Bloat: Reverting llama.cpp commit saves 16GB VRAM (14.7GB vs 15.1GB) + KVCache Tests

With the release of Qwen3.6-27B, I noticed that compared to the excellent IQ4_XS quantization (14.7GB) by mradermacher for the 3.5 version (Qwen3.5-27B-i1-GGUF), the current images have bloated. The Qwen3.6 equivalent (Qwen3.6-27B-i1-GGUF) now weighs 15.1GB.

The IQ4_XS is a true "unicorn" – in all benchmarks, it offers an incredible ratio of size to model quality. In practice, it is the only viable option for running a 27B model on 16GB VRAM with a decent context. Anything lower than this is unsuitable for coding tasks. Unfortunately, the increase from 14.7GB to 15.1GB breaks the experience for 16GB cards.

The Cause & The Fix The culprit is a specific llama.cpp commit (1dab5f5a44): GitHub link. Its effect is hardcoding attn_qkv layer quantizations to a minimum of Q5_K.

To fix this, I modified the source code and replicated the original IQ4_XS layer quantization 1:1. I used the imatrix from mradermacher (Qwen3.6-27B-i1-GGUF) and performed comparative benchmarks. I observed no significant drop in model quality. In my opinion, the mentioned commit is a pure regression for the IQ4_XS format.

My custom 14.7GB model with reverted layers is available here: 👉 cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF

Perplexity Benchmarks: 65k Context (-c 65536)

Testing parameters: pg19.txt (downloaded from Project Gutenberg here), --chunks 32, -ngl 99 (unless noted), -fa 1, -b 512, -ub 128

ID	Model Size	Model File / Version	`-ctk`	`-ctv`	Final PPL
1	15.1GB	`Qwen3.6-27B.i1-IQ4_XS.gguf` (Standard)	`q8_0`	`q8_0`	7.3765 ± 0.0276
2	14.7GB	`...-IQ4_XS-attn_qkv-IQ4_XS.gguf` (Custom)	`q8_0`	`q8_0`	7.3804 ± 0.0276
3	14.7GB	`...-IQ4_XS-attn_qkv-IQ4_XS.gguf` (Custom)	`q8_0`	`turbo2`	7.4260 ± 0.0277
4	15.1GB	`Qwen3.6-27B.i1-IQ4_XS.gguf` (Standard)	`q8_0`	`turbo3`	7.4069 ± 0.0277
5	14.7GB	`...-IQ4_XS-attn_qkv-IQ4_XS.gguf` (Custom)	`q4_0`	`q4_0`	7.3964 ± 0.0277
6	14.7GB	`...-IQ4_XS-attn_qkv-IQ4_XS.gguf` (Custom)	`turbo3`	`turbo3`	7.4317 ± 0.0279

Command lines for 65k context:

./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128
./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128
./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv turbo2 -fa 1
./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -ctk q8_0 -ctv turbo3 -fa 1 -b 512 -ub 128
./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -ctk q4_0 -ctv q4_0 -fa 1 -b 512 -ub 128
./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 128

KV Cache Observations: These tests indicate that for Qwen3.6-27B, the conclusions in turboquant_plus do not apply. There is no significant benefit to increasing K-cache at the expense of V-cache. In fact, for this model, the V-cache appears equally critical.

Perplexity Benchmarks: 110k Context (-c 110000)

Based on the above, I decided to use symmetric Turbo3 quantization. Combined with my custom 14.7GB model, this optimization allowed me to achieve 110k context fully within 16GB VRAM. (This took quite a while to test, so I hope you appreciate the data!)

ID	Model Size	Model File / Version	`-ctk`	`-ctv`	Final PPL
7	14.7GB	`...-IQ4_XS-attn_qkv-IQ4_XS.gguf` (Custom)	`q8_0`	`q8_0`	7.5205 ± 0.0285
8	14.7GB	Selected Final Configuration	turbo3	turbo3	7.5758 ± 0.0287
9	15.1GB	`Qwen3.6-27B.i1-IQ4_XS.gguf` (Standard)	`turbo3`	`turbo3`	7.5727 ± 0.0287

Command lines for 110k context:
7. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 110000 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 64
8. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 110000 --chunks 32 -ngl 99 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 256
9. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS.gguf -f pg19.txt -c 110000 --chunks 32 -ngl -1 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 256

The Q3 Debate

There are theories floating around that the Q3 model is fine. Judge for yourselves:

ID	Model Size	Model File / Version	`-ctk`	`-ctv`	Final PPL
10	Q3_K_L	`Qwen3.6-27B.i1-Q3_K_L.gguf`	`q8_0`	`q8_0`	7.6538 ± 0.0292
11	Q3_K_L	`Qwen3.6-27B.i1-Q3_K_L.gguf`	`turbo3`	`turbo3`	7.7085 ± 0.0295

Command lines for Q3 tests:
10. ./llama-perplexity -m Qwen3.6-27B.i1-Q3_K_L.gguf -f pg19.txt -c 110000 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128
11. ./llama-perplexity -m Qwen3.6-27B.i1-Q3_K_L.gguf -f pg19.txt -c 110000 --chunks 32 -ngl 99 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 256

[-]

xeeff@reddit

just open a PR bro

[-]

Pablo_the_brave@reddit (OP)

To be honest, I can't figure out why ddh0 froze those tensors for Q5_K, but he did it himself. It could be that the model will explode at a larger context; I have only checked up to 110k. Sure, this could just be a side effect of the changes in the main loop. Maybe someone knows more about what is going on. For people with 16GB of VRAM, this is a big impact. That's why I created this topic.

[-]

xeeff@reddit

... did you even open a PR? you could just let people know about the issue and it'd get fixed

[-]

Pablo_the_brave@reddit (OP)

I have thinked about it an create this PR. https://github.com/ggml-org/llama.cpp/issues/22544

[-]

xeeff@reddit

now imagine reporting it before you even posted - how much people's time you would have saved if you just opened a PR for a fix instead of making everyone follow the same steps... before opening a PR anyway

[-]

Pablo_the_brave@reddit (OP)

My goal was to show that the March commit for IQ4_XS inflated model size with negligible PPL gain. While maybe not a strict bug, the general consensus is that --tensor-type should always take precedence.

Dismissing my work as a "waste of time" is unfair. I thoroughly investigated and defined the problem before opening the PR. Even if no one re-quantizes existing models, documenting this logic is valuable for the community.

[-]

Digger412@reddit

1) ddh0 is a she, and I'll point her to this thread if she wants to reply

2) you can use --tensor-type when quantizing a model to specify what level it should be quanted to, you don't need to recompile llama.cpp entirely for that

[-]

Pablo_the_brave@reddit (OP)

Thats the case, --tensor-type after the commit 1dab5f5 do nothing in below scenario:

The condition if (qtype != new_type) on line \~682 in src/llama-quant.cpp is blocking the override when the type from --tensor-type is the same as the default (iq4_xs). For attn_qkv the logic looks like:

The user wants iq4_xs, the default is iq4_xs → qtype == new_type → the override is skipped → manual = false and --tensor-type do nothing... Then: if (!manual) → enters llama_tensor_get_type_impl() → overwrites to Q5_K

Do you think this is something for PR?

[-]

Master-Meal-77@reddit

ddh0 never made quants for Qwen3.6, but you can open a discussion for their Qwen3.5 quants here, i'm sure they will be happy to talk and explain more there (i have talked to them before)

[-]

FW-Connection68@reddit

I had a look at this and it is definitely just the default that is set to Q5_K.

Setting a custom override works on latest lLama.cpp. As and example, Bartowskis IQ4_KS uses the correct type for attn_qkv. It is larger (15.3Gb) due to other design choices, such as the first 24 ssm_out being Q8_0.

[-]

Pablo_the_brave@reddit (OP)

Thanks for your input. I have checked and still the --tensor-type parameter is ignored in this scenario. In my opinion, if the user is set parameters manually it should take precedence over the default settings.

llama_model_quantize_impl: have importance matrix data with 496 entries
llama_tensor_get_type: output.weight                        - applying manual override: iq4_xs -> q6_K
llama_tensor_get_type: blk.3.attn_v.weight                  - applying manual override: iq4_xs -> q5_K
llama_tensor_get_type: blk.7.attn_v.weight                  - applying manual override: iq4_xs -> q5_K
llama_tensor_get_type: blk.11.attn_v.weight                 - applying manual override: iq4_xs -> q5_K
llama_tensor_get_type: blk.15.attn_v.weight                 - applying manual override: iq4_xs -> q5_K
llama_tensor_get_type: blk.19.attn_v.weight                 - applying manual override: iq4_xs -> q5_K
llama_tensor_get_type: blk.23.attn_v.weight                 - applying manual override: iq4_xs -> q5_K
llama_tensor_get_type: blk.27.attn_v.weight                 - applying manual override: iq4_xs -> q5_K
llama_tensor_get_type: blk.31.attn_v.weight                 - applying manual override: iq4_xs -> q5_K
llama_tensor_get_type: blk.35.attn_v.weight                 - applying manual override: iq4_xs -> q5_K
llama_tensor_get_type: blk.39.attn_v.weight                 - applying manual override: iq4_xs -> q5_K
llama_tensor_get_type: blk.43.attn_v.weight                 - applying manual override: iq4_xs -> q5_K
llama_tensor_get_type: blk.47.attn_v.weight                 - applying manual override: iq4_xs -> q5_K
llama_tensor_get_type: blk.51.attn_v.weight                 - applying manual override: iq4_xs -> q5_K
llama_tensor_get_type: blk.55.attn_v.weight                 - applying manual override: iq4_xs -> q5_K
llama_tensor_get_type: blk.59.attn_v.weight                 - applying manual override: iq4_xs -> q5_K
llama_tensor_get_type: blk.63.attn_v.weight                 - applying manual override: iq4_xs -> q5_K
[   1/ 851] output.weight                        - [ 5120, 248320,      1,      1], type =   bf16,
====== llama_model_quantize_impl: did not find weights for output.weight

[-]

Danmoreng@reddit

Hm…I am currently using unsloths Qwen3.6-27B-UD-IQ3_XXS.gguf which is just 12Gb. Gets me around 90k ctx with K/V at q8_0. Would be nice if Q4 works, but at 14.7Gb there is no room for context without turbo3 and llama.cpp doesn’t support that yet, right?

btw for single user use the better speculative decoding option is ngram-map-k over ngram-mod.

[-]

ComfyUser48@reddit

You gotta use llama.cpp that supports it, like https://github.com/TheTom/llama-cpp-turboquant

[-]

Danmoreng@reddit

I simply prefer to not use a fork.

[-]

tomByrer@reddit

Many of the forks used to be llama.cpp team members.

[-]

Pablo_the_brave@reddit (OP)

Thanks for the tip! I will check it out. This is a new thing for me**,** and I'm not even sure if these new ngrams are worth the effort. As you can see in my tests, even the strongest Q3 is worse than IQ4_XS with turbo3. How will it be in real life? Worth a try! 😉

[-]

Danmoreng@reddit

Well it's for the specific use case of repeated text from the context. So if you let it regenerate long sections of code with minor changes, a lot of the time ngram will hit. I get ~27-30 t/s on initial generation for an HTML website on my 5080 (mobile). Then I tell it to make some edit and the subsequent rewrite goes up to 50 t/s because large parts are identical to before.

[-]

Arkenstonish@reddit

I'm theoretical here, but with -np >2 and —kv-unified will speculation trigger for different concurrent request of their output is streamed through grammatic?

So it's always json and same mostly - will spec operate on unified cache? Or is it request/slot bounded?

[-]

Danmoreng@reddit

There are different implementations for this, if you want to have it shared between server slots ngram-mod is the correct one. Documented here: https://github.com/ggml-org/llama.cpp/blob/master/docs/speculative.md#n-gram-mod-ngram-mod

[-]

Pablo_the_brave@reddit (OP)

THX, I have tested it but --pure has a much bigger impact.

Qwen 3.6 27B (IQ4_XS) - Perplexity (PPL) Test Results

Context: 110K | Version: turbo3/turbo3

File Size	Final PPL Estimate	+/- Error	Step-by-step PPL Results
15.1GB	7.5727	0.02876	[1]6.9506, [2]7.2310, [3]7.5049, [4]7.4255, [5]7.4722, [6]7.5441, [7]7.5727
14.7GB	7.5758	0.02878	[1]6.9499, [2]7.2330, [3]7.5052, [4]7.4266, [5]7.4748, [6]7.5466, [7]7.5758
14.3GB	7.6171	0.02891	[1]6.9894, [2]7.2823, [3]7.5561, [4]7.4711, [5]7.5183, [6]7.5883, [7]7.6171

[-]

ComfyUser48@reddit

I can't get this working. I'm OOM with 110k ctx. What am I missing? I am running llama.cpp with turbo quant support

[-]

tomByrer@reddit

Webbrowsers can use GPU.

[-]

Pablo_the_brave@reddit (OP)

110k is possible with Turbo3 and only if the GPU is dedicated for LLM only (no any DM at it). This is my setup for 110k but I'm using rather advenced setup with script and router so you have to change it for your needs. batch-size and ubatch-size are critical:

--models-preset model.ini \
--models-max 1 \
--host 0.0.0.0 \
--port 8081 \
-t 8 \
--parallel 1 \
--cont-batching \
--keep -1 \
--chat-template-kwargs '{"preserve_thinking": true}' \
--defrag-thold 0.3 \
--cache-reuse 1024 \
--jinja \
--temp 0.15 \
--top-k 1 \
--min-p 0.1 \
--spec-type ngram-mod \
--spec-ngram-size-n 24 \
--draft-min 4 \
--draft-max 64 \
--repeat-last-n 512 \
--repeat-penalty 1.05

model.ini:
model = models/Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf
ctx-size = 110000
chat-template-file = qwen36/chat_template.jinja
n-gpu-layers = 99
cache-type-k = turbo3
cache-type-v = turbo3
batch-size = 512
ubatch-size = 256
flash-attn = true
no-mmap = true

[-]

ComfyUser48@reddit

I have successfully loaded it with 100k context with turbo3. And yes it's a server running Ubuntu, I'm only connecting it to remotely, no gui. Thank you so much !

[-]

Hodler-mane@reddit

what about tool calling, reasoning etc, is it all working? in opencode?

[-]

ComfyUser48@reddit

Yes it's all working. I'm shocked tbh. Getting 24 tokens per sec on my 5060 ti 16gb. Could swap for 5070 ti and I'd probably get double

[-]

sylverCode@reddit

There was another post not long time ago of IQ4_XS at 14.3GB, might be of interest to you: https://www.reddit.com/r/LocalLLaMA/comments/1svnmgo/quant_qwen3627b_on_16gb_vram_with_100k_context/

[-]

Pablo_the_brave@reddit (OP)

THX, I have tested it but --pure has a much bigger impact.

Qwen 3.6 27B (IQ4_XS) - Perplexity (PPL) Test Results

Context: 110K | Version: turbo3/turbo3

File Size	Final PPL Estimate	+/- Error	Step-by-step PPL Results
15.1GB	7.5727	0.02876	[1]6.9506, [2]7.2310, [3]7.5049, [4]7.4255, [5]7.4722, [6]7.5441, [7]7.5727
14.7GB	7.5758	0.02878	[1]6.9499, [2]7.2330, [3]7.5052, [4]7.4266, [5]7.4748, [6]7.5466, [7]7.5758
14.3GB	7.6171	0.02891	[1]6.9894, [2]7.2823, [3]7.5561, [4]7.4711, [5]7.5183, [6]7.5883, [7]7.6171

It's interesting to see how the PPL shifts as the file size decreases. The 14.3GB version shows a noticeable jump in PPL compared to the slightly larger ones.

[-]

-Ellary-@reddit

I'm using Bartowski's Qs - Qwen_Qwen3.6-27B-IQ4_XS.gguf 14.2GB,
mmproj for vision I just upload to CPU ram with --no-mmproj-offload.

[-]

Pablo_the_brave@reddit (OP)

There is no any IQ4_XS 14.2GB.

[-]

Sensitive_Ganache571@reddit

12gb vram...(

[-]

Glittering-Call8746@reddit

Is it worth to buy 5060ti 16gb (elevated prices and closer to 5070) atm ?

[-]

ea_man@reddit

If you care I just bought an used AMD 6800 yesterday for 250$:

prompt eval time =  110973.76 ms / 22538 tokens (    4.92 ms per token,   203.09 tokens per second)       eval time =   48434.97 ms /  1071 tokens (   45.22 ms per token,    22.11 tokens per second)

llama-server \
 -m /home/eaman/lm/models/mradermacher/Qwen3.6-27B.i1-IQ4_XS.gguf \
        --host 0.0.0.0 \
        -np 1 \
        --fit-target 10 \
        -ctk q8_0 \
        -ctv q8_0 \
        -fa on \
        --temp 0.7 \
        --top-p 0.80 \
        --top-k 20 \
        --min-p 0.0 \
        --repeat-penalty 1.0 \
        --presence_penalty 1.5 \
        -b 512 \
        --jinja  \
        --no-mmap \

[-]

Glittering-Call8746@reddit

512GB/s not too bad

[-]

ea_man@reddit

Also it's a decently power efficient, default is 200w you can run that far down, I hope I can run it at \~120w so I can run a couple without changing the PSU.

Mine makes no coil whine compared to a 6700xt.

[-]

Tempest_nano@reddit

If it is for this model, it would be memory bandwidth bound rather than compute. Compare on that metric.

[-]

Glittering-Call8746@reddit

What's a good memory bandwith as min ? 9070xt speeds ?

[-]

Tempest_nano@reddit

On my 5080 Laptop, I have 896 GB/s bandwidth (not sure how real this is), and I settled at 25.7 tok/s with 100k context in Windows. The 9070xt gets 640 GB/s or so, and the 5060Ti is 448 GB/s. The internet seems to think the 9070xt is the best of the bunch in that respect. I can't speak to how the different interface (9070xt would use HIP/ROCm, adn the 5060Ti would use CUDA) would affect things.

[-]

hybrid_aries@reddit

I was wondering if it was possible to get a better 27B quant than the IQ3_XXS in 16GB! I figured it was impossible to get one at a decent context since I run IQ3_XXS at around 100k context via mainline llama.cpp w/ Vulkan. I have an older 16GB RDNA2 card, and now I'm able to run your custom IQ4_XS model with a similar size context! I had to install ROCm & compile that custom llama.cpp-turboquant branch, but wow is it worth it! Like magic I went up an entire quant. Thank you so much for your work on this!!

[-]

Tempest_nano@reddit

Tinkering last night with the unsloth version of IQ4-XS and buun-llama-cpp. I found that I got good results with a ctv/ctq of turbo4. It doesn't compress the cache as much as turbo3, but its perplexity and KLD were much better. It allowed me to hit 64k context vs 32k with q8_0. I will find the numbers and post them here. Thanks for your work, I will try this image. It was driving me up the wall that I couldn't hit 128k context to allow full thinking (per the model card).

[-]

NickCanCode@reddit

Are you using a single card? I am using dual cards and it will crash after thinking for a few second.

[-]

Tempest_nano@reddit

I am using a single card for this model. I have absolutely used multiple cards for the MoE models (Qwen3.6 35b A3b), putting the experts on my AMD iGPU, but there wasn't much benefit over cpu. This 27b model is a dense model, so it all needs to be on the same device. At least I thought so, but I have tried so many pertubations that it all gets fuzzy.

[-]

Borkato@reddit

Is buun-llama cpp worth trying? Does it actually speed anything up or is it just context?

[-]

Tempest_nano@reddit

From my understanding it is just context compression. It is one of the two llama.cpp implementations of turboquant, with the other being https://github.com/TheTom/llama-cpp-turboquant . I believe that buun's fork is more bleeding-edge (he seems to be playing with turboquant and speculative decoding), but building is dyi. I am getting 25 t/s on my laptop, AMD AI HX 375, 32GB Ram, 16GB 5080 at 64k context on the IQ4 model.

My build script optimized for Nvidia + Strix Point (powershell):

$PSNativeCommandUseErrorActionPreference = $false
$ErrorActionPreference = 'Continue'

# Wipe build dir to avoid stale cmake cache
Remove-Item -Recurse -Force buun-llama-cpp\build -ErrorAction SilentlyContinue

# Bootstrap VS Build Tools environment (sets INCLUDE, LIB, PATH for clang-cl/link/etc.)
$vcvars = "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\Build\vcvars64.bat"
cmd /c "`"$vcvars`" && set" | ForEach-Object {
    if ($_ -match "^([^=]+)=(.*)$") {
        [System.Environment]::SetEnvironmentVariable($Matches[1], $Matches[2], 'Process')
    }
}

# Prepend ROCm so hipcc and cmake find-modules are reachable
$env:PATH    = "C:\Program Files\AMD\ROCm\7.1\bin;$env:PATH"
$env:HIP_PATH = "C:\Program Files\AMD\ROCm\7.1"

cmake -B buun-llama-cpp/build -S buun-llama-cpp -G Ninja `
  -DCMAKE_BUILD_TYPE=Release `
  -DCMAKE_C_COMPILER=clang-cl `
  -DCMAKE_CXX_COMPILER=clang-cl `
  -DCMAKE_CXX_FLAGS="/EHsc" `
  -DGGML_CUDA=ON `
  -DCMAKE_CUDA_ARCHITECTURES="120a-real" `
  "-DCMAKE_CUDA_FLAGS=-use_fast_math -diag-suppress 221,177" `
  -DGGML_AVX512=ON `
  -DGGML_AVX512_VBMI=ON `
  -DGGML_AVX512_VNNI=ON `
  -DGGML_AVX512_BF16=ON `
  -DGGML_AVX_VNNI=ON `
  -DGGML_BMI2=ON `
  -DGGML_CUDA_GRAPHS=ON `
  -DGGML_CUDA_FA_ALL_QUANTS=ON `
  -DGGML_NATIVE=OFF `
  -DGGML_BACKEND_DL=ON `
  -DGGML_HIP=ON `
  -DGPU_TARGETS="gfx1150" `
  -DGGML_LTO=ON 2>&1 | Tee-Object -FilePath out.txt

cmake --build buun-llama-cpp/build --config Release --parallel 2>&1 | Tee-Object -FilePath out.txt -Append

[-]