Qwen-27B-IQ4_KS for ik_llama.cpp, especially for NVIDIA with 16GB VRAM
Posted by Pablo_the_brave@reddit | LocalLLaMA | View on Reddit | 19 comments
Hi everyone,
I'm presenting a new quantization of the Qwen-27B model, created specifically with 16GB VRAM NVIDIA GPUs in mind. I used quants that, unfortunately, are not yet available in the main upstream llama.cpp. I'm talking about the KS and KSS quants developed by I-Kawrakow. After many trials, I managed to create a 14.1GB model which, in my testing, delivers results highly comparable to my previous 14.7GB IQ4_XS quantization.
Model Link: cHunter789/Qwen3.6-27B-i1-IQ4_KS-GGUF
Unfortunately, the ik_llama.cpp project required to run this model is NVIDIA CUDA and CPU only. There is currently no way to run this on AMD or Apple Silicon (Metal) :/
Using this model with ik_llama.cpp and a Q4_0 Hadamard KV cache allows for a 105k context window.
Benchmark Results & Real-World Impressions
The model was heavily tested in daily production workflows for several days. It runs much faster (1.5x-1.75x) and more reliably than the previous iteration—completely eliminating the issue of "blank outputs", while the search-replace functionality works flawlessly.
- Qwen Benchmark: Successfully passed the performance evaluations on qwen3-6-27b-benchmark.vercel.app.
- Needle In A Haystack: Successfully evaluated with satisfying results across the full 100k context window.
- Comparison: In direct testing, this model performs slightly better than my previous variant:
Qwen3.6-27B-i1-IQ4_XS-GGUF.
Perplexity (PPL) Testing
Perplexity evaluations were conducted focusing exclusively on the KV Cache quantization setup (q4_0), as this is the primary target use case:
wget [https://www.gutenberg.org/files/2600/2600-0.txt](https://www.gutenberg.org/files/2600/2600-0.txt) -O pg19.txt
./llama-perplexity -m Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KSS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -khad -vhad -ctk q4_0 -ctv q4_0 -fa 1 -b 512 -ub 512
Test Log Output:
perplexity: calculating perplexity over 12 chunks, n_ctx=65536, batch_size=512, n_seq=1
perplexity: 71.10 seconds per pass - ETA 14.22 minutes
[1]6.6897,[2]7.0032,[3]7.1989,[4]7.3327,[5]7.4816,[6]7.3770,[7]7.4325,[8]7.4378,[9]7.4754,[10]7.5192,[11]7.5669,[12]7.4040,
Final estimate: PPL over 12 chunks for n_ctx=65536 = 7.4040 +/- 0.02773
Note: I currently do not have the capability to run KLD (Kullback–Leibler divergence) tests.
Example Server Configuration
For reference, here is the server configuration I used during my tests:
llama-server \
-m "$MODEL_PATH" \
-a Qwen3.6-27B \
--ctx-size 105000 \
--chat-template-file chat_template.jinja \
--n-gpu-layers 99 \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--batch-size 512 \
--ubatch-size 256 \
--flash-attn on \
--no-mmap \
--host 0.0.0.0 \
--port 8081 \
--reasoning on \
--reasoning-format deepseek \
-t 8 \
--parallel 1 \
-khad \
-vhad \
--chat-template-kwargs '{"preserve_thinking": true}' \
--defrag-thold 0.3 \
--jinja \
--cont-batching \
--temp 0.15 \
--top-k 1 \
--min-p 0.1 \
--keep -1 \
--repeat-last-n 512 \
--repeat-penalty 1.05
```
laul_pogan@reddit
For the 16GB-only case: grumd's 35B Q8 preference implies significant CPU offload since that model is ~35GB. CPU offload at that ratio tanks decode to single-digit tok/s, so it's not a fair quality/speed tradeoff against this 14.1GB 27B fully on VRAM. The more honest 16GB comparison is 35B-A3B MoE at Q4, where sparse activation keeps active params small enough to leave VRAM headroom for the 105k KV cache. Worth benching that before concluding 27B is a quality ceiling for the card.
Ok_Mine189@reddit
You're wrong, that's not the case with MoE models.
WoodYouIfYouCould@reddit
Do you have a recommended setup for 35B-A3B on 16G. Currently running the unsloth model. Though alot of things happened in the last few weeks 😅
laul_pogan@reddit
I don't unfortunately- I do most everything on a DGX Spark and haven't had to quantize down that far
grumd@reddit
Important to note that PPL actually barely moves with kv cache quants. KLD would show the degradation much faster. As much as I'd like to use 27B on my 16GB 5080, it's quite low quality no matter what you do. I'm preferring 35B at Q8 in terms of quality
kivaougu@reddit
For me tool calling shows the real degradation much more clearly than ppl or kld
Pablo_the_brave@reddit (OP)
I have the same experience, especially search-replace or patch.
Pablo_the_brave@reddit (OP)
Will be greate to do some compare in real use of case. Have you tried that: https://qwen3-6-27b-benchmark.vercel.app/
grumd@reddit
Yeah but I don't often draw SVG chessboards, so I wouldn't say this is a good comparison. I'm using my models for coding and 35B-A3B is very consistent and smart when you give it good instructions on what to do.
redblood252@reddit
Did you try MTP ?
Pablo_the_brave@reddit (OP)
Yes, currently no chance for anything full VRAM usable at 16GB VRAM. This model with RTX5070Ti give me \~25t/s at decoding with 100k context loaded.
road-runn3r@reddit
I can get 70-80t/s on a 5070ti using unsloth's Qwen3.6-27B-UD-IQ3_XXS MTP (kv q8). Would you say the quality you get with this IQ4 is worth the 3x slower tg?
rockoruckus@reddit
Have you tried: https://huggingface.co/GianniDPC/Qwen3.6-27B-IQ4_XS-pure-with-MTP-GGUF
FerLuisxd@reddit
Tks? Did you investigate mtp/dflash/n-gram? Also nvfp4?
Pablo_the_brave@reddit (OP)
Yes, no chance currently for 16GB VRAM. Any memory/CPU offloading with a dense model means decoding at 8t/s and unacceptable slow prefill.
Kagemand@reddit
Why is ik_llama required for this type of quant?
pmttyji@reddit
Because ik_llama supports these quants while llama.cpp don't
pan-gregory@reddit
Awesome work! Always wanted to create my own quants but lacking hardware. Also check ubergarm on hf. Im using his quants with MTP on ik_llama and i highly recommend it.
https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF
icedgz@reddit
You’re the fking man