Qwen-27B-IQ4_KS for ik_llama.cpp, especially for NVIDIA with 16GB VRAM

Posted by Pablo_the_brave@reddit | LocalLLaMA | View on Reddit | 19 comments

Hi everyone,

I'm presenting a new quantization of the Qwen-27B model, created specifically with 16GB VRAM NVIDIA GPUs in mind. I used quants that, unfortunately, are not yet available in the main upstream llama.cpp. I'm talking about the KS and KSS quants developed by I-Kawrakow. After many trials, I managed to create a 14.1GB model which, in my testing, delivers results highly comparable to my previous 14.7GB IQ4_XS quantization.

Model Link: cHunter789/Qwen3.6-27B-i1-IQ4_KS-GGUF

Unfortunately, the ik_llama.cpp project required to run this model is NVIDIA CUDA and CPU only. There is currently no way to run this on AMD or Apple Silicon (Metal) :/

Using this model with ik_llama.cpp and a Q4_0 Hadamard KV cache allows for a 105k context window.

Benchmark Results & Real-World Impressions

The model was heavily tested in daily production workflows for several days. It runs much faster (1.5x-1.75x) and more reliably than the previous iteration—completely eliminating the issue of "blank outputs", while the search-replace functionality works flawlessly.

Qwen Benchmark: Successfully passed the performance evaluations on qwen3-6-27b-benchmark.vercel.app.
Needle In A Haystack: Successfully evaluated with satisfying results across the full 100k context window.
Comparison: In direct testing, this model performs slightly better than my previous variant: Qwen3.6-27B-i1-IQ4_XS-GGUF.

Perplexity (PPL) Testing

Perplexity evaluations were conducted focusing exclusively on the KV Cache quantization setup (q4_0), as this is the primary target use case:

wget [https://www.gutenberg.org/files/2600/2600-0.txt](https://www.gutenberg.org/files/2600/2600-0.txt) -O pg19.txt

./llama-perplexity -m Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KSS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -khad -vhad -ctk q4_0 -ctv q4_0 -fa 1 -b 512 -ub 512

Test Log Output:

perplexity: calculating perplexity over 12 chunks, n_ctx=65536, batch_size=512, n_seq=1
perplexity: 71.10 seconds per pass - ETA 14.22 minutes
[1]6.6897,[2]7.0032,[3]7.1989,[4]7.3327,[5]7.4816,[6]7.3770,[7]7.4325,[8]7.4378,[9]7.4754,[10]7.5192,[11]7.5669,[12]7.4040,

Final estimate: PPL over 12 chunks for n_ctx=65536 = 7.4040 +/- 0.02773

Note: I currently do not have the capability to run KLD (Kullback–Leibler divergence) tests.

Example Server Configuration

For reference, here is the server configuration I used during my tests:

llama-server \
        -m "$MODEL_PATH" \
        -a Qwen3.6-27B \
        --ctx-size 105000 \
        --chat-template-file chat_template.jinja \
        --n-gpu-layers 99 \
        --cache-type-k q4_0 \
        --cache-type-v q4_0 \
        --batch-size 512 \
        --ubatch-size 256 \
        --flash-attn on \
        --no-mmap \
        --host 0.0.0.0 \
        --port 8081 \
        --reasoning on \
        --reasoning-format deepseek \
        -t 8 \
        --parallel 1 \
        -khad \
        -vhad \
        --chat-template-kwargs '{"preserve_thinking": true}' \
        --defrag-thold 0.3 \
        --jinja \
        --cont-batching \
        --temp 0.15 \
        --top-k 1 \
        --min-p 0.1 \
        --keep -1 \
        --repeat-last-n 512 \
        --repeat-penalty 1.05

```

[-]

laul_pogan@reddit

For the 16GB-only case: grumd's 35B Q8 preference implies significant CPU offload since that model is ~35GB. CPU offload at that ratio tanks decode to single-digit tok/s, so it's not a fair quality/speed tradeoff against this 14.1GB 27B fully on VRAM. The more honest 16GB comparison is 35B-A3B MoE at Q4, where sparse activation keeps active params small enough to leave VRAM headroom for the 105k KV cache. Worth benching that before concluding 27B is a quality ceiling for the card.

Ok_Mine189@reddit

You're wrong, that's not the case with MoE models.

WoodYouIfYouCould@reddit

Do you have a recommended setup for 35B-A3B on 16G. Currently running the unsloth model. Though alot of things happened in the last few weeks 😅

I don't unfortunately- I do most everything on a DGX Spark and haven't had to quantize down that far

grumd@reddit

Important to note that PPL actually barely moves with kv cache quants. KLD would show the degradation much faster. As much as I'd like to use 27B on my 16GB 5080, it's quite low quality no matter what you do. I'm preferring 35B at Q8 in terms of quality

kivaougu@reddit

For me tool calling shows the real degradation much more clearly than ppl or kld

Pablo_the_brave@reddit (OP)

I have the same experience, especially search-replace or patch.

Will be greate to do some compare in real use of case. Have you tried that: https://qwen3-6-27b-benchmark.vercel.app/

Yeah but I don't often draw SVG chessboards, so I wouldn't say this is a good comparison. I'm using my models for coding and 35B-A3B is very consistent and smart when you give it good instructions on what to do.