Qwen-27B-IQ4_KS for ik_llama.cpp, especially for NVIDIA with 16GB VRAM

Posted by Pablo_the_brave@reddit | LocalLLaMA | View on Reddit | 19 comments

Hi everyone,

I'm presenting a new quantization of the Qwen-27B model, created specifically with 16GB VRAM NVIDIA GPUs in mind. I used quants that, unfortunately, are not yet available in the main upstream llama.cpp. I'm talking about the KS and KSS quants developed by I-Kawrakow. After many trials, I managed to create a 14.1GB model which, in my testing, delivers results highly comparable to my previous 14.7GB IQ4_XS quantization.

Model Link: cHunter789/Qwen3.6-27B-i1-IQ4_KS-GGUF

Unfortunately, the ik_llama.cpp project required to run this model is NVIDIA CUDA and CPU only. There is currently no way to run this on AMD or Apple Silicon (Metal) :/

Using this model with ik_llama.cpp and a Q4_0 Hadamard KV cache allows for a 105k context window.

Benchmark Results & Real-World Impressions

The model was heavily tested in daily production workflows for several days. It runs much faster (1.5x-1.75x) and more reliably than the previous iteration—completely eliminating the issue of "blank outputs", while the search-replace functionality works flawlessly.

Perplexity (PPL) Testing

Perplexity evaluations were conducted focusing exclusively on the KV Cache quantization setup (q4_0), as this is the primary target use case:

wget [https://www.gutenberg.org/files/2600/2600-0.txt](https://www.gutenberg.org/files/2600/2600-0.txt) -O pg19.txt

./llama-perplexity -m Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KSS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -khad -vhad -ctk q4_0 -ctv q4_0 -fa 1 -b 512 -ub 512

Test Log Output:

perplexity: calculating perplexity over 12 chunks, n_ctx=65536, batch_size=512, n_seq=1
perplexity: 71.10 seconds per pass - ETA 14.22 minutes
[1]6.6897,[2]7.0032,[3]7.1989,[4]7.3327,[5]7.4816,[6]7.3770,[7]7.4325,[8]7.4378,[9]7.4754,[10]7.5192,[11]7.5669,[12]7.4040,

Final estimate: PPL over 12 chunks for n_ctx=65536 = 7.4040 +/- 0.02773

Note: I currently do not have the capability to run KLD (Kullback–Leibler divergence) tests.

Example Server Configuration

For reference, here is the server configuration I used during my tests:

llama-server \
        -m "$MODEL_PATH" \
        -a Qwen3.6-27B \
        --ctx-size 105000 \
        --chat-template-file chat_template.jinja \
        --n-gpu-layers 99 \
        --cache-type-k q4_0 \
        --cache-type-v q4_0 \
        --batch-size 512 \
        --ubatch-size 256 \
        --flash-attn on \
        --no-mmap \
        --host 0.0.0.0 \
        --port 8081 \
        --reasoning on \
        --reasoning-format deepseek \
        -t 8 \
        --parallel 1 \
        -khad \
        -vhad \
        --chat-template-kwargs '{"preserve_thinking": true}' \
        --defrag-thold 0.3 \
        --jinja \
        --cont-batching \
        --temp 0.15 \
        --top-k 1 \
        --min-p 0.1 \
        --keep -1 \
        --repeat-last-n 512 \
        --repeat-penalty 1.05

```