Quant Qwen3.6-27B on 16GB VRAM with 100k context length

Posted by Due-Project-7507@reddit | LocalLLaMA | View on Reddit | 13 comments

I have experimented how to run Qwen3.6-27B on my laptop with an A5000 16GB GPU. I have created an own IQ4_XS GGUF "qwen3.6-27b-IQ4_XS-pure.gguf" with the Unsloth imatrix and compared the mean KLD of it with other quants.

You can see that I also have tested different turboquant versions. It looks that the buun-llama-cpp fork is better than the TheTom/llama-cpp-turboquant fork.

If you want to try my version, you can do the following:

  1. Download my GGUF from Huggingface. It already contains an improved chat template base on this one

  2. Clone buun-llama-cpp from https://github.com/spiritbuun/buun-llama-cpp

  3. Build it, I have used on Windows

    cmake -B build -G Ninja -DGGML_CUDA=ON -DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl cmake --build build --config Release -j 16

  4. Check e.g. with nvidia-smi that the GPU VRAM is all free

  5. Run it like, I have used this command:

    build/bin/llama-server --model qwen3.6-27b-IQ4_XS-pure.gguf --alias qwen3.6-27b -np 1 -ctk turbo3_tcq -ctv turbo3_tcq -c 100000 --fit off -ngl 999 --no-mmap -fa on --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0

  6. To use it on OpenCode, I use this \~/.config/opencode/opencode.json file:

    {   "$schema": "https://opencode.ai/config.json",   "plugin": [     "opencode-anthropic-auth@latest",     "opencode-copilot-auth@latest"   ],   "share": "disabled",   "provider": {     "llama.cpp": {       "npm": "@ai-sdk/openai-compatible",       "name": "llama.cpp (OpenAI Compatible)",       "options": {         "baseURL": "http://127.0.0.1:8080/v1",         "apiKey": "1234"       },       "models": {         "qwen3.5-27b": {           "name": "Qwen 3.5 27B",           "interleaved": {             "field": "reasoning_content"           },           "limit": {             "context": 100000,             "output": 32000           },           "temperature": true,           "reasoning": true,           "attachment": false,           "tool_call": true,           "modalities": {             "input": [               "text"             ],             "output": [               "text"             ]           },           "cost": {             "input": 0,             "output": 0,             "cache_read": 0,             "cache_write": 0           }         }       }     }   },   "agent": {     "code-reviewer": {       "description": "Reviews code for best practices and potential issues",       "model": "llama.cpp/qwen3.5-27b",       "prompt": "You are a code reviewer. Focus on security, understandability, conciseness, maintainability and performance."     },     "plan": {       "model": "llama.cpp/qwen3.5-27b"     }   },   "model": "llama.cpp/qwen3.5-27b",   "small_model": "llama.cpp/qwen3.5-27b" }

I get around 21 tokens/s generation speed/ 550 tokens/s prompt processing in the beginning, later it goes down to around 14 tokens/s (485 tokens/s prompt processing) at 15k context.