Quant Qwen3.6-27B on 16GB VRAM with 100k context length

Posted by Due-Project-7507@reddit | LocalLLaMA | View on Reddit | 13 comments

I have experimented how to run Qwen3.6-27B on my laptop with an A5000 16GB GPU. I have created an own IQ4_XS GGUF "qwen3.6-27b-IQ4_XS-pure.gguf" with the Unsloth imatrix and compared the mean KLD of it with other quants.

You can see that I also have tested different turboquant versions. It looks that the buun-llama-cpp fork is better than the TheTom/llama-cpp-turboquant fork.

If you want to try my version, you can do the following:

Download my GGUF from Huggingface. It already contains an improved chat template base on this one
Clone buun-llama-cpp from https://github.com/spiritbuun/buun-llama-cpp
Build it, I have used on Windows

cmake -B build -G Ninja -DGGML_CUDA=ON -DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl cmake --build build --config Release -j 16
Check e.g. with nvidia-smi that the GPU VRAM is all free
Run it like, I have used this command:

build/bin/llama-server --model qwen3.6-27b-IQ4_XS-pure.gguf --alias qwen3.6-27b -np 1 -ctk turbo3_tcq -ctv turbo3_tcq -c 100000 --fit off -ngl 999 --no-mmap -fa on --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0
To use it on OpenCode, I use this \~/.config/opencode/opencode.json file:

{ "$schema": "https://opencode.ai/config.json", "plugin": [ "opencode-anthropic-auth@latest", "opencode-copilot-auth@latest" ], "share": "disabled", "provider": { "llama.cpp": { "npm": "@ai-sdk/openai-compatible", "name": "llama.cpp (OpenAI Compatible)", "options": { "baseURL": "http://127.0.0.1:8080/v1", "apiKey": "1234" }, "models": { "qwen3.5-27b": { "name": "Qwen 3.5 27B", "interleaved": { "field": "reasoning_content" }, "limit": { "context": 100000, "output": 32000 }, "temperature": true, "reasoning": true, "attachment": false, "tool_call": true, "modalities": { "input": [ "text" ], "output": [ "text" ] }, "cost": { "input": 0, "output": 0, "cache_read": 0, "cache_write": 0 } } } } }, "agent": { "code-reviewer": { "description": "Reviews code for best practices and potential issues", "model": "llama.cpp/qwen3.5-27b", "prompt": "You are a code reviewer. Focus on security, understandability, conciseness, maintainability and performance." }, "plan": { "model": "llama.cpp/qwen3.5-27b" } }, "model": "llama.cpp/qwen3.5-27b", "small_model": "llama.cpp/qwen3.5-27b" }

I get around 21 tokens/s generation speed/ 550 tokens/s prompt processing in the beginning, later it goes down to around 14 tokens/s (485 tokens/s prompt processing) at 15k context.

[-]

Monkey_1505@reddit

If you just do the v cache you'll probably find it performs better.

devnull0@reddit

Unfortunately not, when I tried it with llama-cli the output is really just ghibberish.

Ah. Well I saw someone doing with vllm on a youtube video, and apparently it's the recommended approach. Or so they said, having communicated with someone involved with turboquant. 2nd hand info tho, not exactly my expertise.

harpysichordist@reddit

FYI your opencode.json says "qwen3.5" not 3.6

Due-Project-7507@reddit (OP)

I forgot to change it. I think llama.cpp just ignores the name, therefore it does not matter, but it wouldn't work with vLLM.

Icy-Degree6161@reddit

yeah it ignores the name

grumd@reddit

Tried turbo3_tcq, the overhead in decode is crazy, it's ~50tps at zero depth and then drops to 12tps at 100k context. Will still probably use it because IQ4_XS at my 16gb vram with 100k sounds really good

oxygen_addiction@reddit

Why fit off?

OsmanthusBloom@reddit

Not OP, but I think I can see why. All the important params are already specfied. Having fit on would probably just move some layers to CPU RAM because by default it tries to reserve 1 GB VRAM for other uses (--fit-target 1024) which in this case would massively slow down generation.

Yes, I only use ot to silence the message about fit. You can also try using it with --fit-target instead setting the context lenght. The default --fit-target is conservative, lower and test it with long context until you get a CUDA OOM crash. I think also -fa is not needed anymore, it is now automatic.

LoSboccacc@reddit

Nice! Would you be able to measure kdl at 2048 context and at 65536 context see if all quant keep coherent down the road?

I can try it next week.

logic_prevails@reddit

Looks nice thx