Qwen3.6-27B 4.256bpw in full VRAM on a 5070 Ti with 50000 q4_0 context - not turbo!

Posted by Decivox@reddit | LocalLLaMA | View on Reddit | 35 comments

Hugging face link here.

Ive been waiting for sokann to drop his Qwen 3.6 GGUF for 16 GB GPUs as his Qwen 3.5 was my GGUF of choice. I tried cHunter789's Qwen3.6-27B-i1-IQ4_XS-GGUF that was posted yesterday, but could only achieve a context window of 30000 while staying in VRAM.

With the same launch settings, I am able to achieve a 50000 context window with this GGUF, which is quite the increase.

The Hugging Face model card shows that this quant is the most VRAM-efficient option at just 4.256 BPW (\~13.3 GB), with average perplexity nearly identical to the others (6.99 vs \~6.95–7.02). The fidelity metrics do show it has measurably higher probability distortion (RMS Δp \~6.7% vs \~4.3%, top-p match \~90.3% vs \~94%), but these gaps are modest and typical of aggressive 4-bit compression rather than a severe downgrade.

Ive posted my launch arguments here if you want to take a look.

Does anyone know if Id be better off sticking with Qwen3.6-35B-A3B Q6_K over this lower quant of a dense model? The MoE has the advantage of larger context window due to RAM spillage not destroying performance.

Also, they made a Qwen3.6-27B-GGUF-5.076bpw for 24 GB cards if anyone wants to give that a look.