Qwen3.6-27B 4.256bpw in full VRAM on a 5070 Ti with 50000 q4_0 context - not turbo!

Posted by Decivox@reddit | LocalLLaMA | View on Reddit | 35 comments

Ive been waiting for sokann to drop his Qwen 3.6 GGUF for 16 GB GPUs as his Qwen 3.5 was my GGUF of choice. I tried cHunter789's Qwen3.6-27B-i1-IQ4_XS-GGUF that was posted yesterday, but could only achieve a context window of 30000 while staying in VRAM.

With the same launch settings, I am able to achieve a 50000 context window with this GGUF, which is quite the increase.

The Hugging Face model card shows that this quant is the most VRAM-efficient option at just 4.256 BPW (\~13.3 GB), with average perplexity nearly identical to the others (6.99 vs \~6.95–7.02). The fidelity metrics do show it has measurably higher probability distortion (RMS Δp \~6.7% vs \~4.3%, top-p match \~90.3% vs \~94%), but these gaps are modest and typical of aggressive 4-bit compression rather than a severe downgrade.

Ive posted my launch arguments here if you want to take a look.

Does anyone know if Id be better off sticking with Qwen3.6-35B-A3B Q6_K over this lower quant of a dense model? The MoE has the advantage of larger context window due to RAM spillage not destroying performance.

Also, they made a Qwen3.6-27B-GGUF-5.076bpw for 24 GB cards if anyone wants to give that a look.

[-]

ea_man@reddit

Guys you should use Linux for this: headless takes 50mb of VRAM, \~250MB with LXQt at 4k, \~450MB with KDE with firefox open.

It means that I can run:

- Qwen3.6-27B.i1-IQ4_XS.gguf ,Context: 51712 with graphic + firefox

- For 12GB: https://www.reddit.com/r/LocalLLaMA/comments/1ssnfdb/comment/ohp9x1n/?context=3

[-]

zakkord@reddit

just switch to iGPU and you'll have 0MB used Windows or Linux

(but not really since Nvidia 50 series requires open drivers with GSP that takes up 500MB so cuda won't allow you to allocate full VRAM, it will spill into shared pool and your performance will tank(and you won't see it in nvidia-smi))

[-]

samuraiogc@reddit

How to force igpu on windows?

And how i only use dedicated GPU for gaming and AI?

[-]

zakkord@reddit

if you have a default Windows install just switching to iGPU displayport/hdmi will automatically use it for browser etc, you can also manually select gou for each process in "Graphics" in Windows Settings menu

Contrary to myths the perfomance loss/latency is non-existent. Every notebook works this way today.

[-]

samuraiogc@reddit

Oh i did not know that. How it works in a multi monitor setup?

[-]

ea_man@reddit

just switch to iGPU for your desktop and you'll have 0MB used Windows or Linux

you want me to buy a new CPU or laptop just to switch to iGPU? I'd rather dual boot in Linux and waste less VRAM , makes much more sense ;P

[-]

tomByrer@reddit

Or use a iGPU or 2nd GPU if you have one.
But thanks for the reminder that the screen does take VRAM; seems some forget that here....

[-]

ea_man@reddit

In Linux you should also stop compositor, any 3d eyecandy, possibly use plain X11 with a light desktop if you want to stay in \~200MB.

BTW testing now max context length headlesson a 16GB gpu:

| Model File | Context Size (Tokens) |
| :--- | :--- |
| `Qwen3.6-27B.i1-IQ4_XS.gguf` (KV 4_0) | 93,952 |
| `Qwen3.6-27B.i1-Q4_K_S.gguf` | 70,656 |
| `IQ4_XS.gguf` (KV 8_0) | 51,712 |
| `Qwen3.6-27B.i1-Q4_K_M.gguf` | 22,784 |

[-]

InsensitiveClown@reddit

You can even do better than that. If you have a CPU with onboard graphics, drive the main display with the onboard graphics, and leave the RTX as a dedicated CUDA card - the display is driven by the onboard CPU/GPU, not the discrete RTX. Whenever you want to actually use the RTX for gaming you can always use NVidia's PRIME.

[-]

ea_man@reddit

Yup, in with Linux you can just stop the graphic server, no VRAM needed, in fact you could even do withouit any kind of graphic something as long as you have ethernet.

[-]

MmmmMorphine@reddit

Ugh I simply can't stand x11, spent an hour or two trying to set it up and it drove me crazy now that GNOME doesn't use it.

I'm sure I'd get used to it eventually and have it set up the way I like, but I was a surprisingly high bar.

Definitely worth quite a bit of vram vs wayland though.

Or I might not know what I'm talking about

[-]

ea_man@reddit

well just use LXQt instead of gnome, that runs with zero hassle at least here on Debian.

[-]

Long_comment_san@reddit

Nvidia did us dirty with their failure to launch 18gb 5070 super and 24gb 5070 ti super.

We would have been absolutely loving it by now

[-]

MmmmMorphine@reddit

Yeah it's just barely out of reach perfect sizing. Have a 4070 super ti.

But more importantly, why are so many models just barely out of reach no matter what mad quant or kv cache compression you run, mostly. It's maddening!

[-]

Long_comment_san@reddit

True. We should have also started making 4gb GDDR7 chips by now yet there is zero news on that front.

[-]

WoodYouIfYouCould@reddit

The question is what does one do for this guy on a 4060ti 16G with some success.

Currently running 35B-A3B with "ease", Unsloth Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf.
Running the 27B 4bit is very slow (8tks) vs 35B-A3B (52tks)

Current flags:
--model /root/models/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
--chat-template-kwargs '{"preserve_thinking":true}'
--alias "qwen3.6-35b-a3b"
--ctx-size 98304
--ctx-checkpoints 3
-ngl 99
--n-cpu-moe 20
--cache-type-k q8_0
--cache-type-v q8_0
--flash-attn on
--batch-size 2048
--ubatch-size 768
--threads 12
--threads-batch 12
--jinja
--reasoning-budget 2048
--metrics
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.0
--presence-penalty 0.0
--repeat-penalty 1.0
--parallel 1
--host 0.0.0.0
--port 10000

[-]

wil_is_cool@reddit

You need to use 3.6 27b at iq3xxs or similar to get reasonable context

[-]

Existing_Director_48@reddit

Guys, try qwen 3.6 tq3_4s with custom llama.cpp. Here i got 140tk/s with reasonable quality in RTX 4070 Ti Super. Inusing this (look the readme for install). 100% worth it for fast use in my use cases. Here I got very good context size k q4 v qt3. 128k + contextAll inside VRAM.

https://huggingface.co/YTan2000/Qwen3.6-35B-A3B-TQ3_4S/tree/main

For higher quality I use qwen 27b tq3 too, less context but high speed too.

[-]

loudsound-org@reddit

Oh sweet, I've been looking for options for 27B on my 4070 Ti Super. What kind of speeds are you getting? Using unsloth I can only get 6 t/s compared to 60 with 35B and 65k context. Everything I've read is that 27B is better for coding, but at that speed difference hard for me to even want to try.

[-]

Decivox@reddit (OP)

37 tokens per second

[-]

Glittering-Call8746@reddit

Radeon AI PRO R9700

AMD Ryzen 5 5500

Linux

Q4_K_XL Context: 8,192 so r9700 is slow due to.. ?

[-]

ea_man@reddit

bad config?

FYI: I get 23.41 t/s with a 6800 on vulkan with Qwen3.6-27B.i1-IQ4_XS.gguf

llama-server \
 -m /home/eaman/lm/models/mradermacher/Qwen3.6-27B/Qwen3.6-27B.i1-IQ4_XS.gguf \
        --host 0.0.0.0 \
        -np 1 \
        --fit-target 20 \
        -ctk q4_0 \
        -ctv q4_0 \
        -fa on \
        --temp 0.45 \
        --top-p 0.9 \
        --top-k 35 \
        --min-p 0.05 \
        --repeat-penalty 1.05 \
        --presence_penalty 1.5 \
        -b 512 \
        --jinja  \
        --no-mmap \
        --reasoning-budget 1 \
        --chat-template-kwargs '{"enable_thinking":false}' \
        --no-mmap \

[-]

loudsound-org@reddit

Thanks! I'll give it a shot this weekend hopefully.

[-]

Orbiting_Monstrosity@reddit

I have the same card, and am getting around 38 t/s using this model in LM Studio with \~50k context and Q8 KV cache quantization.

[-]

jd52wtf@reddit

With my 4070 TI Super I'm getting about 10tps. I've also got a lot of ram and CPU cores which help a bit.

Great output results if you don't mind waiting.

[-]

mikelowski@reddit

Same gpu. Forget it, this dense model does not fit there.

[-]

YourNightmar31@reddit

Is there an option to run this with vision capabilities?

[-]

nmkd@reddit

Just load the mmproj, should work

[-]

RanklesTheOtter@reddit

Thanks was gonna setup 27B this week on my 5060TI

This will be perfect.

[-]

Dartix1@reddit

Really cool, I have 16gb gpu too (7800 XT) but I had to settle for smaller models because of slow PP speeds. What are your PP speeds using this setup?

[-]

redblood252@reddit

Interested to compare dense vs moe at higher quant

[-]

Nyghtbynger@reddit

I waited eagerly for this one. I really like the 3.5 version on my 7800XT. That's my daily driver Achieved 70K context and 28-30 tok/sec
Performances are similar to deepseek, kimi, sonnet or the bigger qwens in term of planning, with only GLM flying high in the sky. In term of code precision that's where it can be lacking compared to the other models. I'm slowly moving away autonomous coding so I don't really care
Sadly I can't use my 7800XT lately and will be back on my 6600 for the month being. so 35B will be my new girlfriend

[-]

StorageHungry8380@reddit

I was going to comment that I seem to prefer the dense over the MoE variant at least for coding, better tool calling, better questions and better code. But then I recalled I might have enabled `preserve_thinking` on the dense variant but not the MoE so now I feel uncertain if that could be it.

I ended up rolling with the dense model doing a multi-phase implementation and been pleasantly surprised so far. Then again I've got a 5090 so Q5 UD and 128k FP16 context, so not sure how transferable it is to your setup.

[-]

OneSlash137@reddit

The fully unquantized version is trash. Why try to run this as copium?

[-]

Decivox@reddit (OP)

Based on your comment history, Im not even going to try... lol