Agentic coding Qwen 3.6, Q6_K 125k context vs Q5_K_XL 200k context

[-]

1337_mk3@reddit

the keep think persistant jinja and cap it at 125k trust.

[-]

simracerman@reddit

Stick to the 125k. It’s a sweet spot before context rot starts piling.

[-]

Agree. Especially if you combine with a good agentic memory routine/system prompt to compress context automatically for you and stash it, you typically don't need the extra context as much as you benefit from the quant improvement, IMO.

[-]

hay-yo@reddit

Q5 is working well for me but ive only hit the 125k limit a few times so perhaps Q6 is fine. What you can do, because its so fast, use q6 and when you get close swap to q5 to finish.

[-]

ComfyUser48@reddit (OP)

That's an interesting idea!

[-]

FriendlyTitan@reddit

I am wondering if Q8_0 kv cache is good here. That could double your context window. I heard there is minimal degradation.

[-]

popoppypoppylovelove@reddit

For unsloth's quants, something's a little weird with Q6_K. According to their benchmarks at https://www.reddit.com/r/LocalLLaMA/comments/1so5nrl/qwen36_gguf_benchmarks/, Q5_K_XL actually does better than Q6_K.

[-]

Holiday_Bowler_2097@reddit

Qwen3.6-35B-A3B-UD-Q6_K.gguf :
-c 262144 -ctk q8_0 -ctv q8_0 --no-mmproj-offload

30.6Gib VRAM used on my rtx5090 (Vulkan)
Monitors are connected to iGpu on strix halo though, so in case full context 262144 might be too much

[-]

ComfyUser48@reddit (OP)

Hmm, will try that. Maybe it's the ctk and ctv?

[-]

Holiday_Bowler_2097@reddit

It is. And image processing layers (mmproj) in RAM After cache rotation merge in llama.cpp I don't use f16 anymore

[-]

ComfyUser48@reddit (OP)

Got it working with 200k context, thank you! 256k not working for me with this on a 5090

[-]

ComfyUser48@reddit (OP)

Hmm just tested it and I got a first loop I ever got with this model this I started using it. With these flags I mean

[-]

Holiday_Bowler_2097@reddit

With up to date llama.cpp ? It refactored big legacy codebase overnight in opencode without problems. Q8_0 worked indistinguishable from f16 for 27b for long time, so 3.6 35b inherited params. Well, I guess, I'll spend some time testing and investigating

[-]

laser50@reddit

After some reading in for the 3.5 35B Model, I found that using bf16 was best over fp16 for the cache quantizations. This because they were apparently trained on BF16 or FP16 gave some weird results.. Perhaps worth a shot! I had been on bf16 for a while, it's worked perfect, and it's just a slightly less compressed variant than Q8.

[-]

FoxiPanda@reddit

My guess is that Q5_K_XL is good enough that you could support that higher context and the degradation of response quality would actually not be that serious or maybe not even noticeable between Q6_K and Q5_K_XL.

And ultimately, it will depend on your use case - if you write a bunch of short scripts and keep your context in check regularly, you probably won't see a difference. If you are running long context tasks in a single session over hours, you might notice a difference between having 200K native context vs 125k native context + compaction.

Some other options to consider:

VRAM Efficiency: You might use a TurboQuant fork of llama.cpp and run with both Q6_K + 200K context .. but it's a little off the beaten path and the true real world effects of TurboQuant caching have not been fully proven out yet IMO.
Performance: Speculative decoding using a draft model like Qwen3.5-0.8B might give you a performance increase with no accuracy decrease.

Either way though, you should be in good shape to use that model at 165tok/s (blazing fast) and be able to have it iterate into good working code with a few decent system prompt guardrails and self-test the code prompting techniques.

[-]

FullMetalGooseYT@reddit

I think Speculative decoding doesn't work on 3.5. I did it on 2.5 instruct, but on 3.5 I used a MoE with "--n-cpu-moe 40" on a 16GB VRAM Card.

[-]

FoxiPanda@reddit

Hmm, I'm pretty sure I have it working currently, but let me check my hit rates. I set it up on a separate node and never really looked into how the hit rates were doing lol...it doesn't crash though and I get responses, so it's not dead dead at least :D

[-]

FoxiPanda@reddit

Hmm, I'm pretty sure I have it working currently, but let me check my hit rates. I set it up on a separate node and never really looked into how the hit rates were doing lol...it doesn't crash though and I get responses, so it's not dead dead at least :D

[-]

vevi33@reddit

I am trying to decide between these as well. But no matter how hard I try q_6 feels better and I get better results :/

[-]

Radiant_Condition861@reddit

it's the shallow and wide bucket vs narrow and deep bucket. It depends on the work you're doing and how nuanced the requirements of your projects are, it's really splitting hairs now. I'd go with the faster one. q5 and 125k