Agentic coding Qwen 3.6, Q6_K 125k context vs Q5_K_XL 200k context
Posted by ComfyUser48@reddit | LocalLLaMA | View on Reddit | 20 comments
What would you choose if you were in my shoes? How viable is 125k for agentic coding really? is "compact" really good enough, or would you go with Q6_K 125k?
I am getting around 165-170 tok/sec with either config with my 5090.
1337_mk3@reddit
the keep think persistant jinja and cap it at 125k trust.
simracerman@reddit
Stick to the 125k. It’s a sweet spot before context rot starts piling.
Chilangosta@reddit
Agree. Especially if you combine with a good agentic memory routine/system prompt to compress context automatically for you and stash it, you typically don't need the extra context as much as you benefit from the quant improvement, IMO.
hay-yo@reddit
Q5 is working well for me but ive only hit the 125k limit a few times so perhaps Q6 is fine. What you can do, because its so fast, use q6 and when you get close swap to q5 to finish.
ComfyUser48@reddit (OP)
That's an interesting idea!
FriendlyTitan@reddit
I am wondering if Q8_0 kv cache is good here. That could double your context window. I heard there is minimal degradation.
popoppypoppylovelove@reddit
For unsloth's quants, something's a little weird with Q6_K. According to their benchmarks at https://www.reddit.com/r/LocalLLaMA/comments/1so5nrl/qwen36_gguf_benchmarks/, Q5_K_XL actually does better than Q6_K.
Holiday_Bowler_2097@reddit
Qwen3.6-35B-A3B-UD-Q6_K.gguf :
-c 262144 -ctk q8_0 -ctv q8_0 --no-mmproj-offload
30.6Gib VRAM used on my rtx5090 (Vulkan)
Monitors are connected to iGpu on strix halo though, so in case full context 262144 might be too much
ComfyUser48@reddit (OP)
Hmm, will try that. Maybe it's the ctk and ctv?
Holiday_Bowler_2097@reddit
It is. And image processing layers (mmproj) in RAM After cache rotation merge in llama.cpp I don't use f16 anymore
ComfyUser48@reddit (OP)
Got it working with 200k context, thank you! 256k not working for me with this on a 5090
ComfyUser48@reddit (OP)
Hmm just tested it and I got a first loop I ever got with this model this I started using it. With these flags I mean
Holiday_Bowler_2097@reddit
With up to date llama.cpp ? It refactored big legacy codebase overnight in opencode without problems. Q8_0 worked indistinguishable from f16 for 27b for long time, so 3.6 35b inherited params. Well, I guess, I'll spend some time testing and investigating
laser50@reddit
After some reading in for the 3.5 35B Model, I found that using bf16 was best over fp16 for the cache quantizations. This because they were apparently trained on BF16 or FP16 gave some weird results.. Perhaps worth a shot! I had been on bf16 for a while, it's worked perfect, and it's just a slightly less compressed variant than Q8.
FoxiPanda@reddit
My guess is that Q5_K_XL is good enough that you could support that higher context and the degradation of response quality would actually not be that serious or maybe not even noticeable between Q6_K and Q5_K_XL.
And ultimately, it will depend on your use case - if you write a bunch of short scripts and keep your context in check regularly, you probably won't see a difference. If you are running long context tasks in a single session over hours, you might notice a difference between having 200K native context vs 125k native context + compaction.
Some other options to consider:
Either way though, you should be in good shape to use that model at 165tok/s (blazing fast) and be able to have it iterate into good working code with a few decent system prompt guardrails and self-test the code prompting techniques.
FullMetalGooseYT@reddit
I think Speculative decoding doesn't work on 3.5. I did it on 2.5 instruct, but on 3.5 I used a MoE with "--n-cpu-moe 40" on a 16GB VRAM Card.
FoxiPanda@reddit
Hmm, I'm pretty sure I have it working currently, but let me check my hit rates. I set it up on a separate node and never really looked into how the hit rates were doing lol...it doesn't crash though and I get responses, so it's not dead dead at least :D
FoxiPanda@reddit
Hmm, I'm pretty sure I have it working currently, but let me check my hit rates. I set it up on a separate node and never really looked into how the hit rates were doing lol...it doesn't crash though and I get responses, so it's not dead dead at least :D
vevi33@reddit
I am trying to decide between these as well. But no matter how hard I try q_6 feels better and I get better results :/
Radiant_Condition861@reddit
it's the shallow and wide bucket vs narrow and deep bucket. It depends on the work you're doing and how nuanced the requirements of your projects are, it's really splitting hairs now. I'd go with the faster one. q5 and 125k