Qwen3.6 27B seems struggling at 90k on 128k ctx windows
Posted by dodistyo@reddit | LocalLLaMA | View on Reddit | 45 comments
I have RX 7900 XTX, running Qwen3.6 27B Q4_K_XL. got 400ish pp and 30s tps. every work below 64k is incredible and it spits out good quality code.
But i tried to push it forther to work on kinda complex devops related work and it fail at tool calling at 90k ctx.
I use opencode as my harness and here is the llama.cpp command i ran:
Ilama-server -ctv q8_0 -ctk q8_0 -c 128000 --temp 0.6 --top-p 0.95 --top-k 20 --repeat-penalty 1.0 --fit on.
what's your experience?
Easy_Werewolf7903@reddit
Single data entry point, but around 70k tokens the model would stop generating text mid way in open code. I am using FP8 260k context.
Maleficent-Ad5999@reddit
Yes. So I figured out the best way is to use sub-agents in opencode so that each task is delegated to a subagent which is quick enough to perform and report back.
For example, I have multiple subagents for: web research, codebase search, breaking down tasks, code implementation, validating the change, all coordinated with the main agent. This way, each task starts with a smaller context, faster response, main chat window remains smaller context.
This setup feels like a great boost with Qwen3.6 27b as I go longer into the chat and still consume only like 30K tokens
Djagatahel@reddit
How do you manage prefix caching efficiently? I use llama.cpp with RAM prefix caching and it's very hit or miss, mostly miss. That makes the experience with subagents pretty bad.
Raredisarray@reddit
How do you use sub agents with local LLM, like does a new context spawn and that agent reports its context back to your main agent in the terminal window that stays open?
Maleficent-Ad5999@reddit
Exactly.. Sub-agents are just a markdown file with distinct instructions on what a subagent should do. OpenCode manages the context window effectively. the LLM is provided with instructions on what subagent to choose based on the request and it creates a new session with new context to solve the query.
We never needed this with frontier models as they are soo good and has huge context window.
Raredisarray@reddit
Thanks for the explanation ! Very cool, I’ll have to give it a shot.
Hot_Turnip_3309@reddit
it's your quant. it works fine for me. don't use the unsloth quant for 3.6
juaps@reddit
In my experience going from 80k 8bit KV to 220k TurboQuant ggfu was night and day, now i can manage my 1gb HTML proyect and actually work
AvidCyclist250@reddit
Game a game changer. Currently using buun llama.cpp. Thanks 4 the context, brah
juaps@reddit
np dude,For me, it was the key to finally starting to work, so I’m really happy now, glad it help u too
Glittering-Call8746@reddit
Yea use it or lose it .. u using 3x compression or 4x ?
AvidCyclist250@reddit
I think actual compression is way more than that
juaps@reddit
Looks like 4x, but sorry not so sure, I’m using
-ctv turbo3, and the startup log saysturbo3 using 4-mag LUT. K cache is:q8_0, V cache is :turbo3dodistyo@reddit (OP)
mind to share the exact model?
juaps@reddit
yeah sure, i went from LMstudio to TurboQuant Plus
turboquant-plus-tqp-v0.1.1Model:
Qwen3.6-27B-Q6_K.ggufLaunch:
-ctk q8_0 -ctv turbo3 -c 222144 -fa on -ngl 99 --reasoning onSo K cache is
q8_0, V cache isturbo3, 222k ctx, flash attention on.Glittering-Call8746@reddit
So windows wsl2 , rocm which version and so on..
Sixstringsickness@reddit
This is literally the nature of LLM's, degradation of performance can start at as low as 10% of context window usage. You can see this with even SOTA models such as Opus 4.7, go past 50% and they become nearly useless.
Context Rot: How Increasing Input Tokens Impacts LLM Performance
https://www.trychroma.com/research/context-rot
Automatic-Arm8153@reddit
Correct for local models I don’t recommend going past 64k. If your work has to go past 64k your workflow is not efficient enough.
Claude I sometimes push it up to 140K but from 120k its ability is already degraded. That’s not to say it won’t solve problems, but you will be wasting tokens as its problem solving ability is worse.
Compact often. Compact always.
IMO the best ways to use LLM’s is with no memory features. And a precise prompt. Sometimes I have a chat just to make the perfect prompt/context for another chat. So I’m not wasting tokens in the problem solving chat.
Sirius_Sec_@reddit
I do the same thing . I've been wanting to try caveman as well
michaelsoft__binbows@reddit
did some experiments with opencode and my local 27b configured with max context of 60k tokens. it would autocompact itself to smithereens. experimenting with other harnesses now...
Automatic-Arm8153@reddit
This is funny because I see all the people wanting to turboquant or Q8/Q4 KV cache just to hit a models max token window.
But these max token windows are useless flexes. PP and TG speeds become much slower and you waste time hand guiding the LLM’s through simple problems at high context.
Sixstringsickness@reddit
Yes TG comes to a grinding halt very rapidly.
iezhy@reddit
Its more like that 100k is a semi hard limit for al models, they go fairly quickly to "dum dum" mode after that
Sirius_Sec_@reddit
That's been what Ive noticed as well. The 2 million context window advertised by xai and others is complete bs . This is why it's important to create batches of tasks and keep the context as short as possible.
FlyingDogCatcher@reddit
Yeah, people have complained for a long time at GitHub Copilot capping most of their models at 128k and I was always like "You don't need or want more than that"
Sixstringsickness@reddit
This depends, I can push Opus 4.7 (1M) window into 500k range before noticing substantial degradation (context dependent_), no pun intended. If you have a cohesive plan and it runs it up going through 500k, its usually fine - just don't start a new task when its done.
Maximum-Wishbone5616@reddit
KV is important to that.
wren6991@reddit
This is where you realise sadly that people talking about Q4_K being "lossless" are only trying it on short-context tasks
Maximum-Wishbone5616@reddit
KV16 and it runs to 262k context without any issues.
RealPjotr@reddit
Try flash attention.
dodistyo@reddit (OP)
how far can you go with the context size? i mean the highest usable context.
Raredisarray@reddit
I’ve went up to 262k on cline CLI with 27b q8_0 yesterday and it seems to be good.
dodistyo@reddit (OP)
what's your hardware?
Raredisarray@reddit
2X mi50 32gbs
ambient_temp_xeno@reddit
Maybe it's the untested quant rotation. You could try turning it off using the environment variable (whatever that even is) and see if it's better.
vevi33@reddit
Unfortunately without Q8 KV cache quantanization it is much better on longer context (BF16). I tested it on my project, there is a noticeable difference :/
Prudent-Ad4509@reddit
I’ve seen reports that this is normal for Q4 weights and especially for quantized kv.
ieatdownvotes4food@reddit
hmm. I'd say focus on running via Linux + vllm first, then skip gguf and use model as released. that by itself is gonna resolve a lot issues.
tomByrer@reddit
Tried this AMD-specific inference engine?
https://www.reddit.com/r/LocalLLaMA/comments/1swpsv0/amd_hipfire_a_new_inference_engine_optimized_for/
Though in general, seems almost all models, be they local or hosted, start tanking about 80% of context fill.
Glittering-Call8746@reddit
Rocm or HIP has turbo quant support ?
dodistyo@reddit (OP)
I use vulkan. vulkan is faster than ROCm from what i experienced
max-mcp@reddit
Have you tried lowering the context to around 80k to see if it's more stable? I've noticed most Q4 quants start getting wonky past 70-80k even with proper cache quanting, might be worth testing with Q5_K_M if you can squeeze it in.
dodistyo@reddit (OP)
yeah, I just want to see how far it can go. now we know what to expect on model with that size.
Thrumpwart@reddit
Try a q5 with lower context?
Ok-Measurement-1575@reddit
Tried removing repeat pen?