Qwen3.6 27B seems struggling at 90k on 128k ctx windows

Posted by dodistyo@reddit | LocalLLaMA | View on Reddit | 45 comments

I have RX 7900 XTX, running Qwen3.6 27B Q4_K_XL. got 400ish pp and 30s tps. every work below 64k is incredible and it spits out good quality code.

But i tried to push it forther to work on kinda complex devops related work and it fail at tool calling at 90k ctx.

I use opencode as my harness and here is the llama.cpp command i ran:

Ilama-server -ctv q8_0 -ctk q8_0 -c 128000 --temp 0.6 --top-p 0.95 --top-k 20 --repeat-penalty 1.0 --fit on.

what's your experience?

[-]

Easy_Werewolf7903@reddit

Single data entry point, but around 70k tokens the model would stop generating text mid way in open code. I am using FP8 260k context.

[-]

Maleficent-Ad5999@reddit

Yes. So I figured out the best way is to use sub-agents in opencode so that each task is delegated to a subagent which is quick enough to perform and report back.

For example, I have multiple subagents for: web research, codebase search, breaking down tasks, code implementation, validating the change, all coordinated with the main agent. This way, each task starts with a smaller context, faster response, main chat window remains smaller context.

This setup feels like a great boost with Qwen3.6 27b as I go longer into the chat and still consume only like 30K tokens

[-]

Djagatahel@reddit

How do you manage prefix caching efficiently? I use llama.cpp with RAM prefix caching and it's very hit or miss, mostly miss. That makes the experience with subagents pretty bad.

[-]

Raredisarray@reddit

How do you use sub agents with local LLM, like does a new context spawn and that agent reports its context back to your main agent in the terminal window that stays open?

[-]

Maleficent-Ad5999@reddit

Exactly.. Sub-agents are just a markdown file with distinct instructions on what a subagent should do. OpenCode manages the context window effectively. the LLM is provided with instructions on what subagent to choose based on the request and it creates a new session with new context to solve the query.

We never needed this with frontier models as they are soo good and has huge context window.

[-]

Raredisarray@reddit

Thanks for the explanation ! Very cool, I’ll have to give it a shot.

[-]

Hot_Turnip_3309@reddit

it's your quant. it works fine for me. don't use the unsloth quant for 3.6

[-]

juaps@reddit

In my experience going from 80k 8bit KV to 220k TurboQuant ggfu was night and day, now i can manage my 1gb HTML proyect and actually work

[-]

AvidCyclist250@reddit

Game a game changer. Currently using buun llama.cpp. Thanks 4 the context, brah

[-]

juaps@reddit

np dude,For me, it was the key to finally starting to work, so I’m really happy now, glad it help u too

[-]

Glittering-Call8746@reddit

Yea use it or lose it .. u using 3x compression or 4x ?

[-]

AvidCyclist250@reddit

I think actual compression is way more than that

[-]

juaps@reddit

Looks like 4x, but sorry not so sure, I’m using -ctv turbo3, and the startup log says turbo3 using 4-mag LUT. K cache is: q8_0, V cache is :turbo3

[-]

dodistyo@reddit (OP)

mind to share the exact model?

[-]

juaps@reddit

yeah sure, i went from LMstudio to TurboQuant Plus turboquant-plus-tqp-v0.1.1
Model: Qwen3.6-27B-Q6_K.gguf
Launch: -ctk q8_0 -ctv turbo3 -c 222144 -fa on -ngl 99 --reasoning on
So K cache is q8_0, V cache is turbo3, 222k ctx, flash attention on.

[-]

Glittering-Call8746@reddit

So windows wsl2 , rocm which version and so on..

[-]

Sixstringsickness@reddit

This is literally the nature of LLM's, degradation of performance can start at as low as 10% of context window usage. You can see this with even SOTA models such as Opus 4.7, go past 50% and they become nearly useless.

Context Rot: How Increasing Input Tokens Impacts LLM Performance

https://www.trychroma.com/research/context-rot

[-]

Automatic-Arm8153@reddit

Correct for local models I don’t recommend going past 64k. If your work has to go past 64k your workflow is not efficient enough.

Claude I sometimes push it up to 140K but from 120k its ability is already degraded. That’s not to say it won’t solve problems, but you will be wasting tokens as its problem solving ability is worse.

Compact often. Compact always.

IMO the best ways to use LLM’s is with no memory features. And a precise prompt. Sometimes I have a chat just to make the perfect prompt/context for another chat. So I’m not wasting tokens in the problem solving chat.

[-]

Sirius_Sec_@reddit

I do the same thing . I've been wanting to try caveman as well

[-]

michaelsoft__binbows@reddit

did some experiments with opencode and my local 27b configured with max context of 60k tokens. it would autocompact itself to smithereens. experimenting with other harnesses now...

[-]

Automatic-Arm8153@reddit

This is funny because I see all the people wanting to turboquant or Q8/Q4 KV cache just to hit a models max token window.

But these max token windows are useless flexes. PP and TG speeds become much slower and you waste time hand guiding the LLM’s through simple problems at high context.

[-]

Sixstringsickness@reddit

Yes TG comes to a grinding halt very rapidly.

[-]

iezhy@reddit

Its more like that 100k is a semi hard limit for al models, they go fairly quickly to "dum dum" mode after that

[-]

Sirius_Sec_@reddit

That's been what Ive noticed as well. The 2 million context window advertised by xai and others is complete bs . This is why it's important to create batches of tasks and keep the context as short as possible.

[-]

FlyingDogCatcher@reddit

Yeah, people have complained for a long time at GitHub Copilot capping most of their models at 128k and I was always like "You don't need or want more than that"

[-]

Sixstringsickness@reddit

This depends, I can push Opus 4.7 (1M) window into 500k range before noticing substantial degradation (context dependent_), no pun intended. If you have a cohesive plan and it runs it up going through 500k, its usually fine - just don't start a new task when its done.

[-]

Maximum-Wishbone5616@reddit

KV is important to that.

[-]

wren6991@reddit

This is where you realise sadly that people talking about Q4_K being "lossless" are only trying it on short-context tasks

[-]

Maximum-Wishbone5616@reddit

KV16 and it runs to 262k context without any issues.

[-]

RealPjotr@reddit

Try flash attention.

[-]

dodistyo@reddit (OP)

how far can you go with the context size? i mean the highest usable context.

[-]

Raredisarray@reddit

I’ve went up to 262k on cline CLI with 27b q8_0 yesterday and it seems to be good.

[-]

dodistyo@reddit (OP)

what's your hardware?

[-]

Raredisarray@reddit

2X mi50 32gbs

[-]

ambient_temp_xeno@reddit

Maybe it's the untested quant rotation. You could try turning it off using the environment variable (whatever that even is) and see if it's better.

[-]

vevi33@reddit

Unfortunately without Q8 KV cache quantanization it is much better on longer context (BF16). I tested it on my project, there is a noticeable difference :/

[-]

Prudent-Ad4509@reddit

I’ve seen reports that this is normal for Q4 weights and especially for quantized kv.

[-]

ieatdownvotes4food@reddit

hmm. I'd say focus on running via Linux + vllm first, then skip gguf and use model as released. that by itself is gonna resolve a lot issues.

[-]

tomByrer@reddit

Tried this AMD-specific inference engine?
https://www.reddit.com/r/LocalLLaMA/comments/1swpsv0/amd_hipfire_a_new_inference_engine_optimized_for/

Though in general, seems almost all models, be they local or hosted, start tanking about 80% of context fill.

[-]

Glittering-Call8746@reddit

Rocm or HIP has turbo quant support ?

[-]

dodistyo@reddit (OP)

I use vulkan. vulkan is faster than ROCm from what i experienced

[-]

max-mcp@reddit

Have you tried lowering the context to around 80k to see if it's more stable? I've noticed most Q4 quants start getting wonky past 70-80k even with proper cache quanting, might be worth testing with Q5_K_M if you can squeeze it in.

[-]

dodistyo@reddit (OP)

yeah, I just want to see how far it can go. now we know what to expect on model with that size.

[-]

Thrumpwart@reddit

Try a q5 with lower context?

[-]

Ok-Measurement-1575@reddit

Tried removing repeat pen?