Is there a way to mitigate performance as context grows?

[-]

cstocks@reddit

compaction is used heavily by claude so I guess you can implement something similar in the harness level

[-]

Finanzamt_Endgegner@reddit

thats where everything breaks. this is a fundamental (o\^2) attention problem. Qwen next tries to fix that by linear attention, deepseeks older model did something similar and v4 is even better

[-]

MLExpert000@reddit

A few things help a bit:

don’t keep full chat history, trim it
use sliding window / chunking
avoid max context unless you really need it

But honestly it’s mostly a fundamental thing. context grows, it just slows down. most people just break it up or use retrieval instead of pushing one long convo forever.

[-]

Finanzamt_Endgegner@reddit

qwen3 next so for example 3.6 is not that bad in that regard, barely slows down at 100k context

[-]

MLExpert000@reddit

Thad really impressive. What kind of a hardware setup you got?

[-]

nope, quadratic attention costs with growing context are still a fundamental limitation of transformers. newish model architectures mix in some fraction of sliding-window, deltanet, Mamba, etc. layers that have fixed state size and don't do that, but that just slows the growth down by a constant, it doesn't change the scaling law.

unless you really really need your whole input as context (like you're trying to summarize a novel in one go), it's better to start a new context for each question/feature/change/etc.

[-]

ea_man@reddit

Usually vulkan gives you fast performance while context is short then degrades hard when it fills up, ROCm is the \~opposite, as it starts slower yet keeps stable when your prompt (read too) grows over \~35k.

On the application layer if your problem is (like for most local users) that prompts / context gets BIG fast and tanks the performance ->

* avoid bloated harness like Qwencode / opencode that boots up each prompt with 11k context

* use harness meant to reduce that: Late, Aider, Pi

* create summaries, load them in context early, reset context as soon as you can after each operation. if that starting prompts won't change it should be cached.

[-]

txgsync@reddit

If prefill is what's killing you -- and at large context with multi-turn chat, it almost always is -- llama.cpp's slot save/restore is the single biggest win. Persist the KV cache to disk per slot and you skip re-prefilling history every turn. First token comes back fast instead of after a 30-second-to-five-minute stare-at-the-screen pause. Decode will still get a bit slower as context grows, but you're no longer paying the quadratic prefill tax on every turn.

Beyond that, model architecture matters a lot at long context. Take a hard look at Nemotron 3 Nano (30B total, 3.2B active, released December). It's a hybrid Mamba-Transformer MoE: most layers are Mamba-2 with fixed-size state, with attention layers sprinkled in periodically, and MoE on top so only \~3B params are active per token. That gets you near-linear sequence scaling across the bulk of the stack and a tiny KV cache footprint, since only the attention layers contribute to KV. 1M context window, and llama.cpp supports it. Should run great on your hardware. If you're regularly working past 100k tokens, switching to a hybrid SSM is probably a bigger lever than any flag tuning.

Other knobs on llama.cpp include:

quantize the KV cache (--cache-type-k q8_0 --cache-type-v q8_0, or q4_0 if you're brave) to cut memory bandwidth on decode,
prefer GQA models over full MHA so the KV cache isn't bloated to begin with. Though with Nemotron 3 the bigger win is just having way fewer attention layers in the first place.

It's a legitimately hard problem to preserve performance with large context in attention-based Transformers. The answer of the big labs is "throw more hardware at the problem" which is the exact reason for the massive datacenter boom in the USA right now.

[-]

WhatererBlah555@reddit (OP)

Thank you for the exhaustive answer :)

I already use --cache-type-k q8_0 --cache-type-v q8_0 , maybe i'll try something more aggressive and see if it helps.

I tried the Nemotron nano (Q4) but I found it unsatisfactory, maybe I'll try it again with a less aggressive quantization.

For the prompt caching you're referring to the --slots parameter of llama.cpp ? https://github.com/ggml-org/llama.cpp/discussions/13606

Can you explain more the GQA vs MHA models thing?

Thanks!

[-]

txgsync@reddit

I apologize for the AI-ness of my response below. I've been goofing around a lot with turboquant on Apple Silicon and it was just easier to just have Claude pull the details from my projects than write it up from scratch :)
---

Glad it was useful!

On slots: Yes, that discussion is a good starting point, but the actual mechanism you want is the /slots REST endpoint with slot save/restore via --slot-save-path. You launch llama-server with --slot-save-path /some/dir, then POST to /slots/{id}?action=save with a filename to persist that slot's KV cache, and ?action=restore to load it back. The server README has the API. There's also --prompt-cache for the older CLI tools if you're not using the server.

On Nemotron 3 Nano at Q4: Your instinct to try a less aggressive quant is reasonable. Hybrid Mamba-Transformer MoE models tend to be more quantization-sensitive than pure Transformers; the Mamba state dynamics and the MoE routing both compound quantization error in ways that pure attention doesn't. Q5_K_M or Q6_K might be the sweet spot for Nemotron models on your gear; Q8_0 if you have the VRAM (/u/txgsync's injection on top of Claude here: yeah, q8 if you can. The quality loss is "only a few percent" for smaller quants, but that few percent can multiply if quantized at the wrong layers!). The active param count is only 3.2B so the quant tax in HBM memory terms is small. Mostly a lot of load/unload time of experts in your GPU if the model is larger than VRAM.

On GQA vs MHA: Multi-Head Attention gives every attention head its own K and V projections — if you have 32 heads, you store 32 separate K tensors and 32 separate V tensors per layer per token. Grouped-Query Attention shares K and V across groups of heads — e.g., 32 query heads but only 8 KV heads, so 4 query heads share each KV pair. The Q projections stay per-head (so expressivity is largely preserved), but the KV cache shrinks by the grouping factor. With 32→8 GQA, your KV cache is 4x smaller, which means 4x less memory bandwidth per decode step and 4x more context fits in the same VRAM. MQA (Multi-Query Attention) is the extreme case — one shared KV head — even smaller cache, slightly more quality cost.

Practically every modern model uses GQA now (Llama 3/4, Qwen 2.5/3, Mistral, etc.). (/u/txgsync jumping in again: this means if you're using recent 2026-era models? You can almost certainly ignore MHA completely.)With Nemotron 3 it's almost a non-issue because most layers aren't attention layers at all. Gemma-4's E4B and E2B models similarly swap some attention layers for lookup tables... and it works, nicely.

You may wanna consider "Turboquant" too. Google dropped this at ICLR 2026 for KV cache compression. Uses a random orthogonal rotation to gaussianize the distribution, then applies Lloyd-Max optimal scalar quantization per coordinate. Results in about \~5x compression at 3-bit with way less quality loss than naive q3 KV would give you. I normally run at either Q4 or Q8 depending how bad I'm willing to stomach substantial aliasing at larger contexts (and pragmatic RAM limitations, LOL.)

Anyway, I use Turboquant heavily on my Mac and it works well. The benefits compound the longer the context is. But I've noticed fuzziness at 4 bits and below beyond 128K context on 256K models that I don't see when KV isn't quantized. So I mostly use TQ4 for long-horizon planning, discussion, strategy, general chat, etc. but either use TQ8 or none at all for coding tasks where perfect recall is very important.

For llama.cpp there are a few forks:

TheTom/turboquant_plus — most mature for llama.cpp. Adds --cache-type-k turbo3 --cache-type-v turbo3(turbo2/turbo3/turbo4 variants). Validated on Metal; CUDA path exists but mixed-precision parity isn't fully verified, so test carefully on your V100. Probably rougher on the MI50 ROCm side.
0xSero/turboquant — vLLM-side with Triton kernels, if you ever want to try vLLM. I had really bad luck with my DGX Spark and dealing with only Marlin kernels working worth a damn with NVFP4 Nemotron models, so YMMV.
llama.cpp discussion #20969 — tracks the integration effort, may eventually land in mainline.

Performance ballpark from TheTom's benchmarks: turbo3 gives roughly +3.6% PPL on a 104B at 128K context, +11.4% on a 70B. Bigger models absorb the quantization stacking better. One important nuance: K precision matters way more than V because K drives attention routing through softmax. If you're running a Q4 weight model, the recommended config is asymmetric — -ctk q8_0 -ctv turbo4 keeps K precision high while compressing V hard (but some models bomb out if you use mixed-precision KV quantization! Buyer beware, LOL.). Symmetric turbo3 really only shines on Q8+ weights. The fork also includes a "Sparse V" decode optimization that's orthogonal to turbo but stacks well — up to +22.8% decode speed at 32K context.

---

Right now is there are so many really good models -- yet so many of them shit the bed with mixed-precision KV cache quantization -- that keeping straight what works well with what could become a full-time job if I let it :)

[-]

Accomplished_Snow_78@reddit

Compact , Handover.md at 30-40% of context window,

[-]

shing3232@reddit

maybe using hybrid attention or something close to linear attention like the new DSV4F

[-]

tmvr@reddit

That's how things work unfortunately. Some models drop less, some drop more. Qwen3.6 35B A3B on an RTX4090 for example starts at 169 tok/s at depth 0 and ends at 72 at depth 256K. At 128K it still does 104 tok/s. The Qwen3.6 27B on the same card starts at 44 tok/s and at 128K depth it is down to 31 tok/s.

[-]

Expensive-Paint-9490@reddit

Speed decreases mainly because the 1st K vector must be multiplied by a single Q vector, but the 10,000th K vector must be multiplied by 10,000 Q vectors. So speed decrease is inherent in transformers design.

[-]

Far_Cat9782@reddit

Yes in y harness I have tools the AI can use like /flush with flushes the context except for the last 2 or 3 messages. It also clears vram after image and song generation comfyui automatically. Sti have it running 24/7 aonat night it dreams and compact)embed the relevant information on a cron job at 300am. This all keeps my token speed massive. Also when coding I have ir a tool to edit only the sections that are gifgy without having to rewrite or load the whole code in memeory bits like copy and paste for the ai. It really really helps in dramatically increasing context. I would psot it here since it's on GitHub but I don't want to hear vibe coded naysayers

[-]

Herr_Drosselmeyer@reddit

Nope, it's unavoidable. There are various approaches to reduce context size, like using summarization or RAG, but that only postpone it. More context means more tokens that need to be calculated.

[-]

OsmanthusBloom@reddit

Restarting the discussion often makes sense anyway, as both generation speed and quality degrades with longer context.

Depending on what you are doing, ngram-mod speculative decoding might help boost tg speeds. It helps in cases the model often has to repeat what was already said (eg file editing).