Is there a way to mitigate performance as context grows?
Posted by WhatererBlah555@reddit | LocalLLaMA | View on Reddit | 19 comments
In my local LLM setup I get from 30 to 80 t/s generation at the beginning, but it drops quite a lot as context grows.
I use llama.cpp/Vulkan with an MI50 and a V100, is there some command line flags that can improve this issue? Or some good practice other than restart the chat after some time?
cstocks@reddit
compaction is used heavily by claude so I guess you can implement something similar in the harness level
MLExpert000@reddit
That’s exactly where local breaks.
Finanzamt_Endgegner@reddit
thats where everything breaks. this is a fundamental (o\^2) attention problem. Qwen next tries to fix that by linear attention, deepseeks older model did something similar and v4 is even better
MLExpert000@reddit
A few things help a bit:
But honestly it’s mostly a fundamental thing. context grows, it just slows down. most people just break it up or use retrieval instead of pushing one long convo forever.
Finanzamt_Endgegner@reddit
qwen3 next so for example 3.6 is not that bad in that regard, barely slows down at 100k context
MLExpert000@reddit
Thad really impressive. What kind of a hardware setup you got?
Finanzamt_Endgegner@reddit
4070to + 2070
HopePupal@reddit
nope, quadratic attention costs with growing context are still a fundamental limitation of transformers. newish model architectures mix in some fraction of sliding-window, deltanet, Mamba, etc. layers that have fixed state size and don't do that, but that just slows the growth down by a constant, it doesn't change the scaling law.
unless you really really need your whole input as context (like you're trying to summarize a novel in one go), it's better to start a new context for each question/feature/change/etc.
ea_man@reddit
Usually vulkan gives you fast performance while context is short then degrades hard when it fills up, ROCm is the \~opposite, as it starts slower yet keeps stable when your prompt (read too) grows over \~35k.
On the application layer if your problem is (like for most local users) that prompts / context gets BIG fast and tanks the performance ->
* avoid bloated harness like Qwencode / opencode that boots up each prompt with 11k context
* use harness meant to reduce that: Late, Aider, Pi
* create summaries, load them in context early, reset context as soon as you can after each operation. if that starting prompts won't change it should be cached.
txgsync@reddit
If prefill is what's killing you -- and at large context with multi-turn chat, it almost always is -- llama.cpp's slot save/restore is the single biggest win. Persist the KV cache to disk per slot and you skip re-prefilling history every turn. First token comes back fast instead of after a 30-second-to-five-minute stare-at-the-screen pause. Decode will still get a bit slower as context grows, but you're no longer paying the quadratic prefill tax on every turn.
Beyond that, model architecture matters a lot at long context. Take a hard look at Nemotron 3 Nano (30B total, 3.2B active, released December). It's a hybrid Mamba-Transformer MoE: most layers are Mamba-2 with fixed-size state, with attention layers sprinkled in periodically, and MoE on top so only \~3B params are active per token. That gets you near-linear sequence scaling across the bulk of the stack and a tiny KV cache footprint, since only the attention layers contribute to KV. 1M context window, and llama.cpp supports it. Should run great on your hardware. If you're regularly working past 100k tokens, switching to a hybrid SSM is probably a bigger lever than any flag tuning.
Other knobs on llama.cpp include:
--cache-type-k q8_0 --cache-type-v q8_0, or q4_0 if you're brave) to cut memory bandwidth on decode,It's a legitimately hard problem to preserve performance with large context in attention-based Transformers. The answer of the big labs is "throw more hardware at the problem" which is the exact reason for the massive datacenter boom in the USA right now.
WhatererBlah555@reddit (OP)
Thank you for the exhaustive answer :)
I already use
--cache-type-k q8_0 --cache-type-v q8_0, maybe i'll try something more aggressive and see if it helps.I tried the Nemotron nano (Q4) but I found it unsatisfactory, maybe I'll try it again with a less aggressive quantization.
For the prompt caching you're referring to the --slots parameter of llama.cpp ? https://github.com/ggml-org/llama.cpp/discussions/13606
Can you explain more the GQA vs MHA models thing?
Thanks!
txgsync@reddit
I apologize for the AI-ness of my response below. I've been goofing around a lot with turboquant on Apple Silicon and it was just easier to just have Claude pull the details from my projects than write it up from scratch :)
---
Glad it was useful!
On slots: Yes, that discussion is a good starting point, but the actual mechanism you want is the
/slotsREST endpoint with slot save/restore via--slot-save-path. You launch llama-server with--slot-save-path /some/dir, then POST to/slots/{id}?action=savewith a filename to persist that slot's KV cache, and?action=restoreto load it back. The server README has the API. There's also--prompt-cachefor the older CLI tools if you're not using the server.On Nemotron 3 Nano at Q4: Your instinct to try a less aggressive quant is reasonable. Hybrid Mamba-Transformer MoE models tend to be more quantization-sensitive than pure Transformers; the Mamba state dynamics and the MoE routing both compound quantization error in ways that pure attention doesn't. Q5_K_M or Q6_K might be the sweet spot for Nemotron models on your gear; Q8_0 if you have the VRAM (/u/txgsync's injection on top of Claude here: yeah, q8 if you can. The quality loss is "only a few percent" for smaller quants, but that few percent can multiply if quantized at the wrong layers!). The active param count is only 3.2B so the quant tax in HBM memory terms is small. Mostly a lot of load/unload time of experts in your GPU if the model is larger than VRAM.
On GQA vs MHA: Multi-Head Attention gives every attention head its own K and V projections — if you have 32 heads, you store 32 separate K tensors and 32 separate V tensors per layer per token. Grouped-Query Attention shares K and V across groups of heads — e.g., 32 query heads but only 8 KV heads, so 4 query heads share each KV pair. The Q projections stay per-head (so expressivity is largely preserved), but the KV cache shrinks by the grouping factor. With 32→8 GQA, your KV cache is 4x smaller, which means 4x less memory bandwidth per decode step and 4x more context fits in the same VRAM. MQA (Multi-Query Attention) is the extreme case — one shared KV head — even smaller cache, slightly more quality cost.
Practically every modern model uses GQA now (Llama 3/4, Qwen 2.5/3, Mistral, etc.). (/u/txgsync jumping in again: this means if you're using recent 2026-era models? You can almost certainly ignore MHA completely.)With Nemotron 3 it's almost a non-issue because most layers aren't attention layers at all. Gemma-4's E4B and E2B models similarly swap some attention layers for lookup tables... and it works, nicely.
You may wanna consider "Turboquant" too. Google dropped this at ICLR 2026 for KV cache compression. Uses a random orthogonal rotation to gaussianize the distribution, then applies Lloyd-Max optimal scalar quantization per coordinate. Results in about \~5x compression at 3-bit with way less quality loss than naive q3 KV would give you. I normally run at either Q4 or Q8 depending how bad I'm willing to stomach substantial aliasing at larger contexts (and pragmatic RAM limitations, LOL.)
Anyway, I use Turboquant heavily on my Mac and it works well. The benefits compound the longer the context is. But I've noticed fuzziness at 4 bits and below beyond 128K context on 256K models that I don't see when KV isn't quantized. So I mostly use TQ4 for long-horizon planning, discussion, strategy, general chat, etc. but either use TQ8 or none at all for coding tasks where perfect recall is very important.
For llama.cpp there are a few forks:
--cache-type-k turbo3 --cache-type-v turbo3(turbo2/turbo3/turbo4 variants). Validated on Metal; CUDA path exists but mixed-precision parity isn't fully verified, so test carefully on your V100. Probably rougher on the MI50 ROCm side.Performance ballpark from TheTom's benchmarks: turbo3 gives roughly +3.6% PPL on a 104B at 128K context, +11.4% on a 70B. Bigger models absorb the quantization stacking better. One important nuance: K precision matters way more than V because K drives attention routing through softmax. If you're running a Q4 weight model, the recommended config is asymmetric —
-ctk q8_0 -ctv turbo4keeps K precision high while compressing V hard (but some models bomb out if you use mixed-precision KV quantization! Buyer beware, LOL.). Symmetric turbo3 really only shines on Q8+ weights. The fork also includes a "Sparse V" decode optimization that's orthogonal to turbo but stacks well — up to +22.8% decode speed at 32K context.---
Right now is there are so many really good models -- yet so many of them shit the bed with mixed-precision KV cache quantization -- that keeping straight what works well with what could become a full-time job if I let it :)
Accomplished_Snow_78@reddit
Compact , Handover.md at 30-40% of context window,
shing3232@reddit
maybe using hybrid attention or something close to linear attention like the new DSV4F
tmvr@reddit
That's how things work unfortunately. Some models drop less, some drop more. Qwen3.6 35B A3B on an RTX4090 for example starts at 169 tok/s at depth 0 and ends at 72 at depth 256K. At 128K it still does 104 tok/s. The Qwen3.6 27B on the same card starts at 44 tok/s and at 128K depth it is down to 31 tok/s.
Expensive-Paint-9490@reddit
Speed decreases mainly because the 1st K vector must be multiplied by a single Q vector, but the 10,000th K vector must be multiplied by 10,000 Q vectors. So speed decrease is inherent in transformers design.
Far_Cat9782@reddit
Yes in y harness I have tools the AI can use like /flush with flushes the context except for the last 2 or 3 messages. It also clears vram after image and song generation comfyui automatically. Sti have it running 24/7 aonat night it dreams and compact)embed the relevant information on a cron job at 300am. This all keeps my token speed massive. Also when coding I have ir a tool to edit only the sections that are gifgy without having to rewrite or load the whole code in memeory bits like copy and paste for the ai. It really really helps in dramatically increasing context. I would psot it here since it's on GitHub but I don't want to hear vibe coded naysayers
Herr_Drosselmeyer@reddit
Nope, it's unavoidable. There are various approaches to reduce context size, like using summarization or RAG, but that only postpone it. More context means more tokens that need to be calculated.
OsmanthusBloom@reddit
Restarting the discussion often makes sense anyway, as both generation speed and quality degrades with longer context.
Depending on what you are doing, ngram-mod speculative decoding might help boost tg speeds. It helps in cases the model often has to repeat what was already said (eg file editing).