Has anyone else noticed small models falling apart well before their context limit? Seeing consistent degradation at 12-15K on Mistral 8B/14B despite 128K training context.
Posted by Nice_Willingness_367@reddit | LocalLLaMA | View on Reddit | 10 comments
I've been running 8-14B models from the Mistral family (among others) - Ministral 3 8B/14B Reasoning/Instruct - for local hardware agentic tool-calling workflows. Training context is 128K, and I'm running with 40-77K context windows. But I'm running into soft degradation at around...maybe 15K-ish tokens consumed on cache?
I've seen this now in 2 different workloads, similar pattern.
In a home assistant (intent routing + tool calling), the model starts claiming it performed actions it didn't, or garbling canned responses from sub-agents. Outputs that should be straightforward copy-paste from tool results get mangled.
In a coding assistant (multi-step file editing), the model spirals when context gets heavy. Same task that completes in 5-6 steps when reads come in under budget will spiral for 30-60 steps once context crosses the threshold - nonsensical tool calls, modifying unrelated files, losing track of the task entirely. No clear pattern in which task type triggers it (bug fixes, refactors, and feature additions all hit it), but the likelihood of a spiral clearly correlates with context length.
Both workloads use the same serving backend (llama-server with native FC). Q4_K_M or Q8_0 quantization. Cache quant at default or Q8_0.
I don't have a clear quantitative assessment yet, but enough of a qualitative one to be here wondering if others have come across this and how they resolved it.
Has anyone measured effective attention vs advertised context window for small models? Is this a known quantization effect, a KV cache behavior, or something else? Curious if this is Mistral-specific or general to the 8B-14B class.
Nice_Willingness_367@reddit (OP)
I might have figured this out - at least in my context.
Tool calls and results were being kept in the conversation history as the agent was doing tasks. That's good, because it can see what it already did, and it gets results to use for next steps...that's the whole point. I also had some custom retry messages/nudges when it failed a tool call to get it back on track. No issues on single workflows really.
But once the workflow completes...that's a lot less useful. And the degradation was really most apparent in multi-turn conversations. Think a code-assistant where you fire off one build task that executes, then waits for your next instructions.
So I started collapsing all the tool calls and results into very minimal "[tool] did the thing" style messages - and I cut the retry - stuff from the conversation once a "turn" was completed (back to user input). Helped immensely.
So I think it might not be attention per se but almost a notion of pollution. I'm sure bigger models are more resilient - but thought I'd share here in case someone else runs into this.
Middle_Bullfrog_6173@reddit
Might be a quantization artifact. Borderline ok quants tend to do fine with short context lengths, but fall apart later. I've only used Ministral for short context stuff though.
In general my experience is that most models will not perform as well even at 50% max (trained) context but 10% should be fine.
Nice_Willingness_367@reddit (OP)
Gotcha. I'll keep an eye on any effect quant has on patterns.
p_235615@reddit
I used ministral-3 8b instruct with 64k context with vscode + cline, worked fine even close to full context.
Nice_Willingness_367@reddit (OP)
Good to know there's a world where it worked - I'll dig around and see if I spot something I'm doing.
Intelligent-Glass840@reddit
i think it’s less about them falling behind and more about the benchmarks finally catching up to how we actually use theml. a 1.5b or 3b model from 2024 is definitely a toy now, but the new qwen 3.5 9b is literally outperforming old 70b models on logic tasks. the real issue is just the vibes once you get used to the zero latency of a tiny model, the thinking pauses in the bigger reasoning models feel like forever haha. for simple rag or classification, the small guys are still king for speed.
Radiant-Video7257@reddit
How small are we talking ? Qwen3.5-9b does ok and the bigger models like Gemma-4-31b do well too.
Nice_Willingness_367@reddit (OP)
I was working with Ministral 3 8B and 14B. It does incredibly well with small context workflows (even complex ones), but then loses the thread after a while.
Clear-Ad-9312@reddit
yeah, I find different models handle context differently. ministral just doesnt like large context. great as a sub-agent that does one task and retires, bad for long term contextual usage.
Nice_Willingness_367@reddit (OP)
Interesting! Glad I'm not the only one seeing this. That's what I saw as well, great subagent with limited context. I'll see what I can do to clean it up a bit and give it more breathing room.