Notes on what actually breaks when you run a coding agent on small local models
Posted by BestSeaworthiness283@reddit | LocalLLaMA | View on Reddit | 12 comments
I've spent the last few weeks running real multi-file coding tasks through small local models and small cloud models on free tiers. Wanted to share the failure points that came up consistently, since some of them surprised me and i wanted to share with the community so maybe it helps someone.
Markdown fences are the most common failure across every small model I tested.
You can put "output only raw code, no markdown formatting" in the system prompt. The model agrees. The model also wraps its response in triple backticks anyway, especially when the request involves anything that looks like explaining code. Qwen3.5:9b and gemma4:e4b are the most consistent at following the instruction but still slip occasionally. Others from my testing fail this rule frequently enough that you basically have to assume the fences will be there.
The fix isn't better prompting. It's stripping fences in post-processing as a default. Any code-editing tool using small models has to do this.
From my testing structured output is unreliable below 7B parameters.
If your agent needs the model to return JSON for task lists (like in my caase), action types, or anything machine-parseable, small models fail at this far more often than benchmarks suggest. The benchmarks measure whether the model can produce valid JSON. They don't measure whether it produces valid JSON when given a complex multi-step instruction with edge cases.
In my testing, Gemma4:e4b is the most reliable for structured output among the local models I tried. Qwen3.5:9B is close behind. Codellama (allthoough old) struggles. On the cloud side, Llama 3.3 70B on Groq is rock solid for structured output (this was the most consistent). With other models from OpenRouter for example had some quirks. Example: Nemotron 3 super was very good, but it stopped responding on openrouter when hitting 100k tokens usage.
The practical workaround is to validate the JSON, retry once with an even more explicit instruction, then fall back to a permissive parser that can extract JSON from prose-wrapped responses.
Models will edit the wrong file if you let them.
Give a small model a task that mentions a function name, a project map listing similar function names, and a request like "rename validateToken to verifyToken." (real example from my testing). It might rename validateToken correctly. It might also rename validateUser, or modify a comment that mentions the function, or apply the rename to the wrong file entirely. The model treats the project map as suggestions, not constraints.
The fix is at the orchestration layer, not the prompt. Validate that file paths the model mentions actually exist. Validate that function names it claims to be operating on are actually in the files it claims they're in. Throw clear errors when there's a mismatch. Small models lie confidently and the agent has to not trust them.
Question vs action classification is harder than it sounds.
Asking "how many lines does utils.js have" should be a read-only operation. But if your executor only has one mode — edit this file — it will dutifully edit the file to contain the answer to your question, because the model interprets the request through the only action it knows.
The fix is having the planner classify requests into action types before any execution. Read-only queries route to a separate code path that never touches disk. Without this, a casual question can delete your file.
What works better than I expected
Token budget enforcement in code, before every call. Small models have no concept of context limits. If you trust them to be brief, they will not be brief. Counting tokens in your own code and refusing to send a too-large request is the only way to actually stay under the limit.
Per-file isolation. Sending one file at a time to the model is dramatically more reliable than sending two. Two files in the same call confuses small models surprisingly often. They mix up which fix goes where.
Synthesis-style memory. Storing what the model did last time as a one-sentence summary, not the full task list, gives enough context for the model to handle "undo" and "also add X" requests on the next turn. Doesn't need to be sophisticated.
What I'm still figuring out
Whether any local model under 7B is actually viable for an agent role, or if 7B is the practical floor. I haven't found a smaller model that doesn't fail at structured output frequently enough to be unusable. Curious if anyone has had luck with smaller fine-tunes specifically tuned for tool use or JSON output.
I open sourced the test harness if anyone wants to look or contribute: github.com/razvanneculai/litecode
Any help is highly appreciated and i would love any type of feedback.
As a disclaimer, yes i use AI to reformat some of my text because english is not my first language and i think the information is very interesting and it might help someone out.
ai_without_borders@reddit
the markdown and JSON failures at high context are the same problem: small models have weak 'instruction following at distance'. they can follow formatting rules when the instruction is fresh but the signal decays with token count. frontier models have enough capacity to maintain instruction state through long contexts; 8-14b models typically don't. two things that actually work: constrained decoding (llama.cpp grammar, vllm guided decoding) instead of prompting for format -- the sampler enforces it regardless of context length. for infinite loops: inject 'you already tried X and it failed' explicitly into the next turn. the model needs the failure history in-context, not just in your orchestration state.
BestSeaworthiness283@reddit (OP)
Thank you for the info!
Exact_Guarantee4695@reddit
yeah, this matches the annoying version of small-model agent work: the model can be smart enough to know the edit, but not reliable enough to package the edit. the biggest improvement i’ve seen is making the model produce intent plus a tiny patch plan, then having boring code own the actual file writes, fence stripping, path checks, and format validation. direct “write the final file” feels good in demos, but one bad fence or invented path turns the whole run into cleanup. do you track retry reasons by model, or just pass/fail per task?
BestSeaworthiness283@reddit (OP)
No i dont track those, but its a good ideea.
synw_@reddit
About structured output for small models I recommend using xml over json: it's easier to manage for the model, with less formatting rules. Using shots help the small models a lot to stay on tracks
BestSeaworthiness283@reddit (OP)
Im sorry but im not as informed, but what is a shot? Also thank you very very much for the recomandation.
Also i remembered a post where people said something about AI understanding json the easiest. Is it true?
synw_@reddit
A shot is a user/assistant history turn. You provide several history turns with examples of well formed outputs, and the model will follow this pattern, making it easier for it to get it right.
About json: understanding and writing is different: a small model can easily make formatting errors writing json, xml being simpler it puts less pressure on the model, keeping it's attention more focused. I've pushed Qwen 4b quite far with xml structured output.
BestSeaworthiness283@reddit (OP)
Thank you very very much for the xml ideea.
And tell me if i got this right: so a shot is like a action i have asked it to do in the past?
IrfanZahoor_950@reddit
This is a good reminder that prompts aren’t the contract, the orchestration layer is. small models can be useful, but only if you validate paths, classify actions, check outputs, and never let the model directly decide what gets written to disk.
BestSeaworthiness283@reddit (OP)
Exactly, its spot on!
Party-Log-1084@reddit
For me it's always the infinite loops. An 8b or 14b model will try to fix a bug, fail, and then just keep applying the exact same broken fix forever. Also they almost always lose the ability to output valid JSON for tool calls as soon as the context gets a bit crowded.
BestSeaworthiness283@reddit (OP)
Yes. Thats why i made the memory so it can see what it fixed last time, so if he made a fix he is told to try a different one