What actually breaks when you run a coding agent on small local models — notes from 3 weeks of testing

Posted by BestSeaworthiness283@reddit | LocalLLaMA | View on Reddit | 6 comments

I've been building a CLI coding agent designed around the 8k context constraint, which means I've spent a lot of time running real tasks through small local models (qwen2.5-coder:7b, deepseek-coder:6.7b, codellama:7b) and small cloud models (Llama 3.3 70B on Groq, the free :free tier on OpenRouter). Wanted to share what consistently breaks, because some of it surprised me.

Markdown fence pollution is the #1 failure across all small models.

Tell the model "output only raw code, no markdown formatting" in the system prompt. It will agree. It will also wrap the response in triple backticks anyway, especially when the request involves anything that looks like code being explained. Qwen2.5-Coder is the most consistent at following the instruction, but even it slips \~5% of the time. Deepseek-Coder is similar. Codellama:7b breaks this rule frequently enough that you basically have to assume the fences will be there.

The fix isn't better prompting. It's stripping fences in post-processing as a default. Every code-editing tool that uses small models needs this.

Structured output is hit or miss with anything under 7B.

If your agent needs the model to return JSON (task lists, action types, anything machine-parseable), small models fail at this far more often than benchmarks suggest. The benchmarks measure "did it produce valid JSON" — they don't measure "did it produce valid JSON when given a complex multi-step instruction with edge cases."

In my testing, Deepseek-Coder:6.7b is the most reliable for structured output among local models. Qwen2.5-Coder:7b is close. Codellama struggles. Among cloud free tiers, Llama 3.3 70B on Groq is rock solid; the smaller :free models on OpenRouter vary wildly by provider.

The practical workaround: validate the JSON, retry once with a more explicit "respond ONLY with valid JSON, no prose" instruction, then fall back to a permissive parser that extracts JSON from any wrapping text.

Models will happily edit the wrong file if you let them.

Give a small model a task that mentions a function name, a project map listing similar function names, and a request like "rename validateToken to verifyToken." It might rename validateToken. It might also rename validateUser, or rename a comment that mentions the function, or add the rename to the wrong file entirely. The model treats the project map as suggestions, not constraints.

The fix is at the orchestration layer, not the prompt. Validate that file paths the model mentions actually exist. Validate that the function names it claims to be operating on are actually in the files it claims they're in. Throw clear errors when there's a mismatch instead of letting the edit through. Small models lie confidently; the agent has to not trust them.

"Question vs action" is harder than it sounds.

Asking "how many lines does utils.js have?" should be a read-only operation. But if your agent's executor has only one mode — "edit this file" — it will dutifully edit the file to contain the answer to your question, because the model interprets the request through the only action it knows.

The fix is having the planner classify requests into action types before any execution. Read-only queries route to a separate code path that never touches disk. Without this, a casual question can corrupt your file.

What works better than I expected:

- Token budget enforcement in code, before every call. Small models don't have any concept of context limits. If you trust them to "be brief," they will not be brief. Counting tokens in your own code and refusing to send a too-large request is the only way to actually stay under the limit.

- Per-file isolation. Sending one file at a time to the model is dramatically more reliable than sending two. Two files in one call confuses small models surprisingly often — they'll mix up which fix goes where.

- Synthesis-style memory. Storing "what the model did last time" as a one-sentence summary (not the full task list) is enough context for the model to handle "undo" and "also add X" requests. Doesn't need to be sophisticated.

What I'm still figuring out:

Whether any local model under 7B is actually viable for an agent role, or if 7B is the practical floor. I haven't found one that doesn't fail at structured output frequently enough to be unusable. Curious if anyone has had luck with the smaller fine-tunes.

Open-sourced the test harness if anyone wants to look or contribute:

https://github.com/razvanneculai/litecode