Same 9B Qwen weights: 19.1% in Aider vs 45.6% with a scaffold adapted to small local models
Posted by Creative-Regular6799@reddit | LocalLLaMA | View on Reddit | 41 comments
I spent the past week testing a simple question:
Small local models often look weak inside coding agents. But how much of that is actually model weakness, and how much is scaffold mismatch?
So I held the model fixed and changed only the scaffold.
Same Qwen3.5-9B Q4 weights in both conditions.
Same Aider Polyglot benchmark.
Full 225 exercises.
Results:
- vanilla Aider: 19.11%
- little-coder: 45.56% mean pass@2 across two full runs
little-coder is not a new model. It is a scaffold I adapted to the behavioral profile of a \~10B local model: bounded reasoning budget, a Write guard that refuses to overwrite existing files, explicit workspace discovery, and small per-turn skill injections instead of one huge static preamble.
This is not a conference paper. There are obvious things a proper paper would still want:
- more replications
- component ablations
- more model families
- maybe a second benchmark
But the effect size was large enough that I thought it was worth sharing now (I don’t have time to do the above unfortunately).
My takeaway is fairly narrow:
at this scale, coding-agent benchmark results are not just properties of model weights. They are also properties of scaffold–model fit.
I suspect sub-10B local models may have been written off too early in coding-agent evaluation.
Full write-up, code, and numbers here: https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-agent
Would be very interested in replication attempts, failure cases, or reasons you think this would not generalize.
_-_David@reddit
"This is not a conference paper."
"But"
I love the fuck out of this post.
Far-Low-4705@reddit
dont use a reasoning budget, if it ever hits the budget, its performance is far worse than if you would have just use instruct mode.
I'd suggest just leaving reasoning untouched and unbounded.
look@reddit
Ha. And this excerpt from an analysis that just finished running against Qwen 3.6 Plus:
DefNattyBoii@reddit
Do you have more info on this? I was using this in my conf.ini:
reasoning-budget = 4096
reasoning-budget-message = "...\n Considering the limited time by the user, I have to give the solution based on the thinking directly now."
look@reddit
The example above was from generating training data for a sequence classifier. Going from full thinking to none had an 8% accuracy drop on my test set. Giving it a truncating (no stop message) 256 budget recovered that 8%.
I’ve since run a larger test set at 256, 512, 1024, and full, and found it got to 99.9% with just 512. I’m now running the full dataset at 512.
This was a fairly specialized use, without a stop message at all, but I find that helps with more general tasks. The most important thing I’ve found with the stop message (for small Qwen 3.5 models at least) is to add a newline at the end of your message.
The rest of the message itself doesn’t seem to matter all that much, but the newline had a significant impact. I use something like this:
(I’ll look up my exact message later. On my phone at the moment.)
Ell2509@reddit
This work ie very useful. Thanks for sharing.
look@reddit
Hmm. Is there data to back that up? Mine is anecdotal, but I see improved performance on Qwen3.5 0.8b with reasoning but a small budget that it nearly always hits.
Far-Low-4705@reddit
yes, if you look a the pr request in llama.cpp for the reasoning budget feature, they did performance benchmarks and it absolutely tanked reasoning performance.
look@reddit
I’m familiar with the results mentioned in https://github.com/ggml-org/llama.cpp/issues/20632 but that is about graceful termination with budgets vs the truncated termination.
And I use truncated termination in an application on Qwen3.5, and it definitely benefits from a short, truncated reasoning over no reasoning at all. My case might be an exception, but I doubt it is that rare.
I did find that the message you inject at the end matters a great deal, though. I’d not be shocked if the other results you’ve seen were using an ineffective conclusion message.
Far-Low-4705@reddit
I think it is still a good sign that it’s a hacky solution that is suboptimal at best, and can result in unexpected behaviors.
At the very least, it’s going to completely mess with tool calling.
Best to just use it as it was natively trained imo
look@reddit
That’s fair. I just use truncation with LLM-as-classifier type applications, not any agent application that would be tool calling. It is more like the tool I am using in this scenario, and I’m often just reading off the first token logprobs directly, not even the actual output text.
metmelo@reddit
Great job! I wonder why people don't optimize more harnesses for small models.
vatta-kai@reddit
I’m building one! A browser agent with custom built scaffold that can work with small local models. I tested against llama 4 scout 17b 16e (old I know) and even much smaller ones like the Gemma E4B. It needs refinements but it consistently performs good even on complex tasks at a fraction of cost.
I sincerely believe local models with custom scaffolding will be very very useful.
ArtfulGenie69@reddit
It's more frustrating hehe
thrownawaymane@reddit
How robust is the non Ollama support? I'd wager most who are going to try this out/contribute to the project are running something more robust
Creative-Regular6799@reddit (OP)
Just added llama.cpp support! Thanks again for the tip
TitwitMuffbiscuit@reddit
Using llama.cpp on windows, I don't get the right context.
Little-coder shows:
I think the relevant code is https://github.com/ggml-org/llama.cpp/blob/master/tools/server/server-context.cpp
Anyway:
/props gives the max context as set by -c but also exposes a whole lot more like the whole jinja.
/slots also shows the max context but it's the current state:
Hope, this is helpful.
Creative-Regular6799@reddit (OP)
Unfortunately i only wrote it with ollama, but can add support for others as well
swfsql@reddit
Cool discovery! Perhaps when a turn ends, you could remove the previous turn's skill injection - even if this means doing a little prefill? This should save contexts and help the model to not focus on things that should no longer be important.
Creative-Regular6799@reddit (OP)
That is a cool idea! Will try it out during the weekend (you can fork and try yourself if you get to it before me)
swfsql@reddit
Thanks, please let me know if you manage to test this.
I apologize but I don't have enough total ram to run this model, not even a Q3 variant.
I was thinking back to this, and I think "erasing past cache" from the Gated Delta Net states may not be as easy as it is for attention. I theory it is possible to "reverse-forward" and recover previous states, but you'd most likely need to backup the state that you'd intend to "rollback into" (restore). I.e. make a restoration point for the GDN states before injecting something that is intended to be evicted, and only then you can "move the clean states forward" with the prefill.
Creative-Regular6799@reddit (OP)
No need to apologize at all! Will try it out. BTW, I ran little-coder with an extremely small model (9B parameters, <8GB ram), so maybe it will fit your hardware?
New_Comfortable7240@reddit
So I run limited to cpp aider benchmark with qwen3.5 35B and indeed got better numbers
Creative-Regular6799@reddit (OP)
Now running it with qwen3.6 35B, very curious to see the results
New_Comfortable7240@reddit
Well in my case it went well with qwen 3.6-35B\~
I tweaked a bit and got some of the option but got21/26 in cpp
here is my llama.cpp script if useful
Tailored to my 3060, got 30\~40 tps (tg), only downside is TTFT (only first time starting a session, from llama.cpp point of view all the activity by the little-coder is a session) around 20s but after that works really good
Creative-Regular6799@reddit (OP)
That’s a shocker! I wonder if more models benefit from this coding agents
rarogcmex@reddit
Have you tried any bigger model with little-coder (special scaffold). Is there less difference?
Creative-Regular6799@reddit (OP)
I thought about it, and it might be that I am onto a secret sauce here (though very unlikely). Honestly just didn’t have time to test it yet. Will try to get to it by the end of the week if nobody else tries before that
Taenk@reddit
This tracks with newer research showing that the harness may matter more than the model itself, or rather that the harness explains more variance in performance than model choice.
Have you compared the performance of larger or even frontier models in your harness vs vanilla harnesses? I’m curious whether and how much larger models benefit from more „sophisticated“ harnesses or they benefit from more breathing room.
More generally I noticed halfway decent prompting really levels up smaller models. I haven’t bench marked specific skill files though — there is conflicting data on their effectiveness.
Creative-Regular6799@reddit (OP)
Thank you for the comment. I didn’t test it with larger models yet, that is a natural next step
fragment_me@reddit
Do I understand it right that you used two different temp settings? One for your little cider and the other for the regular model? If so doesn’t that skew results?
Creative-Regular6799@reddit (OP)
That’s a great question, and my answer that it might, although no qualitative difference was observed.
I initially ran aider with the same temperature of 0.3 like i have set in little-coder, and it degraded performance (not on Polygot benchmark, but on my own examples and experimentations). I figured it wouldn’t be fair to change Aider’s configuration and then test it, so I accepted the difference in temperature.
Another example of this is that I found out that for the Aider baseline, litellm times out and resets if the response takes more time. so I made the timeout longer, that way I won’t count these as Aider failures for no good reason.
So yes, the difference in temperature really is there, yet I found it will be less of a confound to leave the temperature as they are
jadbox@reddit
How about against OpenCode?
Creative-Regular6799@reddit (OP)
Great question. I can put it against that as well
dtdisapointingresult@reddit
Impressive, very nice.
Any chance you could try it with https://huggingface.co/agentscope-ai/QwenPaw-Flash-9B so we have a comparison? It's a finetune of Qwen 3.5 9B by a different Alibaba team (the ones making their OpenClaw-style assistant QwenPaw), designed for better agentic performance.
Ok-Measurement-1575@reddit
Nice. Where's the github?
SadBBTumblrPizza@reddit
Nobody clicks links anymore do they? bottom of the article.
lannistersstark@reddit
Then you have to give a click to the article first.
thrownawaymane@reddit
https://github.com/itayinbarr/little-coder/tree/main
SourceCodeplz@reddit
Great read-up! As it happens I am actually working on a coding agent and this was really helpful and encouraging!
tett_works@reddit
Very impressive results! This approach makes so much sense that I wouldn't be surprised if the big AI companies already discovered it internally, but kept it quiet to keep everyone dependent on their larger, more expensive models.