How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model)

Posted by HomoAgens1@reddit | LocalLLaMA | View on Reddit | 3 comments

I'm building a local-first agent — a plain ReAct loop (think, pick a tool, observe, repeat) on a llama.cpp backend — and I want to be precise about a question that usually just gets answered with "it depends."

It does depend. So let me split it into two jobs:

(a) Heavy one-shot generation — write a 400-line module, refactor a big file. That wants a big model, no argument. In my setup I route this to a dedicated coding model; I don't ask the loop model to do it.

(b) The orchestration loop itself — read this, decide which tool, call it with the right arguments, look at the result, react. This post is only about (b).

For (b): how small can that model get before the loop stops being trustworthy? My balance point right now is Qwen3.6-35B-A3B (MoE, ~3B active) — the lightest setup where the loop holds up, still fine on a 12GB card with 30 expert offload (running 40 t/s prompt gen). Below that it degrades, and I've been trying to pin down what degrades first.

It isn't reasoning. It's tool-call discipline. The model gets the intent right and then botches the call. Examples from smaller models I tested:

passes overwrite=true to an append_file tool that has no such parameter
calls grep_search with an output_mode arg that doesn't exist — it generalized it from a different tool
tries to invoke a conclusion "tool" that was never a tool, because finishing the task feels like an action
passes overwrite again to yet another tool, having "learned" the wrong lesson from an earlier call

Over-generalized or invented parameters. The 35B-A3B does this rarely; small dense models do it constantly.

Two things I tried to push the floor lower:

Exposing the exact tool signature in the system prompt — generated tool_name(arg1, arg2, opt=default) straight from the function, next to each tool, so the model sees the precise parameter list and, by omission, which parameters do NOT exist. Subjectively it helped a lot; not measured rigorously yet.
Repetition watchdogs — small models get stuck repeating the same failing (tool, args) call while the observation keeps erroring; their model of the state has drifted. I fingerprint recent actions and inject a "stop, change strategy" hint after N identical failures. Works, but it's a band-aid.

What I'm after:

For the orchestration role specifically — smallest model you actually trust in a loop?
Is tool-call discipline the first thing that breaks for you too, or does something else go first?
Better ways to make small models viable here — stricter tool schemas, light fine-tuning?

Repo's here if useful — still rough: https://github.com/homoagens/pragma

You can probably go smaller than people think — if you fix tool-call discipline instead of just reaching for a bigger model.

[-]

ikkiho@reddit

Tool-call discipline broke first for me too. The thing that helped most: grammar-constrained decoding pointed at the actual JSON schema (llama.cpp --grammar with JSON-schema-to-GBNF). Invented args can't leave the decoder so 'overwrite=true on every tool' just stops. Doesn't fix the 'invents a conclusion tool that was never declared' problem though, that one feels like it needs an extra eval step or just a smaller tool surface.

Legal-Pop-1330@reddit

I noticed the same issues you mention and also stumbled across similar ideas. This ultimately led to a very-much-WIP https://github.com/rekursiv-ai/sagent

One of the features it has -- which seems to help -- is to use bashlexer to analyze the command and suggest better. By having the response be _specific to the failure mode_ it seems to help mitigate future such mistakes.

synw_@reddit

Qwen3.6-35B-A3B does a good job at orchestrating things for me too. What small models did you test exactly? Qwen 4b may be able to do the job if well prompted with a small prompt, but one crucial thing with small models is to limit the number of tools: they get confused very fast with this.

Prefer a limited set of tools, use skills, or use tools routing. I have a concept of tools router tasks: the main model sees only one tool, calls it with what it wants, the request is passed to a model that a has several tools that will be focused on picking the right one, then it's executed and the tool call result is passed directly to the main model as it's own tool call result