How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model)

Posted by HomoAgens1@reddit | LocalLLaMA | View on Reddit | 3 comments

I'm building a local-first agent — a plain ReAct loop (think, pick a tool, observe, repeat) on a llama.cpp backend — and I want to be precise about a question that usually just gets answered with "it depends."

It does depend. So let me split it into two jobs:

(a) Heavy one-shot generation — write a 400-line module, refactor a big file. That wants a big model, no argument. In my setup I route this to a dedicated coding model; I don't ask the loop model to do it.

(b) The orchestration loop itself — read this, decide which tool, call it with the right arguments, look at the result, react. This post is only about (b).

For (b): how small can that model get before the loop stops being trustworthy? My balance point right now is Qwen3.6-35B-A3B (MoE, ~3B active) — the lightest setup where the loop holds up, still fine on a 12GB card with 30 expert offload (running 40 t/s prompt gen). Below that it degrades, and I've been trying to pin down what degrades first.

It isn't reasoning. It's tool-call discipline. The model gets the intent right and then botches the call. Examples from smaller models I tested:

Over-generalized or invented parameters. The 35B-A3B does this rarely; small dense models do it constantly.

Two things I tried to push the floor lower:

  1. Exposing the exact tool signature in the system prompt — generated tool_name(arg1, arg2, opt=default) straight from the function, next to each tool, so the model sees the precise parameter list and, by omission, which parameters do NOT exist. Subjectively it helped a lot; not measured rigorously yet.

  2. Repetition watchdogs — small models get stuck repeating the same failing (tool, args) call while the observation keeps erroring; their model of the state has drifted. I fingerprint recent actions and inject a "stop, change strategy" hint after N identical failures. Works, but it's a band-aid.

What I'm after:

Repo's here if useful — still rough: https://github.com/homoagens/pragma

You can probably go smaller than people think — if you fix tool-call discipline instead of just reaching for a bigger model.