Reviewer agent on local Qwen 3 8B, architect on DeepSeek thinking model: per-agent ledger from a TS pipeline (M1 16GB)

Posted by JackChen02@reddit | LocalLLaMA | View on Reddit | 5 comments

I built a 3-agent TypeScript pipeline (architect → developer → reviewer) where the reviewer runs locally and the other two run on a cheap cloud provider. Posting the per-agent ledger because I haven't seen this exact shape laid out for the TS side.

Quick honest framing: the run below uses DeepSeek for the two cloud agents because I'm on a budget and don't have direct API keys for Anthropic or OpenAI. The same setup runs on those providers too, just different provider + model strings. That's the model-agnostic point.

The framework I used is open-multi-agent, TypeScript-native. (I'll drop the repo link in a comment so this post stays link-clean.)

Each agent declares its own provider + model + baseURL. Cloud and local sit in one team config, one field per agent:

const architect = {
  name: 'architect',
  provider: 'deepseek',
  model: 'deepseek-reasoner',           // thinking mode for design work
  temperature: 0.2,
  systemPrompt: '...',
}

const developer = {
  name: 'developer',
  provider: 'deepseek',
  model: 'deepseek-chat',               // non-thinking, cheaper
  temperature: 0.2,
  systemPrompt: '...',
}

const reviewer = {
  name: 'reviewer',
  provider: 'openai',                   // reuse openai adapter
  model: 'qwen3:8b',
  baseURL: 'http://localhost:11434/v1', // Ollama OpenAI-compat endpoint
  apiKey: 'ollama',                     // SDK validates non-empty, server ignores
  temperature: 0.1,
  systemPrompt: 'You are a code reviewer. Flag bugs, edge cases the developer missed, and questionable abstractions. Keep your review under 200 words.',
}

The OpenAI SDK validates that apiKey is non-empty even when the local server ignores the value. Pass a placeholder. People keep tripping on this.

I used orchestrator.runTasks(team, [...]) with an explicit DAG (architect → developer → reviewer) instead of runTeam(goal). Reason: the goal-driven path lets the coordinator decide which agents to skip, and in my testing it kept routing the review work to the developer instead of dispatching to the reviewer agent. If you want the local reviewer to actually run, runTasks is the reliable path. The framework supports both modes.

Per-agent ledger (single run, single workload, M1 / 16GB unified memory, Ollama 0.20.2):

agent       | model              | latency  | tokens in/out  | cost
------------+--------------------+----------+----------------+--------
architect   | deepseek-reasoner  |    25.3s |   1612/  2450  | $0.0009
developer   | deepseek-chat      |    68.1s | 108219/ 10408  | $0.0181
reviewer    | qwen3:8b           |   208.5s |   1432/   696  | $0 (local)

Grand total: $0.0190 USD
Wall total : 5:03

Pricing snapshot 2026-05-22: deepseek-chat / deepseek-reasoner both at $0.14 / $0.28 per MTok input/output.

A few honest observations for this sub:

1. The reviewer is the slowest agent. 208s on M1 16GB for \~1.4K input + \~700 output through qwen3:8b (Q4 default quantization on Ollama). Cloud agents finish in 25-68s for similar shapes. Local-reviewer trade is "zero marginal cloud cost vs minutes of wall time": fine for async/batch workflows, wrong for anyone polling on a request.

2. Don't reach for the biggest local model you can pull. Also: turn thinking off for review-shaped work. I tried the same DAG with qwen3.5:9b-mlx (MLX GPU-accelerated, 8.9GB, thinking mode default ON) as reviewer. It hit 1347s (22 minutes) for the reviewer alone. Then I re-ran the same DAG with /no_think in the reviewer task description (Qwen 3 series workaround for attenuating thinking). Reviewer dropped to 554s (9 minutes), even though the developer fed it 82% more input that run (38K tokens vs 21K). Output tokens dropped 63% (1566 to 568), confirming thinking content was the bulk of generation. Net of input variance, the pure thinking-mode effect on this M1 16GB is an estimated 4-5x reviewer latency.

Caveat on /no_think: through an OpenAI-compatible endpoint, Ollama's native think: false flag gets ignored (the OpenAI schema has no such field), so /no_think only partially attenuates. Fully off needs Ollama's native /api/chat with think: false, which runs the same prompt in 6.7s. Takeaway: match the local model and thinking setting to the actual task, not the biggest thing you can pull.

3. The framework's runTasks over runTeam(goal) matters when local agents are involved. If you let the coordinator decide and it skips your local agent, you lose the cost saving and the architectural intent. Explicit DAG is the right default when one role is intentionally pinned to local.

Curious what reviewer-grade local models people are running. Specifically interested in:

7B/8B non-thinking vs 9B+ thinking for code-review tasks (my data: 4-5x latency from thinking mode net of input variance)
if anyone's gotten Ollama's think: false through an OpenAI-compatible client without bypassing the framework. might be wrong about that limitation

Drop your setup if you've solved this.

[-]

Conscious_Chapter_93@reddit

This is the right shape — per-agent ledger is what makes a multi-agent setup actually debuggable. The thing we'd add: keep the ledger at the tool layer, not the agent layer. Same idea (what was approved, what ran, what was blocked) but attached to the call, not the chat. Means the ledger survives when the developer agent gets swapped for a different one, and you can answer 'which tool actually did X' without digging through three conversation histories. Same five fields work: source, observed_at, write_reason, last_used_at, dependency ids. Ledger on the wire, not the prompt.

fmlitscometothis@reddit

What is your experience with using a "less intelligent" agent for review? I'd hate to have a junior review my code at work :)

JackChen02@reddit (OP)

the 8B's the junior reviewer, not the senior one. you wouldn't merge on its say-so, but having it clear the obvious stuff first takes load off whoever does the real review.

ComplexType568@reddit

not sure if this is your solution... But

Coding model that's quick and smart under 10b = Qwen3.5 9B or 4B if you want to play risky.

Ollama weird quirks? That's... Common. Go with llama.cpp, if you feel like the UI (or lack thereof) is too complicated, try hexllama, LM Studio (I don't recommend because it's not as customizable) or Catapult.

yeah 9b's worth a shot. I ran 8b here and only tested 9b in its mlx + thinking form, which blew up to \~22min on the review step (M1). /no_think only pulled it back partway, haven't benchmarked non-thinking 9b yet. Ollama's definitely been quirky too, haven't tried llama.cpp. catapult's new to me, will look.