Reviewer agent on local Qwen 3 8B, architect on DeepSeek thinking model: per-agent ledger from a TS pipeline (M1 16GB)

Posted by JackChen02@reddit | LocalLLaMA | View on Reddit | 5 comments

I built a 3-agent TypeScript pipeline (architect → developer → reviewer) where the reviewer runs locally and the other two run on a cheap cloud provider. Posting the per-agent ledger because I haven't seen this exact shape laid out for the TS side.

Quick honest framing: the run below uses DeepSeek for the two cloud agents because I'm on a budget and don't have direct API keys for Anthropic or OpenAI. The same setup runs on those providers too, just different provider + model strings. That's the model-agnostic point.

The framework I used is open-multi-agent, TypeScript-native. (I'll drop the repo link in a comment so this post stays link-clean.)

Each agent declares its own provider + model + baseURL. Cloud and local sit in one team config, one field per agent:

const architect = {
  name: 'architect',
  provider: 'deepseek',
  model: 'deepseek-reasoner',           // thinking mode for design work
  temperature: 0.2,
  systemPrompt: '...',
}

const developer = {
  name: 'developer',
  provider: 'deepseek',
  model: 'deepseek-chat',               // non-thinking, cheaper
  temperature: 0.2,
  systemPrompt: '...',
}

const reviewer = {
  name: 'reviewer',
  provider: 'openai',                   // reuse openai adapter
  model: 'qwen3:8b',
  baseURL: 'http://localhost:11434/v1', // Ollama OpenAI-compat endpoint
  apiKey: 'ollama',                     // SDK validates non-empty, server ignores
  temperature: 0.1,
  systemPrompt: 'You are a code reviewer. Flag bugs, edge cases the developer missed, and questionable abstractions. Keep your review under 200 words.',
}

The OpenAI SDK validates that apiKey is non-empty even when the local server ignores the value. Pass a placeholder. People keep tripping on this.

I used orchestrator.runTasks(team, [...]) with an explicit DAG (architect → developer → reviewer) instead of runTeam(goal). Reason: the goal-driven path lets the coordinator decide which agents to skip, and in my testing it kept routing the review work to the developer instead of dispatching to the reviewer agent. If you want the local reviewer to actually run, runTasks is the reliable path. The framework supports both modes.

Per-agent ledger (single run, single workload, M1 / 16GB unified memory, Ollama 0.20.2):

agent       | model              | latency  | tokens in/out  | cost
------------+--------------------+----------+----------------+--------
architect   | deepseek-reasoner  |    25.3s |   1612/  2450  | $0.0009
developer   | deepseek-chat      |    68.1s | 108219/ 10408  | $0.0181
reviewer    | qwen3:8b           |   208.5s |   1432/   696  | $0 (local)

Grand total: $0.0190 USD
Wall total : 5:03

Pricing snapshot 2026-05-22: deepseek-chat / deepseek-reasoner both at $0.14 / $0.28 per MTok input/output.

A few honest observations for this sub:

1. The reviewer is the slowest agent. 208s on M1 16GB for \~1.4K input + \~700 output through qwen3:8b (Q4 default quantization on Ollama). Cloud agents finish in 25-68s for similar shapes. Local-reviewer trade is "zero marginal cloud cost vs minutes of wall time": fine for async/batch workflows, wrong for anyone polling on a request.

2. Don't reach for the biggest local model you can pull. Also: turn thinking off for review-shaped work. I tried the same DAG with qwen3.5:9b-mlx (MLX GPU-accelerated, 8.9GB, thinking mode default ON) as reviewer. It hit 1347s (22 minutes) for the reviewer alone. Then I re-ran the same DAG with /no_think in the reviewer task description (Qwen 3 series workaround for attenuating thinking). Reviewer dropped to 554s (9 minutes), even though the developer fed it 82% more input that run (38K tokens vs 21K). Output tokens dropped 63% (1566 to 568), confirming thinking content was the bulk of generation. Net of input variance, the pure thinking-mode effect on this M1 16GB is an estimated 4-5x reviewer latency.

Caveat on /no_think: through an OpenAI-compatible endpoint, Ollama's native think: false flag gets ignored (the OpenAI schema has no such field), so /no_think only partially attenuates. Fully off needs Ollama's native /api/chat with think: false, which runs the same prompt in 6.7s. Takeaway: match the local model and thinking setting to the actual task, not the biggest thing you can pull.

3. The framework's runTasks over runTeam(goal) matters when local agents are involved. If you let the coordinator decide and it skips your local agent, you lose the cost saving and the architectural intent. Explicit DAG is the right default when one role is intentionally pinned to local.

Curious what reviewer-grade local models people are running. Specifically interested in:

Drop your setup if you've solved this.