The power of structured workflows and small local models

Posted by DeltaSqueezer@reddit | LocalLLaMA | View on Reddit | 20 comments

A month ago, I experimented with a very basic home-rolled agent loop with a handful of tools and found it worked surprisingly well in spite of how crude it was:

https://www.reddit.com/r/LocalLLaMA/comments/1sl7f8e/homerolled_loop_agent_is_surprisingly_effective/

Later, I wrote about how I addictive developing your own agent loop is, esp. once you reach the point that the agent loop is capable of editing itself:

https://www.reddit.com/r/LocalLLaMA/comments/1sq7cie/warning_do_not_write_your_own_ai_agent_if_you/

Well, 28 days later, it's been getting out of hand. I've been working until 5am on it as it was so addictive.

Once you have a good agentic setup, you quickly realise that you, as the human, is the main bottleneck. You have a massive todo list, but the agent is sitting idle, waiting your your approvals and reviews.

Not only that, since I am using Qwen3.5 9B as the model, the model has limited intelligence and context. I can't just dump hundreds of data files onto it and expect it to crunch it all, so then I thought to manage the context limits through a map-reduce pattern, breaking tasks down into smaller chunks that can be run in parallel to extract maximum FLOPs out of the GPU while staying within context limits.

Enforcing structured outputs also helps to reduce LLM variability and make a smooth reduce step.

Lastly, it is helpful to have a database to monitor and track workflows. Managed to get it up and running today and happy that small local models can handle this task.

[-]

Imaginary-Unit-3267@reddit

Would you mind writing up in detail how all this works and what you've built somewhere and linking it for noobs like me who just use the llama-server web ui and mcp tools to read? (Or at least pointing me to some writeup like that which already exists somewhere?)

[-]

DinoAmino@reddit

It's amazing what can be done locally when you drop the whole fantasy of zero-shotting everything and just use best practices.

[-]

AlistairMarr@reddit

Well, the problem is no one talks about best practices in depth.

Reddit feedback is all "Your holding it wrong" or "Works on my machine" in the large AI subs.

[-]

DinoAmino@reddit

I feel like there are many people with solid knowledge trying to explain things here, but people ignore good advice and even downvote such comments because they don't like to hear truth or be told some things take extra effort.

[-]

dataexception@reddit

My God. Thank you for saying it out loud. Some people take such offense when you suggest anything less than that they are perfect as they are, and whatever they do is great.

I'm curious how they would fare in the actual workforce. They certainly wouldn't be able to handle code reviews gracefully.

[-]

haragon@reddit

AWQ is quite the throwback. I'm not super familiar with it, why did you choose that over a gguf quant?

[-]

DeltaSqueezer@reddit (OP)

I use vLLM for batched throughput and GGUFs are not well supported on vLLM. I think AWQ is still quite commonly used.

I'm not actually using AWQ now. My old Qwen3 configuration used AWQ and I didn't change the model name to avoid having to change model name on all clients.

I'm currently using unquantized 9B.

[-]

MatlowAI@reddit

Just curious what made you opt for bf16 9b over 27b at a quant? Also nice to see other people plinking away at custom hobby agents!

[-]

haragon@reddit

Good to know, thanks!

[-]

argenkiwi@reddit

The approach I have been taking to make the most out of the local LLMs I can run on my own hardware, which are Qwen 3.6 27b and 35b-a3b as well as Gemma 4 27b and 31b (Mac M2 Pro 32GB, is to create minimal frameworks (see AmblerTS and Arch26) that consist of a small amount of code for structure and a comprehensive but focus set of agent skills to scaffold these projects.

I would love to delve into tying that up with development workflow automation, but I want to make sure it doesn't get out of hand as you put it. One of the things I would like to achieve is for the agent to identify repetitive deterministic tasks and create its own tools, using the frameworks I provide, to automate them for itself. Do you think it is achievable?

[-]

DeltaSqueezer@reddit (OP)

I had some thoughts and ideas in this direction but haven't implemented anything yet. There's a lot that can be done with simple hooks, triggers and scheduled jobs. While system could come up with new tasks, that's something I'd like to keep a HITL for rather than letting AI run riot.

[-]

Silver-Champion-4846@reddit

Is this just for code? I'm blind, can't see images.

[-]

zanar97862@reddit

The images show the model creating a workflow for retrieving and analysing recent git commits from a repo then formatting the outputs in markdown.

[-]

DonnaPollson@reddit

This is the part a lot of people miss: once the model is no longer being asked to do everything in one giant prompt, small models suddenly look much smarter. Decomposition, structured outputs, checkpointing, and parallel map-reduce are not “extra scaffolding,” they’re the actual system design. The funny thing is that this is basically how good ops teams work too — you stop worshipping raw intelligence and start designing reliable workflows.

[-]

Danmoreng@reddit

Something similar was my long weekend project: my old gaming notebook (Aero 15X 2018, 32GB RAM, GTX 1070 8GB) setup as Ubuntu server with a local agent running, by now simply to experiment. I am currently running Qwen3.6 35B Q4 with llama.cpp, that works pretty well on mixed CPU + GPU. I get an average of Prefill/s 129.0 Tokens/s 15.22

Build a whole nice management UI (mainly with Codex GPT 5.5 though). Currently I let Codex write the specifications for tasks and test out, how good Qwen3.6 handles them - with review from Codex again. Works suprisingly well, small changes get implemented quite decent. I chose https://github.com/earendil-works/pi as the agent runtime, and just built ontop of that. For 3 days really nice results, but there is so much improvement possible...the pipeline is endless. And testing if the functionality works correctly must be done by a human, the AI creates really weird bugs.

[-]

DeltaSqueezer@reddit (OP)

Python workflow generated in the above example looks like this:

#!/usr/bin/env python3
"""
Workflow: Commit Analysis
1. Get the 6 most recent git commit short IDs.
2. For each commit, run a 2-stage pipeline:
   Stage 1 (tools: git_query): Examine the commit details.
   Stage 2 (structured output): Return structured JSON with classification.
3. Reduce: Combine all JSON results into a markdown table (pure Python).
"""
import json
import subprocess
import sys
from pathlib import Path

sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent))

from agent.workflow import (
    agent_call,
    map_step,
    reduce_step,
    run_workflow,
    finish_workflow,
    StepResult,
)
from agent.config import OutputFormat

COMMIT_SCHEMA = {
    "type": "object",
    "properties": {
        "short_id": {"type": "string"},
        "files_modified": {"type": "integer"},
        "classification": {"type": "string", "enum": ["bugfix", "feature", "other"]},
        "description": {"type": "string"},
    },
    "required": ["short_id", "files_modified", "classification", "description"],
}


def get_last_n_commits(n: int = 6) -> list[str]:
    """Get the last N git commit short IDs."""
    result = subprocess.run(
        ["git", "log", f"-{n}", "--format=%h"],
        capture_output=True, text=True,
    )
    if result.returncode != 0:
        raise RuntimeError(f"git log failed: {result.stderr}")
    return result.stdout.strip().splitlines()


def analyze_commit(short_id: str) -> StepResult:
    """Two-stage pipeline: tools → structured output."""
    # Stage 1: gather data with tools
    gather = agent_call(
        f"Examine git commit {short_id}.\n"
        f"Use git_query with subcommand='show' and ref='{short_id}' to get the commit details.\n"
        f"Then use git_query with subcommand='show_full' and ref='{short_id}' to see the full diff.\n\n"
        f"Report:\n"
        f"1. The total number of files changed\n"
        f"2. The commit message subject line\n"
        f"3. A brief summary of what the code changes do",
        tools=["git_query"],
        step_name=f"examine_{short_id}",
    )
    if not gather.ok:
        return StepResult(text="", ok=False, error=f"Stage 1 failed: {gather.error}")

    # Stage 2: structured JSON (tools stripped by output_format)
    return agent_call(
        f"Convert this commit analysis into structured JSON.\n\n"
        f"Commit short ID: {short_id}\n"
        f"Analysis:\n{gather.text}\n\n"
        f"JSON fields:\n"
        f'- short_id: the git short hash exactly "{short_id}"\n'
        f"- files_modified: count of files changed as an integer\n"
        f'- classification: "bugfix" if the commit fixes a bug, "feature" if it adds new functionality, "other" otherwise\n'
        f"- description: a one-sentence terse description starting with a verb (e.g., 'Fixes login issue', 'Adds user profile', 'Refactors validation')",
        tools=[],
        output_format=OutputFormat(json_schema=COMMIT_SCHEMA),
        step_name=f"classify_{short_id}",
    )


def build_table(texts: list[str]) -> str:
    """Pure Python reduce: JSON results → markdown table."""
    rows = [json.loads(t) for t in texts]
    header = "| Short ID | Files | Classification | Description |"
    sep = "|----------|-------|----------------|-------------|"
    body = []
    for r in rows:
        badge = "🐛" if r["classification"] == "bugfix" else "✨" if r["classification"] == "feature" else "🔧"
        body.append(f"| `{r['short_id']}` | {r['files_modified']} | {r['classification']} | {r['description']} |")
    return "\n".join([header, sep] + body)


def main():
    run_id = run_workflow("commit_analysis", {})
    try:
        commits = get_last_n_commits(6)
        print(f"Analyzing {len(commits)} commits: {', '.join(commits)}\n")

        # Fan out: each worker runs a 2-stage pipeline in parallel
        results = map_step(
            commits,
            worker_fn=analyze_commit,
            concurrency=5,
            run_id=run_id,
            step_name="audit",
        )

        # Print individual results
        for r in results:
            if not r.ok:
                print(f"  ⚠️  failed: {r.error}")
                continue
            data = json.loads(r.text)
            badge = "🐛" if data["classification"] == "bugfix" else "✨" if data["classification"] == "feature" else "🔧"
            print(f"  {badge} {data['short_id']} — {data['description']}")

        # Reduce: markdown table (pure Python, no LLM call)
        table = reduce_step(results, python_fn=build_table, run_id=run_id)
        print(f"\n{table.text}")
        finish_workflow(run_id, summary=f"Analyzed {len(commits)} commits")

    except Exception as e:
        finish_workflow(run_id, error=str(e))
        raise


if __name__ == "__main__":
    main()

This is just an example to demonstrate the map-reduce patter, the ability for workers to make tool calls, chain steps, contstrain outputs to a JSON schema.

If registered the backend can monitor workers and detect failed workers to recover.

[-]

DeltaSqueezer@reddit (OP)

Setup is vLLM running Qwen3.5 9B. The agent is a custom one that isn't released yet, but I hope to open source at some point in the future.