The power of structured workflows and small local models
Posted by DeltaSqueezer@reddit | LocalLLaMA | View on Reddit | 20 comments
A month ago, I experimented with a very basic home-rolled agent loop with a handful of tools and found it worked surprisingly well in spite of how crude it was:
https://www.reddit.com/r/LocalLLaMA/comments/1sl7f8e/homerolled_loop_agent_is_surprisingly_effective/
Later, I wrote about how I addictive developing your own agent loop is, esp. once you reach the point that the agent loop is capable of editing itself:
https://www.reddit.com/r/LocalLLaMA/comments/1sq7cie/warning_do_not_write_your_own_ai_agent_if_you/
Well, 28 days later, it's been getting out of hand. I've been working until 5am on it as it was so addictive.
Once you have a good agentic setup, you quickly realise that you, as the human, is the main bottleneck. You have a massive todo list, but the agent is sitting idle, waiting your your approvals and reviews.
Not only that, since I am using Qwen3.5 9B as the model, the model has limited intelligence and context. I can't just dump hundreds of data files onto it and expect it to crunch it all, so then I thought to manage the context limits through a map-reduce pattern, breaking tasks down into smaller chunks that can be run in parallel to extract maximum FLOPs out of the GPU while staying within context limits.
Enforcing structured outputs also helps to reduce LLM variability and make a smooth reduce step.
Lastly, it is helpful to have a database to monitor and track workflows. Managed to get it up and running today and happy that small local models can handle this task.
Imaginary-Unit-3267@reddit
Would you mind writing up in detail how all this works and what you've built somewhere and linking it for noobs like me who just use the llama-server web ui and mcp tools to read? (Or at least pointing me to some writeup like that which already exists somewhere?)
DinoAmino@reddit
It's amazing what can be done locally when you drop the whole fantasy of zero-shotting everything and just use best practices.
AlistairMarr@reddit
Well, the problem is no one talks about best practices in depth.
Reddit feedback is all "Your holding it wrong" or "Works on my machine" in the large AI subs.
DinoAmino@reddit
I feel like there are many people with solid knowledge trying to explain things here, but people ignore good advice and even downvote such comments because they don't like to hear truth or be told some things take extra effort.
dataexception@reddit
My God. Thank you for saying it out loud. Some people take such offense when you suggest anything less than that they are perfect as they are, and whatever they do is great.
I'm curious how they would fare in the actual workforce. They certainly wouldn't be able to handle code reviews gracefully.
haragon@reddit
AWQ is quite the throwback. I'm not super familiar with it, why did you choose that over a gguf quant?
DeltaSqueezer@reddit (OP)
I use vLLM for batched throughput and GGUFs are not well supported on vLLM. I think AWQ is still quite commonly used.
I'm not actually using AWQ now. My old Qwen3 configuration used AWQ and I didn't change the model name to avoid having to change model name on all clients.
I'm currently using unquantized 9B.
MatlowAI@reddit
Just curious what made you opt for bf16 9b over 27b at a quant? Also nice to see other people plinking away at custom hobby agents!
haragon@reddit
Good to know, thanks!
argenkiwi@reddit
The approach I have been taking to make the most out of the local LLMs I can run on my own hardware, which are Qwen 3.6 27b and 35b-a3b as well as Gemma 4 27b and 31b (Mac M2 Pro 32GB, is to create minimal frameworks (see AmblerTS and Arch26) that consist of a small amount of code for structure and a comprehensive but focus set of agent skills to scaffold these projects.
I would love to delve into tying that up with development workflow automation, but I want to make sure it doesn't get out of hand as you put it. One of the things I would like to achieve is for the agent to identify repetitive deterministic tasks and create its own tools, using the frameworks I provide, to automate them for itself. Do you think it is achievable?
DeltaSqueezer@reddit (OP)
I had some thoughts and ideas in this direction but haven't implemented anything yet. There's a lot that can be done with simple hooks, triggers and scheduled jobs. While system could come up with new tasks, that's something I'd like to keep a HITL for rather than letting AI run riot.
Silver-Champion-4846@reddit
Is this just for code? I'm blind, can't see images.
zanar97862@reddit
The images show the model creating a workflow for retrieving and analysing recent git commits from a repo then formatting the outputs in markdown.
DonnaPollson@reddit
This is the part a lot of people miss: once the model is no longer being asked to do everything in one giant prompt, small models suddenly look much smarter. Decomposition, structured outputs, checkpointing, and parallel map-reduce are not “extra scaffolding,” they’re the actual system design. The funny thing is that this is basically how good ops teams work too — you stop worshipping raw intelligence and start designing reliable workflows.
Danmoreng@reddit
Something similar was my long weekend project: my old gaming notebook (Aero 15X 2018, 32GB RAM, GTX 1070 8GB) setup as Ubuntu server with a local agent running, by now simply to experiment. I am currently running Qwen3.6 35B Q4 with llama.cpp, that works pretty well on mixed CPU + GPU. I get an average of Prefill/s 129.0 Tokens/s 15.22
Build a whole nice management UI (mainly with Codex GPT 5.5 though). Currently I let Codex write the specifications for tasks and test out, how good Qwen3.6 handles them - with review from Codex again. Works suprisingly well, small changes get implemented quite decent. I chose https://github.com/earendil-works/pi as the agent runtime, and just built ontop of that. For 3 days really nice results, but there is so much improvement possible...the pipeline is endless. And testing if the functionality works correctly must be done by a human, the AI creates really weird bugs.
DeltaSqueezer@reddit (OP)
Python workflow generated in the above example looks like this:
This is just an example to demonstrate the map-reduce patter, the ability for workers to make tool calls, chain steps, contstrain outputs to a JSON schema.
If registered the backend can monitor workers and detect failed workers to recover.
Borkato@reddit
This is cool; are there other prompts you used? Would love more ideas for best practices
Nnyan@reddit
Looking forward to release
Mattthhdp@reddit
Mind to share your setup / agent ? I would like to try that ^^
DeltaSqueezer@reddit (OP)
Setup is vLLM running Qwen3.5 9B. The agent is a custom one that isn't released yet, but I hope to open source at some point in the future.