We have sub-agents at home

Posted by sisyphus-cycle@reddit | LocalLLaMA | View on Reddit | 26 comments

At work I get unfettered access to gpt 5.4 and sonnet, so I'm quite used to spawning sub-agents to go crazy on a repo and split up tasks.

At home I am VRAM poor and like to run the models locally for my own enjoyment. Almost every single sub-agent extension/implementation does not account for any of the restrictions imposed by having 10gb of VRAM and a single slot for a KV cache (thats already quantized).

I already work as a developer, so I qwen3.6-35b-a3b tagged teamed a partially vibe-coded fork of an existing sub-agent repository for pi coding agent.

This is really only relevant if you:

Repo is here, feel free to use it or fork it idc. I am also interested in how others around here have dealt with sub-agents on a purely local and VRAM constrained setup. I was also planning to add the ability for sub-agents to be spawned with no previous context, and manage the saving and storing the main context via `--slot-save-path` and the `slots` endpoint. But the `.bin` files produced from that are pretty fat lol

Last thing, I've really been enjoying MTP in the main llama.cpp branch and have been getting pretty solid performance from the Apex Qwen variant. Able to run at 175-200k context with q_8 kv. Getting 200-300 pp and 25-40 tps depending on draft hit rates.