We have sub-agents at home
Posted by sisyphus-cycle@reddit | LocalLLaMA | View on Reddit | 26 comments
At work I get unfettered access to gpt 5.4 and sonnet, so I'm quite used to spawning sub-agents to go crazy on a repo and split up tasks.
At home I am VRAM poor and like to run the models locally for my own enjoyment. Almost every single sub-agent extension/implementation does not account for any of the restrictions imposed by having 10gb of VRAM and a single slot for a KV cache (thats already quantized).
I already work as a developer, so I qwen3.6-35b-a3b tagged teamed a partially vibe-coded fork of an existing sub-agent repository for pi coding agent.
This is really only relevant if you:
- Use pi coding agent as your harness
- Can only run a single LLM at a time with 1 slot via llama.cpp server
- Want to use sub-agents without fully reprocessing your prompts after the sub-agent is done
Repo is here, feel free to use it or fork it idc. I am also interested in how others around here have dealt with sub-agents on a purely local and VRAM constrained setup. I was also planning to add the ability for sub-agents to be spawned with no previous context, and manage the saving and storing the main context via `--slot-save-path` and the `slots` endpoint. But the `.bin` files produced from that are pretty fat lol
Last thing, I've really been enjoying MTP in the main llama.cpp branch and have been getting pretty solid performance from the Apex Qwen variant. Able to run at 175-200k context with q_8 kv. Getting 200-300 pp and 25-40 tps depending on draft hit rates.
regunakyle@reddit
Can you add install instruction in README? I am not sure but maybe
pi install github.com/BenjaminBilbro/pi-subagentwould worksisyphus-cycle@reddit (OP)
Haha lmk if it works. I’ve actually been using it with Gemma recently bc I can’t fit as much context as qwen.
PaceZealousideal6091@reddit
Thanks for posting this. It's worth exploring for creatures like us with vram poor setup. Quick question about APEX quants. I wanted to check them out but then unsloth released the kld chart comparing the kld of different quants. It showed APEX getting big hits in the kld for the file size it occupied as compared to the UD quants of similar sizes. So, I decided it's not worth exploring. Have you compared them for your workflow?
sisyphus-cycle@reddit (OP)
Honestly have only been using it for a day or 2 after I had a bad time with UD q_6 with MTP being kind of dumb. Task was to take a grid search based sub process spawner that modified 4 values of llama.cpp server flags and convert it to particle swarm. The code it made was just wrong and did not work. Will test apex on that same exact input prompt/test.
But I don’t have any real numbers, or examples just yet. How have you found the unsloth mtp quants?
PaceZealousideal6091@reddit
Well, I have been using UD q_4 km with mtp and it's been pretty good. I don't know who's quant you were using for APEX. But when I planned to download Mudler's quants, even his model card showed much worse kld means for Apex as compared to UD. (https://huggingface.co/mudler/Qwen3.6-35B-A3B-APEX-GGUF) So, on paper I find it a hard sell. Since you are using it anyways, I'll love to know what you find from your tests. For me, the UD q4 km at 21GB will always trump the 22GB APEX iquality unless there's a serious advantage. And I am finding it hard to justify if there are any. Which apex quant are you using btw?
sisyphus-cycle@reddit (OP)
Yeah I don’t have enough time with this apex model yet to really make a judgement call. I’m using the I-balanced one. It’s weird because yes the overall mean/median KLD does not beat unsloth (they’re still close), but the KLD max is less than half of unsloth. So it might be more robust for worst case divergence scenarios, where it just goes off the rails with the wrong answer.
I usually prefer bartowski, but he hasn’t released an MTP gguf yet, so I’ve been meaning to graft the mtp layer onto it myself.
Either way, I can’t say for certain if it’s actually better or worse yet, but when I get home later I’ll test it out with a more focused test and let you know!
PaceZealousideal6091@reddit
Yeah... About the kld max. I am not sure I would bother about it. Q8_0 has a double the kld max of top apex quants. That makes no sense . Anyways, those are outliers. Statistically, it's better to pay attention to median and mean. And most importantly real life results. So, looking forward to yours.
sisyphus-cycle@reddit (OP)
Running the tests now. One on apex and one on unsloth q6_K. I’m letting it run via pi so they can test and iterate code solutions. Repo for testing is here. It’s an exact issue we had to solve at my job 2 years ago. Apex seemed stuck for a long time but eventually grinded out a solution after using 40,775 tokens (95% of them reasoning). Running the unsloth one now. I’ll probably make a post with more details
sisyphus-cycle@reddit (OP)
Yeah I agree. It’s always a crapshoot looking at just statistics. Sometimes I’ll use the best model of all time on paper and it just doesn’t work the way I expect. I’ll do the PSO bench and a weird leetcode benchmark I have that measures total tokens used for solution
korino11@reddit
Why the hell all started to name cli ide as a harness?!?! Real harnes is a TOOL\merthod\logical schema to keep the model inside your topology of rules... but cli ide is just an execution source
sisyphus-cycle@reddit (OP)
Yes and no. Pi by all definitions is a harness, albeit a simple one by default. Without pi you just have an inference server. Pi gives you tools, runtime env, and a way to “harness” LLMs. My specific pi setup is a harness, with hooks, design pattern specs, rules, etc. its semantics to argue whether pi or opencode or whatever is a harness.
Borkato@reddit
Can someone explain what a sub agent is/does compared to just asking it to fix stuff in one thread
sisyphus-cycle@reddit (OP)
So this specific implementation is focused on my restrictions. I only have 1 kv cache slot, I’m only running one model locally, and I am not offloading anything to cloud models. Anytime I’ve used sub agents for one off tasks it breaks my llama.cpp kv cache and reprocesses everything.
It specifically allows me to run a fork of the current session using the existing context to have the model perform a task. Then when that fork is done, the original session gets restored without any prompt reprocessing and any context the sub agent/fork added does not persist in the main session.
This isn’t all that different than just doing everything in the main model, but I can’t spend 10 minutes waiting for 125k+ tokens to get processed if I swap between contexts.
Basically:
- Main agent has 20k context of reading files and tool calls
- i want to add a new feature but it would require a bunch of web search and docs reading
- spawn sub agent to go learn everything, and propose detailed implementation guide (could add like 50-60k tokens)
- sub agent completes and shares condensed and relevant finding to main agent
- main agent still at 20k context but has a solid implementation plan with informed docs
Idk if I did a good job at explaining, but it’s useful for me locally to have context informed tasks be completed without adding additional context token overhead.
laul_pogan@reddit
The
.binsize scales directly withn_ctx * n_layers * kv_dtype_bytes. At 175k context with q8 KV on a 35B MoE, you're looking at multi-GB per saved slot. Switching to--cache-type-k q4_0 --cache-type-v q4_0cuts that roughly in half with negligible quality delta at those context lengths. Worth setting before you build out the slot-save orchestration so the files are manageable from the start.sisyphus-cycle@reddit (OP)
Tbh I never really even tried running with q4_0 kv quants after seeing many comments saying it was really bad. I know qwen is specifically robust to kv quantization, but yeah.
Do you use q4?
laul_pogan@reddit
Qwen specifically is documented as robust to KV quant via its GQA + attention-head structure. At long context the quality delta between q4_0 and q8_0 is usually inside your eval-noise floor. Most "q4 KV is bad" comments come from non-Qwen models where it actually hurts. Worth A/B on your own eval rubric before rolling it out, but I'd be surprised if it bites you on Qwen3.
StandardLovers@reddit
How safe is it to quant KV cache to w8 or q4 for long context agentic work? Are you experiencing memory tripping , hallucinations or invalid tool calls?
sisyphus-cycle@reddit (OP)
Never tried q4 kv cache, just q8. I’ve never had it make invalid tool calls even at essentially full context. Memory tripping is a bit more of an issue near the end, but if you remind it, the recovery is clean.
I’ve been happy w q8 for agentic stuff, qwen is a beast even when you’re pushing 200k context. I also tend to code by using a lot of detailed task.md files with a technical write-up and focused “win-conditions”. I’ve found qwen to be awesome if you load it with enough directions. It’s how I made this extension, only took me examining the outgoing messages from pi to llama for sessions to see where exactly the system prompt/tool array was changing.
But even if you were using bf16 for kv cache it’s not like gpt 5.5 where I just give it a 2 sentence task and it just knows. Smaller models need bigger directions lol.
StandardLovers@reddit
GPT 5.5 ? I thought you used a local model since you posted in /LocalLLaMA
m3umax@reddit
When I built my delegate extension, I added the ability to "anchor" a message ID to spawn an agent from, which could be any arbitrary message in the tree from the begining to the current message.
So the prefix of the subagent matches exactly with what's in the cache.
When spawning, my extension can use either the saved anchor message or the full current context to pass to the sub agent.
sisyphus-cycle@reddit (OP)
That’s smart. You would lose some efficiency gains the further back you go right? Like if you spawn an agent using message 0 -> message 10 but you actually have 100 messages. When the agent finishes you now have:
message 0 -> message 100 -> agent result.
Depending on if llama saved the previous prefix to ram (-cram I think?) it might be able to load previous prefix/state up to message 100 and not do prompt reprocessing. But not guaranteed I think.
I do like the concept of choosing how much context an agent gets though
m3umax@reddit
Honestly, it needs experimentation to know for sure what is and isn't hitting the cache. That should be my next extension. A basic telemetry display showing cached tokens etc.
I'm using oMLX on a Mac which keeps KV cache persisted on SSD for as many sessions as you have disk space for, so my results would likely be very different to yours running just llama.cpp.
Asleep-Land-3914@reddit
I tried to use fork extension for this. The funniest thing is when the forked season doesn't realize it's forked and commits a fork bomb by continuously forking itself further. That said thanks for posting, I'll try the solution
arbv@reddit
You can't rely on LLM for this. The extension should disable itself for the subagent.
sisyphus-cycle@reddit (OP)
Dude yeah, I spent a good bit of time trying to figure out how to explain to the LLM via system prompt the flow. I actually even considered changing the name from “sub-agent” to “focused-task” or “one-shot-mode”. Most LLMs are trained with data about sub agents, and get confused at the possibility of it being a sub agent lmao
It’s been working for me pretty well with the included system prompt. Also I explicitly blocked any invocation of spawning another sub agent from within a sub agent, so it’s always depth 1.
Danmoreng@reddit
Sounds great. My weekend project was putting Ubuntu server my old gaming notebook (32GB ram 8GB vram) and building some sort of local agent platform with pi at its core and Qwen35b Q4 as the model. Getting around 150 pp/s and between 15 and 20 tg/s. Task queue, give it some tasks to run over the night, come back in the morning to see some of it properly implemented but also some just breaking stuff. It’s addictive.