What do you consider to be the minimum performance (t/s) for local Agent workflows?
Posted by MexInAbu@reddit | LocalLLaMA | View on Reddit | 61 comments
What would you say is the minimum amount of tokens per second you would tolerate for your local agent workflows?
I have been trying pi.dev connected to a llama.cpp instance running Qwen3.6-27B-Q6_K_L with 200K context running on an RTX A6000. I get about 26 t/s and is surprisingly usable. About the same user experience I get with Claude Code connected to Anthropic. But I have just been fooling around with relative simple prompts so far. I'm trying out Brave search API.
bigh-aus@reddit
I think it depends on the usecase - but in your case where you're waiting for feedback - 26+ would be nice.
A second usecase exists for more "background work", which I think allowing it to be slower to say 14 would be ok. This would be where you have an agent working on backlog - so you're not waiting for a response necessarily.
ConferenceMountain72@reddit
"What's the minimum prompt processing speed" would be a better question in terms of local agentic workflows. I haven't tried local agentic workflows, but I did try heavy tool usage, I have around 350-400 pp with 122b model, it is working well enough, i'd say.
UncleRedz@reddit
The 122b model, is that Qwen 3.5/3.6 or something else? How does it compare to the smaller models?
ConferenceMountain72@reddit
It is Qwen 3.5 122b. Waiting for Qwen 3.6 to drop to replace it with. For the comparison, I use these models in heavy tool usage scenarios where world knowledge is the second most important thing after being able to use tools of course. So 3.6 27b model was not enough for me, as it lacked the 95 billion parameters of world knowledge. I use it for all sorts of things, including real life tasks or random things I wonder about, so it's nice to have this wide knowledge.
What I also noticed is, lower the parameter, higher the default "You are a helpful assistant" answer tone even with strict system prompts, especially after some context builds up.
For a bigger-smaller model, I switched to this model from the Qwen3-Next-80b model, and genuinely that model was undercooked and unpolished. It was obvious it was an experiment. This 122b model feels like its just the "perfected" version of it. But since 3.6 will be an obvious improvement, it is still not perfected to say.
Ok-Internal9317@reddit
I would say 2000tok/s PP is for me to be useable, I frequently go from 20K to 200K context and waiting 10 sec each request is pretty painful already. API is still my only option for now. TG? Tg can be 10tok/s I really don't care.
MexInAbu@reddit (OP)
True. I didn't thought about that. I guess because I use strong Nvidia GPUs to prompt processing speed is much less of a concern than those using Mac, AMD or CPU.
Bootes-sphere@reddit
Depends entirely on your agent's decision loop. If it's mostly waiting on external API calls (tool use, web searches), you can get away with 5-10 t/s. The LLM just isn't the bottleneck.
But if you're doing heavy reasoning chains or multi-turn planning, anything below 15-20 t/s starts feeling sluggish. You're watching the agent "think" in real-time, and it kills UX.
Qwen 27B at Q6_K_L should comfortably hit 20-30 t/s on decent hardware. The real question: are you CPU-bound or memory-bound? If you're running on GPU, that's your limiting factor. If it's system RAM, quantization helps but you might want to test a smaller model first.
What's your hardware setup? That changes the answer significantly.
Bootes-sphere@reddit
26 t/s is genuinely solid for local agent work — that's the sweet spot where you get acceptable latency without crazy hardware. For most agentic tasks with tool calling and reasoning steps, I'd say anything above 15-20 t/s feels usable, but your setup sounds ideal. The Qwen model choice is smart too; it handles function calling well at that size. If you ever need to compare against cloud inference as a fallback option, the newer Qwen and Llama models on public APIs are pretty cheap these days (starting around $0.01 per 1M tokens), so you could hybrid your agent depending on workload complexity.
ai_guy_nerd@reddit
Tokens per second matter most when the AI is blocking a human's workflow, like in a live chat or real-time coding. For those cases, 20-30 t/s feels snappy and keeps the flow.
However, for autonomous agent workflows, the priority shifts from speed to reliability and context handling. When an agent is running in the background, performing research or managing a homelab, even 5-10 t/s is perfectly acceptable because the human isn't waiting on the screen.
Systems like OpenClaw prove that the "think and execute" loop is more important than raw speed. If the agent can reliably handle a large context window and execute the right shell command, a slightly slower generation speed is a fair trade-off for better reasoning.
Main-Confidence7777@reddit
The t/s question is the wrong axis for agent workflows. What kills usability isn't speed, it's capability degradation mid-loop. A 27B model at 26 t/s will confidently hallucinate a tool call on turn 8 of a 20-step agentic task. You don't notice it at 3 t/s either.
For interactive chat, 15 t/s is fine. For real agent loops (multi-file edits, bash feedback cycles, 50+ turn context), the threshold is: can the model recover from its own mistakes? That's a parameter count and RLHF question, not a t/s question.
I run Claude Code for the same reason you'd expect a local-first person not to: no local 27B clears that bar reliably yet on complex tool chaining. Qwen3 is the closest contender but still drops the ball on error recovery past turn 15 or so. Speed is a solved problem at A6000 class hardware. Capability isn't.
Caffdy@reddit
can you gives us an example of some of these tools?
Main-Confidence7777@reddit
A few from this week:
- Refactor a function signature across \~14 files: update the def, all call sites, the tests, and the type stubs. The model has to track which sites take kwargs vs positional args and not corrupt the ones that already match.
- Debug loop: pytest fails → read traceback → grep for the assertion source → realize the actual bug is in a fixture two levels up in conftest.py → patch → rerun → notice the patch broke an unrelated test that depended on the old fixture behavior → decide whether to revert or fix the dependent test.
- Feature behind a flag: edit the config schema, write the migration, update the API handler, wire it through the frontend, add the e2e test, run it, then fix the typo the model itself introduced in step 2.
The thing that breaks local 27Bs isn't any single step. It's holding coherent intent across 15+ tool calls when half the intermediate results are noise, failed bash exits, irrelevant grep hits, files that don't exist, stale stdout from a previous attempt. Qwen3 starts 'fixing' things that aren't broken around turn 12. Sonnet stays on the original task and ignores the noise.
Imaginary-Unit-3267@reddit
Isn't this what trees of subagents are for?
MexInAbu@reddit (OP)
I guess I have limited experience with truly autonomous agent workflows. I don't trust them for my day job. Whenever I gave them tasks without supervision it has come back to bite me.
That experience has coloured the way I work with them on my day personal projects, which is me always in the loop.
Willing-Toe1942@reddit
the trick isn't the speed. for me 500 PP is enough because prefix Caching is the real magic
I have system that log token and generate statistics. daily I use 23M token and almost 98% is handled through cache
Djagatahel@reddit
You still need to process these 23M tokens once though, that's still ~3 hours at 2000tk/s
Obviously it would be way way worse without caching
rpkarma@reddit
20tps is as low as I can go
Chinmay101202@reddit
it's never enough honestly.
masterlafontaine@reddit
2tps. I have patience
giant3@reddit
2 tps is tolerable on non-reasoning models.
On reasoning models, the reasoning tokens by themselves easily exceed 2 or 3k even for simple queries, so you are looking at waiting for ~25 minutes.
AykutSek@reddit
t/s alone is misleading imo.
if a task has 6-8 tool calls, you pay that latency 6-8 times. so even 30-40 tg drags.
my floor is 25 tg, 35+ is leave-it-running territory.
honestly pp matters more than tg for agent stuff. past \~50k context pp craters and the loop just stalls between turns regardless of how fast you generate. been splitting into smaller subtasks and trimming stale tool outputs to keep context lean. made way more difference than any speed tuning i tried tbh.
AvidCyclist250@reddit
30
audioen@reddit
I've been tolerating \~11-17 tokens per second all day, that's what I get out of a GB10 chip and the Qwen3.6-27B-FP8 with vllm when speculating 2 tokens. I found someone's recipe which I adopted almost 1:1 except I had to remove fp8 quantized KV cache because I saw that the model was clearly confused when using it, seemingly thinking that I kept instructing it to do stuff when a message that had instructions was already long in the past, and as soon as I removed it, I saw no problems with the model's understanding of my messages and their ordering.
When reciting code etc. it has basically 100% accurate MTP prediction and spews around 17 tokens per second, and otherwise it varies. Because GB10 has compute, just not ram bandwidth, I can actually run multiple parallel agents at once. Each get about the same speed and half the prompt, which seems to go around 1000 tok/s. I think it would keep scaling up to about 4 parallel streams, then overall token generation isn't going to improve. However, I find that just running 2 agents in parallel is enough for me, because they already keep me busy.
Macestudios32@reddit
-dysangel-@reddit
Yeah - as long as prompt processing time is decent and the agent is smart, 20tps is pretty workable.
You could also use stuff like the caveman skill to make more effective use of the tokens that you are outputting.
Caffdy@reddit
can you give us a rundown of what are skills? new to this
robogame_dev@reddit
A skill is literally a markdown file and a sentence of description.
The model gets promoted with a description of all the skills, and if one sounds applicable to the task, it uses a tool call to read the skill file.
Caffdy@reddit
ok now I'm curious, how does this work? where are normally skills stored, and what are these tools that can summon/read/interact with them? can you give me an example, if it's not much to ask
MexInAbu@reddit (OP)
Is much more simpler than what it seems.
Look, LLM "understand" text. So, "RAG", "tools", "agents" and "skills" are just different types of text you add to the prompt of a LLM.
Tool usage is as easy as adding to your prompt "if the user wants to know the weather, call function 'getWeather()'". So an Skill is as simple as adding to your prompt "write like a caveman". Increase the text if you want the behaviour to be more precise. And having a set of different agents is as simple as having different LLM context that were prompted with different "skills" and "tools".
robogame_dev@reddit
Skills only exist within model harnesses - they’re a protocol so if you setup a folder of skills in a project you can open that project in Claude code or kilo code or cursor etc and they’ll all look at the skill files, extract the descriptions, and give the model a tool to access the full skill.
What exactly that tool looks like is up to the harness itself, you could do it just by giving the model a general purpose “read_file(path)” and the path to the skills, or you could give it a special “check_skill(name)” tool that resolves it internally etc, the point is everything important lives in the markdown file and isn’t locked to any particular harness / software.
-dysangel-@reddit
Skills are kind of like different system prompts that you can bring out when you want the agent to change mode to act in a specific way. The caveman skill just asks the model to be more succinct. I remember there was some good research on this stuff a couple of years ago showing that you could maintain technical performance while cutting down on token usage just by prompting the model. I had never actually got around to trying it until the caveman skill though. And of course it still just feels kind of gimmicky, even though it works. I think a similar skill but with grammatically correct language would feel like something I could use every day.
Pyros-SD-Models@reddit
lol this skill is amazing. And genius.
iamn0@reddit
20 output tokens/sec is perfectly fine, but prompt processing can get annoying. I tested Qwen3.6-27B as Q5 in opencode (on a system with 4x RTX 3090 cards). Up to \~50k context everything is great, but once the context window exceeds \~100k, you really notice how anoying the wait becomes which is why I start a new session at \~120k context at the latest
Far-Low-4705@reddit
Do u use instruct mode? The thinking would be too long no?
iamn0@reddit
thinking mode. It thinks a lot but good.
Makers7886@reddit
I highly recommend you try: TheHouseOfTheDude/Qwen3.6-27B-INT8 via vLLM with those 4x3090s. That will blow away every metric that matters vs a Q5 gguf on the same hardware - including your 100k+ speeds.
psyclik@reddit
Do you have a “go to way” to get vllm running (preferably through containers) ? It’s always a pain for me to get working in a clean an reproducible manner and I always fall back to llamacpp/ik.
laterbreh@reddit
If you run this on VLLM youll notice maybe a 10% slowdown in processing and inference speed once you start packing the context, but it stays fast, doesnt have the context baggage like llamacpp
iamn0@reddit
Thanks for the tip! What's important to me is that the model is uncensored. I tested Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf while waiting for an AWQ version for vllm. I also wanted to try pi.dev instead of opencode.
Makers7886@reddit
Ahh yeah not enough people modding for vLLM-compatible quants. Hard to beat that vLLM + pi combo on that hardware. Don't sleep on that, it's worth making it happen because it's a huge unlock.
FullstackSensei@reddit
-sm row tends to have worse overall performance on 4 cards for such small models in my experience. -sm tensor is great but I find it's not yet very stable. If you have more than one 3090, just go for Q8. The savings in model size will be mostly negated by the unaligned memory access when unpacking those tensors.
Express_Quail_1493@reddit
I think hes looking for the "feels" of different speed. Well if you are then
- 5tok/s is like watching a caveman solve calculus.
- 10-14 tok/s feels like you are peer programming side by side the model. Still tolerable
- 30 tok/s the spot where you can step away and notice a good amount of work done.
- 60 tok/s and above kinda hard to notice the diff for me past 60tok/s but expect to get what you want at the snap of a finger. at 60 you will keep having to prompt more and more because there is little to no waiting!!!
Chinmay101202@reddit
what is the use case? depends entirely on that.
Yes_but_I_think@reddit
40 t/s
JLeonsarmiento@reddit
50 t/s.
virtualicex@reddit
for a thinking model under 40 t/s becomes too slow
my_name_isnt_clever@reddit
I was happy with ~120b models getting >20 t/s at high context, but now I'm spoiled by Qwen 3.6 35b. With that model I don't get less than 40 t/s.
It's been great for agentic use. Low generation isn't that bad for direct chats, but when it has to do 10 iterations behind the scenes before you get any output, it really drags. Right now it feels like I'm using a cloud model, it's that smooth and effective.
robberviet@reddit
Min is 20. Nothing works at 10.
woolcoxm@reddit
20tok/s is ok speed, usable. its prompt processing that eats all the time for me, usually after about 50k context prompt processing becomes long for me with locally hosted. if you are only getting 20tok/s then the prompt processing will be the long wait after about 50k context.
GregoryfromtheHood@reddit
Somewhere around 1000t/s. 500 is a bit slow. I ran 10kt/s on a 5090 on a smaller MOE and that feels pretty good. Specifically talking prompt processing speeds. Token generation I don't really see much difference for agentic stuff. 20t/s and 80t/s feels pretty similar when it's only actually generating a few lines here and there but processing thousands of tokens of context in between the tool calls.
TheRenegadeKaladian@reddit
Anything above 20t/s is workable for me, Now i figured a way to get qwen3.6-35b-a3b different versions( from unsloth, tq3) on ik_llama, llama and Turboquant llamacpp builds and ended up getting 40t/s (40 to 45) now. Happy with it. Way more usable.
suicidaleggroll@reddit
For me, about 500 pp and 40 tg is the minimum, but if you can get up to around 2000 pp and 70 tg it’s much nicer.
pj-frey@reddit
Difficult question, you have to look at the details.
I get about 70 t/s with Qwen 3.6 35B and 20 t/s with Qwen 3.6 27B. Yes, the MoE model is much faster, but it also thinks three times more than the dense model. And in the end, the slower model gives better answers. So speed is not everything.
To answer directly: Anything below 20 tokens per second feels unusable somehow.
jacek2023@reddit
I use gemma 26B and my speed is 50-90 t/s (depends on the context), dense models are little too slow to me (closer to 10-30 t/s), probably it will make sense to use dense with tensor and self-speculative workflow
triplebits@reddit
For agent loops the metric that bites is time to first token and variance, not peak t/s. If your agent runs 12 sequential tool calls, even 30 t/s starts feeling slow because each call carries its own evaluation and sampling overhead.
Rough floors:
The bigger lever is usually cutting round-trips, not pushing hardware. Batching decisions, keeping system prompts short, and passing compact structured output between steps slashes more latency than going from 30 to 50 t/s on the same rig.
If your workflows are I/O bound (web fetches, file ops), you can tolerate much lower t/s than reasoning-heavy chains.
GoingOnYourTomb@reddit
Currently, I run Qwen 3.6 36b at 29t/s but this is just the first question not agent coding. I know once I start to pile up the context this number will drop. I haven’t fully used it for an actual project yet, but I plan to hope all goes well. I’m running it on llama CPP with a Intel arc B580
Exact_Guarantee4695@reddit
agree on prompt processing being the actual bottleneck. output speed at 20-25 t/s feels fine because you're reading anyway, but if pp drops below ~1k tokens/sec on long context you start watching paint dry between turns.
ran a similar setup with Qwen3.5-32B and the cliff hits around 60-80k context where pp basically halves. now i keep agent context under 40k and let the orchestrator summarize older turns. saves more than chasing higher tps.
maschayana@reddit
40 t/s
RoroTitiFR@reddit
IMO, the sweet minimum spot is 25-30 tps, for coding for example
milpster@reddit
i consider everything above 100pp/1tg usable and everything above 200pp/10tg fast.
Impressive_Chain6039@reddit
40