why do agents still fail in multi-step workflows even when each step works fine?

Posted by weoraage@reddit | LocalLLaMA | View on Reddit | 13 comments

testing a few agent setups lately and sth keeps bothering me. individually, each step usually works. calling tools, generating outputs, even simple reasoning. but once you chain them into a real workflow, things start breaking in weird ways. it either loses track halfway, doesn’t recover from a small failure, or just stops without finishing the task

it feels like the problem isn’t capability anymore, but consistency across steps. like there’s no real notion of finishing the job, just executing pieces of it. curious if others here have found a setup that actually handles multi-step workflows reliably, esp when something goes wrong mid-way

[-]

Yes-Scale-9723@reddit

Agentic workflows require very strong models specifically trained for agentic workflows and tool use.

Otherwise they will start to slowly fall apart.

From my experience: Claude and Deepseek are mandatory. Smaller models will make you lose hours and hours debugging errors while the models I mentioned will work "out of the box" most of the time.

I tried local models (qwen3.5 27 and 35, and gemma4) but they fail to properly use tools and agentic templates.

[-]

alphapussycat@reddit

Not gotten into tools yet, but are you using the distill with qwen? All it does is significantly reduce thinking, but output gets significantly worse. So if you're using the distill models that might be the reason for failed tool calling.

[-]

Yes-Scale-9723@reddit

I'm using qwen3.5 27 and 35, and gemma4. Also tried qwen coder but the same problems.

They are great if you use them in the typical chatbot coding assistant (you ask something, it gives you a script, you try it and so on).

But the moment you plug them to Cline, Open Coder or Roo Code they will make a lot of errors. By errors I mean making the wrong decisions in the planning phase, forgetting stuff, breaking the code and things like that. And the worse problem is the tool for editing big scripts without having to rewrite all of it. There is a tool that allows LLMs to search for a part of the script and then update only that part. It saves a lot of tokens, time and error rates. The problem is even flagship models will fail to use that tool from time to time.

qwen3.5 27 and 35, and gemma4 and qwen coder will fail like 80% of the times so they need to rewrite the entire code everytime.

From my experience even deepseek3.2 (which is a 685B parameters model) will also fail to use that tool from time to time (like 2% of times) and it also make mistakes from time to time.

That said, I love local LLMs and I use them whenever I can but it's just not realistic to expect 32B models to perform like a 685B model.

[-]

alphapussycat@reddit

I tried roo code once, on tiny models, and it could do nothing. But it seemed like it was basically running a long prompt, essentially trying to one shot with breaks in between.

I don't think that's the way to do it.

My plan is to have some model fully understand my criteria, propose to me a spec sheet. Create a .md file for the next agent. That agent just plans and considers how it should be done, looking up relevant documentation. Compress that, into different functions and classes to be written. Make a task list in .md files with requirements and documentation for a coding model to code.

Reviewed by Ai for mistakes, and reviewed by me in almost every step, batching it so it doesn't need my attention every few mins.

That's my plan at least. I haven't found any app that does it yet, so I think I'll make one with langgraph.

[-]

Yes-Scale-9723@reddit

Cline does a very similar thing. It doesn't one shot because it verifies things before flagging activities as completed.

The problem is small models will eventually make mistakes and those errors will stack up until they break everything. And those errors will exponentially increase with context. After 100k many models will become unable to do meaningful work. Filling that context is surprisingly fast if you have a big codebase.

After months of intensive coding my conclusion is that small models can't handle complex problems. Even flagship Claude model will fail from time to time.

You can try to build your solution but you'll eventually realize the core problem.

[-]

alphapussycat@reddit

But the context should fill up, since it'll just take one prompt, and then maybe be asked the same prompt again. The context would always be flushed, and only short form tasjd be presented.

The planners would deal with bigger things, but not bloat themselves with code snippets.

[-]

DigIndependent7488@reddit

Yeah this happens a lot once workflows get longer, things look fine step by step but fall apart mid run. We ran into the same issue and switched to something like Kognitos so each step actually tracks state and doesn’t just rely on context, what other alternatives hacve you tried so far?

[-]

Comfortable_Cut6866@reddit

ngl yeah this is exactly where most agents break, not the steps themselves, but the lack of continuity between them. its like they can do tasks, but don’t really own the workflow end-to-end, so once sth small goes off, there’s no recovery or sense of finishing

the only thing that felt a bit different for me was treating it less like a toolchain and more like an AI worker with memory across tasks (been trying this Autonomous Intern). it actually keeps shared context and reuses how you’ve done things before, so multi-step flows feel less fragile, closer to delegating to someone who remembers, not restarting a script every time

[-]

audioen@reddit

I don't know. I only know that Qwen3.5 seems to have fairly solid multi-step workflow understanding and it's pretty stubborn and doesn't give up easily, so it gets more done than most before I have to tell it to check again. Often, it only finishes when the task is actually finished and everything is built, documented and implementation tested. Every other agent before Qwen3.5 that I've been able to try locally has claimed the work is done when it's been like 20 % done, or still has compilation errors, or whatever. It feels like these agents just randomly generate a completion message as possible completion of an agentic task, and they aren't penalized during training severely enough about failing to actually complete the task before claiming it is complete.

[-]

Status_Record_1839@reddit

The core issue is what I'd call "state coherence" — agents don't have a reliable mental model of where they are in the workflow. Each step works because the context is fresh; chains fail because accumulated context becomes noisy and the model loses the thread.

A few patterns that help a lot in practice:

**Explicit state objects:** Instead of relying on the LLM to track progress implicitly through conversation history, pass a structured state object at each step: `{task: ..., completed_steps: [...], current_step: ..., remaining: [...]}`. Force the model to read and update it explicitly. This dramatically reduces drift.

**Step checkpointing with verification:** After each step, add a verification sub-call: "Did this step complete successfully? What was the output?" Before proceeding, confirm the previous step's output is what you expect. This catches failures early instead of propagating them.

**Shorter, more atomic steps:** The more granular each step, the less the model has to hold in working memory. A 10-step workflow with tiny steps usually beats a 3-step workflow with complex ones.

**Recovery prompts:** Include a failure mode in your agent loop: if a step fails or produces unexpected output, route to a recovery prompt that explicitly re-reads state and decides whether to retry, skip, or abort. Most agent frameworks stop at failure; recovery logic is what separates reliable from unreliable agents.

For local models specifically, context window management is critical — truncating early steps to keep total context under \~50% of the window prevents degradation in later steps.

[-]

DaLyon92x@reddit

the consistency problem is real. what helped me was separating monitoring from execution. use stateless steps but keep a journal of what happened so far outside the agent's context. that way when a step fails you can pick up from the journal instead of replaying everything.

the other thing is agents have no concept of "done." they'll keep going or stop randomly. explicit exit conditions for each step fixed most of my reliability issues. basically treat it like a state machine, not a conversation.

[-]

IsThisStillAIIs2@reddit

it’s because each step “working” doesn’t mean the system as a whole has state, goals, or guarantees, you’re basically chaining stateless decisions and hoping coherence emerges. most failures come from missing control flow like explicit state tracking, checkpoints, retries, and validation between steps, not from the model itself. the setups that feel reliable usually look less like open-ended agents and more like workflows with strict structure where the llm fills in specific parts. once you treat it like a distributed system instead of a smart assistant, the failure modes start making a lot more sense.

[-]

AurumDaemonHD@reddit

r/AIagents more luck there