Anthropic just showed how to make AI agents work on long projects without falling apart
Posted by purealgo@reddit | LocalLLaMA | View on Reddit | 81 comments
Most AI agents forget everything between sessions, which means they completely lose track of long tasks. Anthropic’s new article shows a surprisingly practical fix. Instead of giving an agent one giant goal like “build a web app,” they wrap it in a simple harness that forces structure, memory, and accountability.
First, an initializer agent sets up the project. It creates a full feature list, marks everything as failing, initializes git, and writes a progress log. Then each later session uses a coding agent that reads the log and git history, picks exactly one unfinished feature, implements it, tests it, commits the changes, and updates the log. No guessing, no drift, no forgetting.
The result is an AI that can stop, restart, and keep improving a project across many independent runs. It behaves more like a disciplined engineer than a clever autocomplete. It also shows that the real unlock for long-running agents may not be smarter models, but better scaffolding.
Read the article here:
https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
crusoe@reddit
Who knew I was cutting edge when I did exactly this a month and a half ago to land a job...
TheLexoPlexx@reddit
I don't get what's new about this either.
1776FreeAmerica@reddit
Seriously isn't this just basic agent design? Task Decomposition is the fundamental key to agents. I was doing this at least 8 months ago.
Mawuena16@reddit
Task decomposition is definitely a key concept, but Anthropic’s approach adds a layer of structured memory and accountability that can make a big difference in maintaining context over multiple sessions. It’s not just about breaking tasks down; it’s about how you keep track of progress and decisions along the way.
pxan@reddit
Forget agents, task decomposition is how software is written. Feels intuitive.
Environmental-Metal9@reddit
And that is why a skilled project manager or a disciplined and experienced software dev are worth their weight in gold. Decomposing tasks is at least 60% of the actual day to day work in software.
AsparagusDirect9@reddit
AI can do this?
Environmental-Metal9@reddit
I mean… my child can do this. Doesn’t mean they are doing it well, or to the expectations of the people spending the money on it. At least for now anyways. AI isn’t very good at understanding people, even if it “sounds” like it is because it confidently asserts things (mostly because humans are famously inconsistent and confusing in our communications)
pier4r@reddit
no. Too many words.
Less words do job.
Less thinking. Efficient.
"build web app, fast, plox".
Agent stupid.
No_Bake6681@reddit
Ok Kevin
_lindt_@reddit
Did this in 2024, I did know I was a pioneer
B-lovedWanderer@reddit
Everyone is dunking on this for being "just" good software engineering, but I think that misses the point of the article, which is that this is being accomplished by a long running agent, and it's much more lightweight than existing approaches.
The real insight here isn't the task decomposition itself, but using
githistory and a simple progress log as the state management layer instead of some complex vector database or RAG pipeline. It effectively treats the commit log as the agent's long-term memory, which is exactly how human devs onboard to a new codebase.This suggests that the bottleneck for agents isn't context window size, but context curation. I'd be curious to see if a similar harness could make dumber/smaller local models perform strictly better than Claude 4.1 Opus just by forcing this feedback loop.
ascendant23@reddit
Right, it’s not like the general idea is revolutionary, but it’s always great to see really refined and concise iterations on this that have been thoroughly tested by prominent teams
Limp-Huckleberry8008@reddit
This blog describes exactly what RooCode and RooRoo mode does. Have been using it for ages. RooCode you should know; RooRoo is a mode to load on the plugin in the workspace. U then configure the models per Persona and start with the Navigator.
findingmike@reddit
Yep, we did this 2 years ago with various AI projects. I haven't seen any foundational improvement in LLMs since around then.
Cute_Obligation2944@reddit
We have only begun to see the real effects of true ingenuity in this area. AI was incubating in academia for over 40 years and now everyone in the world have access to it. The sheer scale of good ideas being introduced is only drowned out by the quantity of hype, which will eventually collapse, and folks like you will look upon the flaming heap and say "well... duh."
crusoe@reddit
I literally just told it to build a high level design spec. Then break out impl to stages. Then create a low level design for each stage. Then implement the design.
thetaFAANG@reddit
It’s really just product managers that are cooked
If you want, they’re the ones that are replaced
We have our agents turn those same requirements into tickets, and the commit message or pull request manages the status of the ticket
SkyFeistyLlama8@reddit
This is what you can already do with smaller local LLMs as long as you're the orchestrator, also known as software engineer, who's making sure everything is up to spec.
I'm more interested in the ability to stop and resume complete LLM state. If you're working with multiple features, there should be a way to roll back any LLM-generated screwups or to force it to choose another option.
count023@reddit
see this is the bit i haven't been able to do yet. I run as orchestrator, but i vet everything, i haven't goten used to the idea of letting AI sub agents run around doing whatever they want because i dont trust the raw vibe code to come back good. If i do it as an orchestrator but validate as it goes, while it's slower, i can get proper code out insetad.
I guess i need to get used ot the idea of ltting go of that control.
SirRece@reddit
This exists in claude
geenob@reddit
It also exists in the beta copilot
Zeikos@reddit
You can also automate a lot of the checking by having the agent adhere to strict standards and do relevant testing ad cross-checking.
Honestly I genuinely think that local models are better suited for that since it'd have prohibitive costs with APIs.
Do the APIs even allow looking at the embedding decoding distribution and acting on those?
It just occurred to me that I never checked, I just assumed it'd be outrageously expensive.
ThirdMover@reddit
Usually when the short cliff notes summary of a new development sounds totally obvious that just means the cliff notes summary leaves out the important bit.
waiting_for_zban@reddit
There have been plenty of such projects like taskmaster that do exactly that, so I guess we all need to wait for an Anthropic blog to amplify it.
gscjj@reddit
Just out of curiosity, what did you use to create your agents? Like you, I attempted this month ago but I was writing it in Go layered over Kubernetes type operator. Became super cumbersome, I’m think of reattempting just a little simpler this time
InterstellarReddit@reddit
Apply for anthropic right now they’re paying like 1.5 million to engineers that includes bonus and stock options so based salary has to be at least 800k
Spaduf@reddit
Those jobs are going to the CEOs friends silly.
zhambe@reddit
I mean, yeah -- this is how an experienced person would approach any larger project.
kaeptnphlop@reddit
I did that using Github Copilot Custom Agents (Chatmode) for months now. You can give quite detailed instructions and workflows and Claude 4.0 Sonnet is very adept at working through tasks. All based on a session doc the bot updates that lines out what needs to be done, with phases and tasks to complete. In between tasks (esp. if they're big) I just open a new chat, reference the session file and say "continue" (basically). It reviews previous work that it has done, finds the next task, implements and tests it, writes a git commit and then hands everything back to the user.
Actually pretty cool that this works that well!
SilentKnightOwl@reddit
Isnt this just how Claude code has always worked?
LoaderD@reddit
Steps to making a LocalLLaMA thread:
Read article about common practice, but it’s from a for profit lab
Jizz all over yourself
Clean up
Get LLM to read the article and create a summary
Post to /r/localllama
blackkettle@reddit
This is how we’ve been building customer service agents for ages - it’s nothing new.
Analytics-Maken@reddit
Agree, it's just simple ways to manage context. I took a similar approach using MCP servers for analytics. Fed business data into Claude via Windsor ai MCP, and the code quality jumped, and development time dropped.
megadonkeyx@reddit
letting agents process on their own for hours will just lead to massive architectural issues as they make decisions you would never accept.
There's just no getting around the need to handheld even the best AI model every step of the way.
aeroumbria@reddit
So, basically, big list of goals & requirements = fail Small, precise context > big, comprehensive context
Coldaine@reddit
This is just how you scope projects.... In general?
SickPixels257@reddit
I plan with the super smart models then if the plan is detailed enough and phased, I can use a mega cheap Chinese models to do the edits, tools, all the agentic stuff. If small model gets confused or stuck, give it a tool to call a big one or even an ensemble of them
LouB0O@reddit
Can someone explain this to me like I am 5? I just woke up, dont use ai for coding, more so business intelligence so formulas and some sql. How would I apply this to my work?
Will love you forever if you do.
sleeping-in-crypto@reddit
Explain to Claude you want to build something. Tell it what and outline requirements. Tell it to generate a spec document and a tasks document and optionally an architecture and testing document to record technical decisions if you’re coding.
Tell it to create a summary document of the above that it can use to pick up where it left off between sessions and ensure it records everything another LLM will need to pick up the work.
Start a new session. Tell it to consume the summary and associated documents and to deeply understand them and then you’ll continue work.
Tell it to complete some section of the tasks and update the tasks document to mark them as complete. Optionally do a validation pass to ask it to check the work before marking as complete. You can even start a new session for the validation step, repeat the above, and ask the new session to mark as complete instead of the original session doing it. This ensures you get some redundancy and a separate context to validate.
I’ve been using a workflow similar to this for months and it works out well. It also generates documentation you can keep around if needed.
LouB0O@reddit
Thank you so very much
ouroborus777@reddit
So... The solution to "breaks on long runs" is "use short runs"?
Antique-Ad1012@reddit
You can already do this behind the scenes but it doesn't solve the fundamental problem or even any problem in most large enterprise projects where even the description of all of the components will saturate the context window.
Other than that it's nice that we don't have to do this ourselves anymore.
I would love to see how it handles syncs with the main branch and it doesn't actually know what changes if it doesn't ingest all the changes
MatlowAI@reddit
Reading that Anthropic post, I assumed anyone using Claude for code was already doing a janky version of this.
My setups usually end up with a ridiculous number of markdown files during planning.. indexes, “lessons learned” buckets for bugs it’s seen once and will absolutely see again, etc. Even with all that structure, it still tries to drift or “cheat” to pass tests sometimes, so half the time I’m just watching for bad behavior and doing visual debug out of habit from the token scarcity days when cc got nerfed.
I’ve been adding OTEL on the frontend as well as the backend lately, and that’s actually helped a lot with efficiency and spotting where it goes off the rails. But it (even opus 4.5) still “forgets” about little things like uv no matter how many times I spell it out in the instructions as context grows.
Not letting it use
/compactand forcing it to write retrospectives + immediate plans to MD files once context starts ballooning has helped. You can be more deliberate, and Opus 4.5 does noticeably better at staying on task when the context swells past ~120k up into the 180k range than most models I’ve tried—though it still hiccups.I’ve also experimented with a small “supervisor” agent whose only job is to check:
That runs on a narrower context, but even that does weird stuff I still need to tune. And by the time it’s dialed in, a new model drops and changes the behavior anyway it feels like 😅
Honestly, it feels like if we captured all the steering logic people already do for a given stack/framework/language, you could fine-tune a smaller LLM into a dedicated steering agent for that niche. Then you’d let the big model focus on the heavy lifting and need way less human intervention.
megawhop@reddit
Check out the vibe-check, Cipher, and Conport MCPs. Helps on these pain points.
MatlowAI@reddit
Thanks. They look good. I will say one up side of all the md files is theres a really good log of what went well and poorly in git. It looks like theres export md functionality buried in there with Conport I'll have to give it a try on something mid sized and see how it does.
GFrings@reddit
For the love of all that is holy, please nobody tell the PMa that the answer to AGI was SCRUM all along.
brunoha@reddit
hey, I that mindset is stuck at the AI, I'm fine, just don't start shoving it downs to humans too.
cms2307@reddit
I think that no matter how many ways you combine multiple agents and tools you won’t get around fundamental intelligence issues with the models, like if we have a coding and planning agent or some other combination of agents, assuming they’re the same model, I don’t see why that unlocks any special capabilities vs just making a better use of your context. And multi agent systems are hardly useful for local models when most people can hardly run just one at a time. I see another variation of multi agent frameworks just about every day on here but I’ve never seen anyone actually use them in a useful way in production. I would love to be proven wrong though.
Zeikos@reddit
I think people overestimate how much intelligence you actually need.
If I were to count how many times I "hallucinate" in a given day I'd be appalled, the average LLM hallucinates less than my stupid brain.
What makes the difference is that I immediately catch myself on the hallucinate, call myself an idiot and course-correct.
Western_Objective209@reddit
I think it boils down to if you have a fleet of agents working independently, what they think they are working on and what you actually want them to work on can be pretty different, and over time their goals do drift.
Opus 4.5 seems to be the best so far at picking up subtleties in what I'm looking for, but still if I'm not asking it to confirm it's assumptions and asking for clarifications it will just crank out something that kind of looks like what I want and burn $25 doing it
Zeikos@reddit
I think anybody that worked more than a year for a consultancy firm knows what happens if you don't hound the client for all the information they just take for granted.
There are firms that are just yes-men and they produce code that's hardly distinguishable from vibe coded garbage.
The main distinguishing factor often is that it's less readable than what the average LLM produces.
Western_Objective209@reddit
Yep I've joined projects for large consultancies in the back half trying to get the projects back on track, it's always a mess. Vibe coded garbage is generally higher quality than what they were pumping out
Boomfrag@reddit
I think the difference between a human's hallucination and an LLMs is that people are much worse at hiding it. But your point is well taken, even the best programmers make plenty of mistakes.
krileon@reddit
The difference is a human can learn from its mistakes. An LLM cannot. It can, will, and more often than not repeats the same mistakes over and over no matter how much context you cram down its throat. A Junior can be taught and turned into a Senior. An LLM is perpetually an employee with dementia.
no_witty_username@reddit
Using multiple agents solves a lot of problems like context management and engineering. Because the other agent has its own context it doesn't pollute the human facing agents context. Thus the human facing agent is able to perform indefinitely without degradation also reduces hallucinations and errors.
cms2307@reddit
You don’t need multiple agents for that though, just look at ChatGPT/GPT-OSS with the harmony chat format and the multiple response channels. My point is that having the model act a certain way, then changing the context and prompt to make it act a different way, doesn’t make the model any smarter, and you can achieve the same result with efficient context management with one instance of the model acting just one way. A simple example that people use here is not including thinking tokens in the next prompt but including the answer tokens.
And for the case of having two instances of the model running, I don’t think that would increase intelligence either, it just increases your speed because you have two models, even if it seems like it’s making them more intelligent.
phido3000@reddit
And multi agent systems are hardly useful for local models when most people can hardly run just one at a time.
Looks at my local cluster.. Hmm.. Won't be a problem, I have R1 70b (Server) being the creator an Qwen 35b (on a different server) being the arbiter. And I still have a dedicated server for tools and image generation.
Its not production ready, its in a development state, but its a reasonable possibility now.. However the work flow is significantly different from what people are used to with just a single AI thinking it out and human slowly watching generated text output.
I think that is the best advantage of local LLM. You can build something pretty unique. Something that can be adaptable.
In periods when I am sleeping, it can go and review all its output and test the two models against each other and create a ranking table for different types of tasks the one is known to be better at. It can try and formulate a better solution and develops prompts to enhance output and change which model is the creator and which is the checker. Once you have invested in the hardware, its really just electrical costs. It doesn't have to do this all the time, but when you aren't getting good enough results. It can. It doesn't have to be stateless.
Something most subscription models won't do. They don't want one user tying up an entire cluster just so it can produce better output. But why have a single mind in your garage when you can have multiple working tirelessly to improve output.
kripper-de@reddit
Multi-agentic doesn't necessarily mean combining different LLMs. You can use the same LLM for all agent instances. It works because each agent is given a different role or context. Think of it like tackling a problem from multiple perspectives: as a coder, a software architect, a business owner, and so on. It won’t solve all intelligence limitations, of course, but it can lead to better results.
awitod@reddit
The thing I most dislike about Anthropic is how often they ‘solve’ problems and claim they are first and pretend there was no solution before their ‘invention’.
They lead people down the wrong path who don’t know any better.
MitsotakiShogun@reddit
Is that different from what most coding helpers (e.g. Cline/Roo) already did? How?
Spaduf@reddit
No. This post is an ad you see.
claythearc@reddit
I think almost everyone in the space has a janky version of this, the new knowledge contributions seems to be - some nuggets of the common failure cases they saw, to pre seed your mind with things to look out for, and a second part for a first party solution in their SDK
MitsotakiShogun@reddit
Thanks!
Jimmy2Bags@reddit
Claude code superpowers. https://github.com/obra/superpowers
I invoke it for even simple projects now.
KrypXern@reddit
Augment and Juni both do something like this, I think it's fairly common
genobobeno_va@reddit
This is just Agile and Project Management. Who’d a thunk?
GatePorters@reddit
??
Maybe soon they will figure out how to get the AI to call tools soon. I wonder what will happen if we mix language and image models into some kind of model with multiple modalities. . . They could call it multimodacular models.
I am intellinget
xChooChooKazam@reddit
Spec driven development is the easiest way to increase the quality output of Claude. AgentOS has some fantastic out of the box agents for this if you haven’t tried it: https://github.com/buildermethods/agent-os
CuriouslyCultured@reddit
Imagine following basic SDLC practices with agents and calling it research.
if47@reddit
yet another Anthropic blog shit
IrisColt@reddit
We are not impressed.
Quiark@reddit
So like using the beads tool which was made for this
selund1@reddit
Work stealing agents? Are we taking old concepts of managing work and tasks and reapplying them to call it innovation or am I missing something here?
lombwolf@reddit
Who would've thought that giving an agent structure would result in it picking back up previous tasks w/o hassle! Mind Blown! /s
Noiselexer@reddit
So a task list?
molbal@reddit
Tldr: use sub-agents to manage context
Aroochacha@reddit
Maximin-M2 does this with CLine working locally
sstainsby@reddit
I already have scaffolding that breaks projects down into components and capabilities (includes features), and tasks and uses templates.
tindalos@reddit
Funny Anthropic is one of us apparently.
edgyversion@reddit
Was it really them or Chinese hackers who found this capability