Anthropic just showed how to make AI agents work on long projects without falling apart

Posted by purealgo@reddit | LocalLLaMA | View on Reddit | 81 comments

Most AI agents forget everything between sessions, which means they completely lose track of long tasks. Anthropic’s new article shows a surprisingly practical fix. Instead of giving an agent one giant goal like “build a web app,” they wrap it in a simple harness that forces structure, memory, and accountability.

First, an initializer agent sets up the project. It creates a full feature list, marks everything as failing, initializes git, and writes a progress log. Then each later session uses a coding agent that reads the log and git history, picks exactly one unfinished feature, implements it, tests it, commits the changes, and updates the log. No guessing, no drift, no forgetting.

The result is an AI that can stop, restart, and keep improving a project across many independent runs. It behaves more like a disciplined engineer than a clever autocomplete. It also shows that the real unlock for long-running agents may not be smarter models, but better scaffolding.

Read the article here:
https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents

[-]

crusoe@reddit

Who knew I was cutting edge when I did exactly this a month and a half ago to land a job...

[-]

TheLexoPlexx@reddit

I don't get what's new about this either.

[-]

1776FreeAmerica@reddit

Seriously isn't this just basic agent design? Task Decomposition is the fundamental key to agents. I was doing this at least 8 months ago.

[-]

Mawuena16@reddit

Task decomposition is definitely a key concept, but Anthropic’s approach adds a layer of structured memory and accountability that can make a big difference in maintaining context over multiple sessions. It’s not just about breaking tasks down; it’s about how you keep track of progress and decisions along the way.

[-]

pxan@reddit

Forget agents, task decomposition is how software is written. Feels intuitive.

[-]

Environmental-Metal9@reddit

And that is why a skilled project manager or a disciplined and experienced software dev are worth their weight in gold. Decomposing tasks is at least 60% of the actual day to day work in software.

[-]

AsparagusDirect9@reddit

AI can do this?

[-]

Environmental-Metal9@reddit

I mean… my child can do this. Doesn’t mean they are doing it well, or to the expectations of the people spending the money on it. At least for now anyways. AI isn’t very good at understanding people, even if it “sounds” like it is because it confidently asserts things (mostly because humans are famously inconsistent and confusing in our communications)

[-]

pier4r@reddit

Task Decomposition is the fundamental key to agents.

no. Too many words.

Less words do job.

Less thinking. Efficient.

"build web app, fast, plox".

Agent stupid.

[-]

No_Bake6681@reddit

Ok Kevin

[-]

_lindt_@reddit

Did this in 2024, I did know I was a pioneer

[-]

B-lovedWanderer@reddit

Everyone is dunking on this for being "just" good software engineering, but I think that misses the point of the article, which is that this is being accomplished by a long running agent, and it's much more lightweight than existing approaches.

The real insight here isn't the task decomposition itself, but using git history and a simple progress log as the state management layer instead of some complex vector database or RAG pipeline. It effectively treats the commit log as the agent's long-term memory, which is exactly how human devs onboard to a new codebase.

This suggests that the bottleneck for agents isn't context window size, but context curation. I'd be curious to see if a similar harness could make dumber/smaller local models perform strictly better than Claude 4.1 Opus just by forcing this feedback loop.

[-]

ascendant23@reddit

Right, it’s not like the general idea is revolutionary, but it’s always great to see really refined and concise iterations on this that have been thoroughly tested by prominent teams

[-]

Limp-Huckleberry8008@reddit

This blog describes exactly what RooCode and RooRoo mode does. Have been using it for ages. RooCode you should know; RooRoo is a mode to load on the plugin in the workspace. U then configure the models per Persona and start with the Navigator.

[-]

findingmike@reddit

Yep, we did this 2 years ago with various AI projects. I haven't seen any foundational improvement in LLMs since around then.

[-]

Cute_Obligation2944@reddit

We have only begun to see the real effects of true ingenuity in this area. AI was incubating in academia for over 40 years and now everyone in the world have access to it. The sheer scale of good ideas being introduced is only drowned out by the quantity of hype, which will eventually collapse, and folks like you will look upon the flaming heap and say "well... duh."

[-]

crusoe@reddit

I literally just told it to build a high level design spec. Then break out impl to stages. Then create a low level design for each stage. Then implement the design.

[-]

thetaFAANG@reddit

It’s really just product managers that are cooked

If you want, they’re the ones that are replaced

We have our agents turn those same requirements into tickets, and the commit message or pull request manages the status of the ticket

[-]

SkyFeistyLlama8@reddit

This is what you can already do with smaller local LLMs as long as you're the orchestrator, also known as software engineer, who's making sure everything is up to spec.

I'm more interested in the ability to stop and resume complete LLM state. If you're working with multiple features, there should be a way to roll back any LLM-generated screwups or to force it to choose another option.

[-]

count023@reddit

see this is the bit i haven't been able to do yet. I run as orchestrator, but i vet everything, i haven't goten used to the idea of letting AI sub agents run around doing whatever they want because i dont trust the raw vibe code to come back good. If i do it as an orchestrator but validate as it goes, while it's slower, i can get proper code out insetad.

I guess i need to get used ot the idea of ltting go of that control.

[-]

SirRece@reddit

This exists in claude

[-]

geenob@reddit

It also exists in the beta copilot

[-]

Zeikos@reddit

You can also automate a lot of the checking by having the agent adhere to strict standards and do relevant testing ad cross-checking.
Honestly I genuinely think that local models are better suited for that since it'd have prohibitive costs with APIs.

Do the APIs even allow looking at the embedding decoding distribution and acting on those?
It just occurred to me that I never checked, I just assumed it'd be outrageously expensive.

[-]

ThirdMover@reddit

Usually when the short cliff notes summary of a new development sounds totally obvious that just means the cliff notes summary leaves out the important bit.

[-]

waiting_for_zban@reddit

There have been plenty of such projects like taskmaster that do exactly that, so I guess we all need to wait for an Anthropic blog to amplify it.

[-]

gscjj@reddit

Just out of curiosity, what did you use to create your agents? Like you, I attempted this month ago but I was writing it in Go layered over Kubernetes type operator. Became super cumbersome, I’m think of reattempting just a little simpler this time

[-]

InterstellarReddit@reddit

Apply for anthropic right now they’re paying like 1.5 million to engineers that includes bonus and stock options so based salary has to be at least 800k

[-]

Spaduf@reddit

Those jobs are going to the CEOs friends silly.

[-]

zhambe@reddit

I mean, yeah -- this is how an experienced person would approach any larger project.

[-]

kaeptnphlop@reddit

I did that using Github Copilot Custom Agents (Chatmode) for months now. You can give quite detailed instructions and workflows and Claude 4.0 Sonnet is very adept at working through tasks. All based on a session doc the bot updates that lines out what needs to be done, with phases and tasks to complete. In between tasks (esp. if they're big) I just open a new chat, reference the session file and say "continue" (basically). It reviews previous work that it has done, finds the next task, implements and tests it, writes a git commit and then hands everything back to the user.

Actually pretty cool that this works that well!

[-]

SilentKnightOwl@reddit

Isnt this just how Claude code has always worked?

[-]

LoaderD@reddit

Steps to making a LocalLLaMA thread:

Read article about common practice, but it’s from a for profit lab
Jizz all over yourself
Clean up
Get LLM to read the article and create a summary
Post to /r/localllama

[-]

blackkettle@reddit

This is how we’ve been building customer service agents for ages - it’s nothing new.

[-]

Analytics-Maken@reddit

Agree, it's just simple ways to manage context. I took a similar approach using MCP servers for analytics. Fed business data into Claude via Windsor ai MCP, and the code quality jumped, and development time dropped.

[-]

megadonkeyx@reddit

letting agents process on their own for hours will just lead to massive architectural issues as they make decisions you would never accept.

There's just no getting around the need to handheld even the best AI model every step of the way.

[-]

aeroumbria@reddit

So, basically, big list of goals & requirements = fail Small, precise context > big, comprehensive context

[-]

Coldaine@reddit

This is just how you scope projects.... In general?

[-]

SickPixels257@reddit

I plan with the super smart models then if the plan is detailed enough and phased, I can use a mega cheap Chinese models to do the edits, tools, all the agentic stuff. If small model gets confused or stuck, give it a tool to call a big one or even an ensemble of them

[-]

LouB0O@reddit

Can someone explain this to me like I am 5? I just woke up, dont use ai for coding, more so business intelligence so formulas and some sql. How would I apply this to my work?

Will love you forever if you do.

[-]

sleeping-in-crypto@reddit

Explain to Claude you want to build something. Tell it what and outline requirements. Tell it to generate a spec document and a tasks document and optionally an architecture and testing document to record technical decisions if you’re coding.

Tell it to create a summary document of the above that it can use to pick up where it left off between sessions and ensure it records everything another LLM will need to pick up the work.

Start a new session. Tell it to consume the summary and associated documents and to deeply understand them and then you’ll continue work.

Tell it to complete some section of the tasks and update the tasks document to mark them as complete. Optionally do a validation pass to ask it to check the work before marking as complete. You can even start a new session for the validation step, repeat the above, and ask the new session to mark as complete instead of the original session doing it. This ensures you get some redundancy and a separate context to validate.

I’ve been using a workflow similar to this for months and it works out well. It also generates documentation you can keep around if needed.

[-]

LouB0O@reddit

Thank you so very much

[-]

ouroborus777@reddit

So... The solution to "breaks on long runs" is "use short runs"?

[-]

Antique-Ad1012@reddit

You can already do this behind the scenes but it doesn't solve the fundamental problem or even any problem in most large enterprise projects where even the description of all of the components will saturate the context window.

Other than that it's nice that we don't have to do this ourselves anymore.

I would love to see how it handles syncs with the main branch and it doesn't actually know what changes if it doesn't ingest all the changes

[-]

MatlowAI@reddit

Reading that Anthropic post, I assumed anyone using Claude for code was already doing a janky version of this.

My setups usually end up with a ridiculous number of markdown files during planning.. indexes, “lessons learned” buckets for bugs it’s seen once and will absolutely see again, etc. Even with all that structure, it still tries to drift or “cheat” to pass tests sometimes, so half the time I’m just watching for bad behavior and doing visual debug out of habit from the token scarcity days when cc got nerfed.

I’ve been adding OTEL on the frontend as well as the backend lately, and that’s actually helped a lot with efficiency and spotting where it goes off the rails. But it (even opus 4.5) still “forgets” about little things like uv no matter how many times I spell it out in the instructions as context grows.

Not letting it use /compact and forcing it to write retrospectives + immediate plans to MD files once context starts ballooning has helped. You can be more deliberate, and Opus 4.5 does noticeably better at staying on task when the context swells past ~120k up into the 180k range than most models I’ve tried—though it still hiccups.

I’ve also experimented with a small “supervisor” agent whose only job is to check:

is it cheating?
is it silently falling back to a bad pattern?
did it forget a known fix?

That runs on a narrower context, but even that does weird stuff I still need to tune. And by the time it’s dialed in, a new model drops and changes the behavior anyway it feels like 😅

Honestly, it feels like if we captured all the steering logic people already do for a given stack/framework/language, you could fine-tune a smaller LLM into a dedicated steering agent for that niche. Then you’d let the big model focus on the heavy lifting and need way less human intervention.

[-]

megawhop@reddit

Check out the vibe-check, Cipher, and Conport MCPs. Helps on these pain points.

[-]

MatlowAI@reddit

Thanks. They look good. I will say one up side of all the md files is theres a really good log of what went well and poorly in git. It looks like theres export md functionality buried in there with Conport I'll have to give it a try on something mid sized and see how it does.

[-]

GFrings@reddit

For the love of all that is holy, please nobody tell the PMa that the answer to AGI was SCRUM all along.

[-]

brunoha@reddit

hey, I that mindset is stuck at the AI, I'm fine, just don't start shoving it downs to humans too.

[-]

cms2307@reddit

I think that no matter how many ways you combine multiple agents and tools you won’t get around fundamental intelligence issues with the models, like if we have a coding and planning agent or some other combination of agents, assuming they’re the same model, I don’t see why that unlocks any special capabilities vs just making a better use of your context. And multi agent systems are hardly useful for local models when most people can hardly run just one at a time. I see another variation of multi agent frameworks just about every day on here but I’ve never seen anyone actually use them in a useful way in production. I would love to be proven wrong though.

[-]

Zeikos@reddit

I think people overestimate how much intelligence you actually need.
If I were to count how many times I "hallucinate" in a given day I'd be appalled, the average LLM hallucinates less than my stupid brain.
What makes the difference is that I immediately catch myself on the hallucinate, call myself an idiot and course-correct.

[-]

Western_Objective209@reddit

I think it boils down to if you have a fleet of agents working independently, what they think they are working on and what you actually want them to work on can be pretty different, and over time their goals do drift.

Opus 4.5 seems to be the best so far at picking up subtleties in what I'm looking for, but still if I'm not asking it to confirm it's assumptions and asking for clarifications it will just crank out something that kind of looks like what I want and burn $25 doing it

[-]

Zeikos@reddit

I think anybody that worked more than a year for a consultancy firm knows what happens if you don't hound the client for all the information they just take for granted.
There are firms that are just yes-men and they produce code that's hardly distinguishable from vibe coded garbage.
The main distinguishing factor often is that it's less readable than what the average LLM produces.

[-]

Western_Objective209@reddit

Yep I've joined projects for large consultancies in the back half trying to get the projects back on track, it's always a mess. Vibe coded garbage is generally higher quality than what they were pumping out

[-]

Boomfrag@reddit

I think the difference between a human's hallucination and an LLMs is that people are much worse at hiding it. But your point is well taken, even the best programmers make plenty of mistakes.

[-]

krileon@reddit

The difference is a human can learn from its mistakes. An LLM cannot. It can, will, and more often than not repeats the same mistakes over and over no matter how much context you cram down its throat. A Junior can be taught and turned into a Senior. An LLM is perpetually an employee with dementia.

[-]

no_witty_username@reddit

Using multiple agents solves a lot of problems like context management and engineering. Because the other agent has its own context it doesn't pollute the human facing agents context. Thus the human facing agent is able to perform indefinitely without degradation also reduces hallucinations and errors.

[-]

cms2307@reddit

You don’t need multiple agents for that though, just look at ChatGPT/GPT-OSS with the harmony chat format and the multiple response channels. My point is that having the model act a certain way, then changing the context and prompt to make it act a different way, doesn’t make the model any smarter, and you can achieve the same result with efficient context management with one instance of the model acting just one way. A simple example that people use here is not including thinking tokens in the next prompt but including the answer tokens.

And for the case of having two instances of the model running, I don’t think that would increase intelligence either, it just increases your speed because you have two models, even if it seems like it’s making them more intelligent.

[-]

phido3000@reddit

And multi agent systems are hardly useful for local models when most people can hardly run just one at a time.

Looks at my local cluster.. Hmm.. Won't be a problem, I have R1 70b (Server) being the creator an Qwen 35b (on a different server) being the arbiter. And I still have a dedicated server for tools and image generation.

Its not production ready, its in a development state, but its a reasonable possibility now.. However the work flow is significantly different from what people are used to with just a single AI thinking it out and human slowly watching generated text output.

I think that is the best advantage of local LLM. You can build something pretty unique. Something that can be adaptable.

In periods when I am sleeping, it can go and review all its output and test the two models against each other and create a ranking table for different types of tasks the one is known to be better at. It can try and formulate a better solution and develops prompts to enhance output and change which model is the creator and which is the checker. Once you have invested in the hardware, its really just electrical costs. It doesn't have to do this all the time, but when you aren't getting good enough results. It can. It doesn't have to be stateless.

Something most subscription models won't do. They don't want one user tying up an entire cluster just so it can produce better output. But why have a single mind in your garage when you can have multiple working tirelessly to improve output.

[-]

kripper-de@reddit

Multi-agentic doesn't necessarily mean combining different LLMs. You can use the same LLM for all agent instances. It works because each agent is given a different role or context. Think of it like tackling a problem from multiple perspectives: as a coder, a software architect, a business owner, and so on. It won’t solve all intelligence limitations, of course, but it can lead to better results.

[-]

awitod@reddit

The thing I most dislike about Anthropic is how often they ‘solve’ problems and claim they are first and pretend there was no solution before their ‘invention’.

They lead people down the wrong path who don’t know any better.

[-]

MitsotakiShogun@reddit

Is that different from what most coding helpers (e.g. Cline/Roo) already did? How?

[-]

Spaduf@reddit

No. This post is an ad you see.

[-]

claythearc@reddit

I think almost everyone in the space has a janky version of this, the new knowledge contributions seems to be - some nuggets of the common failure cases they saw, to pre seed your mind with things to look out for, and a second part for a first party solution in their SDK

[-]

MitsotakiShogun@reddit

Thanks!

[-]

Jimmy2Bags@reddit

Claude code superpowers. https://github.com/obra/superpowers

I invoke it for even simple projects now.

[-]

KrypXern@reddit

Augment and Juni both do something like this, I think it's fairly common

[-]

genobobeno_va@reddit

This is just Agile and Project Management. Who’d a thunk?

[-]

GatePorters@reddit

Maybe soon they will figure out how to get the AI to call tools soon. I wonder what will happen if we mix language and image models into some kind of model with multiple modalities. . . They could call it multimodacular models.

I am intellinget

[-]