Claude Autonomous Coding: Discussion

[-]

ReflectedImage@reddit

Well I've only been using it for 2 weeks (ignoring simple line completion obviously).

You give it really small tasks, it then goes and completes the test, then you check if it actually did the task right and if so move on to the next small task. Reviewing the work of a hyper-active junior dev basically.

I've seen another dev attempt to do the whole autonomous agent thing. Results are questionable. The whole break it down to small tasks and it does the coding thing seems to have only started working this year.

I'm not sure why people expect the many autonomous agent thing to actually work. Surely, that's the next thing they intend to get working?

[-]

Challseus@reddit

I think the largest, most important thing people need to say when discussing these subjects, is the programming language. I'm using Python. I've been using it since 2012, and have been programming professionally since 2004 (previously Java and Ruby). It's important, because the amount of training data LLM's have with python is far superior to any other language, except Javascript/Typescript.

The Python thing surely clouds my opinion here.

Anyway, I run one of the most simplest setup out there with Claude:

- plugin for the Python AST, so it finds files in large repos quick
- plugin for deferring stuff to codex for implementation

What's working best for me?

- I know exactly what I am building
- 80% of the time, I'm just chatting with Claude, going over architecture, discussing edge cases, making sure the code is consistent, creating issues and Milestones in Github
- 15% is Claude doing the work while I come to Reddit
- 5% is me going over things with a fine tooth comb before committing
- my first job as a junior was with a company that did Xtreme Programming, which included the heavy use of TDD, which I hated. Now I use it exclusively when building
- small, bite sized tasks, taken directly from Github issues. People expecting to one shot advanced architecture is still humorous to me.

I don't use subagents or anything like that. No complex workflows. No MCP tools.

Modular, clean, DRY code. Tests everywhere. Type checking and auto-formatting. CI/CD. Consistent patterns everywhere. You actually know what you're doing. That's how I'm able to use Claude Code without issues for about a year now, maybe less.

And again, I use Python, so that gives me a huge advantage.

[-]

kuntakinteke@reddit

Well well well. I was reviewing a code base that was vibe coded purely in python and it was a hot mess. Cyclic imports, util duplication, lazy imports in almost every function to break the cyclic import. Genuinely horrible code bass but I guess it works.

[-]

po-handz3@reddit

Yes.... but thats kinda meaningless without knowing model used to code, dev's eng xp, level of implementation plan review, etc.

Like who knows maybe that repo was built by a high schooler with chat 3.5 turbo...

[-]

Challseus@reddit

How does the code work if there are cyclic imports AND lazy imports? Curious indeed… Is it possible it has something to do with anything else in the codebase, templates perhaps or something? Maybe there are conditional reasons based on when to use the imports, dependent on how the code is sliced up?

Did you also specifically and directly tell this person what was wrong with their code so they can improve it?

We’re all trying to get better here, right?

[-]

kuntakinteke@reddit

Well.because the cyclic imports were broken using lazy imports inside functions. Which works but is.fucking bad

[-]

Practical-Poet8288@reddit

The upper limit of Python is low. Since it is AI coding, a strongly typed language should be chosen

[-]

Challseus@reddit

I have `ty` for type checking, and it runs on every commit via `pre-commit`. Claude also knows to run all the type checks (along with formatting and other things) after making changes, so it's worked pretty good for me.

[-]

njinja10@reddit (OP)

nice - surely have seen a difference with widely known/used languages work much better due to better training in the models. Love the simplicity you explained here

[-]

new2bay@reddit

I’ve noticed the exact same thing. I started working on a personal project just to test the limits of Codex, and I let it use Python. The difference is absolutely night and day versus even a slightly obscure language. It can actually debug the type of small failures that occur when it makes changes, and fix them. When not using Python, it has a tendency to flail around a lot.

[-]

CockConfidentCole@reddit

This is excellent and I share a similar conclusion as you.

[-]

po-handz3@reddit

I use a dual agent setup i suppose - Claude cloud and I stay high level - requirements, design, architecture, core uses cases and nessecary tests. We iterate and eventually created a 5-10 page implementation doc for Claude Code. Then I download the imp doc onto the repo and have CC /plan the dev work with max effort. Then i take the drv plan back to Cloud Claude for revirw, give and changes to CC and kick off the coding.

Note: this is my setup for full feature build outs.

Also note: this is only really possible with Claude max sub and opus 4.6+. If youre not using Claude and/or your not using a 2+ agent review phase youre prop just generating slop.

Also, if you HAVE NOT CODED with a multi phase review AND Claude, then, imo, you really have not agentically coded and your opinion is if no value.

[-]

Fickle-Tomatillo-657@reddit

Definitely agree with using to pair program and NOT for agent workflows. I’ve seen people get lost in that and still haven’t shown much productivity.

[-]

depthfirstleaning@reddit

well the one-shot prod-ready code thing is mostly hype but you can reduce the amount of back and forth significantly, which in turn boosts your productivity. I barely open my ide anymore.

The first mistake people do is usually not loading up the right context and just asking ai to build something. The interesting thing about llm is that they do know tdd, clean code, hexagonal architecture, they can tell you about every design pattern in the gang of four book, but it won’t use any of it unless you put it in the right headspace for it. Similarly a powerful way to get decent results is to show it your codebase so it can copy existing practices instead of making up its own based on whatever garbage it was trained on.

Also following the research -> high level design -> low level design -> implementation loop is very important with human review at every step.

[-]

kagato87@reddit

I've been using it about 6 months and I've yet to see it get things right, cover all the gaps, and meet all the needs.

Heck I'll spend two days just in "spec" mode. Then pass it off to a fresh context with a generic "review this for consistency, issues, and gaps" and it always finds tons of stuff.

It definitely is a "pair" thing. A human needs to keep the assumptions it will make in check, because it's even worse than that one cowboy engineer that made that one decision 20 years ago that were still paying for today...

[-]

OpenJolt@reddit

It never will. It’s endless patches and follow-ups. The LLMs cannot gauge regression properly. They have no judgement and cannot analyze scenarios properly. They will create duplication all over the place and create a massive technical debt nightmare. It’s endless patching and refactoring forever. This is why they need the proper oversight of an experienced engineer and tasks should be done incrementally and surgically.

[-]

njinja10@reddit (OP)

two days in spec mode - legend!

I feel planning is absolutely necessary but maintaining the plan as a free form text (.md files) is not cutting it. we need deterministic tools to ensure we are on track like others have mentioned here. Highly resonate with "its definitely a pair thing"

[-]

Bayakoo@reddit

Just use Claude and prompt it. Forget sub agents skills etc. just have one Claude instance and use it in a project - implement x,y,z.

[-]

detroitsongbird@reddit

I tried a home project in the wrote no code, use plan mode as much as possible to nail down the requirement, use sub agents to build it, unit tests. Etc.

I did pretty good, though I have 10x commits for the ui vs the backend.

https://detroit.games/euchre.

The problem? Despite Claude.nd rules, architecture guidelines, etc to build a pro game engine that scales (I’m generalizing here) it still painted itself into a corner.

I did get the server to scale to handle 4k users under load (no wait times , unlike human users that would actually have to read, think, and responded), I can’t get past that.

When I was brainstorming the problem it suggested a solution, which is the right one. The problem is it didn’t do it that way from the beginning. The core engine needed a rewrite to move to a lock free design.

This time I’m writing the code but have Claude do the code reviews. The results are much better but take longer.

I was all in, now I’m using it like a pair programmer. It’ll offer suggestions but I’m writing the code.

I’ve been programming since before Java.

It’s great if you ask it exactly the right question at the right time. But, it’ll easily, confidentiality build you something that works but is full of tech debt.

[-]

writesCommentsHigh@reddit

Yeah I do that except I reverse what you do.

[-]

_vertexE_@reddit

I think you really have to know exactly what you want when you ask Claude / codex to build out a feature. I notice it works really well when I know how to solve a problem. Once I start asking for work I’m unsure of then that ambiguity creates slop.

I’m leaning now toward hands on coding / research into a problem space so I know what I want. Once I’m have a solution / architecture in mind codex will write out the final solution. And if it’s simple problems I know how to solve then I only do work through codex.

Example, I really didn’t know how to properly implement short keys in my tauri app, the solution codex came up with was a mess. I really had to figure it out myself on a small scale before asking codex to finish the feature.

[-]

detroitsongbird@reddit

I was very specific, I spent a lot of time in plan mode, a lot of time with “how does this line up with what pro game engines do?” , “how can this be better”, etc.

I just didn’t say the magic words “lock free game state”. Doh!

There was way more detail and hours spent in the design before I had it build things. I was trying to see if it could actual do what some people are saying they’re doing, agents for everything.

I could easily have it build the new design for the core of the engine, but I’m not. I’m building it by hand with Claude as the code reviewer. In the end it’ll be a solid foundation that’ll scale and I’ll have learned a lot.

The rebuilt version I’ll have leaned a lot. For the version that was entirely built by agents I really didn’t learned anything “about the code”, which sucked when I did actually try to debug it. It was an unfamiliar code base.

In the end I think it can be a productivity booster, but unless you’re ok with constantly doing major refactors, potentially disastrous results, it can’t replace programmers.

[-]

tenthousandants44@reddit

There can't be many pro game engine sources in its training set. Like come on what are you even talking about

[-]

detroitsongbird@reddit

Well I regularly would rattle off how three or four different engines did things were where chatting about, so some engines are certainly in its memory

[-]

detroitsongbird@reddit

Six months from now I may sing a different tune, since the LLMs and frameworks around them are advancing fast.

[-]

MulberryExisting5007@reddit

Agree. I got into agentic 9 months ago and it was hit and miss. The past three months performance has been really good.

[-]

new2bay@reddit

I agree with you, if by “really good,” you mean “barely acceptable, if you give it a lot of time to think.”

[-]

MulberryExisting5007@reddit

lol I meant that be shot request started returning relatively positive results as opposed to a bunch of back and forth.

[-]

itix@reddit

LLMs are fairly good but they dont know the exact winner in advance.

[-]

Whitchorence@reddit

The problem? Despite Claude.nd rules, architecture guidelines, etc to build a pro game engine that scales (I’m generalizing here) it still painted itself into a corner.

CLAUDE.md rules often get ignored, especially if you have a lot of them. It works a lot better if you can express the rules you want it to follow in deterministic build steps (though of course you can use AI itself to help you write scripts doing those assertions).

[-]

njinja10@reddit (OP)

building deterministic tools like you described has been a game changer with AI tools. Nice example!

[-]

ronakg@reddit

was all in, now I'm using it like a pair programmer.

That's how I've been using Claude since start. It's like a junior engineer who's crazy fast at reading and writing, but lacks nuance and experience. So you can do back and forth really quickly. Trying different approaches is now considerably cheaper than before.

Have 2 or 3 of these things running at the same time.

[-]

Mornar@reddit

I've been trying to explain to some people, AI isn't smarter than a senior developer, it's not completely stupid very, very fast.

[-]

matjam@reddit

You really have to know what the end result looks like, in terms of architecture. The way I've been framing it to people is that if you're building a POC, then thats fine - you can do that without any background. But otherwise you should avoid trying to build something you couldn't build without AI if its going to be a production thing.

Vibe coding can only get you so far, imho pretty much small apps or POCs are fine, anything that requires real scaling, security, etc - will need someone who knows what they are doing to guide it.

[-]

CandidateNo2580@reddit

That's largely been my experience as well. I'm designing a greenfield project right now and I'm working through organizing the repo in a way that is friendly to these sorts of problems. Ie, I'll code the hot path myself, and give the AI a clear platform to codify domain rules somewhere else where it can't accidentally mess with the infrastructure.

[-]

CmdrSausageSucker@reddit

What I discovered as well was that architectures that heavily adhere to a "feature centred" paradigm seem to be much better readable / predictable for an AI. But honestly: building, say, a web API, is far away from the complexity of a game engine as you described.

I am excited though that the LLM keeps on giving by accelerating tasks like: how does this syntax work again? Or code reviews, which bring out some good ideas more often than not.

[-]

DeepHomeostasis@reddit

Local conventions vary in ways the agent cant pick up in a few tool calls. plans built on the assumption that the rest of the code looks like the sampled part stop fitting around the joins.

[-]

joe0418@reddit

Honestly - lots and lots of up front planning, requirements, and research ... To the point where it's not really autonomous.

I am finding that I prefer to guide the AI in chunks, letting it write the code, but I'm driving the architecture and verifying along the way. I also religiously use adversarial sub agents to check for specific coding principals that I would normally look for in a code review.

[-]

forestsloth@reddit

Another vote for copilot. I have been working on a huge legacy code base and I find that if I write out my flow in comments in the source files, copilot is great about turning those comments to code.

So I get to drive the architecture and flow of the new features but copilot does the coding.

[-]

njinja10@reddit (OP)

stunning! do you follow TDD or other forms of execution?

[-]

forestsloth@reddit

We don’t do TDD but we do require comprehensive unit tests written before you consider the feature/bug fix complete. Our Jenkins job run thousands of unit tests any time it detects a code change.

[-]

defenistrat3d@reddit

You can have it stop after every change it makes to allow you to be in the loop as much as you want.

[-]

njinja10@reddit (OP)

Indeed - this is what I do with plan mode to break up milestone into tasks and will use https://github.com/gastownhall/beads to track individual tasks/PRs. I review ever PR it generates. However, after a point I do feel lost - repeatedly asking claude to recap where am I in the project, feeling lost. The PRs are bite-sized and easy to reason, the overall project drift, not so much.

[-]

bradsk88@reddit

Slice the work up small enough that you can still reasonably review it and understand the changeset. After all, you'll expect that from your peers when you share the PR right?

[-]

njinja10@reddit (OP)

Indeed - this is what I do with plan mode to break up milestone into tasks and will use https://github.com/gastownhall/beads. However, after a point I do feel lost - repeatedly asking claude to recap where am I in the project, feeling lost. The PRs are bite-sized and easy to reason, the overall project drift, not so much.

[-]

AssistFinancial684@reddit

Slow is smooth Smooth is fast

[-]

njinja10@reddit (OP)

phil dunphy - is that you?

[-]

nvtrev@reddit

> At the expense of appearing anti-AI

Why is it so bad to appear anti-AI? There's much to critique.

[-]

njinja10@reddit (OP)

Its not bad. I really like AI for what its worth. Im massively confounded by the hype I read on X and feel I must be doing something off since I cant get to autonomous coding. More-so the popularity of tools like beads and gas-town make me feel Im out of the loop. Its basically FOMO for me right now.

[-]

ProButterscotch@reddit

Same thing . Slow is fast especially new older projects

[-]

njinja10@reddit (OP)

the best way to leverage AI for older/legacy projects is writing different variant of testing: unit test, characterization tests, mutation tests, property tests. They keep the existing source code as oracle (source of truth) and add hardening tests around it.

[-]

EarthquakeBass@reddit

try codex. claude is braindead

[-]

eloel-@reddit

I have fully adapted to never opening an IDE, and having 2-3 agents go at it.

I use planning a lot, and I will make very specific suggestions on file structures, on refactors, utils and all, but I don't write code. I treat it as having several junior engineers that work very quickly but often suboptimally that I need to guide around.

It's a lot more tiring than just doing things myself, but also a lot faster per unit of work

[-]

new2bay@reddit

That’s actually working for you? Using Codex, the thing it seems to be teaching me is that just because you can shit out 10k LoC in a day, that doesn’t mean you should.

I’ve been working on a project to implement a rules engine for Magic the Gathering in Python. It gets to use a language with lots of data in its training set, and I already have a detailed spec prewritten for me in the form of the Comprehensive Rules. I made some assumptions to simplify the task, and it still does a lot of wonky stuff. I’d call it more like having a couple of interns around who don’t get tired and can type really fast, but are completely clueless about architecture, or anything other than shitting out code.

After my previous experiences with it, I don’t know if I should be impressed or disappointed.

[-]

eloel-@reddit

It has been working fine for me and my team. We have been sticking to small, iterative changes where we also make and give it the architectural decisions we want.

Planning mode has it give me 3-4 options where I pick the one I think is the best fit. Often, but not always, it recommends the correct one, but needs that extra push.

It takes guidance better than interns do, and I've been persistent in getting it to update docs for itself as decisions are made, so we have mostly been able to keep the code consistent and well-reused

[-]

Whitchorence@reddit

The whole narrative around, autonomous agents where you have one that plans, breaks down tasks, implement those tasks, test harness agent and a critique agent. How has your success been around such practices. I seem to be faring very poorly.

I have noticed some variance in how well it works based on the project, so I don't want to oversell, but honestly this is like 80 to 90% of the way I do actual technical work now is spending a lot of time volleying on the details of the plan, then firing it off and letting it churn away for 30m-1h, and then probably just kicking off another task or doing some other work while that's going.

[-]

capitalsigma@reddit

I never enjoyed ML auto complete but I'm in about the same boat as you. I think of it as more of a reading/typing assistant than an autonomous developer. I've tried to use it to do more complex solo tasks on a few occasions and I find that basically every single time I eventually hit a wall where I need to throw it out and start from scratch. For the throwaway stuff I trust it with, maybe 60% of the time I eventually end up digging into it and discovering something subtly wrong

It's fantastic as a smart search tool and I get real value out of the typing that I can offload to it. I probably have at least 1 agent running for at at least 20-30% of my working hours. But I feel like it's a very tricky balance to strike in terms of identifying what it can handle without undermining my understanding of the codebase too much, and I worry that juniors (who have never had the experience of really understanding the project that they work on) are likely to lean on it too much and get burned

[-]

RandomPantsAppear@reddit

I find that because of the vibe coding community, most claude based agents that exist are focused around writing more, with more capabilities. Or at best, emulating (poorly) the best practices that exist like testing and debugging.

What I am trying to emphasize is the opposite

Pushing for more human in the loop.
Documenting user stated intent in different ways, that don’t get compressed.
Translating my stylistic preferences into verbiage that the AI can understand.
Turning my refinement of the code (I do not let AI structure how it pleases) into actionable, re-usable restraints.
Forcing documentation and comments to be about why something is the way it is and now how it functions.

I think the nature of documentation is changing, and by default all of the documentation Claude has been trained on is documentation written by humans, for humans to consume - what are the arguments, return result, how does it work, etc.

This is no longer what is required. The AI can infer what humans document easily, and for humans less familiar with the code and structure it’s largely unhelpful for humans as-is.

My theory is that by capturing intent - what a feature does, why it is specifically that way, and what the goal of its existence was, that you can leave substantial enough breadcrumbs that don’t get lost every time a session restarts or memory gets compacted. This helps the AI, and us coming in later having to understand.

The code is not my quality, and it won’t be my quality. But I think it can be maintainable and understandable.

I have seen what AI does running wild and at the very least, this doesn’t resemble that.

[-]

timabell@reddit

That is a good insight there - I can def see a different style of code comments really helping the AI getting the next change right much more often.

[-]

metaphorm@reddit

tight coding harness. lots of tools and connectors for gathering context. for example, a Notion connector that can query an internal knowledge base full of thoroughly written Notion pages. custom agent skills to teach the coding agent how to interact with the code base in the preferred ways, in terms of workflow, idioms, and validations. and a pretty thorough code review and QA process with both human and LLM reviewers involved.

I find a very good workflow that can deliver good-enough quality agent written code looks something like this:

write up a ticket in your project management software (we use Linear). the ticket defines background, business value, and acceptance criteria. the ticket also includes links to relevant knowledge base pages.
feed the ticket into the coding agent and start a chat about the design for the implementation, focusing on things still ambiguous from the ticket, and get an implementation plan from it. save the implementation plan as an addendum to to the ticket. this part is important for debugging. future agent sessions will see the implementation plan and immediately have much more context.
let the agent write the implementation. put up the PR and get review bots (we use coderabbit) to do initial review while you manually QA the change in your local dev env. simultaneously it will run CI and get unit tests and integration tests running as you manually QA it. feed back review comments and QA discoveries into the agent until it's acceptable.
final review gets human eyes on it. if it passes merge it and deploy to canary for e2e testing. if it passes canary testing it gets shipped.

[-]

TonyAtReddit1@reddit

The "spicy auto-complete" Copilot model of LLM assisted development is going to be the status quo when the dust settles- when Anthropic and OpenAPI can no longer defraud investors into believing they can close a gap of a 8 cents to the penny on offering these services, and clients can't cover this gap.

Small local models running auto-complete was how this started and how it will end.

Ive only ever found this way of working productive in a way that doesn't produce crap. Funny enough, when I tell people this the response is they dont consider me to "really be using AI".

[-]

hibikir_40k@reddit

The reviewer right now is going to be lacking in context, because you still know better. It's not that agents couldn't figure it all out, but the cost in tokens is just unreasonable. For now, you still have to be at least a secondary reviewer.

"Slow is fast" is not true here though, because you are not accounting for the parallelization. I am often running 4, 5 changes at once. I might finish 1 faster when I am more in the loop (and hell, I often am intervening in one of them), but the others are still advancing, so I finish the whole lot well ahead.

If you really want to see slow, just have half your team with no expertise in a different timezone. Now that's slow and expensive.

[-]

Worldline_AI@reddit

It's a counter-intuitive approach but there is wisdom keeping the human as the loop. The distinction most teams miss: reviewing the diff is not the same as reviewing what produced the diff. One is a code review. The other is a session review. We have good practices for the first. Almost none for the second.

[-]

NotMyRealNameObv@reddit

I've built a whole harness around kiro-cli/Claude Opus 4.6, which includes an autonomous loop for periodically doing stuff like finding and implementing modernization opportunities or bugs. My rate of commits delivered to master (arguably not really a good metric, but it's the best I have at the moment) has increased with ~750 %. The bottlenecks are, in decreasing order, getting external reviews, manually pushing the code for review/submitting after review, personally doing my own self-review after the agent harness declares the change to be ready.

The workflow includes separate steps for writing an implementation spec, actually implementing the spec, building the binaries, verifying all tests pass, verifying no static code analysis issues has been introduced, doing a review of the finished code, and retriggering fixes if the review found issues. All of this happening autonomously, of course.

Sadly, it doesn't work nearly as good for more complex tasks, but it still helps.

[-]

Ok-Hospital-5076@reddit

My workflow looks like this
Prompt -> Review -> Feedback -> Review -> Test -> Accept
- small sessions
- tight scope
- clear requirements
- clear commits and tags
Keep the build clean , code maintanable

[-]

Nataliashayk@reddit

Honestly, my experience has been similar. Full “autonomous agent” setups sound great on paper but break down fast in real codebases. I’ve had way more success keeping humans firmly in the loop and using AI more like a very fast pair programmer.

What works best for me.:- First:- Small, clearly scoped tasks. Second:- Let the model write or refactor one thing at a time. Third:- Always run tests / lint manually after and forth :- Use AI for explanation and exploration, not decision-making

Every time I’ve tried planner → executor → critic agent chains, I spend more time correcting drift than writing code. Slower, controlled iteration has felt way more reliable. Slow is fast, like you said.

[-]

SlightlyLethalDev@reddit

The drift problem is so real. Even small drift, which seems to always happen, compounds over many PRs. It's easy to overlook small things in a single PR but then agents will pick up that as an example and drift further and further until the code is just a mess. I've found that agents don't typically think of or plan code changes to scale cleanly, be maintainable, etc. So we end up with an extremely narrow and brittle codebase and then some new feature comes in it's very hard to add it.

[-]

EnderMB@reddit

I've played with a mixture of Ralph loops, hive managers, and a few UI tools that orchestrate agents to do my bidding.

I'd say that where they've been useful is in research tasks, where I'm building something new and I want a bunch of agents to go out into the world and do thorough research across a huge amount of parallelizable documentation.

For coding tasks, they're as good as the weakest agent, and that almost certainly means a fundamental mistake has resulted in something piss poor being delivered - even for really basic tasks.

[-]

Polite_Jello_377@reddit

Yep another person who finds the copilot mode productive but doubts the autonomous agent swarm approach

[-]

If_I_Could_Just@reddit

I have a Claude routine that runs once a week and does a deep search on latest findings for what works regarding agent orchestration. Then I have it create the set of skills to mimic what it learned. It created a “risk gated workflow” recently that works pretty well. Agents to review as it goes, check for drift, run a separate list of codebase-specific skills I keep at various times. Not perfect, but improving.

[-]

03263@reddit

I'm so sick of hearing about AI, just shut up

[-]

ibraaaaaaaaaaaaaa@reddit

I agree with.

I usually write code in my hands when it comes to critical pieces.

Although the firm I work at provides me with the latest claude models, I usually use the tokens on stuff that I don’t care if it was not written well or i don’t mind if it breaks, so that I can focus more on the critical parts