Tips for babysitting a vibecoded app?

[-]

viktorianer4life@reddit

The automated metrics are the trap. If the agent is optimizing against them, it will eventually learn to move the metric without moving the thing it measures. On a batch of a few hundred Claude sessions migrating a Rails test suite, the coverage numbers looked green the whole time, until I added a deterministic check outside the model that grepped the diff for skipped tests and tautological assertions. Agents do not game checks they cannot see.

Two things that helped with blast radius: scope every task narrowly (one file, one function, never "evolve the module"), and keep a frozen copy of the research codebase so you can bisect which agent run introduced a regression. Evolving an AI-generated codebase with AI is workable if the unit of change is small enough that you can still read each diff in under a minute.

[-]

demosthenesss@reddit

What would you do if it was some legacy codebase?

Do that.

[-]

ImYoric@reddit (OP)

Sure, that's part of what we're discussing.

But, say, legacy codebases don't tend to rewrite their tests to remove tests they fail to pass.

[-]

demosthenesss@reddit

Are you telling AI to force push?

Even if you're fully using AI the problem you just said is simple to address - just look at the test diffs before pushing/committing.

[-]

ImYoric@reddit (OP)

Well, that's how I caught the issue.

How well do you think this is going to scale to a codebase that was entirely vibe-coded, with no human being having had any say into which tests were written, or having checked what the tests actually check?

[-]

BitNumerous5302@reddit

Are you not allowed to read these tests now?

[-]

ImYoric@reddit (OP)

So, do I understand correctly that your practical suggestion is "use your eyeballs"?

Not entirely certain how well this is going to scale to this codebase.

[-]

arguskay@reddit

You can try to enforce code coverage. Also read about mutation tests. They semi-randomly switch stuff in your code to see if your tests fail properly.

[-]

ImYoric@reddit (OP)

Mutation tests (and generally speaking any form of chaos testing) sounds like a very good idea. They need to be out of reach of the agent, though.

On the other hand, given how I've seen agents game metrics, I wouldn't trust code coverage.

[-]

BitNumerous5302@reddit

You can't automate what you don't know how to do manually.

I don't know the scale of your codebase or your time constraints, but yes, reading every test to make sure it does something sensible is a possible thing. This is usually done in small increments (code review) but now you have a backlog. It'll take time.

You can definitely get some AI assistance there ("scan tests, sanity check them, create a report on what is tested and how") but building confidence in that system will take time, too. Maybe you can review a few dozen tests manually to identify common mistakes and automate from there.

Ultimately though there are time/risk tradeoffs to communicate to stakeholders. You don't have a mature product and you don't have a mature agentic product maturation pipeline. Either or both of those things will take time.

[-]

demosthenesss@reddit

Are you just wanting me to say “Tod sucks quit and find a different job?”

Because that’s what it feels like.

You’re a staff engineer. Treat this as a constraint you didn’t get to pick and work with it the best you can.

[-]

ImYoric@reddit (OP)

Looks like this conversation is going nowhere.

Thank you for your time.

[-]

gefahr@reddit

You got tons of practical suggestions. You're just complaining now.

[-]

leftsaidtim@reddit

legacy codebases don’t tend to rewrite their tests to remove tests they fail to pass

I see you and I haven’t worked with the same junior engineers. I wish I had a nickel for every time I’ve seen someone dismantle a crucial test case because they thought they could just change it at a whim.

[-]

ImYoric@reddit (OP)

Fair enough.

But then, with agents, I can see this happening in the middle of tens of thousands-of-lines refactorings, several times per day. Which will make it harder to follow.

[-]

leftsaidtim@reddit

Agreed. I almost never let my LLMs write their own tests, and when I do I heavily go back and edit them afterwards so they read like my own voice.

There’s so much leverage in testing hygiene nowadays it’s silly.

[-]

soundman32@reddit

Neither do well behaved AIs. What are the instructions files like? Why not add a rule "Fix code, don't delete unit tests" ?

[-]

Cube00@reddit

Why not add a rule "Fix code, don't delete unit tests" ?

Plenty of examples of LLMs starting to ignore those guardrails once the context grows large enough.

[-]

Adept_Carpet@reddit

Or ignoring the context once the guardrails grow large enough.

[-]

ImYoric@reddit (OP)

FWIW, I get both issues right now on a much smaller project.

[-]

ImYoric@reddit (OP)

I am currently working on another app with Claude Code + Opus. During the last 2 weeks, it has regularly ignored all of the sacred instructions of AGENTS.md. Also, it has regularly "commented out" tests during refactorings, then failed to return them to their original state.

Why would this one be different?

[-]

soundman32@reddit

It looks like, as of March 2026, Claude doesn't natively support AGENTS.md. https://github.com/anthropics/claude-code/issues/34235

Are you doing the workaround? Does CLAUDE.md work better for you (temporary workaround maybe).

[-]

ImYoric@reddit (OP)

My bad, I was using OpenCode + Opus.

[-]

Early_Rooster7579@reddit

Man knows how to play the swe game properly. Limp it across the line and make it someone elses problem