Tips for babysitting a vibecoded app?
Posted by ImYoric@reddit | ExperiencedDevs | View on Reddit | 27 comments
A Principal Engineer in my team has concocted a very sophisticated vibecoded app, initially as a research project. I've seen his harness, it's something impressive, with a development guided by automated metrics, thousands of tests imported from a well-trusted data source, automated reviews, etc. And it's Rust + clippy, which he hoped would ensure at least a reasonable baseline for code quality. As a research project, it's extremely cool!
Now, the powers that be have decided to adopt this app in our team and release it in production soon. So we have this big source dump of AI-generated code, which we're now supposed to take control of, and somehow evolve as we gain real-world experience, user feedback and new requirements.
We're currently brainstorming the how. At this stage, it sounds like we're going to use AI (and almost entirely AI) to evolve the product.
Has anybody in this sub encountered such a situation? How did it go? How did you minimize the blast radius of agents going haywire or gaming/rewriting metrics?
viktorianer4life@reddit
The automated metrics are the trap. If the agent is optimizing against them, it will eventually learn to move the metric without moving the thing it measures. On a batch of a few hundred Claude sessions migrating a Rails test suite, the coverage numbers looked green the whole time, until I added a deterministic check outside the model that grepped the diff for skipped tests and tautological assertions. Agents do not game checks they cannot see.
Two things that helped with blast radius: scope every task narrowly (one file, one function, never "evolve the module"), and keep a frozen copy of the research codebase so you can bisect which agent run introduced a regression. Evolving an AI-generated codebase with AI is workable if the unit of change is small enough that you can still read each diff in under a minute.
demosthenesss@reddit
What would you do if it was some legacy codebase?
Do that.
ImYoric@reddit (OP)
Sure, that's part of what we're discussing.
But, say, legacy codebases don't tend to rewrite their tests to remove tests they fail to pass.
demosthenesss@reddit
Are you telling AI to force push?
Even if you're fully using AI the problem you just said is simple to address - just look at the test diffs before pushing/committing.
ImYoric@reddit (OP)
Well, that's how I caught the issue.
How well do you think this is going to scale to a codebase that was entirely vibe-coded, with no human being having had any say into which tests were written, or having checked what the tests actually check?
BitNumerous5302@reddit
Are you not allowed to read these tests now?
ImYoric@reddit (OP)
So, do I understand correctly that your practical suggestion is "use your eyeballs"?
Not entirely certain how well this is going to scale to this codebase.
arguskay@reddit
You can try to enforce code coverage. Also read about mutation tests. They semi-randomly switch stuff in your code to see if your tests fail properly.
ImYoric@reddit (OP)
Mutation tests (and generally speaking any form of chaos testing) sounds like a very good idea. They need to be out of reach of the agent, though.
On the other hand, given how I've seen agents game metrics, I wouldn't trust code coverage.
BitNumerous5302@reddit
You can't automate what you don't know how to do manually.
I don't know the scale of your codebase or your time constraints, but yes, reading every test to make sure it does something sensible is a possible thing. This is usually done in small increments (code review) but now you have a backlog. It'll take time.
You can definitely get some AI assistance there ("scan tests, sanity check them, create a report on what is tested and how") but building confidence in that system will take time, too. Maybe you can review a few dozen tests manually to identify common mistakes and automate from there.
Ultimately though there are time/risk tradeoffs to communicate to stakeholders. You don't have a mature product and you don't have a mature agentic product maturation pipeline. Either or both of those things will take time.
demosthenesss@reddit
Are you just wanting me to say “Tod sucks quit and find a different job?”
Because that’s what it feels like.
You’re a staff engineer. Treat this as a constraint you didn’t get to pick and work with it the best you can.
ImYoric@reddit (OP)
Looks like this conversation is going nowhere.
Thank you for your time.
gefahr@reddit
You got tons of practical suggestions. You're just complaining now.
leftsaidtim@reddit
I see you and I haven’t worked with the same junior engineers. I wish I had a nickel for every time I’ve seen someone dismantle a crucial test case because they thought they could just change it at a whim.
ImYoric@reddit (OP)
Fair enough.
But then, with agents, I can see this happening in the middle of tens of thousands-of-lines refactorings, several times per day. Which will make it harder to follow.
leftsaidtim@reddit
Agreed. I almost never let my LLMs write their own tests, and when I do I heavily go back and edit them afterwards so they read like my own voice.
There’s so much leverage in testing hygiene nowadays it’s silly.
soundman32@reddit
Neither do well behaved AIs. What are the instructions files like? Why not add a rule "Fix code, don't delete unit tests" ?
Cube00@reddit
Plenty of examples of LLMs starting to ignore those guardrails once the context grows large enough.
Adept_Carpet@reddit
Or ignoring the context once the guardrails grow large enough.
ImYoric@reddit (OP)
FWIW, I get both issues right now on a much smaller project.
ImYoric@reddit (OP)
I am currently working on another app with Claude Code + Opus. During the last 2 weeks, it has regularly ignored all of the sacred instructions of AGENTS.md. Also, it has regularly "commented out" tests during refactorings, then failed to return them to their original state.
Why would this one be different?
soundman32@reddit
It looks like, as of March 2026, Claude doesn't natively support AGENTS.md. https://github.com/anthropics/claude-code/issues/34235
Are you doing the workaround? Does CLAUDE.md work better for you (temporary workaround maybe).
ImYoric@reddit (OP)
My bad, I was using OpenCode + Opus.
LateToTheParty013@reddit
dead internet theory
pydry@reddit
This sounds like a trap. He cornered the glory and he managed to saddle you with the fallout when it starts to flake out in prod.
morswinb@reddit
Old trick, existed even before vibe code.
You can pool it off even with a slideshow. The more sophisticated the stakeholders the more likely they will fall for it.
Early_Rooster7579@reddit
Man knows how to play the swe game properly. Limp it across the line and make it someone elses problem