AI-generated scripts are passing every review and then paging me at 3am. The problem is never the code.
Posted by Ambitious-Garbage-73@reddit | sysadmin | View on Reddit | 6 comments
Third oncall incident this month traced back to the same root cause: automation script that passed review, ran fine in staging, and then did something unexpected under real load or in a specific sequence that nobody thought to test.
All three scripts were AI-generated. Not by juniors hiding it — by senior engineers who reviewed the output, understood what it did, and approved it. The code was correct. The logic was fine. The scripts did exactly what they said they did.
What they didn't do was what we actually needed them to do in the edge case that matters at 3am.
The pattern I keep seeing: AI is excellent at generating code that handles the specified case. It's bad at knowing what cases you forgot to specify. When a human writes a script from scratch, they think about failure modes while writing — what if the API is down, what if the response is malformed, what if this runs twice. When you describe a task to AI and it writes the script, it handles the failure modes you mentioned and silently ignores the ones you didn't.
The review problem is that by the time you're reviewing AI output, the cognitive frame is "does this do what it says" not "what did we forget to ask for." Those are different questions and we're only asking one of them.
Incident one: backup rotation script that worked perfectly unless the previous backup had already been partially deleted. The AI handled "backup exists" and "backup doesn't exist" but not "backup exists but is corrupted." Nobody specified that case, so it wasn't handled.
Incident two: log shipping script that assumed timestamps would always be in UTC. They almost always were. Not during a DST edge case in one region.
Incident three: health check script that returned healthy if the service responded, without checking if the response was actually valid. Service was responding with cached 200s while the underlying process was hung.
None of these are hard bugs to catch if you're thinking adversarially. But AI output tends to make reviewers less adversarial, not more. It looks complete. It looks thought through. The confidence of the output creates confidence in the reviewer.
Curious if others are dealing with this and what process changes actually help. We've tried adding explicit "failure mode" sections to prompts with some improvement but it feels like a bandaid.
VA_Network_Nerd@reddit
Sorry, it seems this comment or thread has violated a sub-reddit rule and has been removed by a moderator.
Do Not Conduct Marketing Operations Within This Community.
Your content may be better suited for our companion sub-reddit: /r/SysAdminBlogs
If you wish to appeal this action please don't hesitate to message the moderation team.
accidentlife@reddit
What are you selling and why are you using AI to sell a post about AI problems?
llDemonll@reddit
I read two paragraphs then realized it was a bot shilling something.
If senior engineers are “writing” this code and testing it then putting it in production and the code doesn’t account for basic dependencies, like “is the API available”, then you have some shitty engineers and both the code and the engineers who “reviewed it” are the issue.
But it’s still a crappy bot post.
itfosho@reddit
Too long didn’t read but sounds like it’s not passing a real review nor are you doing enough testing to validate it. If it needs error handling then you have to write that in there. AI doesn’t solve everything.
malventano@reddit
…except they didn’t, else you would have no complaint.
Festernd@reddit
It's like hiring a cheap contract coder -- they will exactly what you ask, with not plans for maintainability and no handling of anything you didn't ask for, with the added bonus of hallucination.
Until we can write prompts way better and more clearly than PMs/clients can write specs, this is the inevitable result.