AI-generated scripts are passing every review and then paging me at 3am. The problem is never the code.

Posted by Ambitious-Garbage-73@reddit | sysadmin | View on Reddit | 6 comments

Third oncall incident this month traced back to the same root cause: automation script that passed review, ran fine in staging, and then did something unexpected under real load or in a specific sequence that nobody thought to test.

All three scripts were AI-generated. Not by juniors hiding it — by senior engineers who reviewed the output, understood what it did, and approved it. The code was correct. The logic was fine. The scripts did exactly what they said they did.

What they didn't do was what we actually needed them to do in the edge case that matters at 3am.

The pattern I keep seeing: AI is excellent at generating code that handles the specified case. It's bad at knowing what cases you forgot to specify. When a human writes a script from scratch, they think about failure modes while writing — what if the API is down, what if the response is malformed, what if this runs twice. When you describe a task to AI and it writes the script, it handles the failure modes you mentioned and silently ignores the ones you didn't.

The review problem is that by the time you're reviewing AI output, the cognitive frame is "does this do what it says" not "what did we forget to ask for." Those are different questions and we're only asking one of them.

Incident one: backup rotation script that worked perfectly unless the previous backup had already been partially deleted. The AI handled "backup exists" and "backup doesn't exist" but not "backup exists but is corrupted." Nobody specified that case, so it wasn't handled.

Incident two: log shipping script that assumed timestamps would always be in UTC. They almost always were. Not during a DST edge case in one region.

Incident three: health check script that returned healthy if the service responded, without checking if the response was actually valid. Service was responding with cached 200s while the underlying process was hung.

None of these are hard bugs to catch if you're thinking adversarially. But AI output tends to make reviewers less adversarial, not more. It looks complete. It looks thought through. The confidence of the output creates confidence in the reviewer.

Curious if others are dealing with this and what process changes actually help. We've tried adding explicit "failure mode" sections to prompts with some improvement but it feels like a bandaid.