3 production incidents we traced back to Copilot-generated code — and what they had in common

Posted by Ok_Stretch_6623@reddit | Python | View on Reddit | 7 comments

Incident 1: Stripe signature not verified → double charges
Incident 2: Token expiry used >= instead of > → session bypass
Incident 3: Exception swallowed silently in auth path → failures invisible

What all 3 had in common:
— All in auth or payments path
— All looked correct on review
— All passed existing tests
— All were AI-written with no human writing equivalent code nearby

What we changed afterward... [continue the story]

[-]

Salfiiii@reddit

It‘s a topic that will be discussed a lot in the future.

Do you think your or your colleagues would have written it correctly from the start?
how is the review process?
are the tests written by ai too or did they already exist?

[-]

Ok_Stretch_6623@reddit (OP)

Honestly, I don’t think the issue is AI vs human. A good engineer could still make these mistakes — especially under time pressure. The difference is volume and confidence: AI produces “clean-looking” code fast, which lowers our guard during review.

On your questions:

Would a human have written it correctly?
Maybe — but not guaranteed. The tricky part is that humans usually leave context (comments, discussions, incremental commits). AI often drops in a “complete-looking” solution without that trail, which makes it harder to question.

How is the review process?
That’s where things break. Reviews tend to focus on readability and logic flow, not adversarial thinking. In auth/payments paths, reviewers should be asking:

“What happens if this is manipulated?”
“What assumptions is this making?” But clean AI code can bypass that instinct.

Were tests AI-written too?
In many cases, yes — or at least influenced by the same assumptions. So tests end up validating the same flawed logic, not the real-world edge cases. That’s why everything “passes” but still fails in production.

What I’m starting to believe:
AI doesn’t introduce new classes of bugs — it amplifies subtle ones and makes them easier to ship.

The real gap isn’t code quality — it’s risk-aware review and test design, especially for critical paths like auth and billing.

[-]

Salfiiii@reddit

Makes sense.

In the end it’s probably cognitive overload because code is generated so fast and one has to thoroughly review it but most people don’t enjoy reviewing code, especially when it’s written by an ai because you can’t teach the ai to do it better next time like with a good junior (at least right now). There’s no reward for reviewing it, it’s just tedious work.

Do you personally like the current way to work where your part is more specification, planing and reviewing than actually writing code?

[-]

Ok_Stretch_6623@reddit (OP)

I actually love coding. But these days my perspective is shifting mostly because around 90% of code is being written with AI assistance.

So instead of writing everything from scratch, my role is becoming more about guiding, reviewing, and refining. It’s more convenient and efficient, but I still enjoy the part where I get to think deeply and write critical pieces myself.