I reviewed 6 months of AI-generated Python across 3 projects. The code was fine. The Python was not.

Posted by Ambitious-Garbage-73@reddit | Python | View on Reddit | 16 comments

Did a code review sweep across three internal projects that have been using Copilot and Claude heavily for the past 6 months. Wanted to understand what we actually have before doing a bigger refactor.

The logic was mostly fine. The tests passed. Nothing was catastrophically wrong.

But the Python specifically had accumulated some patterns that I don't think would have emerged from developers writing it themselves.

A few things I noticed:

Functions were longer than they needed to be. Not by a lot, but consistently. The AI seems to prefer a single function that handles the full flow rather than decomposing into smaller pieces. When I asked the team why a 60-line function wasn't broken up, nobody had a strong answer. It wasn't a conscious decision — it was just the shape the AI produced.

Exception handling was copy-pasted in weird ways. Broad except clauses in places where specific ones made more sense. Error messages that were clearly generated for a generic context, not for the actual failure mode. In one case, an except Exception as e block that logged the error and then continued silently — exactly the pattern you don't want in a data pipeline where a silent failure corrupts downstream results.

Import hygiene was bad. Libraries imported that weren't used (probably from earlier versions of the prompt that got regenerated). Heavy dependencies pulled in for single utility functions that the standard library handles fine. One service had pandas imported in a module that used exactly one DataFrame operation that could have been a list comprehension.

Type hints were inconsistently applied. Some functions had full annotations, others had none. The pattern matched prompt-level context: if the original request mentioned types, they were there. If not, they weren't.

None of this is catastrophic. All of it is the kind of thing that accumulates and becomes a problem when the codebase grows or when someone new has to work in it.

The part that surprised me: when I asked the engineers about specific decisions, they often couldn't explain them. Not because they hadn't reviewed the code — they had. But the review frame was "does this work" not "is this good Python." Those are different questions and we were only asking the first one.

Curious whether others are seeing similar patterns in AI-assisted Python codebases, and what (if anything) you've done about it. We're thinking about adding a Pythonic style checklist to PR reviews but I'm not sure that addresses the root cause.