I reviewed 6 months of AI-generated Python across 3 projects. The code was fine. The Python was not.

Posted by Ambitious-Garbage-73@reddit | Python | View on Reddit | 16 comments

Did a code review sweep across three internal projects that have been using Copilot and Claude heavily for the past 6 months. Wanted to understand what we actually have before doing a bigger refactor.

The logic was mostly fine. The tests passed. Nothing was catastrophically wrong.

But the Python specifically had accumulated some patterns that I don't think would have emerged from developers writing it themselves.

A few things I noticed:

Functions were longer than they needed to be. Not by a lot, but consistently. The AI seems to prefer a single function that handles the full flow rather than decomposing into smaller pieces. When I asked the team why a 60-line function wasn't broken up, nobody had a strong answer. It wasn't a conscious decision — it was just the shape the AI produced.

Exception handling was copy-pasted in weird ways. Broad except clauses in places where specific ones made more sense. Error messages that were clearly generated for a generic context, not for the actual failure mode. In one case, an except Exception as e block that logged the error and then continued silently — exactly the pattern you don't want in a data pipeline where a silent failure corrupts downstream results.

Import hygiene was bad. Libraries imported that weren't used (probably from earlier versions of the prompt that got regenerated). Heavy dependencies pulled in for single utility functions that the standard library handles fine. One service had pandas imported in a module that used exactly one DataFrame operation that could have been a list comprehension.

Type hints were inconsistently applied. Some functions had full annotations, others had none. The pattern matched prompt-level context: if the original request mentioned types, they were there. If not, they weren't.

None of this is catastrophic. All of it is the kind of thing that accumulates and becomes a problem when the codebase grows or when someone new has to work in it.

The part that surprised me: when I asked the engineers about specific decisions, they often couldn't explain them. Not because they hadn't reviewed the code — they had. But the review frame was "does this work" not "is this good Python." Those are different questions and we were only asking the first one.

Curious whether others are seeing similar patterns in AI-assisted Python codebases, and what (if anything) you've done about it. We're thinking about adding a Pythonic style checklist to PR reviews but I'm not sure that addresses the root cause.

[-]

UglyFloralPattern@reddit

Let me look past the fact that this was clearly generated by an AI and engage with your actual topic.

But the review frame was "does this work" not "is this good Python."

In your reviews you need to better distinguish between:

Functionally correct and working code (obvs. with tests)
Code meeting some definition of "best practice"
Code meeting some definition of "on the surface: pythonic"

And decide what you actually want. Pythonic code meeting a style guide was best practice when maintenance engineers need to quickly read, debug and fix code without making it worse.

Does that even matter any more?

I get Claude to write the backend in Rust, so I can legitimately say "no human will ever be paid to read this code".

[-]

headinthesky@reddit

And you can also have AI follow your rules with an agent file. I have it set up to follow the conventions , run linters and formatters, use the most appropriate exceptions, etc. My code doesn't look like AI wrote it, it looks like I did. Because I also spend a lot of time doing the design and plan, which you should be doing anyway.

[-]

UglyFloralPattern@reddit

And this makes perfect sense because:

1/ you like your own code style
2/ you want to read the code an AI generated in your style

And that's cool, for what it's worth. Your AI writes code that *looks* like you might have written it, and is therefore easily understood by you, should you ever need to read it.

A colleague of mine gets Claude / Codex to write all non trivial code in Assembler. This goes against every single policy and style guide we have. His argument is that no human is ever going to read that code. That paradigm is dead.

What took me a long time to accept is that there will come a time when I don't even want to read code any more. I want correctness guarantees, completeness analyses, space and time complexity guarantees, and confidence of continued self-healing functioning for a set of specified environment variables.

I don't read punch cards either.

[-]

headinthesky@reddit

Yeah, maybe. I don't read much of the code I wrote years ago, but I'd still understand it and see why I made certain decisions and it's usually sane enough to fix.

But AI isn't there yet to let it work completely blindly like that. Though in my case, it's 95% there, and it gets to 100% with more guidance. You can also guide it to those guarantees, but at the time it takes to have it write the spec and plan, it's as much time as if you wrote it yourself

AutoModerator@reddit

Your submission has been automatically queued for manual review by the moderation team because it has been reported too many times.

Please wait until the moderation team reviews your post.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

Deep_Ad1959@reddit

the same pattern drift happens in test code and it's arguably worse there because nobody reviews tests as carefully. i've seen AI-generated test suites where every test follows the exact same structure even when it doesn't make sense, like setup/teardown for tests that don't need state. the broad exception handling thing is especially bad in tests because it silently swallows real failures. linting the test files with the same rigor as production code is underrated.

leynosncs@reddit

You should enable ruff rules to check on these things. Use pylint or semgrep for the stuff ruff doesn't cover.

You can make pylint faster by running it using UV and pypy.

Vautlo@reddit

I just submitted a 2,300 line entirely AI generated PR. The PR introduces a new data ingestion pipeline that hits Cursors enterprise analytics and AI Code tacking APIs and writes to delta tables. It's Python and some DDL

I use skills that reference existing patterns in pipelines written mostly before we got access to agentic IDEs, and one written after. I use plan mode and give the agent the existing skills, pointers to the API docs I'm working with, and any context related to our infrastructure that it might not yet have, plus what my goals are.

Regarding longer than necessary functions: I'm not seeing this happen lately in my workflow. Before skills were introduced and I retired mostly on rules, I definitely still ran into that problem. The functions in the PR I just submitted are quite small and separate concerns. Using Opus 4.6 thinking high reasoning, it wrote 63 tests. In a follow-up prompt, after an automated GitHub copilot review flagged something test related, I asked it to assess the tests, considering coverage, the pr comment, and general robustness. It changed a few of the existing tests and wrote 17 additional ones.

SaltAssault@reddit

What is this title anyway? The code isn't fine if it's poorly written. Just define clear requirements for the code, and do it with your own brains. A 60-line function is absurd. The skills of your programmers are probably already beginning to athropy, for them to not see it as such.

Key-Half1655@reddit

What's wrong with a 60 line function? Not everything needs broken into smaller chunks if its not re-used.

Rainboltpoe@reddit

The import problem is fixable. Tell your IDE that unused imports should be flagged as warnings, and then when your AI runs warning checks, it will see them.

I don’t have the long function issue. Perhaps because when I started using AI, none of my functions were long. Maybe lead by example?

I also don’t have the type hint problem, but my code was 100% type hinted since I made an effort to patch up the holes a while back. So again, maybe lead by example?

I do have the exception handling problem and I don’t know how to solve it. My AI loves to swallow errors, hiding the real problem and causing harder to debug issues down the line. I would love to know how others fix this.

pwnersaurus@reddit

Using AI it is interesting seeing what it can and can’t do. Although they can sometimes be more capable than you’d think, I am finding that in general you do still find that they behave like a copy-paste engine, albeit an extremely capable one, in line with what you’re observing. It’s not surprising to me that in drawing from examples to solve a problem, it misses making changes to it to suit the new context unless explicitly prompted to. I think the solution - at least for now - is to have a style guide that explicitly states what your requirements are (e.g., minimize imports that are only used for a small number of functions, prefer native types unless there is a substantial performance difference etc.) and then also require in the agent settings that implementations be checked against your style guide. A lot of the verification you’ve described as being needed can be done by the AI but only if specifically asked to