How much time do you spend reviewing AI-generated Python code before pushing?
Posted by Desperate_Crew1775@reddit | Python | View on Reddit | 66 comments
Genuine question — not selling anything.
I've been using Cursor/Copilot heavily for the last few months and noticed I spend almost as much time reviewing and fixing the AI's output as I would have writing it myself.
Specifically for Python and FastAPI:
- Missing auth checks on routes
- Pydantic models that don't validate edge cases
- Async functions that look correct but have subtle race conditions
- Exception handling that swallows errors silently
My background is network operations (20 years) so I'm paranoid about production failures. I'm curious whether this is just me or a wider pattern.
Questions:
1. How long do you spend reviewing AI code before each commit?
2. What's the most common class of bug AI code introduces in your experience?
3. Have you tried any automated review tools (CodeRabbit, Qodo etc) — were they useful or too noisy?
Not looking to pitch anything — genuinely trying to understand the problem space before deciding whether to build something.
GXWT@reddit
I sidestep this by not generating AI code.
odimdavid@reddit
Instead of giving AI a prompt, I give it the steps I would use and ask it to generate code line by line. That way am debugging it and debugging myself instead of having to go through a 100 lines of code looking for errors. It has worked for me.
Desperate_Crew1775@reddit (OP)
Interesting — what was the experience that made you stop? Was it bugs in production or just too much time reviewing?
GXWT@reddit
I never started. I've tested and played with it, sure. Never used (or wanted to use it) in any meaningful capacity because ~~I'm not a gimp~~ the code I'm not just writing boilerplate code, the work I do is novel and LLMs by definition cannot help me.
Don't feel obliged to reply to me again if you're not going to use your own words and thoughts. Gimp.
error1954@reddit
Why is this so hostile
GXWT@reddit
It's a public forum in which I am able to freely express my opinions on the topic.
Why does my opinion have to be neutral or even positive? Is a public forum only for agreeing?
Of course you may disagree and downvote, as others have done, but that does not mean I have to be flaccid with my opinion.
mattl33@reddit
Where do you work? I'd like to ensure I avoid you.
error1954@reddit
You can disagree without being a total twat
GXWT@reddit
But then I wouldn't really get the strength or full reality of my opinion across.
Zouden@reddit
No one's paying attention to your opinion now.
GXWT@reddit
That's ok. Such is the wonder of a public forum.
Another wonder is that you can freely make this functionally useless comment. What did it hope to achieve?
error1954@reddit
Ah okay then mission accomplished
Desperate_Crew1775@reddit (OP)
yeah that tracks — the review overhead kills the time saving. how long does a typical review take you?
Zouden@reddit
Why do your responses all have em dashes?
GXWT@reddit
Are the people downvoting me for public forums being used in this manner...? Bleak.
Black_Magic100@reddit
The code you work is Novel? And you are using a high-level language like python? Yea ok.. 😅
GXWT@reddit
What? I truly don't know if this is satire amongst the other unrest.
You cannot fathom using Python to do something that hasn't been done before?
Black_Magic100@reddit
Something so novel that an LLM can't write it is absolute nonsense. You act like LLMs can only regurgitate previous thoughts, but it's not that shallow. Yea it's not true AGI, but somebody can create a video of something that never existed before.
What the hell are you writing in Python that an AI can't help you? That is straight boomer talk.
OwnTension6771@reddit
The language of choice is not what he is saying the LLM cannot assist in, it is likely the domain. If you work in high impact classification spaces, your entire domain could be unknown to any LLM as the LLM has no access to this domain knowledge.
And yes, it is this shallow. It literally feeds off of existing data to "create" something that is nothing more than probability and retrieval at a scale beyond human capacity, not deep introspective synthesis.
GXWT@reddit
Oh my god this wasn't satire.
doocheymama@reddit
It takes a real talent to be so unpleasant. Bravo
GXWT@reddit
ta
AssociateEmotional11@reddit
Coming from a background in AI output evaluation and heavily utilizing LLMs for my own workflows, I completely resonate with your pain points. To answer your questions:
DEBUG=True, swallowing exceptions withexcept Exception: pass, missing CSRF tokens, or trusting frontend validation without server-side checks. It generates code that runs, but isn't safe.Because of this exact frustration, and seeing that you're contemplating building something in this space, I actually started an open-source project to tackle this directly. It's called Pyneat.
Instead of just formatting, I'm currently migrating its core to Rust (using AST parsing) to create an automated "Structural Guardrail" specifically for AI-generated code. We're building out a feature to scan for exactly the types of logical/security flaws you mentioned, with the speed needed to handle massive, automated code outputs.
Since you have 20 years of network operations experience and are paranoid about production failures (rightfully so!), I’d honestly value your harsh technical feedback on the idea. I'm not selling anything either—it's an open-source repo. If you're interested in the problem space, maybe we could exchange notes on what specific strict rules an "AI code reviewer" tool actually needs to have?
Desperate_Crew1775@reddit (OP)
this is exactly the problem space i'm in too — would be good to compare notes. what's the repo link for Pyneat?
AssociateEmotional11@reddit
https://github.com/khanhnam-nathan/Pyneat
this is pyneat 's link sir , and i will suport more language in the future
Desperate_Crew1775@reddit (OP)
checked it out interesting approach. mine is more focused on semantic/logic bugs that AST rules can't catch. would be good to talk, dm me
AssociateEmotional11@reddit
my project is still in beta so if you encounter any bugs just dm me via email or reddit
thank for your attention!
Chroiche@reddit
Why are people spending time discussing with an llm in this thread, are they stupid?
Desperate_Crew1775@reddit (OP)
no one here discussing with a LLM rather that they are discussing about LLM
Chroiche@reddit
Your responses are all from an LLM...
GXWT@reddit
AutoModerator@reddit
Your submission has been automatically queued for manual review by the moderation team because it has been reported too many times.
Please wait until the moderation team reviews your post.
I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.
aloobhujiyaay@reddit
not just you. i probably spend ~70% of the time reviewing / fixing and 30% actually generating. feels faster overall, but yeah nowhere near write once and ship
genman@reddit
AI is for when you don't know how to do it. If you can already write it it's nice to see what it comes up with but it usually is not exactly what you want.
Rafcdk@reddit
Absolutely not. No one should trust AI with something they can't comprehend themselves. My advice is to actually learn software development and coding and just avoid AI altogether. It will just hinder your personal development even if it were faster.
genman@reddit
It's not about trust. It's about having something to look at. It's not that different than copying an example out of a library's documentation.
Rafcdk@reddit
Library documentations don't hallucinate and they are usually community driven. It's completely different. If something is hallucinated you can even get into a self sustaining loop and if you dont know how to do it you don't know what is wrong and will be in a fixing loop.
This video exemplifies one of those loops. Like yes its trivial for us to understand there is no S in chatGPT, but imagine someone learning the western alphabet from 0 with AI and not knowing.
https://youtu.be/ZMP8_jD-y0s
spuds_in_town@reddit
I disagree. If I am adding a new API endpoint, particularly with patterns following the numerous endpoints in our codebase, AI does a good job and much faster than I can do it. And me doing it myself adds no value, I'm not learning anything, I'm just doing it slower. That's just one example.
I'd also say it's very good at exploring new ideas, pointing out flaws, refactoring with minimum effort, compiling info taken from documentation or web searches. It can write good tickets, good PRs, provide decent comments on commits, and other things that I certainly can do, but it does it better and faster.
You certainly do need guard rails though; we have a lot of clanker guidance docs for adding api endpoints, use of postgres, redis, kafka etc, transaction management, coding style, common patterns, unit test writing and coverage, appropriate levels of detail for commits, tickets and PRs etc etc.
GiveMoreMoney@reddit
Your question came at the right time where I question everything about agentic coding.
I had a big project I was working on with over 100K lines of code. I had started the project without AI assistance, and then when I used LLMs, I had to understand/learn a lot of new stuff and keep in some cases correcting the LLM for every class and for every method. End result, fantastic, super correct code, virtually bug free.
That gave me the confidence to start another project with letting the LLMs do the work mostly. Everything looked fine for a month, then I realised the original design needed refactoring and that is not the LLMs fault, the project grew.
Now finally I am paying more attention, that is fully working and clean...but...I would NEVER have written this kind of deceiving code myself. The amount of mistakes in design, the amount of hidden bugs, especially around multithreading are astonishing.
In the first project, I had started it so the LLM was assisting me to move forward and perfect it. If it wasn’t for the LLM I would never have finished it, and even if I had finished it (in let’s say additional 2 years of work), it would never be as good. I can admit I would not have done it without LLMs.
But now I realised that it worked because I was pair programming with the LLMs. On the second project I see that I need another month to try to shape the code and overall design up to my standards and I should never have left the LLMs to do most of the work.
So the way I see it at the moment, if you let the LLM do the work and you just “look” at the code, it is not enough. That is if you want to create good, main tenable code and overall design. If you just want to complete jiras and then go home, LLMs are the perfect tool. When it all goes up in smoke next year though a lot of people are going to be searching for a new job and a different career from the looks of it.
If you use the LLM as a junior dev you are working with, and at the same time as an advisor for coding patterns etc. the results are amazing. No matter what the hype says, they cannot program, they can only code.
AssociateEmotional11@reddit
That quote "they cannot program, they can only code" from u/GiveMoreMoney is probably the most accurate description of the current LLM landscape I've read.
To answer your question about tooling: In my experience, traditional linters (Flake8, Pylint, even MyPy) are practically blind to these "deceiving" AI bugs. The LLM writes syntactically perfect Python, so the linters give it a green light. But structurally, it's a house of cards—especially with asyncio race conditions, improper thread locking, or silently swallowed exceptions just to make the execution pass.
Right now, catching these is almost entirely reliant on manual debugging and painful post-mortems.
This massive tooling gap is exactly what pushed me to start building Pyneat — an open-source AST analyzer. Standard regex-based linters won't cut it anymore. We need tools that can parse the actual syntax tree (which is why I'm migrating its core to Rust for speed) to specifically hunt for these AI-generated "vibe coding" anti-patterns. If the AI is acting as the junior dev writing the code, we need an automated "Senior Tech Lead" tool to ruthlessly audit the architecture before it hits production.
Desperate_Crew1775@reddit (OP)
the pair programming observation is spot on — the multithreading issues especially. that's the exact class of bug i keep seeing too. did you find any tooling that helped catch those or was it purely manual debugging?
GiveMoreMoney@reddit
The way I am dealing with that is to create a mini framework around controller objects lifecycle and threading patterns, e.g. single thread, thread pool, scheduler thread. Then stick to those patterns and observe any deviations from the LLM.
There are no tools that can really give you validation on async issues that are logical bugs rather than trivial stuff. You have to be the one that thinks 10 steps ahead.
I also noticed that with some complex (state-wise) async SQL stored procs too, if I ask Opus 4.6 to evaluate the project from a race conditions perspective, it will give me 11 bugs...one of those bugs maybe real, the rest are not. When I explain to it how things are supposed to work it agrees with me everything is fine. Of course finding 1 single bug is super valuable.
So funnily enough only LLMs can detect those, but you need to give them a good explanation of what you are after and then discuss with them the results. Then you will start discussing the possible solutions with them and come up with something great.
They create the problem, they help you fix the problem...
OwnTension6771@reddit
I find it easier to spend time developing, reviewing, and documenting my tests and then let the AI run (vibe) until it passes, then use a different model to assess and test.
Which goes without saying, the more time you invest up front on design plans, specs, standards, and conventions the less likely the AI later will spit out complete nonsense
Certain-Business-472@reddit
I make it generate small snippets to incorporate it into my code. I don't make it generate entire changes on a codebase. Sometimes I'll ask it for alternative ways of doing something to understand it better, which I've found to have accelerated my learning about the language I'm working with.
My prompts are more "write me a fastapi endpoint to get customer data with proper pydantic models" which I will then modify to fit into existing code and less "write me a customer service application". During rewrites another model will suggest local changes on the fly which often corrects what the initial output got wrong.
I feel like the more you ask of it, the less quality it has. More mistakes, more rewrites and just net more time lost dealing with hallucinations and non-working code. It cannot seem to write correct python async code or multithreaded code to justify it's existence. Wondering if others have the same experience in that department?
Mediocre-Movie-5812@reddit
ote it from scratch. It's like having a very fast but slightly distracted intern haha. I always double check the edge cases in Pydantic models because LLMs love to skip those sometimes.
jhonyrod@reddit
I just use AI to get a grip about unknown unknowns, meaning, if I want to dive into (the specifics of) a topic I know nothing or very little about I use it to try to gather best practices, agreed solutions… stuff like that. I might then ask for concrete examples, but I never use AI generated code directly; I need to understand the code I'm using thoroughly and as such 99% of the time I rewrite it myself, probably using the generated code as a suggestion. The other 1% I use the code directly only if I think that what I'd write would be practically the same and I don't feel like typing it out myself.
So yeah, I use AI as a knowledge assistant rather than a thinking assistant.
Alex--91@reddit
Extremely comprehensive test suits and very strong linting and formatting and type checking seem to be the only effective methods at the minute (shift left).
In a lot of ways, using AI to generate Rust code is far more reliable and enjoyable than Python. Just because the rust compiler catches so many things earlier (shift left) and gives a great error message the model can then use to fix its original mistake.
There’s definitely different magnitudes of ROI on using AI depending on the codebase and on the feature/bug you’re working on. It’s great for little obvious bugs, assign to GitHub CoPilot directly in the UI when making the issue and let it solve it and add some test cases and then do a quick review and confirm that’s how you would have done it and then merge and move on. It also can be great for performance optimizations if you have great test cases and great profiling tooling - same input same output but now faster, easy to review. It’s not always great at new features if you end up spending as long reviewing the code as you would have writing it and if you need a lot of context and experience to know the common footguns etc.
Review bottleneck is a real thing. Making code fail earlier before a human reviewer look at it is key.
stibbons_@reddit
If the harness is correctly configured around it, I do not look at the code anymore
jah_broni@reddit
This + properly planning is the answer try to ops pain points. You want it to do auth in a certain way for fastapi? Put that in a configuration file. Use planning mode to talk through implantation details, edge cases, etc.
Use the tool properly and it works well, who would have thought.
Desperate_Crew1775@reddit (OP)
the config approach makes sense — do you version control those rules so the whole team uses them?
jah_broni@reddit
Yes when they are repo wide rules they are committed. But I can’t commit how to write good prompts and do a proper planning session.
cosmicomical23@reddit
Good for you
Desperate_Crew1775@reddit (OP)
what does your harness setup look like — pytest with coverage thresholds or something more custom?
spuds_in_town@reddit
Everything, carefully. Seems like any time I fail to pay attention, something slips through that I regret.
Desperate_Crew1775@reddit (OP)
async python + kafka is exactly where i'd be most paranoid too — any specific failure pattern you've hit more than once?
spuds_in_town@reddit
I find two main issues.
First, Claude sometimes needs to be reeled in. Often a problem it thinks it has is because it's made a bad design decision, and then seems unable to correct that decision itself, and just deals with the consequences by generating stupid workarounds. I ask it why it did a thing, get it to describe the root cause, and help it to simplify.
Second, by default it goes overboard with unit tests, and doesn't always follow my instructions to only create meaningful tests, test behaviour not implementation etc. But I do find if I tell it to run an agent to compare what it generated against the unit test development guidelines, it does a good job of paring back the cruft.
netspherecyborg@reddit
I review every single letter the AI puts out. 0 gets past me without validation.
Desperate_Crew1775@reddit (OP)
reviewing every letter sounds exhausting — how long does a typical PR take you to review this way?
rosentmoh@reddit
It's hard enough reading other people's code and checking it, sure as hell not gonna spend time reading and checking some AI's code. Much prefer writing my own from scratch and taking full responsibility.
Acceptable_Durian868@reddit
I find the same. I spend the same amount of time on AI generated code than I do writing it by hand. I've measured this. It doesn't increase my productivity.
Desperate_Crew1775@reddit (OP)
This is exactly what I'm trying to solve — the time saved writing is lost in reviewing. Since you've measured it, what does your review process actually look like? Do you have a checklist or is it just reading line by line?
Desperate_Crew1775@reddit (OP)
I find the same. I spend the same amount of time on AI generated code than I do writing it by hand. I've measured this. It doesn't increase my productivity.
m33-m33@reddit
People around me : Just enough time to tell the AI to write the commit message 😩
Desperate_Crew1775@reddit (OP)
Honestly that's the scary part — the code ships but nobody reviewed it. Has that caused real issues in your team?
ddxv@reddit
It depends what. My ai volume for frontend is huge and I often high-level check it.
For python backend im careful and often force it to be very narrow in its changes.
For SQL I usually deconstruct and work through anything I dont understand, I always get burned by ai with SQL. Every time: wow it one shotted this! Hmm interesting it used LATERAL or some other concept I loosely understand. Always a month later debugging that for some subtle issue.
Desperate_Crew1775@reddit (OP)
The SQL example is exactly the pattern I'm worried about — looks perfect, subtle issue surfaces a month later. Do you have any automated check in your workflow or is it purely manual review?