Agent-written tests missed 37% of injected bugs. Mutation-aware prompting dropped that to 13%.
Posted by kalpitdixit@reddit | Python | View on Reddit | 30 comments
We had a problem with AI-generated tests. They'd look right - good structure, decent coverage, edge cases covered - but when we injected small bugs into the code, a third of them went undetected. The tests verified the code worked. They didn't verify what would happen if the code broke.
We wanted to measure this properly, so we set up an experiment. 27 Python functions from real open-source projects, each one mutated in small ways - < swapped to <=, + changed to -, return True flipped to return False, 255 nudged to 256. The score: what fraction of those injected bugs does the test suite actually catch?
A coding agent (Gemini Flash 3) with a standard "write thorough tests" prompt scored 0.63. Looks professional. Misses more than a third of bugs.
Then we pointed the same agent at research papers on test generation. It found a technique called mutation-aware prompting - from two papers, MuTAP (2023) and MUTGEN (2025).
The core idea: stop asking for "good tests." Instead, walk the function's AST, enumerate every operator, comparison, constant, and return value that could be mutated, then write a test to kill each mutation specifically.
The original MuTAP paper does this with a feedback loop - generate tests, run the mutant, check if it's caught, regenerate. Our agent couldn't execute tests during generation, so it adapted on its own: enumerate all mutations statically from the AST upfront, include the full list in the prompt, one pass. Same targeting, no execution required.
The prompt went from:
"Write thorough tests for
validate_ipv4"
to:
"The comparison
<on line 12 could become<=. The constant0on line 15 could become1. The returnTrueon line 23 could becomeFalse. Write a test that catches each one."
Score: 0.87. Same model, same functions, under $1 total API cost.
50 lines of Python for the AST enumeration. The hard part was knowing to do it in the first place. The agent always knew how to write targeted tests - it just didn't know what to target until it read the research.
We used Paper Lantern to surface the papers - it's a research search tool for coding agents. This is one of 9 experiments we ran, all open source. Happy to share links in the comments if anyone wants to dig into the code or prompts.
Feeling_Ad_2729@reddit
13% residual is still notable — mostly from mutations that produce semantically-different-but-still-passing outputs, right? e.g. off-by-one that happens to match the test's example, or branch-swap that re-enters the same code path.
two things I've found orthogonal to mutation-aware prompting:
make the agent write tests for the NEGATIVE space first — what inputs should NOT return X. negative assertions catch the off-by-one class that positive assertions miss.
force a property-based test alongside example-based (hypothesis). agents default to 3-5 hand-picked examples; property-based forces them to think about invariants, which naturally surfaces the mutations they'd otherwise miss.
does your harness let you compose mutation-aware + property-based prompting, or does one wash out the other?
Issueofinnocence@reddit
What is mutation testing?
wRAR_@reddit
Aha, so /u/paperlantern-ai is indeed named after a thing you guys are trying to promote.
paperlantern-ai@reddit
kinda... we are more trying to understand what is worth making that would help python users and software engineers in general. so if we create something that helps many users here - it'll help them and help guid us too.
wRAR_@reddit
I despise spam and I especially despise accounts that post bot comments in order to look less spammy, like you. I'm not interested in engaging with the stuff you are promoting.
slowpush@reddit
Because you are using a crappy agent.
Also the workflow is use for 5.4/opus write the tests and have the other model review them and feed back in the changes to the model.
kalpitdixit@reddit (OP)
I am using Opus 4.6, so I think the agent itself was probably the best available - i guess I should've mentioned it
slowpush@reddit
Your post literally says gemini flash.
paperlantern-ai@reddit
sorry - I should have clarified. the coding agent is opus 4.6 and when its job is to create some prompt for a production system, it creates a prompt for a gemini flash 3 api call
slowpush@reddit
Correct. Your entire architecture is flawed and this is just a post to promote your service.
paperlantern-ai@reddit
i think the architecture is very well setup, in fact. using opus 4.6 is perfect for code agents. For serving production use-cases most teams use something like Flash 3 to serve their customers, not opus ....
slowpush@reddit
Not really. It’s about 3-6 months old.
Good luck with your service though! I did offer suggestions on my first comment.
RepresentativeFill26@reddit
AI written tests make testing feel like quite the afterthought. TDD is, in my opinion, a bit of overkill but I do agree that best practice should be be writing tests, of thinking about what you are going to test, in advance.
aistranin@reddit
Fully agree. Tbh for me, TDD is still the only way to go fast with AI. Shared some thoughts on this here is someone is curious The ONLY Way Python Coding with AI Works https://youtu.be/Mj-72y4Omik
kalpitdixit@reddit (OP)
I agree - TDD for me is sometimes too-much-work but I think generally a good idea... maybe with AI Coding Agents we should be writing more tests since its easier to write it.
Here, what I found is that using the research backed test writing ideas from Paper Lantern made it trivially easy to improve the tests that the AI agent (Opus 4.6) was writing.
honcler@reddit
It would be a good idea in principle.
If society subsidized medical school in exchange for a few years in underserved areas, you would lower debt, widen access, and improve physician distribution. The catch is that the doctor shortage is not just controlled by med school admissions — it is also limited by residency slots, training capacity, and funding. So opening admissions alone would not solve it, but pairing cheaper education with service obligations and more residency positions would be a very strong model. On a side note, I was involved in a rear end crash at a red light in fall of 2021. I was at a red light and a drunk driver hit me from behind. The car was not drivable off scene and geico handled the tow of the car from the scene and I never saw it again. I sent I'm my title later and was paid out for the total loss. Fast forward to this last monday, someone came to my house and served me papers summoning me for a court case for an accident occurring December 2024, where me and the new apparent driver of the car are being sued for just shy of 10k. In the case it states the driver was driving with permission of me and was at fault for a variety of reason. I haven't seen the car in almost 5 years and hadn't for over 3 years at the time of the accident. I am going to have to send something into the court for dismissal but am curious if geico is at all responsible for this mismanagement of ownership and if i should be reaching out to request legal help from them, as well if they are liable to provide me that legal help.
RepresentativeFill26@reddit
What?
mr_frpdo@reddit
Looks like fest is an automatic tool to help create the mutation testing
kalpitdixit@reddit (OP)
fest ?
mr_frpdo@reddit
https://github.com/sakost/fest
kalpitdixit@reddit (OP)
thanks for the pointer - i guess these are complementary - fest creates the mutation tests and here Paper Lantern (https://www.paperlantern.ai/code) helped create unit tests to catch those errors
Henry_old@reddit
ai tests always hallucinate coverage mutation testing is only way to verify real logic 13 percent still high but better than trust me bro code ai is just lazy dev fuel
kalpitdixit@reddit (OP)
True - 13% is still high - just wanted to share that simply giving it access to papers through that tool gave a boost for \~free
Henry_old@reddit
papers wont stop hallucinations mutation testing is only real proof stay skeptical of ai outp
kalpitdixit@reddit (OP)
yes - i think this is in the same light of being skeptical of ai output - hence having some human-done, research-backed methods to tell the ai exactly what to do was helpful
Henry_old@reddit
zero security issues is what every drainer says 15 usdt for kyc or wallet connect is pure scam keep your pennies stay away
really_not_unreal@reddit
"here is a spot where a bug might be. Test for that bug specifically" doing a good job at finding that specific bug doesn't seem all that impressive.
kalpitdixit@reddit (OP)
true - i think what the paper did though is highlight multiple places where the bug might be - so going from say hundreds of options to 10-20 options of where the bug might be. so ultimately, the tests that focus on those bug-location options have a higher chance of catching the bugs
j01101111sh@reddit
Maybe I just need more detail but I don't get how the AI is supposed to know the correct outputs if you're changing the functions beforehand. What am I missing? Can you walk through the < to <= swap?
kalpitdixit@reddit (OP)
yes - it's a bit convoluted how mutation testing is measured - but its a powerful tool.
the idea is that several versions of the target function are created with small changes. and the tests are judged baed no their ability to differentiate between the, i.e the original correct function and the small changes.
so that, in the future, when antoher code edit changes the target function, any errors in it are caught by the tests.