Agent-written tests missed 37% of injected bugs. Mutation-aware prompting dropped that to 13%.

Posted by kalpitdixit@reddit | Python | View on Reddit | 30 comments

We had a problem with AI-generated tests. They'd look right - good structure, decent coverage, edge cases covered - but when we injected small bugs into the code, a third of them went undetected. The tests verified the code worked. They didn't verify what would happen if the code broke.

We wanted to measure this properly, so we set up an experiment. 27 Python functions from real open-source projects, each one mutated in small ways - < swapped to <=, + changed to -, return True flipped to return False, 255 nudged to 256. The score: what fraction of those injected bugs does the test suite actually catch?

A coding agent (Gemini Flash 3) with a standard "write thorough tests" prompt scored 0.63. Looks professional. Misses more than a third of bugs.

Then we pointed the same agent at research papers on test generation. It found a technique called mutation-aware prompting - from two papers, MuTAP (2023) and MUTGEN (2025).

The core idea: stop asking for "good tests." Instead, walk the function's AST, enumerate every operator, comparison, constant, and return value that could be mutated, then write a test to kill each mutation specifically.

The original MuTAP paper does this with a feedback loop - generate tests, run the mutant, check if it's caught, regenerate. Our agent couldn't execute tests during generation, so it adapted on its own: enumerate all mutations statically from the AST upfront, include the full list in the prompt, one pass. Same targeting, no execution required.

The prompt went from:

"Write thorough tests for validate_ipv4"

to:

"The comparison < on line 12 could become <=. The constant 0 on line 15 could become 1. The return True on line 23 could become False. Write a test that catches each one."

Score: 0.87. Same model, same functions, under $1 total API cost.

50 lines of Python for the AST enumeration. The hard part was knowing to do it in the first place. The agent always knew how to write targeted tests - it just didn't know what to target until it read the research.

We used Paper Lantern to surface the papers - it's a research search tool for coding agents. This is one of 9 experiments we ran, all open source. Happy to share links in the comments if anyone wants to dig into the code or prompts.

[-]

Feeling_Ad_2729@reddit

13% residual is still notable — mostly from mutations that produce semantically-different-but-still-passing outputs, right? e.g. off-by-one that happens to match the test's example, or branch-swap that re-enters the same code path.

two things I've found orthogonal to mutation-aware prompting:

make the agent write tests for the NEGATIVE space first — what inputs should NOT return X. negative assertions catch the off-by-one class that positive assertions miss.
force a property-based test alongside example-based (hypothesis). agents default to 3-5 hand-picked examples; property-based forces them to think about invariants, which naturally surfaces the mutations they'd otherwise miss.

does your harness let you compose mutation-aware + property-based prompting, or does one wash out the other?

[-]

Issueofinnocence@reddit

What is mutation testing?

[-]

wRAR_@reddit

We used Paper Lantern to surface the papers - it's a research search tool for coding agents

Aha, so /u/paperlantern-ai is indeed named after a thing you guys are trying to promote.

[-]

paperlantern-ai@reddit

kinda... we are more trying to understand what is worth making that would help python users and software engineers in general. so if we create something that helps many users here - it'll help them and help guid us too.

[-]

wRAR_@reddit

what did you think of the above work ?

I despise spam and I especially despise accounts that post bot comments in order to look less spammy, like you. I'm not interested in engaging with the stuff you are promoting.

[-]

slowpush@reddit

Because you are using a crappy agent.

Also the workflow is use for 5.4/opus write the tests and have the other model review them and feed back in the changes to the model.

[-]

kalpitdixit@reddit (OP)

I am using Opus 4.6, so I think the agent itself was probably the best available - i guess I should've mentioned it

[-]

slowpush@reddit

Your post literally says gemini flash.

[-]

paperlantern-ai@reddit

sorry - I should have clarified. the coding agent is opus 4.6 and when its job is to create some prompt for a production system, it creates a prompt for a gemini flash 3 api call

[-]

slowpush@reddit

Correct. Your entire architecture is flawed and this is just a post to promote your service.

[-]

paperlantern-ai@reddit

i think the architecture is very well setup, in fact. using opus 4.6 is perfect for code agents. For serving production use-cases most teams use something like Flash 3 to serve their customers, not opus ....

[-]

slowpush@reddit

Not really. It’s about 3-6 months old.

Good luck with your service though! I did offer suggestions on my first comment.

[-]

RepresentativeFill26@reddit

AI written tests make testing feel like quite the afterthought. TDD is, in my opinion, a bit of overkill but I do agree that best practice should be be writing tests, of thinking about what you are going to test, in advance.

[-]

aistranin@reddit

Fully agree. Tbh for me, TDD is still the only way to go fast with AI. Shared some thoughts on this here is someone is curious The ONLY Way Python Coding with AI Works https://youtu.be/Mj-72y4Omik

[-]

kalpitdixit@reddit (OP)

I agree - TDD for me is sometimes too-much-work but I think generally a good idea... maybe with AI Coding Agents we should be writing more tests since its easier to write it.
Here, what I found is that using the research backed test writing ideas from Paper Lantern made it trivially easy to improve the tests that the AI agent (Opus 4.6) was writing.

[-]

honcler@reddit

It would be a good idea in principle.

If society subsidized medical school in exchange for a few years in underserved areas, you would lower debt, widen access, and improve physician distribution. The catch is that the doctor shortage is not just controlled by med school admissions — it is also limited by residency slots, training capacity, and funding. So opening admissions alone would not solve it, but pairing cheaper education with service obligations and more residency positions would be a very strong model. On a side note, I was involved in a rear end crash at a red light in fall of 2021. I was at a red light and a drunk driver hit me from behind. The car was not drivable off scene and geico handled the tow of the car from the scene and I never saw it again. I sent I'm my title later and was paid out for the total loss. Fast forward to this last monday, someone came to my house and served me papers summoning me for a court case for an accident occurring December 2024, where me and the new apparent driver of the car are being sued for just shy of 10k. In the case it states the driver was driving with permission of me and was at fault for a variety of reason. I haven't seen the car in almost 5 years and hadn't for over 3 years at the time of the accident. I am going to have to send something into the court for dismissal but am curious if geico is at all responsible for this mismanagement of ownership and if i should be reaching out to request legal help from them, as well if they are liable to provide me that legal help.

[-]

RepresentativeFill26@reddit

What?

[-]

mr_frpdo@reddit

Looks like fest is an automatic tool to help create the mutation testing

[-]

kalpitdixit@reddit (OP)

fest ?

[-]

mr_frpdo@reddit

https://github.com/sakost/fest

[-]

kalpitdixit@reddit (OP)

thanks for the pointer - i guess these are complementary - fest creates the mutation tests and here Paper Lantern (https://www.paperlantern.ai/code) helped create unit tests to catch those errors

[-]

Henry_old@reddit

ai tests always hallucinate coverage mutation testing is only way to verify real logic 13 percent still high but better than trust me bro code ai is just lazy dev fuel

[-]

kalpitdixit@reddit (OP)

True - 13% is still high - just wanted to share that simply giving it access to papers through that tool gave a boost for \~free

[-]

Henry_old@reddit

papers wont stop hallucinations mutation testing is only real proof stay skeptical of ai outp

[-]

kalpitdixit@reddit (OP)

yes - i think this is in the same light of being skeptical of ai output - hence having some human-done, research-backed methods to tell the ai exactly what to do was helpful

[-]

Henry_old@reddit

zero security issues is what every drainer says 15 usdt for kyc or wallet connect is pure scam keep your pennies stay away

[-]

really_not_unreal@reddit

"here is a spot where a bug might be. Test for that bug specifically" doing a good job at finding that specific bug doesn't seem all that impressive.

[-]

kalpitdixit@reddit (OP)

true - i think what the paper did though is highlight multiple places where the bug might be - so going from say hundreds of options to 10-20 options of where the bug might be. so ultimately, the tests that focus on those bug-location options have a higher chance of catching the bugs

[-]

j01101111sh@reddit

Maybe I just need more detail but I don't get how the AI is supposed to know the correct outputs if you're changing the functions beforehand. What am I missing? Can you walk through the < to <= swap?

[-]

kalpitdixit@reddit (OP)

yes - it's a bit convoluted how mutation testing is measured - but its a powerful tool.

the idea is that several versions of the target function are created with small changes. and the tests are judged baed no their ability to differentiate between the, i.e the original correct function and the small changes.

so that, in the future, when antoher code edit changes the target function, any errors in it are caught by the tests.