Agent-written tests missed 37% of injected bugs. Mutation-aware prompting dropped that to 13%.

Posted by kalpitdixit@reddit | Python | View on Reddit | 30 comments

We had a problem with AI-generated tests. They'd look right - good structure, decent coverage, edge cases covered - but when we injected small bugs into the code, a third of them went undetected. The tests verified the code worked. They didn't verify what would happen if the code broke.

We wanted to measure this properly, so we set up an experiment. 27 Python functions from real open-source projects, each one mutated in small ways - < swapped to <=, + changed to -, return True flipped to return False, 255 nudged to 256. The score: what fraction of those injected bugs does the test suite actually catch?

A coding agent (Gemini Flash 3) with a standard "write thorough tests" prompt scored 0.63. Looks professional. Misses more than a third of bugs.

Then we pointed the same agent at research papers on test generation. It found a technique called mutation-aware prompting - from two papers, MuTAP (2023) and MUTGEN (2025).

The core idea: stop asking for "good tests." Instead, walk the function's AST, enumerate every operator, comparison, constant, and return value that could be mutated, then write a test to kill each mutation specifically.

The original MuTAP paper does this with a feedback loop - generate tests, run the mutant, check if it's caught, regenerate. Our agent couldn't execute tests during generation, so it adapted on its own: enumerate all mutations statically from the AST upfront, include the full list in the prompt, one pass. Same targeting, no execution required.

The prompt went from:

"Write thorough tests for validate_ipv4"

to:

"The comparison < on line 12 could become <=. The constant 0 on line 15 could become 1. The return True on line 23 could become False. Write a test that catches each one."

Score: 0.87. Same model, same functions, under $1 total API cost.

50 lines of Python for the AST enumeration. The hard part was knowing to do it in the first place. The agent always knew how to write targeted tests - it just didn't know what to target until it read the research.

We used Paper Lantern to surface the papers - it's a research search tool for coding agents. This is one of 9 experiments we ran, all open source. Happy to share links in the comments if anyone wants to dig into the code or prompts.