TDD enforcement primitive for coding agents — Tests the agent literally can't modify (formal spec + 120-line Python impl, MIT)

Posted by S-J-Rau@reddit | LocalLLaMA | View on Reddit | 0 comments

TDD enforcement primitive for coding agents — Tests the agent literally can't modify (formal spec + 120-line Python impl, MIT)

Spent the last months building AI agent frameworks locally after running into

the same failure mode over and over: the agent writes code, tests fail, and

instead of fixing code it "fixes" the test. AutoGen, LangChain, CrewAI — none

of them prevent this structurally. It's not a model-quality issue.

So I extracted a standalone primitive from the larger framework I've been

validating and wrote it up with formal invariants. Diagram attached.

Four primitives:

1. Blueprint Layer — plans tests in a context the agent can't see

2. Test Queue — append-only, ordered

3. TestLock — SHA-256 seal at commit time; agent gets the hash, never the source

4. Gate Condition — code gen only if a sealed test is RED

Three invariants (falsifiable):

I₁ TEMPORAL: commit(t) < generate(c)

I₂ STRUCTURAL: locked(t) → ¬modifiable(agent, t)

I₃ BEHAVIORAL: generate(c) iff ∃ t: locked(t) ∧ fails(t)

Feedback loop: Test fails → agent retries CODE, never tests. Tests are reality.

Works with any local LLM — I've tested the larger framework across 9 actors

(Qwen, DeepSeek, Claude, Codex, Gemini, Nemotron) with score variance under

15 points, which is the actor-exchangeability claim. Reference impl is Python

stdlib only.

Paper: https://doi.org/10.5281/zenodo.19393854 (STP standalone)

Context: https://doi.org/10.5281/zenodo.19378044 (Triple-A Thesis)

Code: https://github.com/SebazzProductions/sealed-test-paradigm.git

Happy to answer questions about the invariants or why this matters more for

local/autonomous agents than cloud ones.