I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math

Posted by QuantumSeeds@reddit | LocalLLaMA | View on Reddit | 48 comments

A few months ago, I got stuck on one line in the DeepSeek-R1 paper. It said models could improve through verifiable rewards.

That sounded almost magical to me. Not because it was impossible, but because it made me wonder something very simple:

What if a model could teach itself to code, without humans writing the training data?

I did not have a lab. I did not have a grant. I had a 24GB MacBook, a RunPod account with some credits and a Python interpreter.

So I tried.

THE PLAN

In plain English. I'd ask a base model to invent a coding problem and write a few small tests for it. Then ask the same model to solve its own problem several times. Sometimes it gets the answer right, sometimes wrong. I'd save the pairs of (broken attempt, working attempt) and fine-tune the model on its own corrections. Nothing human written. The Python interpreter is the only judge in the loop.

THE PART WHICH WASN'T IN PLAN

I started with Qwen 2.5 7B base. Trained on its own mined pairs. Ran HumanEval (a standard set of 164 coding problems). The base model got 25 right. After training, 2

I'd made the model worse.

I spent the next day pair-debugging with Claude Code and Codex. The model was producing what looked like correct code in the logs. The grader kept rejecting it. We found the bug around 2am: the grader was stopping too early, cutting the model's function in half before scoring it. The model was writing complete correct functions. The grader was scoring the truncated halves.

THE PART THAT WORKED

Once I fixed it and re-ran, Qwen 2.5 7B base went from 25 to 112 on HumanEval. That's +87 problems. From a model trained on zero human-written code.

So I tried it bigger. Qwen 2.5 14B base. Mined 100 of its own pairs. Trained. 95 minute H100 run, $3.50 of cloud credit.

The base model, trained only on its own mistakes, lands within 4 points of the same company's RLHF version of itself.

I didn't believe it. So I ran a test that would kill the whole thing if it failed.

What if the model was just getting smarter from training on any data in this format? I built fake training pairs of the same length and shape as my real ones, but with random garbage code inside that didn't pass anything. Trained on those.

Score: 25 out of 164. Same as the base. Zero lift.

So the model wasn't getting smarter from generic training. It was getting smarter specifically from training on its own mistakes and corrections. The signal was real.

Now I got more curious. Was this a Qwen-only thing, or would it work on other model families?

I tried Llama 3.2 3B from Meta. Different architecture, different tokenizer, different training corpus. After self-mining 32 pairs and training, HumanEval went from 39 to 43. The lift is small but the sign is right. The recipe transfers across families.

I tried Qwen 2.5 Coder 7B base, which is already a code-specialized model. After self-mining: HumanEval 83 to 87, MBPP 122 to 124. Even a model already optimized for code picked up a small lift.

I tried Qwen 3, a newer generation than what I'd been using. Qwen 3 4B base specifically. After the recipe: HumanEval 79 to 106 (+27 problems), MBPP 135 to 148.

Different architectures, different generations, different vendors. The recipe is not a Qwen quirk.

THE UNEXPECTED THAT WASN'T PLAN EITHER

Then I got more curious about whether it'd work for math.

The trick is the judge. Python checks code. SymPy can check math. Same loop should apply.

First attempt failed.

When I asked the base model to invent its own math problems, it produced easy arithmetic. That didn't transfer to GSM8K, which is grade-school word problems with multiple reasoning steps.

So I added a twist. When the model solved its own made-up problem on every try, the next problem had to be harder. When it kept failing, the next had to be easier. The model gradually drifted toward problems at the edge of its ability.

A 3B model, trained on 13 math problems it wrote for itself, beats the version of ChatGPT that broke the internet in 2022.

Then, the finding I'm most proud of.

There are two ways to improve a model.

One is training: change the model itself.

The other is test-time sampling: don’t change the model, just ask it multiple times and keep the answer that passes the tests.

I expected them to add up.

Training should make the model better. Sampling should give the better model more chances. So training + sampling should beat sampling alone.

But that is not always what happened.

At 100 mined pairs, training and sampling compound. At 36 pairs, they fight each other. The training narrows the model's output diversity so much that sampling loses the variety that made it useful.

There's a threshold. I have not seen this written down anywhere. If you have a small dataset, you might be better off not fine-tuning and just sampling from the base. The standard advice ("always fine-tune when you can") is wrong below the threshold.

This is the finding I most want other researchers to test and try to break.

The list of things that didn't work, because the field hides these and shouldn't:

Training on (wrong answer, then corrected answer) for math destroyed the model. Qwen 3 4B went from 60% to 14% on MATH-500. Training only on corrections taught the model to always doubt itself, even when it was right. Fix: mix in examples where a correct answer stays correct.
Recipe trained on code does almost nothing on math. +2 problems on GSM8K. The signal doesn't carry across domains.
Iterating (using the trained model to mine more, retrain) plateaus by round 2.
Recipe doesn't work on already-strong models. Qwen 3 8B, Qwen 3 14B, Qwen 2.5 72B all got slightly worse. Not enough wrong attempts to mine from.
Recipe doesn't work on too-weak models either. OLMo 2 7B at 3% on HumanEval can't produce enough right answers to mine from.
HumanEval-style problems don't transfer to real-world Python that uses libraries like pandas. Different worlds.

THE HARDEST PART BY COLDPLAY

The hardest part of this whole thing wasn't the math or the code. It was learning to suspect my own results before celebrating them. The stop-token bug almost killed the project on day one. Without an advisor to catch me, I had to learn to be the person who catches me.

Everything is open:

Code and reproduction guide: github.com/ranausmanai/tinyforge-zero
14B adapter weights: huggingface.co/ranausmans/tinyforge-zero-qwen25-14b-lora
Paper: arXiv link as soon as moderation clears.

[-]

tednoob@reddit

I liked this post. It was chill, good job.

[-]

mat8675@reddit

Good shit, dude! I’ve been going down this independent research path too and it’s tough sledding. You’re gonna get tons of shit from people who haven’t rubbed more than two brain cells together on a tough question.

This looks good though, man! I’d love to read the paper…is it in the GitHub repo?

[-]

QuantumSeeds@reddit (OP)

Thank you for encouraging. I haven't added the paper in git, but good idea. I'll add it there and inform you.

[-]

ComplexType568@reddit

Why are you training on such old models and comparing against old models too? These models are more than a year old. That's basically 7 centuries in the LLM world...

[-]

Hungry_Particular_14@reddit

Because this post is AI slop. LLMs have a cutoff knowledge of a few months, so if you ask them to do anything related to LLMs they use the old models they know about. OP made an AI do everything, didn't check it, and just posted what it spit out.

[-]

QuantumSeeds@reddit (OP)

You know, this really frustrates to see people call everything slop. It is like taking away everything in a single word from the person who might have worked hard by just simply calling it "slop".

I really want to ask you, what is it that you won't call slop? AI is here to help us move forward and anybody who is not using it is basically falling behind. So please be thoughtful and never just diss anyones work with one word.

To my credential, I am not claiming something big, but I think OpenAI granting me $1000 to experiment because of my contributions to parameter golf say a word or two about me. Not bloating.

Thanks.

[-]

Hungry_Particular_14@reddit

Oh, you can definitely use AI and produce good, non-slop content, but you need some level of knowledge of the field and be willing to actually engage with the work. There is ZERO reason to use qwen 2.5 or chatgpt 3.5 nowadays. The only reason why a post would include those models is because the user didn't bother actually engaging with the output. When I saw the headline I thought it was funny, because gpt 3.5 was known for being absolutely useless for math.
Don't want to have your content called slop? Don't produce slop. It's that simple.

[-]

QuantumSeeds@reddit (OP)

I wrote that these models have more headroom for improvement.

[-]

tiffanytrashcan@reddit

This thread and the bot accounts chiming in with praise.

There was a rule update about a month ago reinforcing AI-generated content, one of the comments mentioned that bringing up Q2.5 (or Llama3-8B) should be an instant ban.
I absolutely supported that take, except for this very specific edge case, because fine-tuning does take time, but I don't think we have what I was hoping for here... And honestly, that argument went out the window a few months ago. It doesn't take this long.

https://i.redd.it/yi5id12t081h1.gif

[-]

QuantumSeeds@reddit (OP)

I don't claim this is award winning stuff. Remember, I am no lab.

[-]

coffee869@reddit

I believe OP's intent is not to challeng SOTA, but to test a hypothesis. These are well tested, small models that are dense (not MOE) and easy to work with!

[-]

QuantumSeeds@reddit (OP)

Fair question and one of the reason was headroom for improvement. HumanEval is very saturated on new models.

[-]

QuantumSeeds@reddit (OP)

I also added negative results.

[-]

Intraluminal@reddit

Don't let the nay-sayers get you down. So long as you are willing (eager) to find the errors in your own experiments and accept the corrections you see, you are doing it right, and it looks like you are. I am also doing independent research and I have 'discovered' some minor things, so I understand the frustration you may be feeling.

What you've created is a version of the post-training pipeline that can use free tools, and you've demonstrated it cheaply on base models where the gap was large enough to be visible.

[-]

QuantumSeeds@reddit (OP)

really appreciate this comment. :)

[-]

Intraluminal@reddit

I'm glad. It is frustrating at times. I intend to try your system out on an app I am trying to develop. If you have anything you would like to collaborate on, I would be very interested. I consider myself a decent amateur in this space, and if you DM me, I can point you to my GitHub and see if anything interests you.

[-]

Unlikely_Rich1436@reddit

Using the Python interpreter as the ultimate judge is brilliant. It completely removes the human bottleneck from the reinforcement loop. I am curious how quickly the model plateaued once the syntax errors were resolved.

[-]

TheRealMasonMac@reddit

Check out https://github.com/lasgroup/SDPO

[-]

QuantumSeeds@reddit (OP)

Your username reminds me of Slim Shady, haha.

Quickly glanced at SDPO using Codex. My read is that it is close in spirit, but different in mechanism.

SDPO seems to use feedback during RL. The model looks at feedback from its own attempts, uses that to form a better next-token signal, and then distills that back into the policy. So the model becomes a kind of feedback-conditioned teacher for itself.

My setup is more direct. The model generates problems, attempts solutions, Python/SymPy verifies them, and I mine broken → fixed pairs for LoRA training.

So I’d put both in the self-improvement / self-distillation family, but TinyForge-Zero is more of a verifier-mined data bootstrap than an RL objective.

[-]

hiepxanh@reddit

That really amazing, looking for your paper

[-]

PiRhoManiac@reddit

Interesting. Hector Zenil's Feb 2026 paper "On the Limits of Self-Improving in Large Language Models" talks about the "curse of recursion". When your training data is increasingly polluted with your own synthetic outputs, the tails of your distribution disappear and the model converges toward a high-confidence, low-variance output space. This has been summarized as essentially saying that model collapse in LLMs is inevitable with self-learning.

[-]

QuestionMarker@reddit

What we don't know without having a poke is how close to the frontier any given model is on release. There's a tangential connection here to sensitivity to quantisation, at least in my head. My suspicion is that (for instance) the qwen-3.6 models are undertrained compared to gemma4, that's why they're less sensitive to quantisation. So I'd expect them to be further from model collapse.

What would be interesting is cross-training, need to have another read of the paper but I'd expect that 1) you'd find another frontier by generating the samples from (e.g.) gemma4 or qwen-3.6:27b and feeding them to qwen-3.6:35b, and 2) that frontier would be further away than pure self-training.

[-]

QuantumSeeds@reddit (OP)

Lines up with what I saw. Recursive bootstrap iter1 → iter2 → iter3 plateaus hard, most lift is in the first round. And when I trained wrong→fix self-correction on math, the model over-doubted its own correct answers and went 299/500 -> 69/500 on MATH-500. Consistent with the tail-disappearance picture. I'd frame this as one-shot extraction of existing headroom, not recursive self-improvement.

[-]

Ducktor101@reddit

In this case the code kinda verifies the result in a deterministic way, not? Isn’t that different?

[-]

jazir55@reddit

The solution I saw in some paper (which escapes me) is to inject noise so there is variance.

[-]

techlatest_net@reddit

Really cool work. Love that you shared the failures too—that grader bug would've messed up so many experiments. The finding about fine-tuning vs. sampling depending on dataset size is super useful, and wild that a 3B model beat GPT-3.5 on math with just 13 self-made problems. Thanks for open-sourcing everything.

[-]

QuantumSeeds@reddit (OP)

Did test on Qwen3 (current gen) too — Qwen3-4B-Base went 79 → 106 on HumanEval (+27) and 135 → 148 on MBPP (+13) with the same recipe. Reason the 14B headline uses Qwen2.5 is that Qwen3-14B-Base already starts at \~143/164 on HumanEval — there's no headroom left to mine, recipe regresses. That's actually the main finding of the paper: lift tracks remaining headroom, not model year. On strong-baseline bases (Qwen3-8B/14B, Qwen2.5-72B) the recipe doesn't help; on bases with headroom it does.

[-]

nuclearbananana@reddit

Fine tuning does kill diversity, good to see it validated.

Also I remember a bit ago how many fine-tuning papers only tested on Qwen, which is really good at that and then a study dropped showing most of them don't generalize to other models, the papers are unvalidated and qwen is just that good. Looks like you got some of that too

[-]

QuantumSeeds@reddit (OP)

Fair, I went in worried about exactly this. Did run cross-arch on own self-mined pairs: Llama-3.2-3B +4 HE, Qwen2.5-Coder-7B +4 HE / +2 MBPP. So it transfers, but magnitude is way smaller than the Qwen2.5-Base headline. Some "Qwen is unreasonably FT-friendly" is probably in there.

[-]

jazir55@reddit

I figured this out in early 2024 with those "custom GPTs" ChatGPT had. Every single one of those performed worse than the general model.

[-]

TomLucidor@reddit

Can we have some kind of training that does not kill diversity in creative tasks, while at the same time can think more clearly with RL/SFT?

[-]

philmarcracken@reddit

I was told they're poor at math if doing things directly and not indirectly? like they can write script that when run by a human, it will answer the math. It can't do that directly

[-]

Void_mgn@reddit

This is really interesting. I wonder would it be possible to have 2 models train against each other like one model creates the maths problems and the other solves them with the intention of both side improving at their respective goals. I have no idea how feasible something like that is tho

[-]

edsonmedina@reddit

I wonder how these results compare to knowledge distillation?

[-]

badplayz99@reddit

This is really interesting work. The idea of verifiable rewards aligns closely with how we think about AI agent autonomy - systems that improve through real execution feedback, rather than relying only on human-labeled data, are exactly what’s needed for agents operating independently in commercial environments.

Out of curiosity, what kind of latency are you seeing in the loop between code generation and test verification? I’m asking because at Yellow Network we’re focused on building trust infrastructure for AI agents, and one of the tougher challenges is enabling agents to verify their own transaction outcomes without relying on human checkpoints. State channels provide cryptographic proof of execution, which could potentially extend your verifiable rewards model beyond code testing into real economic activity.

It would be great to explore how this kind of architecture could connect with agent to agent payments. If that’s of interest, you can take a look at yellow.com/sdk -it’s a step toward giving models real economic agency.

[-]

jazir55@reddit

Recipe doesn't work on already-strong models. Qwen 3 8B, Qwen 3 14B, Qwen 2.5 72B all got slightly worse. Not enough wrong attempts to mine from.

Is this not simply a case of just giving them problems that are too easy? Every model has failure modes, would they not just need a tougher challenge to flub problems?

[-]

Turbulent_Pin7635@reddit

A bakery calculator beat 3.5 in math.

[-]

rhythmdev@reddit

How many parameters in the bakery calc?

[-]

Optimal-Bass-5246@reddit

A bakery calculator needs human intervention and cannot do anything on its own.

[-]

QuantumSeeds@reddit (OP)

But a bakery calculator probably costs more than $3 to build. I'm no frontier, just a lone dude experimenting and sharing whether negative or positive results I get.

[-]

mz_gt@reddit

Not if the bakery calculator started out as a scientific calculator.

Qwen2.5 also took a lot more than 3 dollars to train.

You have to know that wasn’t a fair point.

[-]

Irisi11111@reddit

Smaller models mean shorter reasoning paths and less internal world knowledge. While they might do well on some benchmarks, they'll struggle to generalize to tasks they haven't seen before. The criticisms above are valid; a better benchmark would use a recent small model, like Qwen 3.6 or Deepseek V4, which have better architecture and more knowledge per parameter.

[-]

That sounded almost magical to me. Not because it was impossible, but because it made me wonder something very simple:

no free lunch.

[-]

nebteb2@reddit

Great research, thank you