Apple: Embarrassingly Simple Self-Distillation Improves Code Generation

[-]

JackLikesDev@reddit

I love these charts. How do they make such beautiful charts?

[-]

m0j0m0j@reddit

There was other research that LLMs actually get dumber when fed their own content back. How is the contradiction resolved against this new article?

[-]

We are past the inflection point where models are "good enough" that they can put out work, and as long as there is anything like ground truth, the models get a tiny bit better, just by tightening up their existing distributions.

With coding, you can often get high-quality deterministic feedback which tells you exactly what the problem is, you can get benchmarks, you can get performance reports, and you can keep building increasingly complicated things, while remaining in "deterministically verifiable and scorable".
That means a fully automated process where there doesn't have to be a human in the training loop, and no more human data is needed.

[-]

Due-Memory-6957@reddit

That's just a myth people on Reddit that don't understand anything about LLMs spread as a cope due to their anti-AI tendencies. The reality is that AI has been trained on AI data since at least Llama 2, and models have only improved from doing so.

[-]

damhack@reddit

The reality is that there are hundreds of thousands of contracters working for Scale Labs and its subsidiaries (like Outlier) manually annotating and providing reasoning traces based on AI generated prompts and responses. The idea that LLMs are trained on synthetic data they generated themselves is only the visible half of the story. LLM pre- and post-training is still dependent on the Mechanical Turk principle from the early days of LLMs. SOTA LLMs still need datasets of curated information. The industry’s dirty little (not so) secret.

[-]

Due-Memory-6957@reddit

Actually, Deepseek did that, and it's one of the reasons American companies whined about them being unsafe while asking for censorship.

[-]

damhack@reddit

Yeah, there was some hypocrisy in US companies calling out Deepseek when they themselves are the biggest users of Scale Labs’ curated datasets for RL post-training.

[-]

someguy@reddit

Since Llama 2, the creative writing ability of LLMs is completely stagnant, often worse.

Synthslopping increases benchmark score and knowledge recitals.

It doesn't make them any smarter.

[-]

Ryoonya@reddit

LOL, nah, opus 4.6 writes more creatively than any legacy model.

[-]

someguy@reddit

Well, I mean creativity per parameter.

I can imagine Claude writes better when it is 10x bigger than Goliath.

That's just brute forcing it.

[-]

Due-Memory-6957@reddit

Go check your old logs with OG Llama, or even bettee, spiny it up and use it. You're suffering from a malignant mental disorder called nostalgia.

[-]

Orolol@reddit

Because this is RL, not classic training. You don't train on your own data, you train on the reward signal from your own data.

[-]

TheRealMasonMac@reddit

Yes and no. LLMs perform better based on certain structural patterns unique to them compared to how humans output data. Training a model on human-written reasoning performs no better than the non-reasoning baseline model.

But you have to curate the data, so the model will end up learning a different distribution than its existing distribution.

[-]

arg_max@reddit

There's a big difference between pre-training on some random generated trash and training after filtering for high quality.

Llm don't magically get dumber when trained on Ai generated content. Rejection sampling and distillation have been an absolute staple for years. A big reason why Chinese labs are so good is that they distilled on a massive scale from anthropic (see anthropic s Blogpost for more info). In large scale pre-training, we also had some recent papers that rewriting the data and training on rewrites and original data can help with extending the data horizon since huge models are more and more limited by data scarcity.

The real issue is that when you scrape the web, there's a big chance that you encounter shitty generations from old models that is much lower quality than what we can generate nowadays.

But when you can filter out the good data, you can absolutely improve the model by training on synthetic data.

[-]

The_frozen_one@reddit

They aren’t feeding content back, they are selectively training the best possible tokens based on a heuristic that seemingly works.

At each token selection, the model is pointing to a location in a very high dimensional space. Imagine you follow directions in Home Depot to get a tool I’m asking for you to get for me, you arrive at the correct aisle and location in that aisle, but it’s for “Jorvick Assemblies” which has a selection of tools that make no intuitive sense to you. It sounds like they are optimizing the shelves for people who are just going to reach their arms out and grab one of the 5 closest tools. Of course there’s still some intentional randomness in the process (you might be taller or shorter so “closest” can mean different things), so it’s not about optimizing for one right answer but a set of good answers (without being boring and converging on one answer).

And because of the way token generation actually works, improving selection means later choices will be better as well.

At least that my pre-coffee brain understanding of it.

[-]

HorriblyGood@reddit

From reading the abstract, they are using their own model’s output (self distillation) which is different from just feeding other random LLMs output as training data.

Through the lens of on policy/off policy RL, I’m guessing in their case, it’s using the model’s own outputs, it’s on policy, so it’s getting learning signals from itself to be more precise for coding tasks, but more creative on writing tasks. It’s doesn’t have to change how it works or thinks to match other LLM’s outputs.

My intuition is kinda like learning to code from copying other people’s code or having someone show you what’s wrong your with your own code so you can learn to improve.

[-]

Majinsei@reddit

Creo que es más un Pipeline que toma el código, pone el mismo modelo a analizarlo, encontrar errores, etc, para corregirse a si mismo~

Lo que hace un dataset de: crea un sudoku

A: Este código tiene fallos en, no compila por x, y está confundiendo z. El código mejorado es:

Esto permite que el mismo modelo piense en formas de auto mejorarse~ Es como un escritor que toma un descanso de lo que escribió y al día siguiente se pone a editarlo~ encontrando miles de errores que decide afinar~ mejorando la novela con el mismo talento~

Esto es algo que todos hacemos bastante, se pide el código, crea el MVP funcional y después le paso tandas de mejoramiento de diseño de código, orden, errores, seguridad, etc~

[-]

FoxTimes4@reddit

They did mention it and as best as I can understand it it’s because of the problem having “forks” and allowing the model to explore more.

[-]

Thrumpwart@reddit

I believe this method allows an LLM to learn why a rollout was good or bad, thus offering a better negative reward signal. I maybe way off.

[-]

davikrehalt@reddit

Depending on loss, i GUESS (?) training on own products probably either doesn't change distribution at all or just sharpens it?

[-]

SlopTopZ@reddit

The approach here is elegant — using the model's own correct solutions as training signal rather than requiring external teachers or complex reward models. Self-distillation at this level essentially lets the model bootstrap quality from its own distribution. The fact that it's "embarrassingly simple" is the best part, because it means it's straightforward to apply on top of existing open models. Would love to see this combined with Qwen3.5 or Gemma 4 fine-tunes to see how much headroom there still is on coding benchmarks.

[-]

Odd-Ordinary-5922@reddit

imagine the community works together on this and gets a huge dataset of ssd responses and we train a monster of a model like qwen3.5 27b

[-]

DigiDecode_@reddit

for the proposed method, you need the original data that was used to train the model, so this new dataset would be sprinkled on original dataset, otherwise this dataset on its own likely will cause the model to collapse

[-]

eat_my_ass_n_balls@reddit

It’s a feedback loop. We just gotta do a Kovarex enrichment process loop and sprinkle in some U-238

[-]

woct0rdho@reddit

We're already collecting data. Let me introduce DataClaw https://github.com/peteromallet/dataclaw

[-]

grisly256@reddit

You need to reply with a plan.

[-]

Cool-Chemical-5629@reddit

[-]

ZeroCool2u@reddit

/plan

[-]

NCpoorStudent@reddit

> Keep using Claude? You've reached your plan's message limit. You can wait until it resets at the scheduled time, or continue now:

[-]

divide0verfl0w@reddit

[-]

JohnMason6504@reddit

Self-distillation is underrated for local deployment. You get most of the teachers quality at a fraction of the parameter count and memory footprint. The real win is running the distilled model on-device where every byte of VRAM matters.

[-]

JohnMason6504@reddit

Self-distillation is practically free compared to pretraining. Generate N samples, filter by pass rate, fine-tune on winners. No teacher model needed. For local inference this is huge because you can iterate on a 27B model with just one GPU for generation and a second for the fine-tune step. The cost-per-quality-gain ratio is absurd.

[-]

Constant-Bonus-7168@reddit

The on-policy learning signal is genuinely different from distillation. Curious if you can iterate this or if gains plateau.

[-]

DOAMOD@reddit

I am creating a 10k dataset following this method, we could create a bigger one together if necessary.

[01:29:39] 54/10000 (0.5%) |

so slow for local but...

[-]

Negative_Flight3856@reddit

There’s always a Zhang

[-]

DOAMOD@reddit

Yes

[-]

ghulamalchik@reddit

Zheng Zhang.

[-]

DetouristCollective@reddit

Almost like practicing..

[-]

Dany0@reddit

They used rStar-coder, and embarrassingly shitty dataset. I bet you could improve this even more by just using a better dataset (or selecting a better subset than they used)

I could make a 27B SSD Coder over the weekend,

[-]

LocoMod@reddit

That was a wild ride. Eagerly awaiting the sequel.

[-]

Dany0@reddit

Even if nothing comes of this, I learned a lot today

[-]

ryebrye@reddit

It uses the output from the evaluation runs at the low temperature / high truncation in the supervised fine tuning stage. It's effectively taking what it was already confident in before and making it more confident in that.

Then when you crank up the temperature later, the things that were baked in more via this approach are less likely to branch off and the exploration is focused on other areas.

[-]

de4dee@reddit

isn't this GRPO?

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Eyelbee@reddit

The model already had more useful coding ability inside it than its normal decoding was able to reliably express and this helped set it straight. This can be a useful technique for unlocking the full capability of a model.

[-]

Traditional-Gap-3313@reddit

well...

In this stress test, the synthesized data is almost gibberish. Without truncation to suppress the tail, sampling at T train =2.0 produces outputs that are often unusable as code. About ∼62% contain no extractable code at all, and even seemingly coherent solutions frequently devolve into multilingual gibberish mid-sequence (Figure 7a). By ordinary dataquality standards, this is unusable as training data for SFT.

And..

SSD still improves the model materially. Even when the synthesized outputs devolve into gibberish, the resulting fine-tuned model is not merely salvageable, it improves substantially. SSD improves the model to 48.1% pass@1 and 64.0% pass@5, for gains of +5.7 pp and +10.5 pp respectively (Figure 7b).

It seems there's something there...

[-]

-dysangel-@reddit

it feels probably related to how training on that model that really liked owls, caused the target model to like owls, even when owls were not mentioned

[-]

-dysangel-@reddit

SSD: while you were RHLFing, I studied the blade

[-]

Specialist_Golf8133@reddit

wait this is actually kind of a big deal. if you can just run a model against itself and get meaningful improvement without any external labels, that changes the economics of model training pretty dramatically. like the whole 'we need human annotations' bottleneck just got way smaller. curious if this holds up at different model sizes or if there's a sweet spot where it breaks down

[-]

grumd@reddit

Standard supervised models often struggle to suppress long tails of bad tokens (hurting precision in syntax-heavy tasks like code) while simultaneously needing diversity to explore different algorithmic approaches. By applying top-k/top-p truncation and temperature scaling during the data synthesis phase — and then explicitly fine-tuning the model to map back to those truncated distributions — the model learns a context-dependent token reshaping that boosts both pass@1 (precision) and pass@5 (exploration/diversity) metrics, especially on hard algorithmic problems.

Gemini explained it like this. It's interesting, this basically feels like "baking-in" top-k/top-p into the model weights themselves, improving both precision and diversity of tokens in the fine-tuned model. Sounds quite simple and brilliant tbh

[-]

Myrkkeijanuan@reddit

Wow, your username resurfaced memories from ten years ago. Nice to see you here.

[-]

TheThoccnessMonster@reddit

Right almost like we keep learning containerized parts of the bitter lesson over and over. Show it everything and not frozen interpretations of settings we think “perform best” so that it works well no matter what we set it to.

[-]

Haxtore@reddit

someone needs ro try and freeze the bottom layers or make a LoRA variant

[-]

CondiMesmer@reddit

Sounds exactly like dspy? I can't tell the difference.

[-]

Dany0@reddit

Literally nothing like dspy, are you a bot?

[-]

CondiMesmer@reddit

No...?

And they are both rely on updating their prompts based on quality of output, so how is that nothing alike?

Dspy is just a python framework that formalizes this into functions.

[-]

r4in311@reddit

Sounds like a big deal... and really unintuitive at first. If I get this right, we should be able to benefit from this effect right away by generating multiple candidate solutions for coding problems with high and low temp values and later aggregate the candidates to avoid the precision <-> exploration conflict described there...

[-]

Live-Crab3086@reddit

ssd qwen3.5 wen?

[-]

Reddit_User_Original@reddit

ESSD?

[-]

Live-Crab3086@reddit

in the paper, they use SSD

[-]

Ok-Scarcity-7875@reddit

That makes so much sense. Imagine you could never learn from your own mistakes because you can't remember your own thoughts but only foreign thoughts. With this method the model can use its own intelligence to see all its flaws and ways how it is thinking and internally self adapt to its own ways to get better.