Apple: Embarrassingly Simple Self-Distillation Improves Code Generation
Posted by Mike_mi@reddit | LocalLLaMA | View on Reddit | 61 comments
Posted by Mike_mi@reddit | LocalLLaMA | View on Reddit | 61 comments
JackLikesDev@reddit
I love these charts. How do they make such beautiful charts?
m0j0m0j@reddit
There was other research that LLMs actually get dumber when fed their own content back. How is the contradiction resolved against this new article?
Bakoro@reddit
We are past the inflection point where models are "good enough" that they can put out work, and as long as there is anything like ground truth, the models get a tiny bit better, just by tightening up their existing distributions.
With coding, you can often get high-quality deterministic feedback which tells you exactly what the problem is, you can get benchmarks, you can get performance reports, and you can keep building increasingly complicated things, while remaining in "deterministically verifiable and scorable".
That means a fully automated process where there doesn't have to be a human in the training loop, and no more human data is needed.
Due-Memory-6957@reddit
That's just a myth people on Reddit that don't understand anything about LLMs spread as a cope due to their anti-AI tendencies. The reality is that AI has been trained on AI data since at least Llama 2, and models have only improved from doing so.
damhack@reddit
The reality is that there are hundreds of thousands of contracters working for Scale Labs and its subsidiaries (like Outlier) manually annotating and providing reasoning traces based on AI generated prompts and responses. The idea that LLMs are trained on synthetic data they generated themselves is only the visible half of the story. LLM pre- and post-training is still dependent on the Mechanical Turk principle from the early days of LLMs. SOTA LLMs still need datasets of curated information. The industry’s dirty little (not so) secret.
Due-Memory-6957@reddit
Actually, Deepseek did that, and it's one of the reasons American companies whined about them being unsafe while asking for censorship.
damhack@reddit
Yeah, there was some hypocrisy in US companies calling out Deepseek when they themselves are the biggest users of Scale Labs’ curated datasets for RL post-training.
__some__guy@reddit
Since Llama 2, the creative writing ability of LLMs is completely stagnant, often worse.
Synthslopping increases benchmark score and knowledge recitals.
It doesn't make them any smarter.
Ryoonya@reddit
LOL, nah, opus 4.6 writes more creatively than any legacy model.
__some__guy@reddit
Well, I mean creativity per parameter.
I can imagine Claude writes better when it is 10x bigger than Goliath.
That's just brute forcing it.
Due-Memory-6957@reddit
Go check your old logs with OG Llama, or even bettee, spiny it up and use it. You're suffering from a malignant mental disorder called nostalgia.
Orolol@reddit
Because this is RL, not classic training. You don't train on your own data, you train on the reward signal from your own data.
TheRealMasonMac@reddit
Yes and no. LLMs perform better based on certain structural patterns unique to them compared to how humans output data. Training a model on human-written reasoning performs no better than the non-reasoning baseline model.
But you have to curate the data, so the model will end up learning a different distribution than its existing distribution.
arg_max@reddit
There's a big difference between pre-training on some random generated trash and training after filtering for high quality.
Llm don't magically get dumber when trained on Ai generated content. Rejection sampling and distillation have been an absolute staple for years. A big reason why Chinese labs are so good is that they distilled on a massive scale from anthropic (see anthropic s Blogpost for more info). In large scale pre-training, we also had some recent papers that rewriting the data and training on rewrites and original data can help with extending the data horizon since huge models are more and more limited by data scarcity.
The real issue is that when you scrape the web, there's a big chance that you encounter shitty generations from old models that is much lower quality than what we can generate nowadays.
But when you can filter out the good data, you can absolutely improve the model by training on synthetic data.
The_frozen_one@reddit
They aren’t feeding content back, they are selectively training the best possible tokens based on a heuristic that seemingly works.
At each token selection, the model is pointing to a location in a very high dimensional space. Imagine you follow directions in Home Depot to get a tool I’m asking for you to get for me, you arrive at the correct aisle and location in that aisle, but it’s for “Jorvick Assemblies” which has a selection of tools that make no intuitive sense to you. It sounds like they are optimizing the shelves for people who are just going to reach their arms out and grab one of the 5 closest tools. Of course there’s still some intentional randomness in the process (you might be taller or shorter so “closest” can mean different things), so it’s not about optimizing for one right answer but a set of good answers (without being boring and converging on one answer).
And because of the way token generation actually works, improving selection means later choices will be better as well.
At least that my pre-coffee brain understanding of it.
HorriblyGood@reddit
From reading the abstract, they are using their own model’s output (self distillation) which is different from just feeding other random LLMs output as training data.
Through the lens of on policy/off policy RL, I’m guessing in their case, it’s using the model’s own outputs, it’s on policy, so it’s getting learning signals from itself to be more precise for coding tasks, but more creative on writing tasks. It’s doesn’t have to change how it works or thinks to match other LLM’s outputs.
My intuition is kinda like learning to code from copying other people’s code or having someone show you what’s wrong your with your own code so you can learn to improve.
Majinsei@reddit
Creo que es más un Pipeline que toma el código, pone el mismo modelo a analizarlo, encontrar errores, etc, para corregirse a si mismo~
Lo que hace un dataset de: crea un sudoku
A: Este código tiene fallos en, no compila por x, y está confundiendo z. El código mejorado es:
Esto permite que el mismo modelo piense en formas de auto mejorarse~ Es como un escritor que toma un descanso de lo que escribió y al día siguiente se pone a editarlo~ encontrando miles de errores que decide afinar~ mejorando la novela con el mismo talento~
Esto es algo que todos hacemos bastante, se pide el código, crea el MVP funcional y después le paso tandas de mejoramiento de diseño de código, orden, errores, seguridad, etc~
FoxTimes4@reddit
They did mention it and as best as I can understand it it’s because of the problem having “forks” and allowing the model to explore more.
Thrumpwart@reddit
I believe this method allows an LLM to learn why a rollout was good or bad, thus offering a better negative reward signal. I maybe way off.
davikrehalt@reddit
Depending on loss, i GUESS (?) training on own products probably either doesn't change distribution at all or just sharpens it?
SlopTopZ@reddit
The approach here is elegant — using the model's own correct solutions as training signal rather than requiring external teachers or complex reward models. Self-distillation at this level essentially lets the model bootstrap quality from its own distribution. The fact that it's "embarrassingly simple" is the best part, because it means it's straightforward to apply on top of existing open models. Would love to see this combined with Qwen3.5 or Gemma 4 fine-tunes to see how much headroom there still is on coding benchmarks.
Odd-Ordinary-5922@reddit
imagine the community works together on this and gets a huge dataset of ssd responses and we train a monster of a model like qwen3.5 27b
DigiDecode_@reddit
for the proposed method, you need the original data that was used to train the model, so this new dataset would be sprinkled on original dataset, otherwise this dataset on its own likely will cause the model to collapse
eat_my_ass_n_balls@reddit
It’s a feedback loop. We just gotta do a Kovarex enrichment process loop and sprinkle in some U-238
woct0rdho@reddit
We're already collecting data. Let me introduce DataClaw https://github.com/peteromallet/dataclaw
grisly256@reddit
You need to reply with a plan.
Cool-Chemical-5629@reddit
ZeroCool2u@reddit
/plan
NCpoorStudent@reddit
> Keep using Claude? You've reached your plan's message limit. You can wait until it resets at the scheduled time, or continue now:
divide0verfl0w@reddit
JohnMason6504@reddit
Self-distillation is underrated for local deployment. You get most of the teachers quality at a fraction of the parameter count and memory footprint. The real win is running the distilled model on-device where every byte of VRAM matters.
JohnMason6504@reddit
Self-distillation is practically free compared to pretraining. Generate N samples, filter by pass rate, fine-tune on winners. No teacher model needed. For local inference this is huge because you can iterate on a 27B model with just one GPU for generation and a second for the fine-tune step. The cost-per-quality-gain ratio is absurd.
Constant-Bonus-7168@reddit
The on-policy learning signal is genuinely different from distillation. Curious if you can iterate this or if gains plateau.
DOAMOD@reddit
I am creating a 10k dataset following this method, we could create a bigger one together if necessary.
[01:29:39] 54/10000 (0.5%) |
so slow for local but...
Negative_Flight3856@reddit
There’s always a Zhang
DOAMOD@reddit
Yes
ghulamalchik@reddit
Zheng Zhang.
DetouristCollective@reddit
Almost like practicing..
Dany0@reddit
They used rStar-coder, and embarrassingly shitty dataset. I bet you could improve this even more by just using a better dataset (or selecting a better subset than they used)
I could make a 27B SSD Coder over the weekend,
LocoMod@reddit
That was a wild ride. Eagerly awaiting the sequel.
Dany0@reddit
Even if nothing comes of this, I learned a lot today
ryebrye@reddit
It uses the output from the evaluation runs at the low temperature / high truncation in the supervised fine tuning stage. It's effectively taking what it was already confident in before and making it more confident in that.
Then when you crank up the temperature later, the things that were baked in more via this approach are less likely to branch off and the exploration is focused on other areas.
de4dee@reddit
isn't this GRPO?
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Eyelbee@reddit
The model already had more useful coding ability inside it than its normal decoding was able to reliably express and this helped set it straight. This can be a useful technique for unlocking the full capability of a model.
Traditional-Gap-3313@reddit
well...
And..
It seems there's something there...
-dysangel-@reddit
it feels probably related to how training on that model that really liked owls, caused the target model to like owls, even when owls were not mentioned
-dysangel-@reddit
SSD: while you were RHLFing, I studied the blade
Specialist_Golf8133@reddit
wait this is actually kind of a big deal. if you can just run a model against itself and get meaningful improvement without any external labels, that changes the economics of model training pretty dramatically. like the whole 'we need human annotations' bottleneck just got way smaller. curious if this holds up at different model sizes or if there's a sweet spot where it breaks down
grumd@reddit
Gemini explained it like this. It's interesting, this basically feels like "baking-in" top-k/top-p into the model weights themselves, improving both precision and diversity of tokens in the fine-tuned model. Sounds quite simple and brilliant tbh
Myrkkeijanuan@reddit
Wow, your username resurfaced memories from ten years ago. Nice to see you here.
TheThoccnessMonster@reddit
Right almost like we keep learning containerized parts of the bitter lesson over and over. Show it everything and not frozen interpretations of settings we think “perform best” so that it works well no matter what we set it to.
Haxtore@reddit
someone needs ro try and freeze the bottom layers or make a LoRA variant
CondiMesmer@reddit
Sounds exactly like dspy? I can't tell the difference.
Dany0@reddit
Literally nothing like dspy, are you a bot?
CondiMesmer@reddit
No...?
And they are both rely on updating their prompts based on quality of output, so how is that nothing alike?
Dspy is just a python framework that formalizes this into functions.
r4in311@reddit
Sounds like a big deal... and really unintuitive at first. If I get this right, we should be able to benefit from this effect right away by generating multiple candidate solutions for coding problems with high and low temp values and later aggregate the candidates to avoid the precision <-> exploration conflict described there...
Live-Crab3086@reddit
ssd qwen3.5 wen?
Reddit_User_Original@reddit
ESSD?
Live-Crab3086@reddit
in the paper, they use SSD
Ok-Scarcity-7875@reddit
That makes so much sense. Imagine you could never learn from your own mistakes because you can't remember your own thoughts but only foreign thoughts. With this method the model can use its own intelligence to see all its flaws and ways how it is thinking and internally self adapt to its own ways to get better.