DISTILLATION is so underrated. I spent an hour and got a neat improvement in accuracy while keeping the costs low
Posted by Ambitious_Anybody855@reddit | LocalLLaMA | View on Reddit | 42 comments

mimirium_@reddit
It's so funny when people just assumed that OP doesn't know what distillation and fine-tuning is
5lipperySausage@reddit
Standard Reddit logic
RevolutionaryLime758@reddit
Confused op
az226@reddit
Fine tuning isn’t the same as distillation.
Distillation is taking either outputs with or without logits from a large model to continue training/tuning a smaller model.
Fine tuning keeps the model the same size. It’s just about aligning outputs (usually done supervised, but can also be reinforcement learned).
Are you conflating concepts?
polytique@reddit
Distillation is a type of fine tuning.
fauxfeliscatus@reddit
I assume they mean they are fine-tunning on the soft labels.
Leelaah_saiee@reddit
They use hard targets also to make it more robust
_yustaguy_@reddit
I think the nomenclature is getting vague across the whole industry in general. Just look at OpenAI's "distillation" API.
V0dros@reddit
You can do distillation to fine-tune a model on the output of a bigger model
Ambitious_Anybody855@reddit (OP)
That's right
KillerQF@reddit
Are you testing on your training input?
Ambitious_Anybody855@reddit (OP)
Pretty standard split: 90% training, 10% for testing
Harrycognito@reddit
That is definitely not a standard split brother.
coldrolledpotmetal@reddit
90-10 is totally a pretty standard split
r1str3tto@reddit
I agree with you. It’s not about the percentage split between the train and test sets - it’s about how large the test set is in absolute terms. It needs to be large enough to form a representative sample of the distribution you are modeling.
waiting_for_zban@reddit
You could ask your local llm this question, but to save you few minutes, in such splits, a common, well known problem as overfitting arise.
coldrolledpotmetal@reddit
Yeah I'm aware that overfitting can be a problem, but splits can range anywhere from 50-50 to 95-5. Some LLM tasks even require a bit of overfitting anyways, if you really want to reduce hallucinations. OP shouldn't be getting downvoted so hard for saying something that isn't outlandish at all
Ambitious_Anybody855@reddit (OP)
Thanks u/coldrolledpotmetal for having my back <3
Su1tz@reddit
Soo, synthetic data fine tuning?
V0dros@reddit
But also has to be from a big model to a smaller one to be considered distillation
dp3471@reddit
knowledge distillation != model distillation != distillation
bad op
Ambitious_Anybody855@reddit (OP)
What should I interpret from this? Yes this is knowledge distillation but does that make the results incorrect or change anything?
SirRece@reddit
Cool work. I love how everyone just assumes blindly you don't know what distillation is when you clearly do. Love seeing homegrown stuff like this. 😂
ColorlessCrowfeet@reddit
Distillation == Fine-tuning?
Ambitious_Anybody855@reddit (OP)
Use cases are different for each. Distillation ensures a smaller model performs on par with a much larger model. It's 14x cheaper in this my example.
Finetuning is more to improve a model's performance on a specific task/domain. Not always done for a cost benefit
Psychological_Cry920@reddit
How did this explanation got so many down votes?
SirRece@reddit
Big distillation
ShadowbanRevival@reddit
If it is ensured why not distill the distilled model on and on until you get AGI in your basement?
Ambitious_Anybody855@reddit (OP)
Hahah! Spare me lord english is my second language
eleqtriq@reddit
You can combine the processes. You could distill domain knowledge into the smaller model, too.
getmevodka@reddit
do you have some video tutorials on the process to learn it for me ? i would love to create some distilled versions from bigger models on my m3 ultra :)
Ambitious_Anybody855@reddit (OP)
Not a video but a detailed step by step guide. Check my colab notebook for sentiment analysis here: https://github.com/bespokelabsai/curator
Tell me how it works out!!
getmevodka@reddit
hey thanks ! ill take a look, but i need to finish a feature at my own program first xD
verbari_dev@reddit
Distillation is specifically fine tuning a smaller model with a larger model's outputs
Unlucky_Lecture_7606@reddit
Do you have RAFT vs RAG vs Base comparison anywhere?
Ambitious_Anybody855@reddit (OP)
I don't have it but interesting idea
ReadyAndSalted@reddit
wait how did distillation give you an improvement in accuracy? The new smaller model be worse than the original larger model... When you say "improvement in accuracy", what are you comparing your new small model against?
Ambitious_Anybody855@reddit (OP)
I am comparing base small model with finetuned small model. Annotations from large model are treated as ground truth. In essence, I am able to replicate the performance of large model via the finetuned model at 92% accuracy (all this while being 14x cheaper than large model).
Hope this helps
You_Wen_AzzHu@reddit
Those accuracy rates of 90% and above seem almost too good to believe, tbh.
Ambitious_Anybody855@reddit (OP)
Not so much if the base model already had a 82.5% accuracy,right? Here's my colab notebook if you would like to check out where I could have gone wrong. https://colab.research.google.com/drive/1Zfl3g7POsqqYQqkzXdyhYRSAymLhZugn?usp=sharing
You_Wen_AzzHu@reddit
Thank you brother, I would love to know a way to improve my model like what you said.
Ambitious_Anybody855@reddit (OP)
Colab notebook added on my Github: https://github.com/bespokelabsai/curator