Baguettotron, a 321 million parameters generalist Small Reasoning Model (80-layers deep)
Posted by Balance-@reddit | LocalLLaMA | View on Reddit | 26 comments
Baguettotron is a 321 million parameters generalist Small Reasoning Model, trained on 200 billions tokens from SYNTH, a fully open generalist dataset.
Despite being trained on consideraly less data, Baguettotron outperforms most SLM of the same size range on non-code industry benchmarks, providing an unprecedented balance between memory, general reasoning, math and retrieval performance.
The name is both a nod to French origins and to the unusual shape of the model: with 80 layers, Baguettotron is currently the deepest SLM in its size range.
TomieNW@reddit
cant even say hello back :(
New_Cartographer9998@reddit
That's because it's not a conversational model. Check this using the RAG format (temp 0.4)
TomieNW@reddit
SrijSriv211@reddit
I'm curious why you decided to make it so deep?
Pojiku@reddit
Not part of the team here but I am also interested after seeing the Mixture of Recursions paper (apologies if that's not what it's actually called).
The curiosity is whether for SLMs, we can get reasoning gains from depth as a trade off against semantic gains from width.
JChataigne@reddit
I had an intuition of this but couldn't put words on it, this is well said.
SrijSriv211@reddit
I don't think that the authors of this model are using MoR in this model. Mixture of Recursions is where the layers of the models are re-used and it also uses a dynamic token routing mech which helps to make it more efficient..
Also I'm not very sure that depth or width brings difference in reasoning or semantic gains respectively. I think as long as your model (either deep or wide) is properly able to capture and represent it should be able to become good at both semantic and reasoning tasks.
Dorialexandre@reddit
Hi. So ery empirically: we had the intuition deeper architecture could be more beneficial for intense reasoning tasks. And since we designed a fully generalist synthetic datasets (SYNTH) that made full model training much less costly, we simply tested that.
Overall we have seen most improvements on math, but also less significant ones everywhere (memorization, query adherence, etc.). Main trade-off is training time/flops (easily x1.5) and inference time — though it should parallelize well.
We're going to test most systematically for the paper to come in a few weeks.
SrijSriv211@reddit
Yeah I saw that training time/flops & inference time trade-off coming.. I personally think your dataset is good enough to achieve similar results with a wider model as well but anyways it's still cool that you guys tried such a different approach.
I think ur intuition might be correct cuz someone in this thread posted a link to a research paper (I haven't read it). Here it is in-case if you want to give it a read https://arxiv.org/abs/2503.03961
Looking forward to a more detailed paper :D
Dorialexandre@reddit
Yes exactly. Also helped it was also a relatively effortless major change on the code side (just a few lines in a yaml). But now I look forward more controlled experiments with synth data, similarly to what Physics of Language Models did with transformers/ssm etc.
SrijSriv211@reddit
Cool! Also I'd love to learn more about Monad as well.
eztrendar@reddit
Curious too. Is there any benefit to this?
logicchains@reddit
Without chain of thought, a 80 layer model can do 80 non-parallelisable state tracking operations when generating a single token (no chain of thought), making it much better at challenges which involve that type of problem. E.f. tracking parity or braces nesting.
SrijSriv211@reddit
I don't understand can you elaborate please?
logicchains@reddit
For a given input sequence length, more depth allows solving a wider class of problems: https://arxiv.org/abs/2503.03961
SrijSriv211@reddit
Thank you :D
SrijSriv211@reddit
I know that wider models perform better and are easier to run than deeper models so I can't really see any substantial benefit.
No_Afternoon_4260@reddit
Bc it looks like a baguette
Dorialexandre@reddit
Answer also correct :D
MoffKalast@reddit
Honhonhon
Temporary-Roof2867@reddit
🤣🤣🤣🤣
SrijSriv211@reddit
It's not about looks.
limapedro@reddit
Deep Learning*
SrijSriv211@reddit
well ur reply makes me feel my comment should be marked as dirty stuff nsfw. lol!
BalorNG@reddit
Men will do 80 layers SLM instead of going to thera... creating a proper recursive model!
-p-e-w-@reddit
80 layers is astonishingly many for such a small model. For comparison, gpt-oss-20b has only 24 layers, despite having 60 times the parameter count of this model. The difference is so stark that it’s basically a different architecture.