Baguettotron, a 321 million parameters generalist Small Reasoning Model (80-layers deep)

Posted by Balance-@reddit | LocalLLaMA | View on Reddit | 26 comments

Baguettotron is a 321 million parameters generalist Small Reasoning Model, trained on 200 billions tokens from SYNTH, a fully open generalist dataset.

Despite being trained on consideraly less data, Baguettotron outperforms most SLM of the same size range on non-code industry benchmarks, providing an unprecedented balance between memory, general reasoning, math and retrieval performance.

The name is both a nod to French origins and to the unusual shape of the model: with 80 layers, Baguettotron is currently the deepest SLM in its size range.

[-]

TomieNW@reddit

cant even say hello back :(

[-]

New_Cartographer9998@reddit

That's because it's not a conversational model. Check this using the RAG format (temp 0.4)

[-]

TomieNW@reddit

[-]

SrijSriv211@reddit

I'm curious why you decided to make it so deep?

[-]

Pojiku@reddit

Not part of the team here but I am also interested after seeing the Mixture of Recursions paper (apologies if that's not what it's actually called).

The curiosity is whether for SLMs, we can get reasoning gains from depth as a trade off against semantic gains from width.

[-]

JChataigne@reddit

reasoning gains from depth as a trade off against semantic gains from width.

I had an intuition of this but couldn't put words on it, this is well said.

[-]

SrijSriv211@reddit

I don't think that the authors of this model are using MoR in this model. Mixture of Recursions is where the layers of the models are re-used and it also uses a dynamic token routing mech which helps to make it more efficient..

Also I'm not very sure that depth or width brings difference in reasoning or semantic gains respectively. I think as long as your model (either deep or wide) is properly able to capture and represent it should be able to become good at both semantic and reasoning tasks.

[-]

Dorialexandre@reddit

Hi. So ery empirically: we had the intuition deeper architecture could be more beneficial for intense reasoning tasks. And since we designed a fully generalist synthetic datasets (SYNTH) that made full model training much less costly, we simply tested that.

Overall we have seen most improvements on math, but also less significant ones everywhere (memorization, query adherence, etc.). Main trade-off is training time/flops (easily x1.5) and inference time — though it should parallelize well.

We're going to test most systematically for the paper to come in a few weeks.

[-]

SrijSriv211@reddit

Yeah I saw that training time/flops & inference time trade-off coming.. I personally think your dataset is good enough to achieve similar results with a wider model as well but anyways it's still cool that you guys tried such a different approach.

I think ur intuition might be correct cuz someone in this thread posted a link to a research paper (I haven't read it). Here it is in-case if you want to give it a read https://arxiv.org/abs/2503.03961

Looking forward to a more detailed paper :D

[-]

Dorialexandre@reddit

Yes exactly. Also helped it was also a relatively effortless major change on the code side (just a few lines in a yaml). But now I look forward more controlled experiments with synth data, similarly to what Physics of Language Models did with transformers/ssm etc.

[-]

SrijSriv211@reddit

Cool! Also I'd love to learn more about Monad as well.

[-]

eztrendar@reddit

Curious too. Is there any benefit to this?

[-]

logicchains@reddit

Without chain of thought, a 80 layer model can do 80 non-parallelisable state tracking operations when generating a single token (no chain of thought), making it much better at challenges which involve that type of problem. E.f. tracking parity or braces nesting.

[-]

SrijSriv211@reddit

I don't understand can you elaborate please?

[-]

logicchains@reddit

For a given input sequence length, more depth allows solving a wider class of problems: https://arxiv.org/abs/2503.03961

[-]

SrijSriv211@reddit

Thank you :D

[-]

SrijSriv211@reddit

I know that wider models perform better and are easier to run than deeper models so I can't really see any substantial benefit.

[-]

No_Afternoon_4260@reddit

Bc it looks like a baguette

[-]

Dorialexandre@reddit

Answer also correct :D

[-]

MoffKalast@reddit

Honhonhon

[-]

Temporary-Roof2867@reddit

🤣🤣🤣🤣

[-]

SrijSriv211@reddit

It's not about looks.

[-]

limapedro@reddit

Deep Learning*

[-]

SrijSriv211@reddit

well ur reply makes me feel my comment should be marked as dirty stuff nsfw. lol!

[-]

BalorNG@reddit

Men will do 80 layers SLM instead of going to thera... creating a proper recursive model!

[-]

-p-e-w-@reddit

80 layers is astonishingly many for such a small model. For comparison, gpt-oss-20b has only 24 layers, despite having 60 times the parameter count of this model. The difference is so stark that it’s basically a different architecture.