The joy and pain of training an LLM from scratch

Posted by kazzus78@reddit | LocalLLaMA | View on Reddit | 20 comments

mii-llm just released a detailed technical report on the development of the Zagreus and Nesso model families: a set of 0.4B parameter language models trained from scratch with a focus on edge deployment, multilingual capability, and European languages.

The report documents the full pipeline behind a family of small language models designed for Italian, Spanish, French, and Portuguese, with bilingual pretraining centered on English + target language settings.

Released models

Zagreus-0.4B-ita — English/Italian base model
Zagreus-0.4B-spa — English/Spanish base model
Zagreus-0.4B-fra — English/French base model
Zagreus-0.4B-por — English/Portuguese base model
Nesso-0.4B-instruct — post-trained for conversational use
Nesso-0.4B-agentic — post-trained for structured / agentic tasks
Open-Zagreus-0.4B — fully open variant built with open data and open recipes

Training setup

According to the report, the project used:

64 NVIDIA A100 GPUs
\~1 trillion tokens
Datatrove for tokenization
Hugging Face Nanotron for pretraining
Axolotl for post-training
Slurm for multi-node orchestration

The report also explains why a dense 0.4B architecture was selected instead of MoE, arguing that in the sub-1B regime, stability and utilization can matter more than sparse efficiency.

Why this is interesting

A lot of current discussion focuses on frontier-scale models, but this report is a useful example of the opposite direction: small models trained from scratch for practical multilingual edge scenarios.

Some points that stand out:

small multilingual models can still be competitive if the pipeline is well engineered
post-training has a major effect on usability
model behavior differs significantly across Italian and English tasks
open pipelines can still produce meaningful results in this size class
small models still show clear weaknesses in arithmetic, factual recall, repetition, and domain-specific knowledge

Benchmark notes

The report includes comparisons against Qwen3-0.6B and Qwen3.5-0.8B, along with multilingual evaluations and task-by-task analysis.

A few interesting takeaways:

Nesso-0.4B-agentic appears especially strong and consistent on Italian tasks
Qwen3.5-0.8B performs better on several English generative tasks
Qwen3-0.6B stands out on logic / reasoning-style tasks
the fully open variant still achieves competitive results in several settings

Figures

llm-as-judge comparison

Classical benchmark

Italian benchmark results

English benchmark results english-nesso.png

Main takeaway

This is a solid case study on what it actually looks like to train a small multilingual LLM from scratch in 2026: tokenization, storage, Slurm orchestration, distributed training, post-training, evaluation, and model release.

For anyone interested in small language models, multilingual training, edge deployment, or open LLM engineering, the report is worth a read.

[-]

Enthu-Cutlet-1337@reddit

Thanks for letting me know. Next time onwards I won't take the time to summarise and shorten the comment.

[-]

Borkato@reddit

Notice that this comment has a completely different feel than your original one

[-]

Enthu-Cutlet-1337@reddit

Here I am not trying to summarise my thoughts. Rather I am trying to speak up against the “the account is a bot” claim.

One false claim like this affects, even if it has no actual proof, unfairly affects the user on Reddit.

[-]

Borkato@reddit

Can you please just admit you used an ai.

[-]

Enthu-Cutlet-1337@reddit

So, having a structured thought process and using semicolons now qualifies as 'bot cadence'? If summarizing a technical point about A100s and LR schedules is 'slop,' then the bar for human discourse is getting depressingly low.

I'm not going to admit to using AI just to make your 'comment cop' hunch feel valid. What's next? Are you going to start checking my comments for the proper usage of em and en dashes?

[-]

Borkato@reddit

You don’t have to admit it, I can’t make you do anything, but it’s just sad you’d lie like this.

[-]

Enthu-Cutlet-1337@reddit

It's a strange hill to die on. You're dismissing actual technical insight because you're obsessed with sentence structure. If you want to keep 'policing' users based on vibes instead of content, that's on you, but it’s definitely not the service to the community you think it is.

[-]

Borkato@reddit

Sure.

The joy and pain of training an LLM from scratch

Released models

Training setup

Why this is interesting

Benchmark notes

Figures

Main takeaway

Eyelbee@reddit

Clean_Hyena7172@reddit

Enthu-Cutlet-1337@reddit

Borkato@reddit

Enthu-Cutlet-1337@reddit

Borkato@reddit

Enthu-Cutlet-1337@reddit

Borkato@reddit

Enthu-Cutlet-1337@reddit

Borkato@reddit

Enthu-Cutlet-1337@reddit

Borkato@reddit

Enthu-Cutlet-1337@reddit

Borkato@reddit

Enthu-Cutlet-1337@reddit

Borkato@reddit

Constant-Simple-1234@reddit

GroundbreakingMall54@reddit

Borkato@reddit

phira@reddit