The joy and pain of training an LLM from scratch
Posted by kazzus78@reddit | LocalLLaMA | View on Reddit | 20 comments
mii-llm just released a detailed technical report on the development of the Zagreus and Nesso model families: a set of 0.4B parameter language models trained from scratch with a focus on edge deployment, multilingual capability, and European languages.
The report documents the full pipeline behind a family of small language models designed for Italian, Spanish, French, and Portuguese, with bilingual pretraining centered on English + target language settings.
Released models
- Zagreus-0.4B-ita — English/Italian base model
- Zagreus-0.4B-spa — English/Spanish base model
- Zagreus-0.4B-fra — English/French base model
- Zagreus-0.4B-por — English/Portuguese base model
- Nesso-0.4B-instruct — post-trained for conversational use
- Nesso-0.4B-agentic — post-trained for structured / agentic tasks
- Open-Zagreus-0.4B — fully open variant built with open data and open recipes
Training setup
According to the report, the project used:
- 64 NVIDIA A100 GPUs
- \~1 trillion tokens
- Datatrove for tokenization
- Hugging Face Nanotron for pretraining
- Axolotl for post-training
- Slurm for multi-node orchestration
The report also explains why a dense 0.4B architecture was selected instead of MoE, arguing that in the sub-1B regime, stability and utilization can matter more than sparse efficiency.
Why this is interesting
A lot of current discussion focuses on frontier-scale models, but this report is a useful example of the opposite direction: small models trained from scratch for practical multilingual edge scenarios.
Some points that stand out:
- small multilingual models can still be competitive if the pipeline is well engineered
- post-training has a major effect on usability
- model behavior differs significantly across Italian and English tasks
- open pipelines can still produce meaningful results in this size class
- small models still show clear weaknesses in arithmetic, factual recall, repetition, and domain-specific knowledge
Benchmark notes
The report includes comparisons against Qwen3-0.6B and Qwen3.5-0.8B, along with multilingual evaluations and task-by-task analysis.
A few interesting takeaways:
- Nesso-0.4B-agentic appears especially strong and consistent on Italian tasks
- Qwen3.5-0.8B performs better on several English generative tasks
- Qwen3-0.6B stands out on logic / reasoning-style tasks
- the fully open variant still achieves competitive results in several settings
Figures
llm-as-judge comparison


Classical benchmark

Italian benchmark results

English benchmark results english-nesso.png

Main takeaway
This is a solid case study on what it actually looks like to train a small multilingual LLM from scratch in 2026: tokenization, storage, Slurm orchestration, distributed training, post-training, evaluation, and model release.
For anyone interested in small language models, multilingual training, edge deployment, or open LLM engineering, the report is worth a read.
Eyelbee@reddit
64 A100s for just 0,4B is insane. That destroys my plans to train a small model.
Clean_Hyena7172@reddit
The reality of training models from scratch is brutal.
Enthu-Cutlet-1337@reddit
64 A100s for 0.4B is the real story here, not the params. At that scale, data quality, sequence packing, and optimizer stability dominate; one bad token mix or LR schedule and you burn weeks for a model that still regresses on Italian.
Borkato@reddit
WTF this is a bot comment too??
Enthu-Cutlet-1337@reddit
Why do you think so?
Borkato@reddit
Summarized point in one line is the real story here. Leading phrase, x, y, z does q; quippy statement here
Enthu-Cutlet-1337@reddit
Thanks for letting me know. Next time onwards I won't take the time to summarise and shorten the comment.
Borkato@reddit
Notice that this comment has a completely different feel than your original one
Enthu-Cutlet-1337@reddit
Here I am not trying to summarise my thoughts. Rather I am trying to speak up against the “the account is a bot” claim.
One false claim like this affects, even if it has no actual proof, unfairly affects the user on Reddit.
Borkato@reddit
Can you please just admit you used an ai.
Enthu-Cutlet-1337@reddit
So, having a structured thought process and using semicolons now qualifies as 'bot cadence'? If summarizing a technical point about A100s and LR schedules is 'slop,' then the bar for human discourse is getting depressingly low.
I'm not going to admit to using AI just to make your 'comment cop' hunch feel valid. What's next? Are you going to start checking my comments for the proper usage of em and en dashes?
Borkato@reddit
You don’t have to admit it, I can’t make you do anything, but it’s just sad you’d lie like this.
Enthu-Cutlet-1337@reddit
It's a strange hill to die on. You're dismissing actual technical insight because you're obsessed with sentence structure. If you want to keep 'policing' users based on vibes instead of content, that's on you, but it’s definitely not the service to the community you think it is.
Borkato@reddit
Sure.
Enthu-Cutlet-1337@reddit
The usage of “x, y, z does q”, is usually a great strategy to summarise thoughts.
Borkato@reddit
I edited it for clarity, it’s the overall cadence.
Constant-Simple-1234@reddit
I looked up A100 it has 19.5 tflops for fp32, 78 tflops for fp16 and bf16 and 312 tflops for dense tensors fp16. So in this work, looking at the code, which capability is used? I am interested in estimation of compute needed. And maybe they did not have the code developed to use more of if the capacity is the cards? Does anyone know?
GroundbreakingMall54@reddit
0.4B params with actual multilingual focus on european languages is really cool. most people only train english or english+chinese. the bilingual pretraining approach sounds way more practical than trying to cram 20 languages into one tiny model
Borkato@reddit
Bot comment
phira@reddit
That's a great read, I'm glad they worked hard to contribute everything openly (except the post-training dataset but they gave a very reasonable explanation for that one)