Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

Posted by angeletti89@reddit | LocalLLaMA | View on Reddit | 46 comments

The problem

If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an afterthought — English-first tokenizer, English-first data, maybe some Italian sprinkled in during fine-tuning. The result: bloated token counts, poor morphology handling, and models that "speak Italian" the way a tourist orders coffee in Rome.

I decided to fix this from the ground up.

What is Dante-2B

A 2.1B parameter, decoder-only, dense transformer. Trained from scratch — no fine-tune of Llama, no adapter on Mistral. Random init to coherent Italian in 16 days on 2× H200 GPUs.

Architecture:

LLaMA-style with GQA (20 query heads, 4 KV heads — 5:1 ratio)
SwiGLU FFN, RMSNorm, RoPE
d_model=2560, 28 layers, d_head=128 (optimized for Flash Attention on H200)
Weight-tied embeddings, no MoE — all 2.1B params active per token
Custom 64K BPE tokenizer built specifically for Italian + English + code

Why the tokenizer matters

This is where most multilingual models silently fail. Standard English-centric tokenizers split l'intelligenza into l, ', intelligenza — 3 tokens for what any Italian speaker sees as 1.5 words. Multiply that across an entire document and you're wasting 20-30% of your context window on tokenizer overhead.

Dante's tokenizer was trained on a character-balanced mix (\~42% Italian, \~36% English, \~22% code) with a custom pre-tokenization regex that keeps Italian apostrophe contractions intact. Accented characters (à, è, é, ì, ò, ù) are pre-merged as atomic units — they're always single tokens, not two bytes glued together by luck.

Small detail, massive impact on efficiency and quality for Italian text.

Training setup

Data: \~300B token corpus. Italian web text (FineWeb-2 IT), English educational content (FineWeb-Edu), Italian public domain literature (171K books), legal/parliamentary texts (Gazzetta Ufficiale, EuroParl), Wikipedia in both languages, and StarCoderData for code. Everything pre-tokenized into uint16 binary with quality tiers.

Phase 1 (just completed): 90B tokens at seq_len 2048. DeepSpeed ZeRO-2, torch.compile with reduce-overhead, FP8 via torchao. Cosine LR schedule 3e-4 → 3e-5 with 2000-step warmup. \~16 days, rock solid — no NaN events, no OOM, consistent 28% MFU.

Phase 2 (in progress): Extending to 4096 context with 30B more tokens at reduced LR. Should take \~4-7 more days.

What it can do right now

After Phase 1 the model already generates coherent Italian text — proper grammar, correct use of articles, reasonable topic continuity. It's a 2B, so don't expect GPT-4 reasoning. But for a model this size, trained natively on Italian, the fluency is already beyond what I've seen from Italian fine-tunes of English models at similar scale.

I'll share samples after Phase 2, when the model has full 4K context.

What's next

Phase 2 completion (est. \~1 week)
HuggingFace release of the base model — weights, tokenizer, config, full model card
SFT phase for instruction following (Phase 3)
Community benchmarks — I want to test against Italian fine-tunes of Llama/Gemma/Qwen at similar sizes

Why I'm posting now

I want to know what you'd actually find useful. A few questions for the community:

Anyone working with Italian NLP? I'd love to know what benchmarks or tasks matter most to you.
What eval suite would you want to see? I'm planning perplexity on held-out Italian text + standard benchmarks, but if there's a specific Italian eval set I should include, let me know.
Interest in the tokenizer alone? The Italian-aware 64K BPE tokenizer might be useful even independently of the model — should I release it separately?

About me

I'm a researcher and entrepreneur based in Rome. PhD in Computer Engineering, I teach AI and emerging tech at university, and I run an innovation company that brings emerging technologies to businesses. Dante-2B started as a research project to prove that you don't need a massive cluster to train a decent model from scratch — you need good data, a clean architecture, and patience.

Everything will be open-sourced. The whole pipeline — from corpus download to tokenizer training to pretraining scripts — will be on GitHub.

Happy to answer any questions. 🇮🇹

[-]

Party-Special-5177@reddit

Although none of this matches your questions, I physically could not stop myself from commenting so I apologize.

Random init to coherent Italian in 16 days on 2× H200 GPUs.

Back of the napkin looks like 32.5k tps per card, which isn’t excellent for that card vs your model size and precision (I suspect you should be able to just slightly exceed those speeds at bf16). It’s more front end work, but might be worth (having AI help you to) rolling your own PyTorch so you can actually dive in and make some hand optimizations? I’ve never used deep speed before and don’t actually know how much control you have there.

This is where most multilingual models silently fail. […] Multiply that across an entire document and you're wasting 20-30% of your context window on tokenizer overhead.

What are you getting at here? There is an argument to be made for fertility, but splitting on contractions is desired behavior for small(ish) models. I have some work I’m releasing in about a month where I’ll be making the argument that that first attention layer actually isn’t part of the model at all, it’s part of the embeddings. It does a ton of things (including healing bad token splits), but in your case it is a form of lemmatization, where the split actually helps the model generalize that pattern faster.

As a result, the model gets to learn the “l’” and “intelli…” primitives separately, rather than having to rediscover that “l’intelli…” has a similar meaning to “intelli…”. You do lose some context window, but you gain back token efficiency during training (this one is huge) and generalizability during test time.

Run it both ways if you can - I’d bet coin it is just your training mix doing the heavy lifting.

Extending to 4096 context … 30B more tokens

Whyyyy? My guy, you can extend your context 4 to 8x with YaRN or Longrope and fully heal for around 1.2 tokens per parameter.

Please doublecheck your plan, it really looks like you are burning a lot of money for no reason.

16 days, rock solid — no NaN events, no OOM, consistent 28% MFU.

Llama style you said, so you have both rmsnorm and qk norm - it isn’t mathematically possible for your model to NaN. It’s crazy how much of a guardrail those 2 norms are, you can run it balls out (in terms of LR) and it still won’t explode, it just won’t converge. I’ve been experimenting with relaxing the norms actually as I suspect they are over constraining.

28% MFU

I’m dying, send help. Please optimize your code I’m begging you, this physically pains me to read

[-]