Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.
Posted by angeletti89@reddit | LocalLLaMA | View on Reddit | 46 comments
The problem
If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an afterthought — English-first tokenizer, English-first data, maybe some Italian sprinkled in during fine-tuning. The result: bloated token counts, poor morphology handling, and models that "speak Italian" the way a tourist orders coffee in Rome.
I decided to fix this from the ground up.
What is Dante-2B
A 2.1B parameter, decoder-only, dense transformer. Trained from scratch — no fine-tune of Llama, no adapter on Mistral. Random init to coherent Italian in 16 days on 2× H200 GPUs.
Architecture:
- LLaMA-style with GQA (20 query heads, 4 KV heads — 5:1 ratio)
- SwiGLU FFN, RMSNorm, RoPE
- d_model=2560, 28 layers, d_head=128 (optimized for Flash Attention on H200)
- Weight-tied embeddings, no MoE — all 2.1B params active per token
- Custom 64K BPE tokenizer built specifically for Italian + English + code
Why the tokenizer matters
This is where most multilingual models silently fail. Standard English-centric tokenizers split l'intelligenza into l, ', intelligenza — 3 tokens for what any Italian speaker sees as 1.5 words. Multiply that across an entire document and you're wasting 20-30% of your context window on tokenizer overhead.
Dante's tokenizer was trained on a character-balanced mix (\~42% Italian, \~36% English, \~22% code) with a custom pre-tokenization regex that keeps Italian apostrophe contractions intact. Accented characters (à, è, é, ì, ò, ù) are pre-merged as atomic units — they're always single tokens, not two bytes glued together by luck.
Small detail, massive impact on efficiency and quality for Italian text.
Training setup
Data: \~300B token corpus. Italian web text (FineWeb-2 IT), English educational content (FineWeb-Edu), Italian public domain literature (171K books), legal/parliamentary texts (Gazzetta Ufficiale, EuroParl), Wikipedia in both languages, and StarCoderData for code. Everything pre-tokenized into uint16 binary with quality tiers.
Phase 1 (just completed): 90B tokens at seq_len 2048. DeepSpeed ZeRO-2, torch.compile with reduce-overhead, FP8 via torchao. Cosine LR schedule 3e-4 → 3e-5 with 2000-step warmup. \~16 days, rock solid — no NaN events, no OOM, consistent 28% MFU.
Phase 2 (in progress): Extending to 4096 context with 30B more tokens at reduced LR. Should take \~4-7 more days.
What it can do right now
After Phase 1 the model already generates coherent Italian text — proper grammar, correct use of articles, reasonable topic continuity. It's a 2B, so don't expect GPT-4 reasoning. But for a model this size, trained natively on Italian, the fluency is already beyond what I've seen from Italian fine-tunes of English models at similar scale.
I'll share samples after Phase 2, when the model has full 4K context.
What's next
- Phase 2 completion (est. \~1 week)
- HuggingFace release of the base model — weights, tokenizer, config, full model card
- SFT phase for instruction following (Phase 3)
- Community benchmarks — I want to test against Italian fine-tunes of Llama/Gemma/Qwen at similar sizes
Why I'm posting now
I want to know what you'd actually find useful. A few questions for the community:
- Anyone working with Italian NLP? I'd love to know what benchmarks or tasks matter most to you.
- What eval suite would you want to see? I'm planning perplexity on held-out Italian text + standard benchmarks, but if there's a specific Italian eval set I should include, let me know.
- Interest in the tokenizer alone? The Italian-aware 64K BPE tokenizer might be useful even independently of the model — should I release it separately?
About me
I'm a researcher and entrepreneur based in Rome. PhD in Computer Engineering, I teach AI and emerging tech at university, and I run an innovation company that brings emerging technologies to businesses. Dante-2B started as a research project to prove that you don't need a massive cluster to train a decent model from scratch — you need good data, a clean architecture, and patience.
Everything will be open-sourced. The whole pipeline — from corpus download to tokenizer training to pretraining scripts — will be on GitHub.
Happy to answer any questions. 🇮🇹
Party-Special-5177@reddit
Although none of this matches your questions, I physically could not stop myself from commenting so I apologize.
Back of the napkin looks like 32.5k tps per card, which isn’t excellent for that card vs your model size and precision (I suspect you should be able to just slightly exceed those speeds at bf16). It’s more front end work, but might be worth (having AI help you to) rolling your own PyTorch so you can actually dive in and make some hand optimizations? I’ve never used deep speed before and don’t actually know how much control you have there.
What are you getting at here? There is an argument to be made for fertility, but splitting on contractions is desired behavior for small(ish) models. I have some work I’m releasing in about a month where I’ll be making the argument that that first attention layer actually isn’t part of the model at all, it’s part of the embeddings. It does a ton of things (including healing bad token splits), but in your case it is a form of lemmatization, where the split actually helps the model generalize that pattern faster.
As a result, the model gets to learn the “l’” and “intelli…” primitives separately, rather than having to rediscover that “l’intelli…” has a similar meaning to “intelli…”. You do lose some context window, but you gain back token efficiency during training (this one is huge) and generalizability during test time.
Run it both ways if you can - I’d bet coin it is just your training mix doing the heavy lifting.
Whyyyy? My guy, you can extend your context 4 to 8x with YaRN or Longrope and fully heal for around 1.2 tokens per parameter.
Please doublecheck your plan, it really looks like you are burning a lot of money for no reason.
Llama style you said, so you have both rmsnorm and qk norm - it isn’t mathematically possible for your model to NaN. It’s crazy how much of a guardrail those 2 norms are, you can run it balls out (in terms of LR) and it still won’t explode, it just won’t converge. I’ve been experimenting with relaxing the norms actually as I suspect they are over constraining.
I’m dying, send help. Please optimize your code I’m begging you, this physically pains me to read
angeletti89@reddit (OP)
No need to apologize, this is exactly the kind of pushback I was hoping for. Let me go through it:
MFU at 28%: I feel the pain too, trust me. For context: that 28% is on 2 GPUs (not 8) with ZeRO-2 communication overhead, torch.compile with reduce-overhead mode, and FP8 via torchao on all linear layers except lm_head. The key tradeoff: I'm running with zero activation checkpointing. That means no recomputation during backward -> pure speed per step, at the cost of higher VRAM. That's the reason for the H200s (more on that below). With activation checkpointing on, VRAM drops but you're recomputing activations and step time goes up. I chose wall-clock speed over MFU percentage. That said, 28% is where I landed after JIT compilation, FP8 conversion, fused attention via SDPA/Flash Attention, and quite a bit of profiling. I'm sure there's more juice to squeeze, genuinely curious what MFU you're hitting on similar setups.
Tokenizer and contraction splitting: This is a really interesting argument and I'd love to read your work when it's out. You're right that splitting lets the model learn primitives independently, e.g. the "l'" prefix becomes a reusable component. My counter-argument is efficiency at inference time: Italian text is \~40% apostrophe contractions by frequency. At 2B params with a 4K context window, those extra tokens add up fast and you're effectively seeing less content per forward pass. But I'll be honest: I optimized for fertility and haven't run a controlled A/B on downstream quality. If your work shows the split helps generalization at small scale, I'd seriously consider it for v2. Fair challenge.
Phase 2 context extension vs YaRN/LongRope: Genuine question. At 2048→4096 (only 2× extension), is YaRN actually better than continued pretraining? My understanding is that YaRN and LongRope shine at large extension ratios (4-8×) where continued pretraining would be prohibitively expensive. At 2× I figured the model benefits from actually seeing longer documents during training rather than just interpolating positions it's never attended over. But I'm open to being wrong here, if you have references showing YaRN fully heals at 2× with \~2.5B tokens of warmup, I'd genuinely like to see them.
NaN and norms: Fair point. RMSNorm is a strong guardrail. Though I should clarify: no QK norm in this architecture, just pre-norm RMSNorm on attention and FFN inputs plus a final norm. The "no NaN" note in the post was more for people who've had unstable runs, not claiming it was a heroic achievement.
H200 vs H100: I'm on 2×H200, not 8×. The cost comparison shifts significantly at that scale. The extra VRAM is what lets me run no activation checkpointing with micro_batch=24 at seq_len=2048 (and micro_batch=12 at 4096 in Phase 2). On 80GB H100s I'd have to either checkpoint activations (slower steps) or cut batch size (more gradient accumulation steps, worse throughput). The VRAM headroom is buying me training speed, not sitting idle. Also tried Liger fused cross-entropy early on, it's incompatible with torch.compile (internal .item() call breaks the graph → CUDA illegal memory access). Ended up with a non-chunked F.cross_entropy which is simple but works cleanly with JIT.
On the cost overall: Is this the most cost-efficient way to train a 2B? Probably not. But no course, no tutorial, and no paper taught me as much about LLM training as actually doing it end-to-end. From corpus assembly to tokenizer design to debugging NCCL hangs at 2am. The "waste" is the education, and I'm sharing everything so others can skip the expensive mistakes.
Seriously though: when your tokenizer work drops, link me. The lemmatization argument is compelling and I want to test it man!
Party-Special-5177@reddit
Re MFU: GPUs count doesn’t affect MFU, it’s more of a dimensionless ratio of compute usage / compute availability. Sorry if you already knew that. This ties into H100 vs H200, as training always kinda comes back to used flops per dollar-hour. If you are paying $4.50 an hour for 2x h200, and $9.00 gets you 8x H100, you can gain a 4x speedup for 2x cost just by getting under that 80gb threshold. This rough math will hold for tiny-model, compute bound loads like yours. In my experience, the tps penalty to move a setup from 8xH200 to 8xH100 (holding all other factors equal) is about 6%, which doesn’t even begin to eat into your cost and time savings.
Re grad checkpointing: to be clear, you really don’t need it to fit a 2B model in h100s, trust me, but again, if you can’t find any other method to fit on H100s, you should turn it on anyway and eat the 20-30% penalty, as you’ll still come out ahead.
You really should be able to get there just with gradient accumulation - your ‘worse throughput’ comment reeks of you using AIs to advise your training strategy. Ignore this comment if I’m wrong, but if I’m right: bots generally have the right logic but have no intuition of magnitude, as here. It is right that grad accum causes bus traffic and, e.g. a 4x accumulation loop adds 4 PyTorch loops for every otherwise-on-VRAM step.
….however, the penalty turns out to be negligible, especially considering the other performance bits you’re leaving on the table. Generally, if a bot tells you to do a thing with hyperbole (e.g. ‘…and if you use grad accum, then PyTorch loops will shred your GPU and your H200s will be STARVED while they sit there doing nothing waiting on the GIL…’ - I know what this looks like as I’ve seen all the big bots do it XD), you say ‘makes sense, thanks’ and then go try it anyway.
In this case, I think you will find you won’t notice any loss in TPS, you can keep DDP and no grad check pointing, and finish your run in 5 days or less rather than 16. I have never noticed a change in TPS from accum on/off that stood out from the environmental cluster-to-cluster background noise.
Context extension: in all cases yarn/longrope is better, or you chose your RoPE frequencies poorly lol. To clarify why, if you have space in your lowest frequencies that you aren’t using, then you are asking your model to discern long range dependencies with less range on the signal than your attention could be using. Without getting too in the weeds, strong signals train the model faster, and help robustness on downstream tasks (especially in your case considering everything you are doing is low rank, your heads are low rank, you are training in 8 bit precision, etc).
You should do context extension after pretraining completes. It doesn’t take much for your attention to learn the new frequencies, again like 1.2 tokens per parameter. Your model will see longer documents, during your SFT phase. Cites are mostly the main papers themselves, plus it is industry standard now:
angeletti89@reddit (OP)
Quick update since you put so much effort into your feedback.
SFT is now running. 540K conversations (40% Italian, 11 datasets), 1 epoch on 2×H200 NVL. Current numbers:
So \~89k tok/s sustained, 33.5% MFU with the corrected H200 NVL peak (1,671 TFLOPS, not 1,979), ETA \~2 hours to completion. Config: micro_batch=12, reduce-overhead, no activation checkpointing, FP8 ON. You'll notice the MFU is exactly what I recalculated after your comment. Your pushback literally fixed my reporting.
On Liger FLCE: I have Liger 0.7.0 installed now and the torch.compile compatibility docs look promising. However during SFT testing I found that Liger FLCE + FP8 + CUDA graphs diverges to NaN. Specifically the combination of torchao float8 training and Liger's fused kernel doesn't play nice when reduce-overhead mode enables CUDA graphs. I had to fall back to standard F.cross_entropy. I suspect it's a precision issue in the fused backward pass when the linear weights are already in float8 format. If you have experience mixing fused CE kernels with FP8 training I'd love to hear how you handle it.
On FA3: You and u/FullOf_Bad_Ideas both recommended it, so I tried. Short version: it was a nightmare on my setup. The flash-attn package (2.8.4) detects Hopper and reports FA3 as available, but actually building and running FA3 training kernels requires a specific build from source with CUDA 12.8 and the right sm_90a flags. The pre-built wheel ships FA2 kernels that route through SDPA automatically, which is what I've been using. Building FA3 from source failed repeatedly with compiler errors on my setup, and after burning a few hours of GPU time debugging the build I gave up and stuck with FA2 via SDPA. It works, it's stable, and the MFU delta wasn't worth the build hell. Maybe I'll revisit when the FA3 pip install just works out of the box.
On YaRN: Haven't implemented it yet but I've read the paper and the Cerebras blog. Plan is to use it for future context extensions (4096 → 16K+) after SFT, since modifying only the RoPE frequencies and fine-tuning for \~2.5B tokens is obviously better than burning 30B tokens on continued pretraining. You saved me real money with that callout.
This thread keeps paying dividends. Will post again when the SFT model is ready for eval.
FullOf_Bad_Ideas@reddit
you have a really cool and substantive convo going on here, it's a pleasure to read it all!
I don't see see where the other commenter mentioned FA3 in this thread, but nonetheless, here's how I install FA3 on H100s (jammy); maybe it'll help you decide to revisit it
FA3 install procedure is a weird witchcraft, it just worked for me after I did the above, and I definitely did not have FA2 installed at the time, only FA3, as confirmed by looking at old
pip listgenerated by Megatron-LM.I can post a gist with the rest of setup too (apex, megatron-lm) specific to my model if that would be any help.
I think FP8 makes it much harder to get 40%+ MFU, it does not look bad at all to me.
If you want to reduce your training costs 10x in the future and you pay out of your own pocket, you can try renting consumer GPUs like 8x 3090, 8x 4090 or 8x 5090s. My local rig is barely cheaper to run electricity-wise than renting it is, and I calculated that I get a TFLOP about 10x cheaper than when I was renting 8x H100 SXM node for 23 euro/hr (I was getting piss-poor MFU on H100s tho). 3090 Tis don't have FP8 but 4090 and 5090 do, it's actually somehow competitive in my experience despite slow interconnects that you'd expect would kill the training speed. I had about 40% MFU on 3090 Tis and I paid about 220 euro in electricity to train my model on 28B tokens. You'd need to be more patient with it and maybe do more checkpointing to HF, but if time is not an essence and project isn't a big proprietary secret, it saves money. I doubt that either H100 or H200 comes close to that price, community-hosted GPUs on Vast are really affordable and usually stable for week-long jobs.
I'm curious, how are you handling sample packing and cross-sample contamination during your SFT training now?
angeletti89@reddit (OP)
FA3: Thanks for the exact install steps, saving this for next time. I did try building from the hopper/ subdirectory but kept hitting compiler issues with my CUDA 12.8 + sm_90a setup. Probably a version mismatch somewhere that I didn't have the patience to hunt down mid-project. For the next model I'll start fresh with FA3 from day one instead of trying to retrofit it into an existing stack. Your note about not having FA2 installed at the same time is useful, I had both coexisting which might have been part of the problem.
FullOf_Bad_Ideas@reddit
yeah, a lot of time spent on training is just debugging those kinds of package issues here or there.
here's a gist of how I was setting up my H100 instances for training. It's not cleaned up outside of removing api keys, but you could easily do it if you want, maybe it will be helpful to you in a future - https://gist.github.com/adamo1139/2065ada54233dcce0cb88cbd2d68191b
angeletti89@reddit (OP)
It seems that I tend to learn more from mistakes than anything else. I am evaluating the model right now, trying to figure out if it is good or bad. Next model will train in a better optimized env, saved your gist!
angeletti89@reddit (OP)
Totally agree.. For the best and for the worst, seems that I tend to learn more from mistakes than anything else.
I am trying to figure out why the model is so badly performing right now. I think I will have to build a new one from scratch. Look at that
angeletti89@reddit (OP)
This is really a good question. Right now I’m doing offline pre-packing, not packed attention. I'll explain in the following but honestly I was a bit of excited to see the model talking that for the SFT I tried to go through a shortcut, I'll fix it for a newer SFT round. So, back to the topic ->
The SFT pipeline normalizes everything to a ChatML-style format, tokenizes each conversation with explicit boundaries (
<|begin_of_text|>, role headers,<|eot|>,<|end_of_text|>), masks user turns out of the loss, then greedily concatenates multiple conversations into fixed 4096-token blocks before saving them to disk. The trainer just loads those pre-packedinput_ids/labelstensors directly.So in the strict sense: yes, cross-sample contamination exists right now. I’m not building a block-diagonal attention mask or resetting attention per packed sample boundary. In the training code the model is fed the packed sequence as-is, and the layer forward path is called with
mask=None, so attention can flow across conversation boundaries inside the packed block.What I am doing to soften that:
So the current setup is basically: packing for throughput, boundaries for damage control, but not true contamination-free packed training.
It was a conscious speed tradeoff for this run because I wanted the zero-overhead pre-packed path and the throughput bump. For the next revision, the “proper” fix would be either segment-aware/block-diagonal masks or a packed-sequence attention kernel that respects sample boundaries.
If you’ve got a packed-attention implementation you like that plays nicely with SDPA / FA2 / FA3 in a simple PyTorch setup, I’m all ears. In the meantime as soon as the SFT finished (10 minutes left), I'll try your setup to install FA3, crossing all the fingers I have
FullOf_Bad_Ideas@reddit
Nope, I am just using off the shelf training frameworks that handle all of that for me. LLaMa-Factory for SFT and whatever packing they have on by default.
I'd be curious to know how big of an SFT dataset you've gathered and how you did it.
I tend to not mask user turns, mainly out of an assumption that in data-scarce environment, all data is good. Plus it's a safer route to average performance where you don't quite know if prompt or completion will be disproportionally longer. Ideal approach would be to give less weight to user turn but not mask it completely. Here's a good read about masking user turns - https://towardsdatascience.com/to-mask-or-not-to-mask-the-effect-of-prompt-tokens-on-instruction-tuning-016f85fd67f4/
Party-Special-5177@reddit
I really appreciate the circle around - I’m rooting for you, seriously. Happy to hear your successes.
89k tps: is that per card? Also, what are you using as your optimizer? Also are your activations also in fp8 (w8a8)?
If you post your codebase, I’d love to take a look. I train in bf16, and am of the general opinion that fp8 is usually not worth the trouble, excepting when used inference only on a teacher model (and even then I do w8a16). Caveat emptor I usually run a strange setup with a lot of custom modules and can’t be bothered to make my stuff play nice with fp8.
Can’t wait to see your output!
angeletti89@reddit (OP)
Thanks man, really appreciate it! To your questions:
89k tok/s: That's total across both cards, so \~44.5k per GPU. Not great per-card, but with ZeRO-2 all-reduce overhead on only 2 GPUs the communication-to-compute ratio isn't ideal. Should scale better with more GPUs.
Optimizer: AdamW via DeepSpeed, β1=0.9, β2=0.95, weight decay 0.1, eps 1e-8. Nothing fancy. Cosine schedule with 10% floor.
FP8: Not full w8a8. The torchao
convert_to_float8_trainingconverts eligible linear layers (everything except lm_head) to do their forward pass in fp8, but activations flowing between layers stay in bf16. So it's more like w8a16 for the matmuls with bf16 residual stream. Honestly, after all the headaches (Liger incompatibility, CUDA graph issues, debugging time), I'm starting to agree with your take that fp8 for training is often not worth the trouble. The throughput gain exists but it's maybe 10-15%, and the debugging surface area it adds is significant. For v2 I might just go pure bf16 and spend the optimization budget on things that don't break other parts of the stack.Codebase: Will be fully open. The entire pipeline, from corpus download to tokenizer training to pretraining to SFT, every script numbered 1-9. Should be up on GitHub pretty soon, I have to clean up the repo. I'll tag you when it drops.
I'm also writing a Medium article series documenting the entire engineering journey from corpus assembly to these benchmarks. Will share links when the first articles drop.
Weights will be on HuggingFace alongside the repo. Stay tuned ;)
angeletti89@reddit (OP)
As always, I'll try to go point by point.
MFU and GPU count: You're right, MFU is dimensionless. I mentioned the 2 GPUs not as an excuse for the percentage but to clarify the communication overhead context (ZeRO-2 all-reduce on 2 GPUs vs 8). Fair correction though, the number is the number. Actually, while writing this I went back and checked my MFU calculation and found a mistake. MFU is simply the ratio of achieved throughput to the theoretical hardware peak:
MFU = Achieved FLOPs/sec ÷ Peak Hardware FLOPs/secI was using 1,979 TFLOPS as peak, which is the H200 SXM variant (700W). But my setup is 2×H200 NVL (the PCIe variant with NVLink-C2C, rated 600W, actually running at \~580W). The correct BF16 peak for H200 NVL is 1,671 TFLOPS. So the real MFU is closer to \~33%, not 28%. Still not great, but less painful than I reported. Thanks for making me double-check this.
After Dante-2B I'll probably try a MoE with 7B/9B total and 2-3B active. In that case I'll need to scale well beyond 2 GPUs and a bigger cluster is unavoidable, hoping to find some funding for that one.
H100 vs H200 cost math: This is a compelling argument and I'll be honest, the H200 choice wasn't a carefully optimized decision. I got a good deal on a 2×H200 NVL node that brought the price close to what 2×H100 would have cost elsewhere, so I took the extra VRAM and ran with it. You're absolutely right that at full price an 8×H100 node would have been the smarter play. 4× GPU count at \~2× cost with only 6% tps penalty is hard to argue against. For Phase 1 this ship has sailed, but for a bigger model I'd have no choice anyway. Good callout.
Gradient accumulation: Fair enough, I'll own this one. I assumed the throughput penalty would be larger than it actually is without properly benchmarking it. You're right that the overhead from accumulation steps is negligible compared to the actual forward/backward compute at this model size. Lesson learned: measure, don't assume. I'll profile accum on vs off properly next time.
Context extension / YaRN: OK this is the most valuable part of your comment IMO. The Cerebras result (Llama 8B to 1M context on 10B tokens) is wild and I hadn't seen it. You've convinced me, for a 2× extension, continued pretraining with 30B tokens is almost certainly overkill when YaRN can heal it with \~2.5B tokens. I'm mid-Phase 2 so I'll finish this run, but the YaRN approach is clearly the right call for future context extensions. The RoPE frequency argument makes physical sense too. If you're not using the full frequency range, you're leaving signal on the table.
Liger CE: The issue is specifically
torch.compilewithmode='reduce-overhead'. Liger'sLigerFusedLinearCrossEntropyLosscalls.item()internally which forces a graph break, and in reduce-overhead mode that triggers a CUDA illegal memory access. It works fine in eager mode or withtorch.compile(mode='default'). Could be a version-specific thing, I was on torchao 0.5.x at the time. If you ever get time to poke at it I'd genuinely appreciate it, because that 24GB saving would be significant.Real talk: this whole thread has been one of the most useful learning experiences of the project. The YaRN point alone will save me thousands of dollars on future runs. This is exactly why I posted before finishing. Getting this kind of feedback mid-project is worth way more than posting a polished model card after the fact.
angeletti89@reddit (OP)
Update #2: SFT complete, first ITA-Bench results are in.
SFT training finished. 540K conversations (40% Italian, 11 datasets), 1 epoch, \~3.5 hours on 2×H200 NVL. Now let's talk numbers. Ran ITA-Bench (Sapienza NLP's Italian evaluation suite) against comparable instruct models in the 1B-3B range. Here are the results:
What the benchmarks show:
Dante-2B-Instruct is competitive with established instruct models in its weight class, including Granite-3.2-2B, Granite-3.3-2B, OLMo-2-1B, and Qwen2.5-1.5B. On Math (GSM8K) it holds its own at \~0.50, on par with models that had far more resources behind them. WiC and Other categories are tightly packed across all models.
SmoLLM3-3B leads in most categories, which is expected since it's 50% larger. The more interesting comparison is against models at or below 2B, where Dante is consistently in the mix despite being trained from scratch on 2 GPUs with a fraction of the budget.
Where it's weak: MMLU (\~0.27) is the floor of the group. Not surprising for a 2B trained on \~120B tokens total. World knowledge scales with both parameters and data, and we're at the low end of both. This should improve with more training data or a larger v2.
The real story isn't that Dante-2B beats anything. It's that a from-scratch model with a native Italian tokenizer, trained by one person on 2 GPUs, lands in the same ballpark as models built by IBM (Granite), Allen AI (OLMo), Alibaba (Qwen), and HuggingFace (SmolLM). The tokenizer and data quality are doing a lot of heavy lifting here.
mrtrly@reddit
The tokenizer choice is everything here. Italian morphology is dense enough that a generic bpe vocab wastes space fast, and you're fighting it through the whole training run. Did you build a custom tokenizer for the bilingual split or stick with something existing and retrain it? That decision alone probably saved or cost you days.
angeletti89@reddit (OP)
Custom from scratch, no existing tokenizer reused. Trained a 64K BPE on a character-balanced mix (\~42% Italian, 36% English, 22% code) with a custom pre-tokenization regex that keeps Italian apostrophe contractions intact (l'intelligenza, dell'algoritmo stay as single pre-tokens instead of splitting on the apostrophe). Accented characters (à, è, é, ì, ò, ù) are pre-merged as atomic units in the initial alphabet so they're always single tokens, not two bytes glued together by luck.
Fertility results:
For reference, English-first tokenizers like LLaMA's score 1.8-2.5 on Italian. So roughly 30% fewer tokens on Italian text with zero English regression.
You're right that the decision compounds through the entire training run. Every batch sees more actual content per sequence, the model learns Italian morphology from clean representations instead of fighting tokenizer artifacts, and at inference time you fit more Italian text in the context window. It's the highest-ROI piece of the whole pipeline and I'd recommend it to anyone building a non-English model.
simmessa@reddit
Ottimo lavoro, grazie! Fortunatamente c'è qualche italiano che frequenta questa community!
angeletti89@reddit (OP)
Grazie! Forse siamo pochi, ma buoni. Quando il modello è pronto ti taggo nel post di release, serve feedback da madrelingua!
FullOf_Bad_Ideas@reddit
Cool. I'm doing something similar for Polish. 4B MoE, I moved training to local machine recently but I started on 8x H100 node.
I took a pause there but once I'll get bigger SFT dataset I should be able to move it across the finish line. All intermediate data is open source already though, I called it Poziomka.
What made you choose this size and dense architecture? What pre-training framework are you using? Do you use FA2 or FA3? How are you sourcing your Instruct SFT dataset?
angeletti89@reddit (OP)
Nice! Polish is another language that gets the short end of the stick with English-first tokenizers. Will check out Poziomka, curious about your MoE setup at 4B. To your questions:
Size and dense: 2B dense was a deliberate constraint. I wanted to prove the "native bilingual tokenizer + clean data + from-scratch training" thesis at the smallest scale where it's still meaningful. Dense keeps the architecture and debugging simple so no routing headaches, no load balancing, no expert collapse to worry about. If the approach works at 2B dense, scaling up or adding MoE is a clear next step with known tradeoffs. How's your experience been with MoE at 4B?
Framework: DeepSpeed ZeRO-2 for sharding, torch.compile with reduce-overhead mode, and FP8 via torchao on the linear layers (excluding lm_head). Nothing exotic, I wanted a stack I could debug myself on 2 GPUs without fighting framework issues.
Attention: FA2 through PyTorch's SDPA backend that kicks in automatically with the right head dims (d_head=128, power of 2). Haven't tried FA3 yet, is it stable enough for training in your experience?
SFT dataset: Haven't started sourcing yet, it is still finishing Phase 2. My current plan is a mix of translated instruction datasets (probably OpenHermes/Orca-style filtered and translated) plus native Italian instruction data if I can find or build enough of it. This is honestly the part I'm least sure about. How are you approaching it for Polish?
FullOf_Bad_Ideas@reddit
I did run into router and expert collapse when I tried to use 8bit optimizer at the start, and was a bit underwhelmed by how scaling laws and effective leverage over dense realized in practice - MFU on H100s was sub-par, model got a lot better quickly when I changed active expert count from 4/128 to 16/128 mid-training, though MFU on my local 3090 Tis is great. In theory dense becomes less efficient per compute spend rather quickly, probably around 100B tokens for 0.5B model, and is definitely better once you patch issues like low MFU but it's not smooth sailing.
I used FA3 with Megatron-LM, it was totally stable and AFAIR I got significantly better MFU with it then with FA2.
I translated about 200M of open instruct dataset using Seed-X-PPO-7B, I couldn't source a significant amount of real Polish instruct data. It's very low quality but model works anyway. It can't rhyme due to how translated poems/music don't preserve rhymes, it often uses USD as a base currency and it feels like it's living in the USA but speaks in Polish, culturally. It doesn't know Polish literature and it's symbolism well, for example. I plan to generate more SFT data with magpie and translate more instruct datasets to get to about 2B tokens and then SFT on it. And then do some preference optimization since I always find it to be beneficial for performance. I want to train a reasoning version too, just to see it generate COT in Polish.
angeletti89@reddit (OP)
The MoE journey sounds familiar, the kind of thing that looks clean in papers but gets messy in practice. Interesting that 16/128 active experts made a real difference over 4/128. Makes sense that the model needs enough routing diversity to actually specialize, but that's exactly the kind of thing you only learn by running into it.
Good to know FA3 is stable with Megatron-LM, I'll seriously consider it for a future run. If the MFU gain is significant it could offset a lot of the cost concerns that another commenter in this thread raised.
Your SFT experience is exactly what I'm afraid of for Italian. The "speaks Polish but lives in the USA" problem is spot on -> translated instruct data inherits the cultural frame of the source language. Italian has the same issue: translated datasets will reference dollars, Thanksgiving, and the FDA instead of euros, Ferragosto, and AIFA. The rhyming problem is even worse for Italian poetry since the whole metric structure breaks in translation.
A few ideas I'm considering after reading your experience:
The reasoning/CoT in Polish idea is fascinating and I think CoT in a non-English language is genuinely underexplored territory. Same plan here eventually for Italian.
Let's stay in touch, we're basically running the same experiment in parallel. Would be great to compare notes on the SFT phase when we both get there.
FullOf_Bad_Ideas@reddit
I think routing diversity was fine with 4/128 setup though I actually didn't measure it (I probably should..). I think it was just not enough parameters and I was hitting some floor. With 4 active experts it was 4B A0.3B, so very sparse.
Seed-X-PPO is actually one of a better models for translation. It's a dedicated translation model with performance similar to that of DeepSeek R1 and Gemini 2.5 Pro. You'd probably need to use frontier closed models like GPT 5.4 or Gemini 3/3.1 Pro to significantly beat it at scale, and that could be prohibitively expensive if you want at least 100M tokens but probably even a few times more. I want to do most of my training work locally, so Seed X PPO was the best option. I translated those 200M tokens locally and the last 28B tokens of training was also done locally. It made for good heating when we were coming out of winter.
I think DataFlow (from Llama-Factory team) has a fleshed out workflow for doing this, I'll be looking into that for Polish too.
angeletti89@reddit (OP)
I didn't know about Seed-X-PPO and neither I was aware it performed at that level for translation. That could change my calculus significantly, I'll benchmark it against Qwen-72B on some Italian test pairs before committing to a pipeline. Running it locally is a huge plus over burning API credits on frontier models, maybe i can use my DGX Spark for that or even better since it is only 7B the 5080.
DataFlow from the Llama-Factory team is a great shoutout but I hadn't looked into that yet, will check it out. If it can handle the "real native content → instruction pairs" pipeline cleanly, that could solve the cultural grounding problem for both of us.
And the local GPU heating benefit is genuinely underrated. My H200s are rented so I don't even get free heating out of this, seems you're ahead on ROI already!
smflx@reddit
Thank a lot for sharing your valuable experience. I'm also going to build bilingual small LLM but Korean/English. Great to hear it took only 16 days. That's faster than I worried. I will learn a lot from your trace!
angeletti89@reddit (OP)
Thanks! Korean is a great candidate for the same approach since the efficiency gap between English-first tokenizers and a native Korean one should be even bigger than Italian, especially with the Hangul syllable blocks.
A few things that saved me time in case they help: rely on upstream dataset quality as much as possible (FineWeb-2 and similar are already well-deduplicated), balance your tokenizer training data by character count not document count, and don't underestimate how much a clean tokenizer improves downstream quality. It's the highest-ROI piece of the whole pipeline.
Happy to share notes when the repo goes public. Good luck with the Korean model and ping me when you have something running!
smflx@reddit
Thank you for detailed response. I appreciate!
Yes, token inefficiency is big for Korean. Actually, there was an approch of 10b model retrained after tokenizer change.
And, use of high quality data is the same approch I after. My plan is 500B. Good to hear FineWeb-2 is already deduplicated. I thought of rewriting by LLM to make dataset denser. Maybe not needed.
Thank so much for your advice, tokenizer is another way to improve quality & density. Best luck with your Italian model.
angeletti89@reddit (OP)
500B tokens for a 10B-class model sounds like a solid plan, well over the Chinchilla ratio. On the LLM rewriting though, I'd be cautious. It can improve consistency but you risk homogenizing the style and losing the natural distribution of the language. Probably better to invest that compute in better filtering than in rewriting.
Good luck with the Korean model. If you're up for it, I'd love to compare tokenizer fertility gains across languages when we both have results. Keep me posted!
smflx@reddit
You're right. Thanks for thoughtful advice. Actually, I started as a special model for story writing or role playing. So, the style bias could be ok.
But, upon finding your project, general model is of also interest. Yes, I will try to get more compute. Or, the bottom line is tooptimize & minimize everything to my compute. Hope me follow well your trace
angeletti89@reddit (OP)
Update: Phase 2 mid-training sample (step 15750/\~28600)
Tested an intermediate checkpoint. Prompt: "Il futuro della tecnologia e della scienza": 503 tokens, temp 0.7, top_p 0.9, repetition penalty 1.15.
Full 503 tokens, no repetition loops, coherent structure throughout. 131 tok/s inference on a single GPU.
The good: Grammar, syntax, article usage, complex subordinate clauses, all solid. It's writing structured Italian with technical vocabulary at 2B params and only 55% through Phase 2.
The expected: It hallucinates everything (the "Neural Learning Robot", Prof. James Martin, the IEEE conference). This is normal for a base model with no instruction tuning, factual grounding comes with SFT in Phase 3.
For non-Italian speakers: the output reads like a well-written Italian science article. Native fluency, not "translated English."
beneath_steel_sky@reddit
Pretty cool - I'm looking forward to trying the model when it's ready!
angeletti89@reddit (OP)
Thanks! Will post an update here when the weights are on HuggingFace.
silentus8378@reddit
How much did you spend so far?
angeletti89@reddit (OP)
Roughly $5-6k in GPU rental for the actual training runs (Phase 1 + Phase 2 so far). But honestly the real cost was the iteration time before that: debugging the pipeline, fixing tokenizer edge cases, smoke tests that fail at step 200, all while the GPUs are running and billing. If I count everything, probably closer to $8-10k total. Not cheap, but doable for a solo project. The key is getting everything bulletproof before you start the real run.
MadLabMan@reddit
This is very interesting! Considering the way you’ve trained the model, could this serve as a good translation/study tool for learning Italian?
angeletti89@reddit (OP)
Interesting idea! Right now Dante-2B is a base model. Thus, it generates text, but doesn't follow instructions yet. So you can't say "translate this to Italian" and get a clean result.
After the SFT phase (instruction tuning, Phase 3), it could potentially work for that use case. The native Italian tokenizer gives it a real advantage since it actually understands Italian morphology rather than treating it as mangled English. Things like contractions (l'intelligenza, dell'algoritmo) and accented forms are handled natively.
That said, at 2B params it won't compete with larger models on complex translation. Where it could shine is as a lightweight tool for simple translations, vocabulary in context, or generating example sentences. Basically the kind of thing you'd want running locally and fast, not waiting on an API.
I'll keep this use case in mind when designing the SFT dataset. Thanks for the suggestion!
MadLabMan@reddit
Ah capito! As an Italian-American who grew up speaking dialect, I’m really interested in finding a model I can run myself that can effectively help me learn formal Italian.
So parlare italiano (per la maggior parte) ma non al 100%, quindi sto cercando di migliorare!
Thanks for sharing this with us! :)
angeletti89@reddit (OP)
Il tuo italiano è già ottimo!
The dialect-to-formal gap is actually a fascinating use case I hadn't considered. Indeed the training corpus includes a lot of formal Italian (Gazzetta Ufficiale, EuroParl, Wikipedia) so the model has a strong bias toward standard Italian. Could be genuinely useful for someone in your position.
In bocca al lupo con l'italiano. Ti aggiorno quando il modello è pronto!
MadLabMan@reddit
Grazie mille! Aspetterò il tuo aggiornamento! 🇮🇹
Dany0@reddit
AI slop phrase in the title makes me think a clanker built an LLM from scratch and you're just here to what, pretend? Can't even write your own titles
angeletti89@reddit (OP)
Fair enough, English isn't my first language and the model's Italian is already better than my Reddit titles. Code's on GitHub when it drops, judge that instead.
ForTheDankMemes@reddit
Cool stuff. I might bug you a lot in the future. Out of curiosity what, if any pre processing did you do, what are the quality filters, and how do you schedule the data?
angeletti89@reddit (OP)
Good question! Three layers to this:
Pre-processing: Deliberately minimal. I rely heavily on upstream quality — FineWeb-2 IT is already globally MinHash-deduplicated, FineWeb-Edu is pre-filtered for educational content. On my side, I apply a min character threshold (100 chars for text, 20 for code) to drop stubs and junk, and EOS tokens separate documents in the binary stream. No custom heuristic filters beyond that — I'd rather trust HuggingFace's dedup pipelines than reinvent them at my scale.
Quality tiers: At tokenization time, every binary shard gets a tier prefix:
Data scheduling: A custom
TieredMemmapDatasetsamples with weighted probabilities — tier 1 gets 3× the sampling rate of tier 3, tier 2 gets 2×. So the model sees educational and curated content much more often than raw bulk data, but still gets exposure to the full corpus for vocabulary coverage. Within each tier, sampling is uniform random across all shards.The whole tokenization + tier assignment pipeline will be in the repo when I release — it's a single script that runs on CPU only.
FusionCow@reddit
that's pretty cool man
angeletti89@reddit (OP)
Thanks! Excited to share the weights once Phase 2 wraps up. Stay tuned 🇮🇹