Gemma 4 is a huge improvement in many European languages, including Danish, Dutch, French and Italian
Posted by Balance-@reddit | LocalLLaMA | View on Reddit | 56 comments
The benchmarks look really impressive for such small models. Even in general, they stand up well. Gemma 4 31B is (of all tested models):
- 3rd on Dutch
- 2nd on Danish
- 3rd on English
- 1st on Finish
- 2nd on French
- 5th on German
- 2nd on Italian
- 3rd on Swedish
Curious if real-world experience matches that.
Source: https://euroeval.com/leaderboards/
AffectionateHome3113@reddit
Only Gemma 4‑31b solved my own smol benchmark (a German exercise from a book). The other local models I tried all failed:
Qwen 3.5‑122b q4 k xl Qwen 3.5‑35b q8 Qwen 3.5‑27b q8 Gemma 4‑26b q8
Inevitable-Name-1701@reddit
Butchering Hungarian language realy dissapointed.
drillmast3r@reddit
I tried to see if there had been any improvement in the Hungarian language compared to the previous model, but unfortunately, I don’t think so. And yet I was really looking forward to this model.
AssOverflow12@reddit
Kár érte :(
drillmast3r@reddit
Igen. A Gemma 3 már majdnem jó volt nekem, nagyon bíztam a 4-ben, dehát...
EntertainmentOne7897@reddit
Milyen feladatra nem elégséges? Próbáltam a racka meg puli modelleket, de nem volt nagy sikerem velük. Valszeg gemma jobb náluk
Healthy-Nebula-3603@reddit
Have you tested Aya 31b model ( trained for transactions)
drillmast3r@reddit
I will try it, thank you!
alamacra@reddit
Are there any models that are any good in Hungarian, even of the really large ones?
drillmast3r@reddit
I don't know the big ones. I’ve tried these: qwen3.5-35b - 27b, gpt-oss-20b, gemma-3-27b-it, glm-4.7-flash, and gemma-4-26b - 31b. In my opinion, gemma-3 produces the best Hungarian text out of these. But unfortunately, it’s not flawless either.
PunnyPandora@reddit
gemini is pretty good at them but it's not open source
pip25hu@reddit
Isn't it in second place for Hungarian, right after the latest Gemini?
Barbaricliberal@reddit
I've found Gemma 4 to be surprisingly good for Farsi/Persian translations and support.
From E4B upwards it's good (E2B leaves a lot to be desired).
ZeitgeistArchive@reddit
Is it functional now in LM Studio?
Ok_Fish_39@reddit
In one small European language, gemma-3-27b is much better than gemma-4-31B. Starting with the fact that gemma 3 starts the answer right away in the same language, while gemma 4 reasoning in English and then translates it poorly.
That_Country_7682@reddit
1st on finnish is actually wild. small models doing multilingual this well was not on my 2026 bingo card.
FinBenton@reddit
I did some stuff on it in Finnish, its best Iw seen but it does make a lot of mistakes.
drallcom3@reddit
What kind of mistakes does it make? I'm curious. Does it get words wrong? Or is it more in long texts?
FinBenton@reddit
It does pretty clear mistakes like someone immigrated to finland would do rest of their life, there like 100 different variations of each word as its pretty weird language.
koloved@reddit
Is there a website that includes all languages, rather than just the ones that made the list for political reasons?
EmsMTN@reddit
Right! The second most commonly used language on the internet is conveniently absent.
tahini001@reddit
All 4000 languages?
draconisx4@reddit
Solid benchmarks on Gemma 4, but real-world testing is crucial to ensure these models don't introduce biases in different languages that's where proper oversight pays off. What's your experience with cultural safeguards?
anotheruser323@reddit
Non-professional translation is one of the things I think LLMs are actually good for. And google seems to be the best at it currently.
PreciselyWrong@reddit
Google Translate, on the other hand, has become utter garbage compared to deepl and such
Mrfrednot@reddit
What model should I use for old greek? Is there one that is specifically good for old texts?
ambient_temp_xeno@reddit
They really just gave us a SOTA translation model.
LoafyLemon@reddit
31B can actually roleplay too, without extra fine tuning or decensoring! I am actually amazed and baffled at the same time. It really sticks to character descriptions, so if you do DnD RP, villains are actually villainous.
It also has a lot of knowledge about fantasy worlds, which is fun.
ambient_temp_xeno@reddit
Gemma 4 really showed the weaknesses of the Chinese models in terms of their censorship (not just the Tiananmen square kind, the ass-slapping kind) and lack of datasets (compared to Google).
Mistral of course is even worse.
pol_phil@reddit
The translation evals can be misleading. After testing on some lower resource EU languages for scientific document translation, Gemma4 can lose coherence and start outputting random Chinese/Hindi/Arabic.
ambient_temp_xeno@reddit
Is gemma 4 working right in what you're testing it on though? Gemma 4 was broken for me in llama.cpp in an insidious way until b8648
Mark__27@reddit
Is there a similar eval to this for Arabic/Hindi?
arbv@reddit
Unfortunately, it is worse at Ukrainian. Gemma 3 27B was near perfect, second only to Google's cloud models.
madsheepPL@reddit
I don't see qwen 3.5 27B in there... It's been a top performer for me.
Icy-Degree6161@reddit
For Europ3an languages specifically?
madsheepPL@reddit
yeah, translations from different languages into english. I'll have to review it though, going through this made me realize I need better benchmarks.
windozeFanboi@reddit
Qwen 3.5 was the first qwen to actually be usable for many EU languages, but it's closer to Gemma 3 than 4...
I mean, as far as the non popular languages i speak
FlamaVadim@reddit
unfortunately qwen is not so great in translations
Tenerezza@reddit
Not for me that's for sure, sure can only verify Swedish, Norwegian and Danish to a extent and every Qwen model is worse then even Gemma 3 when it comes to translations. And the few times i needed to use Finnish it's basically just crap. Gemma is by far much better at it and Gemma 4 is actually one of the best yet even better then claude so far.
FinBenton@reddit
Atleast in Finnish language, qwen is just horrible compared to gemma.
Moreh@reddit
Many requests below are asking for similar benchmarks for non-european languages, does anyone know if such a thing exists? I know google is the best for most languages, but i am interested whether it beats qwen for asian languages like Indonesian.
Cold_Tree190@reddit
Has anyone tested it with Japanese? How well does it perform if so?
unskilledexplorer@reddit
what does the rank mean? average position of a model across various tasks? so if a model is rank 1.34, it is only good relative to other models, right? so if all models are bad at a particular language, then...
Mark__27@reddit
What about Arabic/Hindi?
HigherConfusion@reddit
Thanks. It confirm my own experience, that Gemma 3 12B, is still the best model at Danish, my machine can handle. It feels like Gemma 4 left a big gab between E4B and 26B-A4B.
alexx_kidd@reddit
Greek.. worse
Fluxx1001@reddit
Interesting Leaderboard. However it's strange that Mistral models are way behind in this benchmark, although they are explicitly trained on being multilingual European.
bonobomaster@reddit
As a German this is still a good reminder to only talk to any LLM in that language it was trained on the most.
Icy-Degree6161@reddit
Is this about generic language interaction or tranlsation specifically?
In the translation space for these languages I found TranslateGemma and EuroLLM to be great.
Mashic@reddit
In my tests for English > Arabic translation. Gemma 4 beats translategemma out of the water.
Mashic@reddit
For English to Arabic too. I have really been impressive of its accuracy over translategemma.
phido3000@reddit
One day they will develop and AI that can understand Australian.
Available_Load_5334@reddit
https://millionaire-bench.referi.de/
A benchmark using questions from the German version of "Who Wants to Be a Millionaire?".
BrightRestaurant5401@reddit
from the local models gemma always has been the most impressive in Dutch, this time I no exception.
I must say however that the most surprising to me is that Claude sonnet has the number 1 spot in this ranking.
Middle_Bullfrog_6173@reddit
Doesn't reach the performance of the closed models on some of the smaller languages, but probably the open SOTA. Matches my experience in practice.
Zestyclose-Ad-6147@reddit
Wow, impressive! Also, nice site, didn’t knew this one