An actual example of "If you dont run it, you dont own it" and Gemma 4 beats both Chat GPT and Gemini Chat

Posted by ThisGonBHard@reddit | LocalLLaMA | View on Reddit | 40 comments

A bit of an interesting story of model degradation and censorship.

So, one of my use cases for AI has been translating and reading an Chinese novel as it appears, chapter by chapter.

Due to the way some characters have secret identities plot points, and the AI had to follow context clues for the translation + consistency reasons too, I had to prompt the AI to look for them, and chose the correct name when translating.

When I originally started it, the main available models were GPT OOS 120B (slow), Qwen 3 max and the free Chat GPT 4o.

Tried GPT OSS 120B initially, it failed, mixed names and sometimes made new ones consistently.

Then, I used Qwen 3 Max for it. Better, but still has an 20% fail rate. Then, it consistently started getting censorship filtered (despite no NSFW).

Then tried the free Chat GPT version at the time, 4o, and it was by far the best. Names were correct all the time, and translation quality itself was top notch.

Some times later, with the 5.2 updates, it starts failing on 20% of the queries. Then I see A-B testing, with one of the versions consistently failing the translations, choosing the wrong name.

Now, with GPT 5.3, the A-B testing seems done, and they deployed the worse version for the users, to the point it is comparable to the old Qwen 3 Max.

Now, this made me curious to retest the current state of the art local models for translation. And to my surprise, Gemma 4 31B wipes the floor with the closed models. Quality is very similar to peak GPT 4o.

This made me curious to retest the same prompt and chapter on some of the open and close models, results are positive for us:

Model	PASS/FAIL	INFO
GPT OOS 120B	FAIL	Merges characters names
Qwen 3 Max	FAIL (CENSORED)	Ok writing, but model got censored and autodeleted
Qwen 3.6 Plus	FAIL (CENSORED)	Good writing, but model got censored and autodeleted
Chat GPT 5.3	FAIL	Messes up correct character name, unnaturally feeling translation
Gemma 4 31B	PASS	Good translation, feels natural, and is fast
Qwen 3.5 27B	PARTIAL PASS	Similar to Gemma 4, a bit less natural sounding and messes character pronouns (calls a Lady a Lord)
Gemini Chat	PARTIAL PASS	Surprisingly, worse than Gemma 4, a bit less natural sounding and messes character pronouns (calls a Lady a Lord)

Holly molly, I did the test AFTER I started writing this post. How the hell does Gemma 4 at Q4 beats both Gemini and GPT 5.3? Is the Gemini Google using really worse than Gemma wtf?!

[-]

Uncle___Marty@reddit

Google did something with the language abilities of Gemma 4 which really puts it in a class of its own. I've seen SO many posts praising gemma 4 for this. I was a little disappointed with Gemma at first because it felt like Qwen 3.5 was better than it, but it turns out that they're both amazing little models which excel in different areas.

And you know what? The really mind blowing part is people like us are just being handed these models for free. I have SO much love for the gemma team and alibaba.

[-]

Ell2509@reddit

Right haha. But realise, we are testing their products and giving feedback. 99% of the population is not capable of doing this kind of product testing. Lack the technical skills to get started.

[-]

vogelvogelvogelvogel@reddit

I am indeed also very thankful to Alibaba, like Qwen3.6 35B is amazing. didn't yet test Gemma 4 tho i am very happy they released another one

[-]

somerandomperson313@reddit

Without Alibaba, i think we would hardly get any good open models anymore. They kept releasing and forcing others to do better.

[-]

mikael110@reddit

Yeah they really did. I use LLMs for translation a lot, and thus have been testing a lot of models, both open and proprietary for translation over the years.

I've actually found that translation is one of the areas where models don't actually universally improve version from version, for instance Claude 4 and later is actually worse at translation than earlier Claude models in my tests.

But generally speaking, I always found the big three proprietary models beating local LLMs, until now. Gemma 4 is truly a huge step up. It's the first time I've found a local LLM that genuinely is amazing at translation, not just matching but in many cases beating the larger proprietary LLMs. I really can't say enough how happy I've been with the Gemma 4 release.

[-]

Goldkoron@reddit

Gemma 4 might be the best model for translation out there right now period. I previously found the google gemini models to be the best personally, with 2.5 pro being my favorite (3.1 pro was a downgrade). Gemma-4 31B beats them both.

[-]

Altruistic_Heat_9531@reddit

Google after all is the mother lab of all LLM. T5, PaLM and now Gemma, side note T5 is created as a machine translation. so yeah no weird business there. About gemini chat, is OP use free version or not? since it might be using a more "flash" variant to handle more a simple request?

[-]

TheRealMasonMac@reddit

It exceeds Gemini in a lot of respects, IMO. Maybe because Gemini is so heavily quantized? But I trust Gemma4 with NLP in a way I don't trust any other model.

[-]

Borkato@reddit

100%. I’m waiting for someone to somehow distill the creative writing into qwen and boom we’ll have everything lol

[-]

Salt-Willingness-513@reddit

Agreed. Never saw such a tiny model (and actually 80% of the SOTA models are not capable in that) being so capable in writing and transcribing swiss german.

[-]

CheatCodesOfLife@reddit

reading an Chinese novel as it appears, chapter by chapter

Do you send it one isolated chapter at a time with "Translate this into Chinese", or do you need something more complicated with history / summaries of the plot, etc?

FAIL (CENSORED)

Could you explain this? My understanding was that anything published in China (fiction), is censored already by the publishers / platforms. And Qwen would have the same censorship rules. So how is it published in China -> translation censored by the Chinese LLM?

[-]

ProfessionalSpend589@reddit

Yep, great model, comrade. I use the 26B MoE to have fun with reading Russian propaganda.

[-]

WhoRoger@reddit

Qwen is awesome for translating from Chinese, but I guess you have to query it well, because it's such a complex language with a ton of nuance. I sometimes need to translate stuff and then ask for details, even 3.5 4B always explains stuff so we'll, I'm quite amazed.

That said... If you use Qwen in its regular censored form, you're just setting yourself up to get burned. I keep harping about censors, but with Qwen it's particularly critical to use decensored models for any kind of creative work. You absolutely will run into guardrails, as you can see.

[-]

ThisGonBHard@reddit (OP)

Guardrails here are different than the censorship in the local model. The online models has a second censor model deleting it's output.

Local one did the task no issue raised, was just worse than Gemma 4.

[-]

vlodia@reddit

"beats" based on theoretical, r u high?

[-]

ThisGonBHard@reddit (OP)

Seems you are mate. Or did you not read the post? It is based on actual tasks, no theoretical.

[-]

positive_mango@reddit

I'm interested in your usecase. Could you please share your workflow. Where do you get the chinese source?

[-]

ThisGonBHard@reddit (OP)

Chapters from SFACG, at time put trough Qwen (split in 5 images) or a similar good visual model to extract the text (they are formatted as images to stop piracy).

Then, make a prompt with the correct character names, places names etc, and give it to Gemma.

[-]

positive_mango@reddit

Thanks alot. Do you capture the pic manually? Because i imagine if the text is long then you would have to capture alot. Or do you just zoom out then capture everything at once?

[-]

ThisGonBHard@reddit (OP)

You can force reenable right click on Firefox via right click, open it as image, and then take a full page screenshot (this works better than save as image).

You split that SS in photoshop/similar program in 5, and give it to Qwen.

[-]

positive_mango@reddit

Thanks alot

[-]

mineyevfan@reddit

SFACG

Good taste!?? Although the quality has declined over the years :/

[-]

Mashic@reddit

I had the same experience translating English to Arabic and translating Korean to English. Gemma 31B and 26B don't make any mistakes. Maybe a non polish sentence translation each 3 paragraphs.

The only caveat is that it sometimes confuses the pronouns gender at the beginning, adding the narrator is male/female one at the beginning fixes it.

[-]

Lone_void@reddit

Inspired by your comment, I tried to translate mathematically heavy text to Egyptian Arabic and oh god it is incredible. I have never seen an LLM that can write Egyptian Arabic like a native would. Before that, LLMs would mix Egyptian Arabic with standard Arabic with standard Arabic appearing more often even when prompted not to

[-]

sine120@reddit

Gemini Flash has tanked recently. I don't know if Google held back a larger Gemma, but if they released it, it'd definitely beat flash now.

[-]

Spirited_Neck1858@reddit

tanked means

[-]

Independent_Plum_489@reddit

Gemma 4’s language handling is honestly on another level right now. I’ve been seeing the same thing across multiple use cases.

[-]

ranting80@reddit

Gemma 4 has me using Silly Tavern a lot now. I used to dabble in it for fun but man it's so good. I have a Star Trek roleplay I do and it makes anomalies or random plot lines, holds conversations with multiple characters, understands nuance and even make my android not use contractions as a secondary character. I've never had a model do so well and it's completely local.

[-]

Ardalok@reddit

Yeah, 4o was really good at fiction. For some reason, modern closed models can't really pull that off anymore. Maybe too much synthetic data? But Gemma 4 is actually cool - the 31B is at least on par with Flash, if not better.

[-]

Worried-Squirrel2023@reddit

this is exactly why local matters even when it's a few percent worse than frontier. the model i pin today won't suddenly refuse to do my task next month because some PM decided to add a guardrail. cloud models are rented, not owned, and the rental terms can change overnight.

[-]

weiyong1024@reddit

Same principle applies to agent runtimes too, not just the models themselves. I was running OpenClaw agents on my ChatGPT subscription via Codex OAuth, then OpenAI added Cloudflare protection and killed the whole path overnight. If the vendor controls your runtime access they can pull the rug anytime.

[-]

edward-dev@reddit

Any thoughts on gemma 4 26b-a4b? Have you tested it out too?

[-]

ttkciar@reddit

I have a hypothesis that the Gemma models are the beta-test releases of the next version of Gemini. Gemma 3 was similarly a step up from Gemini 2.

Google might be using traces logged from API users to make last-minute improvements to their mid- and/or post-training data before hitting Gemini 4 with it.

Certainly the way they botched Gemma 4 tool-calling has a beta-test "smell" to it. Presumably they've noticed and are taking steps (internally, at least) to address it.

[-]

the320x200@reddit

I mean, maybe, but having an open release of any aspect of your upcoming paid model (so all your competitors can dissect it) just to get the feedback of a few users from localllama would be an odd move to make for sure...

[-]

ttkciar@reddit

I don't think they care about us much at all. Like I said, they would be getting traces from their API users, not local inference enthusiasts.

[-]

cibernox@reddit

I am just creating https://meetwillow.app and for most of the AI features I'm using Gemma 4 26B-A4B and it's very good.

I have strong guardrails to force it to always use the RAG tools to look for factual information instead of winging it, but it does maintain character while obbeing guidelines and can find the information it needs and compose accurate recommendations.

I was using qwen 3.5 before but I found gemma 4 to be both cheaper to run and also nicer to talk to. By a lot.

Maybe qwen was a bit more thorough in that it called more tools and gathered more information before answering, gemma 4 is more prone to call one or two tools and consider that it has enough information to answer. And most of the time it's correct, and it's not a bad thing because it saves me context.

I did try gpt-oss 120B too, and gemma 4 36B was better across the board.

[-]

Mashic@reddit

Google has all the data of how people search for things on the internet in multiple languages. This definitely helps them tune to give human like responses. While alibaba doesn't have that dataset.

[-]

Sevenos@reddit

I love Gemma 4 as a (German) chat bot as well, so good at formulating and structuring responses and very very few language mistakes.

It's the first one I actually use and not just toy around with.

Really hope they release a larger version of it, but I already prefer it over Gemini in some cases so they probably hold it back.

[-]

Ok-Measurement-1575@reddit

There is something quite special about 3.6 35b.

Maybe try it.

[-]

Potential-Gold5298@reddit

You are not the only one who came to these conclusions:

https://dubesor.de/benchtable

https://foodtruckbench.com/#leaderboard

The RP community has been stuck with Mistral Nemo and Mistral Small (both two years old) simply because there was nothing better of comparable size. Now, they finally have a decent model for RP/creative writing.

Gemma 4 is an incredible model with many talents.