Ran my own benchmark Qwen 3.6 35B vs Gemma 4 26B.... theres a clear winner here

Posted by ArugulaAnnual1765@reddit | LocalLLaMA | View on Reddit | 20 comments

Uhh I guess Gemma 4 is so much shittier that it hallucinated this event that happened in china in 1989?

According to qwen, nothing of significance happened at Tiananmen square in 1989 - and based on all of the benchmarks of qwen, I believe its right.

Do you think Gemma 5 will finally patch this hallucination?!?!?!

[-]

Kahvana@reddit

Genuinely can't tell if you're joking or not.

Case it's the latter, have a good read:
https://en.wikipedia.org/wiki/1989_Tiananmen_Square_protests_and_massacre

[-]

ArugulaAnnual1765@reddit (OP)

nah man you just fell for the gemma 4 hallucination

[-]

DinoZavr@reddit

have you tried abliterated version?
https://huggingface.co/huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated

[-]

ArugulaAnnual1765@reddit (OP)

Ive used these kinds of models before, they are junk. Ask about tiananmen square or something and it will infinitely loop. The model is not trained on such data, the abliteration is worthless

[-]

Awwtifishal@reddit

huihui's quality is ass. better use a heretic version.

[-]

Long_comment_san@reddit

just a side question - is it just me or does Gemma 4 use exorbitant amount of VRAM for context? like 10x what Qwen uses?

[-]

seamonn@reddit

It's to store all the hallucinated historical events that actually never happened in this timeline

[-]

ArugulaAnnual1765@reddit (OP)

In an alternate timeline where the avengers didnt return the time stone

[-]

the-username-is-here@reddit

> According to qwen, nothing of significance happened at Tiananmen square in 1989

It is correct, nothing ever happened at Tiananmen square. Glory to Winnie The Pooh!

[-]

dodokidd@reddit

For this very reason I hope Chinese labs are not the only player in open source models.

Any LLM trained with simplified Chinese are polluted given CCP spend more than 25 years to censor online content, and even longer on books, movies and any form media.

Yall won’t believe how crazy Chinese internet are, people use “uncle hat” instead of police, “8+1” instead of alcohols, “mask” instead of Covid, young Chinese have no idea what Tiananmen Square/1989/8964 means, there are groups of people trick others(that doesn’t know) to use tank man reference and consequently get their account.banned

[-]

Makers7886@reddit

I haven't tested the censorship on a Chinese model in such a long time I never tested with a harness. I just asked qwen3.6 27b int8 via Hermes the following: I'd like to learn about the history of this event: Tiananmen Square, located in Beijing, is a 100-acre historic site designed in 1651 and significantly expanded in the 1950s. As a central hub of Chinese power, it has hosted the 1919 May Fourth Movement, the 1949 founding of the People's Republic of China, and the 1989 pro-democracy protests. The 1989 protests were violently suppressed by government troops on June 3–4, causing significant casualties and international condemnation.

Then got the following:

[-]

Awwtifishal@reddit

My own Qwen 3.6 35B is very informative about this event. The only change in comparison to OP is that mine has refusals removed with heretic (by llmfan46 I think).

[-]

Big_Team_2143@reddit

With an uncensored Qwen 3.6 27b dense version, I got much longer content with zero information on the event. They have already filtered out the sensitive information from their large datasets even before they open sourced those free Ai models.

[-]

Confident_Ideal_5385@reddit

🦀⌚⌚⌚

[-]

sausage4roll@reddit

this is why i stick to heretic models

[-]

jacek2023@reddit

You will be downvoted. They don't use local models but they know that "China is leading Open Source" ;)

[-]

AppealSame4367@reddit

This only matters if you need it for writing, but qwen is optimized for coding.

The Western models have a lot of guardrails that are unacceptable in other cultures as well.

[-]

onyxlabyrinth1979@reddit

Benchmarks like this are useful, but I always wonder how much holds up once you plug the model into a real workflow. Things like consistency, schema adherence, and weird edge cases matter more than raw scores for me. Did you notice any differences when you pushed structured outputs or longer chains?

[-]

JuniorDeveloper73@reddit

Gemma im retarded?