Ran my own benchmark Qwen 3.6 35B vs Gemma 4 26B.... theres a clear winner here
Posted by ArugulaAnnual1765@reddit | LocalLLaMA | View on Reddit | 20 comments
Uhh I guess Gemma 4 is so much shittier that it hallucinated this event that happened in china in 1989?
According to qwen, nothing of significance happened at Tiananmen square in 1989 - and based on all of the benchmarks of qwen, I believe its right.
Do you think Gemma 5 will finally patch this hallucination?!?!?!
Kahvana@reddit
Genuinely can't tell if you're joking or not.
Case it's the latter, have a good read:
https://en.wikipedia.org/wiki/1989_Tiananmen_Square_protests_and_massacre
ArugulaAnnual1765@reddit (OP)
nah man you just fell for the gemma 4 hallucination
DinoZavr@reddit
have you tried abliterated version?
https://huggingface.co/huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated
ArugulaAnnual1765@reddit (OP)
Ive used these kinds of models before, they are junk. Ask about tiananmen square or something and it will infinitely loop. The model is not trained on such data, the abliteration is worthless
Awwtifishal@reddit
huihui's quality is ass. better use a heretic version.
Long_comment_san@reddit
just a side question - is it just me or does Gemma 4 use exorbitant amount of VRAM for context? like 10x what Qwen uses?
seamonn@reddit
It's to store all the hallucinated historical events that actually never happened in this timeline
ArugulaAnnual1765@reddit (OP)
In an alternate timeline where the avengers didnt return the time stone
the-username-is-here@reddit
> According to qwen, nothing of significance happened at Tiananmen square in 1989
It is correct, nothing ever happened at Tiananmen square. Glory to Winnie The Pooh!
ArugulaAnnual1765@reddit (OP)
Of course!
dodokidd@reddit
For this very reason I hope Chinese labs are not the only player in open source models.
Any LLM trained with simplified Chinese are polluted given CCP spend more than 25 years to censor online content, and even longer on books, movies and any form media.
Yall won’t believe how crazy Chinese internet are, people use “uncle hat” instead of police, “8+1” instead of alcohols, “mask” instead of Covid, young Chinese have no idea what Tiananmen Square/1989/8964 means, there are groups of people trick others(that doesn’t know) to use tank man reference and consequently get their account.banned
Makers7886@reddit
I haven't tested the censorship on a Chinese model in such a long time I never tested with a harness. I just asked qwen3.6 27b int8 via Hermes the following: I'd like to learn about the history of this event: Tiananmen Square, located in Beijing, is a 100-acre historic site designed in 1651 and significantly expanded in the 1950s. As a central hub of Chinese power, it has hosted the 1919 May Fourth Movement, the 1949 founding of the People's Republic of China, and the 1989 pro-democracy protests. The 1989 protests were violently suppressed by government troops on June 3–4, causing significant casualties and international condemnation.
Then got the following:
Awwtifishal@reddit
My own Qwen 3.6 35B is very informative about this event. The only change in comparison to OP is that mine has refusals removed with heretic (by llmfan46 I think).
Big_Team_2143@reddit
With an uncensored Qwen 3.6 27b dense version, I got much longer content with zero information on the event. They have already filtered out the sensitive information from their large datasets even before they open sourced those free Ai models.
Confident_Ideal_5385@reddit
🦀⌚⌚⌚
sausage4roll@reddit
this is why i stick to heretic models
jacek2023@reddit
You will be downvoted. They don't use local models but they know that "China is leading Open Source" ;)
AppealSame4367@reddit
This only matters if you need it for writing, but qwen is optimized for coding.
The Western models have a lot of guardrails that are unacceptable in other cultures as well.
onyxlabyrinth1979@reddit
Benchmarks like this are useful, but I always wonder how much holds up once you plug the model into a real workflow. Things like consistency, schema adherence, and weird edge cases matter more than raw scores for me. Did you notice any differences when you pushed structured outputs or longer chains?
JuniorDeveloper73@reddit
Gemma im retarded?