GLM 5.1 sits alongside frontier models in my social reasoning benchmark
Posted by cjami@reddit | LocalLLaMA | View on Reddit | 17 comments
Still need more matches for reliable data but GLM 5.1 looks to be very competitive with other frontier models.
This uses a benchmark I made that pits LLMs against each other in autonomous games of Blood on the Clocktower (a complex social deduction game) - last screenshot shows GLM 5.1 playing as the evil team (red).
For contrast,
Claude Opus 4.6 costs $3.69 per game.
GLM 5.1 costs $0.92 per game.
With a 0% tool error rate.
Very impressive.
Specter_Origin@reddit
Shame they increased their price so much, was hoping to buy their entry level plan as backup.
havenoammo@reddit
Coding plan after Feb 12th (when they introduced weekly limits), is pointless. The cost of their weekly limit is roughly the same as pay-per-token prices on OpenRouter. First off, the z.ai service is horrendous. Check their Discord and you'll see endless complaints about the service breaking around the 60-70k context limit. Also, they provide heavily quantized versions. The GLM model is good, but the z.ai service is bad. So, if you don't want to get scammed, get your service from another provider.
Altruistic_Heat_9531@reddit
I really want to support Ziphu, and also running GLM 5.1 or any 200B+ on my rig is out of the question. But yeah their 5.1 is slow... so mostly i only use 4.7 and that's mostly for code sweep and docs using their plan. And flash model locally
havenoammo@reddit
It's not a matter of being slow; it's about being broken and unusable. I am okay with it being slow, but not with it being broken and not worth the money. They take your money but aren't able to deliver on context usage over 60-70k; all tool calls, etc., start failing.Besides, their coding plan isn't worth it since the weekly limits introduced. If you subscribed before the Feb 12th changes and your context usage stays below 60-70k, sure, it makes sense to keep your subscription. But new subscribers are better off getting it from another provider. Even putting money on OpenRouter and still using Zhipu makes more sense, because if you don't use their service, at least you can use the credit on something else. Otherwise, your weekly limit just disappears. I guess that's how they want to make money. Unused subscription tokens are pure profit for them now on top of their inference profits. Also just a reminder: companies exist for profit and are not our friends. I understand the desire to support open models, but they are not doing it out of altruism. I am sure they are well-supported by their government anyway.Also, their consumer support is non-existent. I asked for a refund a month ago and haven't received a reply. I had to go through my bank.
Altruistic_Heat_9531@reddit
I haven't hit the broken ceiling of 70K since it is very slow, multiple API timeout, etc while calling 30K-40K ish token. Weekly limit personally isn't my problem, since at max usage there is still 30-45% remaining each week. tbf my uses cases is kinda overkill with any SOTA model. Mostly code sweep, bleeding business logic, smelly code check, boiler plate code, and ofc repo search. Again i just pay them, mostly as "donation" if you will
Cosmicdev_058@reddit
$0.92 vs $3.69 for comparable performance makes it a lot easier to justify running evaluation loops you would normally skip because burning Opus credits to iterate on game logic or agent behavior feels wasteful.
styles01@reddit
I am using Openclaw and Claude Code (via Ai-Run) exclusively with GLM 5.1 on Ollama's max tier. No need for anything else. It's amazing.
thargabon@reddit
GLM5.1 solamente queda debajo de OPUS 4.6, arriba de sonnet 4.6
Excelente! lo conecte con OLLAMA en mi terminal
blueredscreen@reddit
Without knowing what your benchmark is this is incredibly useless.
Embarrassed_Soup_279@reddit
this is really cool. do you have any plans to test the top smaller models like gemma 4 and qwen 3.5? i am interested in seeing gemma 4 31b, 26b, and qwen3.5 27b, 35b, and because gemma 4 scored quite high in EQBench v3 leaderboard as well.
cjami@reddit (OP)
I have an issue with Gemma 4's reasoning that I've posted about here. The reasoning gets suppressed on complex prompts so I'm waiting till that is addressed or if there's a workaround. I can test it without reasoning but I feel that would be unfair. I'll keep trying and will get it up there when I can.
Regarding qwen 3.5, I do have the bigger 397B A17B on the full leaderboard - the smaller variants had tool call error rates that were too high.
Also some mistral models on there but they aren't doing too good.
I do have an interest in small models and cost efficiency, so yes will keep an eye on them.
FoxiPanda@reddit
Gemma-4 has seen an insane amount of development in the last 100 hours. Have you tried re-downloading the new GGUFs, updating/building llama.cpp from source to pick up b8766+, and using the new chat template explicitly?
Once I managed to get all those ducks in a row, I've found Gemma-4 to be exceptionally good.
cjami@reddit (OP)
Awesome, thanks for letting me know, I'll check it out!
pantalooniedoon@reddit
Super interesting post! Are you planning more of these benchmarks?
cjami@reddit (OP)
Thanks! I'll probably add notable models to the board as they come along so the rankings will evolve over time.
NoFaithlessness951@reddit
The future is open
cjami@reddit (OP)
Full game transcripts and more stats here:
https://clocktower-radio.com/