GLM 5.1 crushes every other model except Opus in agentic benchmark at about 1/3 of the Opus cost
Posted by zylskysniper@reddit | LocalLLaMA | View on Reddit | 93 comments

I want to know whether GLM is another benchmark optimized model or actually useful in agents like OpenClaw, so I tested GLM 5.1 in our agentic benchmark.
Turns out it reaches Opus 4.6 level performance with just 1/3 of the cost (\~$0.4 per run vs \~$1.2 per run) based on my tests. It outperforms all other models tested. Pushes the cost effectiveness frontier quite a bit.
I don't quite trust any static benchmarks, seen many models optimized for it, ranking high on those leaderboard but not working well in real agentic tasks. So we uses OpenClaw to test the agentic performance of models in real environment + real tasks (user submitted). Chatbot Arena/LMArena style battle, LLM as judge.
Based on the result, I would say GLM 5.1 is one of the top models for OpenClaw type of agents now.
Qwen 3.6 also did a good job, but it does not support prompt caching yet (on openrouter) so the current price is inflated. With prompt caching I except it to reach minimax m2.7 level cost per run and becomes another great choice for cost effectiveness.
Full leaderboard, cost-effectiveness analysis, and methodology can be found at https://app.uniclaw.ai/arena?via=reddit . Strongly recommend submitting your own task and see how different models on it.
LittleYouth4954@reddit
Glm 5.1 is all I need for my use cases and works wonderfully via z.ai coding plan. Qwen 3.6 is the next on the line and performs really well. It is currently free in qwen cli.
Kappalonia@reddit
Uuuh the token/s is very slow for me, are you using some config different than the setup on the claude code documentation page?
LittleYouth4954@reddit
Mine is relatively slow too, but nothing crazy. I use opencode a pp, not cli, with no mcp servers.
angelarose210@reddit
Which zai plan do you have?
LittleYouth4954@reddit
I have a legacy lite plan with the 5 h limit only and a pro plan with both 5 h and a week limit. Lite plan I am using for the autoclaw tool only. I have been coding hard this last week and reached 80% of the pro plan before my quota was renewed.
inthesearchof@reddit
GLM 5.1 seems like the current holy grail for those that are running the largest local llm setups.
zylskysniper@reddit (OP)
even with api/coding plan it's still pretty affordable compared to claude
ttkciar@reddit
Yes, but this is LocalLLaMA.
Due-Memory-6957@reddit
Interesting choice in what you made bold. Let's just accept that the subreddit has outgrown it's "running Llama locally" origins.
ttkciar@reddit
That's not "outgrowing" anything; it's off-topic content polluting the river.
LocalLLaMA is for locally hosted LLM technology, whether that is for inference or training.
Recoil42@reddit
Local Llama is for Llama, the Large Language model by Meta, Inc.
If you don't like it, gtfo.
llama-impersonator@reddit
the local part is far more important than the llama part
Recoil42@reddit
Oh so you're saying we've outgrown the llama part
How interesting
llama-impersonator@reddit
if meta released llama 5, we would happily talk about it, but they got wanged.
Recoil42@reddit
So you're saying we've outgrown the llama part
How fascinating
llama-impersonator@reddit
it's not like we didn't talk about falcon when this place was new. as it is not particularly difficult to understand my point without further elaboration, i'll leave it at that
thrownawaymane@reddit
we've outgrown your attitude, get a new one
Recoil42@reddit
Ah, I see we've arrived at the ad hominems already.
knightgod1177@reddit
That’s not an ad hominem, he’s just pointing out your attitude
Recoil42@reddit
knightgod1177@reddit
Idk if you really wanna go down this path, considering you’re using logical fallacies yourself: “Again: So you’re saying we’ve outgrown the llama part”. Strawman argument, in its most classic form I might add. You really wanna throw stones when you’re in a glass house?
Recoil42@reddit
Go ahead, explain the strawman argument.
knightgod1177@reddit
You put that argument in his mouth without him ever saying that. He made a clear and distinct argument about the local aspect, and you unilaterally repositioned his argument as “we’ve outgrown the llama part”. Not once did he ever say that nor angle that argument. Factually, Meta dropped the bag big time, leaving everyone to resort to other models. It might be colloquially true that we’ve outgrown the llama part, but that dude wasn’t saying that. You are.
Recoil42@reddit
I'm representing the argument as it literally fucking is, champ. That's the argument! The Llama part is no longer important! We've outgrown it! That's the whole fucking argument!
knightgod1177@reddit
You’re resorting almost exclusively to strawman arguments. That typically happens when someone is so emotionally invested in an argument that they lose all reason, and try to win on emotion alone. Your outbursts are evidence to this. I asked you if you wanted to go down this path because you’re using logical fallacies to prop up your arguments, while simultaneously using blaming him for using logical fallacies. It’s a case of the pot calling the kettle black. In response you used more strawman arguments to try and reshape and reframe my argument. It doesn’t work primarily because you used literally no other legit argument tactics to support your position. All you seem to be able to do is lash out emotionally. It’s the weakest position to try and win an argument from, as evidenced by the fact you’re the only one getting angry.
ttkciar@reddit
"llama" has become a general term for a model architecture. Many models even have "llama" as their GGUF architecture value, like K2-V2-Instruct (by LLM360, who have nothing whatsoever to do with Meta):
Nowadays it's understood to mean transformer model architectures compatible with or derived from the original llama.
As a rule of thumb, if it's compatible with llama.cpp, or if support for it is expected to come to llama.cpp, it's okay on this sub.
Recoil42@reddit
Because we've outgrown the scope of LLaMA, the Large Language Model from Meta AI.
How am I being forced to explain this twice? Jfc.
Because we've outgrown the scope of Llamma, the Large Language Model from Meta AI.
Three times now.
Firm-Fix-5946@reddit
and ages ago too. as much as some whiners wont accept it, the name of this sub was accurate for like three days at the start. thankfully.
Due-Memory-6957@reddit
From the start we discussed Cloud models because they're also a point of comparison to.
MerePotato@reddit
It does rank pretty poorly on privacy via the official API compared to Claude though, but that's not so much of a concern on third party providers and local deployments
Automatic-Arm8153@reddit
What do you mean it ranks poorly on privacy compared to Claude? Did something happen?
MerePotato@reddit
They operate under direct obligations to the Chinese state, whereas Anthropic has famously had an enormous falling out with the American state
Top-Rub-4670@reddit
And whilst that was distracting you in the news, they've made billion of dollars worth of contract with the government with other departments...
Automatic-Arm8153@reddit
Wow.. that’s your reasoning?
Every business is under the obligation of the state they reside. Don’t be foolish man.
Eyelbee@reddit
It's not really possible to run locally
-dysangel-@reddit
habachilles@reddit
Is this q4 or q2
-dysangel-@reddit
IQ2_XXS UD
habachilles@reddit
Do you know how big a q4 is.
-dysangel-@reddit
Q4s are between 360-465 GB
habachilles@reddit
Amazing. One 512 could do it.
Negative-Web8619@reddit
How useful is that at 2 bit?
-dysangel-@reddit
Very useful for chat. Always produces the best outputs when I compare against other things. For example I asked a bunch of models to make a Mario-like platformer, and the level of detail in GLM 5.1's attempt was a cut above the rest.
https://i.redd.it/h1hhrvbdpfug1.gif
Obviously too slow on my Mac for general agentic/pair programming type work (around 18tps and drops as context grows), but if all cloud services disappeared tomorrow, it's what I'd be running when I need the best quality I can get.
ticoneva@reddit
I manage my organization's GPU cluster and we run GLM-5.1. It's not your typical homelab, but it's local, and I am sure I am not alone.
Negative-Web8619@reddit
Why?
Eyelbee@reddit
My bad, for some reason I forgot these were MoE models. It would nonetheless be slow and it would still require 10k+ USD but runnable.
miniocz@reddit
You can run it from SSD and It would run. Well run. 1t/s but would work.
thawizard@reddit
More like walk.
kaggleqrdl@reddit
crawl
miniocz@reddit
GLM 5.1 is real. For me it could be the only LLM I need and replace all the cloud ones. Only if it can run more than 1-1.5 t/s on my hardware. As Q3...
sleepy_roger@reddit
Give Q2 a try it's surprisingly good! I've never tried any Q2 models but figured might as well since I could get 15 t/s on it, it definitely thinks A LOT but it's pretty damn good. In a pinch I could use it.
Worried_Drama151@reddit
Fake as CodeC model, can’t pass the real benchmark
a_beautiful_rhind@reddit
To be fair, GLM got a lot of opus and gemini in her :P
zeke780@reddit
Praying in 1 year we see this sort of perf in something I can run.
ThePaSch@reddit
Easily, probably. Gemma 4 is an 31B model that smokes 4o in benchmarks. 4o was rumored to be around a trillion params.
mindwip@reddit
Or better hardware at better prices!
Come on 512gb lpddrx6 wide bandwidth systems!
zeke780@reddit
I agree but that seems like a pipedream.
If gemma4 can run on your average macbook and gets you model performance from last year, maybe we have a opus level model one day on my laptop.
mindwip@reddit
2027 strix halo leaks suggest lpddr6x and wider bandwidth. So not a pipedream. Just have a longer wait. I just hope amd does not cap memory at 128gb or even 256gb, hoping 3xx or 5xx something.
SnooPaintings8639@reddit
1/3 of the opus cost is still helluva lot of $$$
I'll stick with MiniMax m2.7, which I am surprised to see has lower score than Qwen3.5 27b on your graph.
dalhaze@reddit
It’s more than 80% cheaper than Opus. Not sure what this person is talking about. Output tokens are less than $5 where Opus is $25.
zylskysniper@reddit (OP)
GLM uses about 2x token per task compare to Opus (on the same task) based on our benchmark. So that's why the final cost per task is closer to 1/3 of Opus rather than 1/5 of Opus
DeepOrangeSky@reddit
Yea, I was going to say, given how extremely expensive Opus is, even 1/3rd cost is still pretty expensive. Like in even the fairly past, wasn't it like the top Chinese open LLM models were more like 1/20th or even 1/100th the cost of the top U.S. frontier models (with a somewhat similar strength disparity to this, or maybe ever so slightly wider)? 1/3rd is nowhere near as good of a gap as some of the price gaps I saw even just a few months ago, I think.
Presumably some of that is because it is so recent and the buzz around it is so high, so the price gap will probably widen in coming weeks/month or so.
AppealSame4367@reddit
I'm not surprised. I'm on the 10$ plan and it's really not that great anymore. According to benchmarks I should have Opus 4.5 or at least Sonnet 4.5 level power from it and my tests in OpenRouter were good. But with their plan it misses a lot in planning _and_ in coding mode if a smarter model did the planning. It felt better when it was new, so I guess they had to quantisize it a lot.
Beginning_Bed_9059@reddit
Yeah that is kind of shocking
zylskysniper@reddit (OP)
I'm surprised too tbh. I bet Qwen 3.6 will dominate that price range once the 27b version is opensource or the current version supports prompt caching
Shingikai@reddit
Arena-style evals catch things static benchmarks miss, but they have their own failure modes.
LLM judges tend to favor outputs that match their stylistic patterns, responses that are longer, and answers that hedge in ways that read as "thoughtful." In agentic contexts specifically, that last tendency is dangerous. "Sounds confident and complete" and "actually finished the task correctly" can come apart, and a judge model that conflates the two will systematically reward the wrong thing.
The 21-battle sample is the obvious concern (already flagged in this thread). There's a subtler one: what's the task distribution? If user-submitted tasks skew toward OpenClaw-style workflows, you're measuring "good at what OpenClaw users care about," which may or may not match your use case. Domain-specific evals are actually the right design when your goal is a specific workflow. But then "beats Opus at 1/3 the cost" is a bigger claim than the current data fully supports.
zylskysniper@reddit (OP)
Thanks for the insightful feedback!
> LLM judges tend to favor outputs that match their stylistic patterns, responses that are longer, and answers that hedge in ways that read as "thoughtful." In agentic contexts specifically, that last tendency is dangerous. "Sounds confident and complete" and "actually finished the task correctly" can come apart, and a judge model that conflates the two will systematically reward the wrong thing.
Agreed. I tried to mitigate that in a few ways:
- we have a judge model set and all self judge are excluded from ranking calculation
- the tasks I bootstrapped (which are most of the current tasks) are mostly about tool calls + producing artifacts that are mostly verifiable. For example, producing a pptx that include certain content and format; use browser to get some data and put in a spreadsheet. Judge will evaluate whether the agent gets things done, mainly completeness and quality, but not what the model tells.
- In task description I explicitly ask agents to produce output in artifacts
Those will not fix the problem completely but it's much better than comparing two text output and determine which is better.
> The 21-battle sample is the obvious concern (already flagged in this thread). There's a subtler one: what's the task distribution? If user-submitted tasks skew toward OpenClaw-style workflows, you're measuring "good at what OpenClaw users care about," which may or may not match your use case. Domain-specific evals are actually the right design when your goal is a specific workflow. But then "beats Opus at 1/3 the cost" is a bigger claim than the current data fully supports.
Task distribution is actually part of the benchmark settings. In every benchmark, ranking is tied to a certain task distribution, changing task distribution will affect ranking. One of the biggest advantages of Chatbot Arena, in my opinion, is that task distribution is determined by user submission rather than arena team, so rank is closer to the actual performance a typical arena user will get. That's the ultimate goal I try to achieve: make our benchmark task distribution closer to what user actually do in their general purpose agent (openclaw for now, that's why i call it openclaw arena).
Currently i don't have enough user submitted tasks, so I bootstrapped the benchmark this way: I crawled what users are doing using openclaw from public posts on twitter, reddit, etc, find out those that are relatively objective and verifiable, needs tool calls, and generate similar tasks. So for now my task distribution is close to what openclaw users care about. I'm hoping to get more user submitted task from now on to better match actual user task distribution.
ThePixelHunter@reddit
It'll be more like 1/10th the cost once more providers are hosting it. Give it a couple weeks.
dalhaze@reddit
Actually though? Quantized a lot more though right?
Objective-Picture-72@reddit
This is exactly why the Apple M3 Ultra 512GB sold out instantly. Once everyone saw that there is a pathway to current SOTA model capability run locally, it was a no-brainer for people who could afford it. For many, spending $40K on a MacStudio cluster is worth it to have Opus 4.5 or Sonnet 4.6 level of intelligence that they control and can use 24/7 for just the cost of electricity. Imagine the brute forcing loops those things are being run on right now.
dalhaze@reddit
You’re talking about the unsloth version? Isn’t that a pretty extreme quantization to get it to run on 512GB?
And it would be less than 5 tokens a second wouldn’t it?
DistanceSolar1449@reddit
It’s 744b. Any regular Q4 or even Q5 would fit in 512GB
dalhaze@reddit
What type of T/s do you think you'd see with that?
socialjusticeinme@reddit
With the M3, the prompt processing is going to be such bullshit that it’s not remotely worth it. I also imagine it would drop to single digit T/s at moderate constant lengths.
Now the upcoming M5 ultra Mac Studio, that may be worth it
-dysangel-@reddit
It's great as a smart local chat model. Obviously not the best for agentic use.
Dave_from_the_navy@reddit
Typically the calculation for MOE models at 100% efficiency is memory bandwidth / active weights size in vram, with the apple software architecture sitting at around 60-65% efficiency. With the M3 Ultra sitting at 819 GB/s and the active weights sitting at 20-22GB on a quant like that, that should give us somewhere in the range or 18-25 t/s-ish. I could be wrong, but I'd imagine the real answer isn't too far off.
DistanceSolar1449@reddit
819GB/sec memory bandwidth
40B active params, so ~20GB active
So 40tokens/sec theoretical max. Probably 20 tokens/sec IRL
-dysangel-@reddit
Existing-Wallaby-444@reddit
It's less the a fifth for output tokens!
zylskysniper@reddit (OP)
GLM tends to call more tools and use more tokens than Opus (around 2x tool calls and 2x tokens) given the same task. That's how we get \~1/3 instead of 1/5
Existing-Wallaby-444@reddit
Is this always the case or can this be improved by optimizing the environment for GLM?
zylskysniper@reddit (OP)
I think it's an intentional design, part of their effort to get better result at agentic tasks.
And interestingly, it seems to be a common choice for many recent models. Based on our bench, top models ranked by token per task are:
qwen 3.6 plus (1.5M per task)
glm 5.1 (1.2M per task)
step 3.5 flash (1.2M per task)
deepseek 3.2 (0.86M per task)
minimax m2.7 (0.71M per task)
opus 4.6 (0.66M per task)
DeepOrangeSky@reddit
How are KimiK2.5 and Qwen3.5 397b for this (if you tested)?
Also, how does this work. Like does it only count it if it actually successfully completes the task, or is this just the tokens used per task regardless of whether if partially or fully succeeds or totally fails?
I assume it is the latter, or else it would mean Minimax is way stronger at coding than GLM 5.1, right? Or am I not understanding how the metric works?
zylskysniper@reddit (OP)
Kimi is the yellow dot on graph, not performing well.
Cost wise, it uses about 0.2M per run. The issue is that, kimi doesn't put enough effort at tool call. It only calls \~5.7 tools per task, while glm 5.1 calls 28.9 tools per task. That's part of the reason why kimi gets low performance score.
Never tested qwen 3.5 397b though.
> Also, how does this work. Like does it only count it if it actually successfully completes the task, or is this just the tokens used per task regardless of whether if partially or fully succeeds or totally fails?
I only excluded runs that failed due to provider (openrouter) error or runtime (openclaw) error. Others are considered successful runs and are included in evaluation and cost/token calculation.
In general the token per task is positively correlated with performance but not guarantee. You can check full stats here https://app.uniclaw.ai/arena/model-stats?via=reddit . Score means whether a model is stronger (quality of the output), and Avg Cost means how much you pay on average per run. Other stats are more like auxiliary metrics helping you understand why a model performs good/bad or cost more/less.
PromptInjection_@reddit
It is one of the best coding models out there. However, for creative writing i still prefer Sonnet or Opus.
victoryposition@reddit
I can confirm this is the smartest local coding model .....and the **only** reason it's not perfect is qwen 3.5 397b runs twice as fast, is multi modal, uses half the vram and works great with fp8 kv cache.
SSOMGDSJD@reddit
Only 21 battles and spread bars big enough to encompass the entire top 7. Also shits out 3x more tokens than opus. Interesting results and well done site though, looking forward to more data being collected
Rorqualx@reddit
I tried to get glm5.1 to execute a prompt that Claude has no issues getting setup and running successfully and it had so many bad assumptions that made it so frustrating to use having to correct its behavior and not getting any worth while results
Leafytreedev@reddit
Am I reading your graph right? Qwen 3.5 27b costs more to run than 230b minimax m2.7? Why is that?
zylskysniper@reddit (OP)
Yes you read it right, and that's because qwen has no prompt caching on openrouter, so all read costs $0.195 per Mtoken while minimax costs $0.06 per Mtoken if cache hit.
For Gemma, it's because it doesn't try hard compared to other models and thus only uses a fraction of the tokens. On average Gemma only calls 5 tools per task, while GLM 5.1 calls 29 tools per task... So the per task cost is very low
atape_1@reddit
For any company that needs fully air gaped development, this is absolutely incredible.
monjodav@reddit
how did you use it? coding plan or api key through openrouter?
zylskysniper@reddit (OP)
for benchmark i used api key through openrouter (easier implementation so it works for any model) but for personal use coding plan for sure