GLM 5.1 crushes every other model except Opus in agentic benchmark at about 1/3 of the Opus cost

Posted by zylskysniper@reddit | LocalLLaMA | View on Reddit | 93 comments

I want to know whether GLM is another benchmark optimized model or actually useful in agents like OpenClaw, so I tested GLM 5.1 in our agentic benchmark.

Turns out it reaches Opus 4.6 level performance with just 1/3 of the cost (\~$0.4 per run vs \~$1.2 per run) based on my tests. It outperforms all other models tested. Pushes the cost effectiveness frontier quite a bit.

I don't quite trust any static benchmarks, seen many models optimized for it, ranking high on those leaderboard but not working well in real agentic tasks. So we uses OpenClaw to test the agentic performance of models in real environment + real tasks (user submitted). Chatbot Arena/LMArena style battle, LLM as judge.

Based on the result, I would say GLM 5.1 is one of the top models for OpenClaw type of agents now.

Qwen 3.6 also did a good job, but it does not support prompt caching yet (on openrouter) so the current price is inflated. With prompt caching I except it to reach minimax m2.7 level cost per run and becomes another great choice for cost effectiveness.

Full leaderboard, cost-effectiveness analysis, and methodology can be found at https://app.uniclaw.ai/arena?via=reddit . Strongly recommend submitting your own task and see how different models on it.

[-]

LittleYouth4954@reddit

Glm 5.1 is all I need for my use cases and works wonderfully via z.ai coding plan. Qwen 3.6 is the next on the line and performs really well. It is currently free in qwen cli.

[-]

Kappalonia@reddit

Uuuh the token/s is very slow for me, are you using some config different than the setup on the claude code documentation page?

[-]

LittleYouth4954@reddit

Mine is relatively slow too, but nothing crazy. I use opencode a pp, not cli, with no mcp servers.

[-]

angelarose210@reddit

Which zai plan do you have?

[-]

LittleYouth4954@reddit

I have a legacy lite plan with the 5 h limit only and a pro plan with both 5 h and a week limit. Lite plan I am using for the autoclaw tool only. I have been coding hard this last week and reached 80% of the pro plan before my quota was renewed.

[-]

inthesearchof@reddit

GLM 5.1 seems like the current holy grail for those that are running the largest local llm setups.

[-]

zylskysniper@reddit (OP)

even with api/coding plan it's still pretty affordable compared to claude

[-]

ttkciar@reddit

Yes, but this is LocalLLaMA.

[-]

Due-Memory-6957@reddit

Interesting choice in what you made bold. Let's just accept that the subreddit has outgrown it's "running Llama locally" origins.

[-]

ttkciar@reddit

That's not "outgrowing" anything; it's off-topic content polluting the river.

LocalLLaMA is for locally hosted LLM technology, whether that is for inference or training.

[-]

Recoil42@reddit

Local Llama is for Llama, the Large Language model by Meta, Inc.

If you don't like it, gtfo.

[-]

llama-impersonator@reddit

the local part is far more important than the llama part

[-]

Recoil42@reddit

Oh so you're saying we've outgrown the llama part

How interesting

[-]

llama-impersonator@reddit

if meta released llama 5, we would happily talk about it, but they got wanged.

[-]

Recoil42@reddit

So you're saying we've outgrown the llama part

How fascinating

[-]

llama-impersonator@reddit

it's not like we didn't talk about falcon when this place was new. as it is not particularly difficult to understand my point without further elaboration, i'll leave it at that

[-]

thrownawaymane@reddit

we've outgrown your attitude, get a new one

[-]

Recoil42@reddit

Ah, I see we've arrived at the ad hominems already.

[-]

knightgod1177@reddit

That’s not an ad hominem, he’s just pointing out your attitude

[-]

Recoil42@reddit

[-]

knightgod1177@reddit

Idk if you really wanna go down this path, considering you’re using logical fallacies yourself: “Again: So you’re saying we’ve outgrown the llama part”. Strawman argument, in its most classic form I might add. You really wanna throw stones when you’re in a glass house?

[-]

Recoil42@reddit

“Again: So you’re saying we’ve outgrown the llama part”. Strawman argument

Go ahead, explain the strawman argument.

[-]

knightgod1177@reddit

You put that argument in his mouth without him ever saying that. He made a clear and distinct argument about the local aspect, and you unilaterally repositioned his argument as “we’ve outgrown the llama part”. Not once did he ever say that nor angle that argument. Factually, Meta dropped the bag big time, leaving everyone to resort to other models. It might be colloquially true that we’ve outgrown the llama part, but that dude wasn’t saying that. You are.

[-]

Recoil42@reddit

He made a clear and distinct argument about the local aspect, and you unilaterally repositioned his argument as “we’ve outgrown the llama part”.

I'm representing the argument as it literally fucking is, champ. That's the argument! The Llama part is no longer important! We've outgrown it! That's the whole fucking argument!

[-]

knightgod1177@reddit

You’re resorting almost exclusively to strawman arguments. That typically happens when someone is so emotionally invested in an argument that they lose all reason, and try to win on emotion alone. Your outbursts are evidence to this. I asked you if you wanted to go down this path because you’re using logical fallacies to prop up your arguments, while simultaneously using blaming him for using logical fallacies. It’s a case of the pot calling the kettle black. In response you used more strawman arguments to try and reshape and reframe my argument. It doesn’t work primarily because you used literally no other legit argument tactics to support your position. All you seem to be able to do is lash out emotionally. It’s the weakest position to try and win an argument from, as evidenced by the fact you’re the only one getting angry.

[-]

ttkciar@reddit

"llama" has become a general term for a model architecture. Many models even have "llama" as their GGUF architecture value, like K2-V2-Instruct (by LLM360, who have nothing whatsoever to do with Meta):

"general.architecture": {
     "index": 3,
     "offset": 24,
     "type": "STRING",
     "value": "llama"
},

Nowadays it's understood to mean transformer model architectures compatible with or derived from the original llama.

As a rule of thumb, if it's compatible with llama.cpp, or if support for it is expected to come to llama.cpp, it's okay on this sub.

[-]

Recoil42@reddit

"llama" has become a general term for a model architecture.

Because we've outgrown the scope of LLaMA, the Large Language Model from Meta AI.

How am I being forced to explain this twice? Jfc.

Many models even have "llama" as their GGUF architecture value

Because we've outgrown the scope of Llamma, the Large Language Model from Meta AI.

Three times now.

[-]

Firm-Fix-5946@reddit

and ages ago too. as much as some whiners wont accept it, the name of this sub was accurate for like three days at the start. thankfully.

[-]

Due-Memory-6957@reddit

From the start we discussed Cloud models because they're also a point of comparison to.

[-]

MerePotato@reddit

It does rank pretty poorly on privacy via the official API compared to Claude though, but that's not so much of a concern on third party providers and local deployments

[-]

Automatic-Arm8153@reddit

What do you mean it ranks poorly on privacy compared to Claude? Did something happen?

[-]

MerePotato@reddit

They operate under direct obligations to the Chinese state, whereas Anthropic has famously had an enormous falling out with the American state

[-]

Top-Rub-4670@reddit

has famously had an enormous falling out with the American state

And whilst that was distracting you in the news, they've made billion of dollars worth of contract with the government with other departments...

[-]

Automatic-Arm8153@reddit

Wow.. that’s your reasoning?

Every business is under the obligation of the state they reside. Don’t be foolish man.

[-]

Eyelbee@reddit

It's not really possible to run locally

[-]

-dysangel-@reddit

[-]

habachilles@reddit

Is this q4 or q2

[-]

-dysangel-@reddit

IQ2_XXS UD

[-]

habachilles@reddit

Do you know how big a q4 is.

[-]

-dysangel-@reddit

Q4s are between 360-465 GB

[-]

habachilles@reddit

Amazing. One 512 could do it.

[-]

Negative-Web8619@reddit

How useful is that at 2 bit?

[-]

-dysangel-@reddit

Very useful for chat. Always produces the best outputs when I compare against other things. For example I asked a bunch of models to make a Mario-like platformer, and the level of detail in GLM 5.1's attempt was a cut above the rest.

https://i.redd.it/h1hhrvbdpfug1.gif

Obviously too slow on my Mac for general agentic/pair programming type work (around 18tps and drops as context grows), but if all cloud services disappeared tomorrow, it's what I'd be running when I need the best quality I can get.

[-]

ticoneva@reddit

I manage my organization's GPU cluster and we run GLM-5.1. It's not your typical homelab, but it's local, and I am sure I am not alone.

[-]

Negative-Web8619@reddit

Why?

[-]

Eyelbee@reddit

My bad, for some reason I forgot these were MoE models. It would nonetheless be slow and it would still require 10k+ USD but runnable.

[-]

miniocz@reddit

You can run it from SSD and It would run. Well run. 1t/s but would work.

[-]

thawizard@reddit

More like walk.

[-]

kaggleqrdl@reddit

crawl

[-]

miniocz@reddit

GLM 5.1 is real. For me it could be the only LLM I need and replace all the cloud ones. Only if it can run more than 1-1.5 t/s on my hardware. As Q3...

[-]

sleepy_roger@reddit

Give Q2 a try it's surprisingly good! I've never tried any Q2 models but figured might as well since I could get 15 t/s on it, it definitely thinks A LOT but it's pretty damn good. In a pinch I could use it.

[-]

Worried_Drama151@reddit

Fake as CodeC model, can’t pass the real benchmark

[-]

a_beautiful_rhind@reddit

To be fair, GLM got a lot of opus and gemini in her :P

[-]

zeke780@reddit

Praying in 1 year we see this sort of perf in something I can run.

[-]

ThePaSch@reddit

Easily, probably. Gemma 4 is an 31B model that smokes 4o in benchmarks. 4o was rumored to be around a trillion params.

[-]

mindwip@reddit

Or better hardware at better prices!

Come on 512gb lpddrx6 wide bandwidth systems!

[-]

zeke780@reddit

I agree but that seems like a pipedream.

If gemma4 can run on your average macbook and gets you model performance from last year, maybe we have a opus level model one day on my laptop.

[-]

mindwip@reddit

2027 strix halo leaks suggest lpddr6x and wider bandwidth. So not a pipedream. Just have a longer wait. I just hope amd does not cap memory at 128gb or even 256gb, hoping 3xx or 5xx something.

[-]

SnooPaintings8639@reddit

1/3 of the opus cost is still helluva lot of $$$

I'll stick with MiniMax m2.7, which I am surprised to see has lower score than Qwen3.5 27b on your graph.

[-]

dalhaze@reddit

It’s more than 80% cheaper than Opus. Not sure what this person is talking about. Output tokens are less than $5 where Opus is $25.

[-]

zylskysniper@reddit (OP)

GLM uses about 2x token per task compare to Opus (on the same task) based on our benchmark. So that's why the final cost per task is closer to 1/3 of Opus rather than 1/5 of Opus

[-]

DeepOrangeSky@reddit

Yea, I was going to say, given how extremely expensive Opus is, even 1/3rd cost is still pretty expensive. Like in even the fairly past, wasn't it like the top Chinese open LLM models were more like 1/20th or even 1/100th the cost of the top U.S. frontier models (with a somewhat similar strength disparity to this, or maybe ever so slightly wider)? 1/3rd is nowhere near as good of a gap as some of the price gaps I saw even just a few months ago, I think.

Presumably some of that is because it is so recent and the buzz around it is so high, so the price gap will probably widen in coming weeks/month or so.

[-]

AppealSame4367@reddit

I'm not surprised. I'm on the 10$ plan and it's really not that great anymore. According to benchmarks I should have Opus 4.5 or at least Sonnet 4.5 level power from it and my tests in OpenRouter were good. But with their plan it misses a lot in planning _and_ in coding mode if a smarter model did the planning. It felt better when it was new, so I guess they had to quantisize it a lot.

[-]

Beginning_Bed_9059@reddit

Yeah that is kind of shocking

[-]

zylskysniper@reddit (OP)

I'm surprised too tbh. I bet Qwen 3.6 will dominate that price range once the 27b version is opensource or the current version supports prompt caching

[-]

Shingikai@reddit

Arena-style evals catch things static benchmarks miss, but they have their own failure modes.

LLM judges tend to favor outputs that match their stylistic patterns, responses that are longer, and answers that hedge in ways that read as "thoughtful." In agentic contexts specifically, that last tendency is dangerous. "Sounds confident and complete" and "actually finished the task correctly" can come apart, and a judge model that conflates the two will systematically reward the wrong thing.

The 21-battle sample is the obvious concern (already flagged in this thread). There's a subtler one: what's the task distribution? If user-submitted tasks skew toward OpenClaw-style workflows, you're measuring "good at what OpenClaw users care about," which may or may not match your use case. Domain-specific evals are actually the right design when your goal is a specific workflow. But then "beats Opus at 1/3 the cost" is a bigger claim than the current data fully supports.

[-]

zylskysniper@reddit (OP)

Thanks for the insightful feedback!

> LLM judges tend to favor outputs that match their stylistic patterns, responses that are longer, and answers that hedge in ways that read as "thoughtful." In agentic contexts specifically, that last tendency is dangerous. "Sounds confident and complete" and "actually finished the task correctly" can come apart, and a judge model that conflates the two will systematically reward the wrong thing.

Agreed. I tried to mitigate that in a few ways:
- we have a judge model set and all self judge are excluded from ranking calculation
- the tasks I bootstrapped (which are most of the current tasks) are mostly about tool calls + producing artifacts that are mostly verifiable. For example, producing a pptx that include certain content and format; use browser to get some data and put in a spreadsheet. Judge will evaluate whether the agent gets things done, mainly completeness and quality, but not what the model tells.
- In task description I explicitly ask agents to produce output in artifacts

Those will not fix the problem completely but it's much better than comparing two text output and determine which is better.

> The 21-battle sample is the obvious concern (already flagged in this thread). There's a subtler one: what's the task distribution? If user-submitted tasks skew toward OpenClaw-style workflows, you're measuring "good at what OpenClaw users care about," which may or may not match your use case. Domain-specific evals are actually the right design when your goal is a specific workflow. But then "beats Opus at 1/3 the cost" is a bigger claim than the current data fully supports.

Task distribution is actually part of the benchmark settings. In every benchmark, ranking is tied to a certain task distribution, changing task distribution will affect ranking. One of the biggest advantages of Chatbot Arena, in my opinion, is that task distribution is determined by user submission rather than arena team, so rank is closer to the actual performance a typical arena user will get. That's the ultimate goal I try to achieve: make our benchmark task distribution closer to what user actually do in their general purpose agent (openclaw for now, that's why i call it openclaw arena).

Currently i don't have enough user submitted tasks, so I bootstrapped the benchmark this way: I crawled what users are doing using openclaw from public posts on twitter, reddit, etc, find out those that are relatively objective and verifiable, needs tool calls, and generate similar tasks. So for now my task distribution is close to what openclaw users care about. I'm hoping to get more user submitted task from now on to better match actual user task distribution.

[-]

ThePixelHunter@reddit

It'll be more like 1/10th the cost once more providers are hosting it. Give it a couple weeks.

[-]

dalhaze@reddit

Actually though? Quantized a lot more though right?

[-]

Objective-Picture-72@reddit

This is exactly why the Apple M3 Ultra 512GB sold out instantly. Once everyone saw that there is a pathway to current SOTA model capability run locally, it was a no-brainer for people who could afford it. For many, spending $40K on a MacStudio cluster is worth it to have Opus 4.5 or Sonnet 4.6 level of intelligence that they control and can use 24/7 for just the cost of electricity. Imagine the brute forcing loops those things are being run on right now.

[-]

dalhaze@reddit

You’re talking about the unsloth version? Isn’t that a pretty extreme quantization to get it to run on 512GB?

And it would be less than 5 tokens a second wouldn’t it?

[-]

DistanceSolar1449@reddit

It’s 744b. Any regular Q4 or even Q5 would fit in 512GB

[-]

dalhaze@reddit

What type of T/s do you think you'd see with that?

[-]

socialjusticeinme@reddit

With the M3, the prompt processing is going to be such bullshit that it’s not remotely worth it. I also imagine it would drop to single digit T/s at moderate constant lengths.

Now the upcoming M5 ultra Mac Studio, that may be worth it

[-]

-dysangel-@reddit

It's great as a smart local chat model. Obviously not the best for agentic use.

[-]

Dave_from_the_navy@reddit

Typically the calculation for MOE models at 100% efficiency is memory bandwidth / active weights size in vram, with the apple software architecture sitting at around 60-65% efficiency. With the M3 Ultra sitting at 819 GB/s and the active weights sitting at 20-22GB on a quant like that, that should give us somewhere in the range or 18-25 t/s-ish. I could be wrong, but I'd imagine the real answer isn't too far off.

[-]

DistanceSolar1449@reddit

819GB/sec memory bandwidth

40B active params, so ~20GB active

So 40tokens/sec theoretical max. Probably 20 tokens/sec IRL

[-]

-dysangel-@reddit

[-]

Existing-Wallaby-444@reddit

It's less the a fifth for output tokens!

[-]

zylskysniper@reddit (OP)

GLM tends to call more tools and use more tokens than Opus (around 2x tool calls and 2x tokens) given the same task. That's how we get \~1/3 instead of 1/5

[-]

Existing-Wallaby-444@reddit

Is this always the case or can this be improved by optimizing the environment for GLM?

[-]

zylskysniper@reddit (OP)

I think it's an intentional design, part of their effort to get better result at agentic tasks.
And interestingly, it seems to be a common choice for many recent models. Based on our bench, top models ranked by token per task are:
qwen 3.6 plus (1.5M per task)
glm 5.1 (1.2M per task)
step 3.5 flash (1.2M per task)
deepseek 3.2 (0.86M per task)
minimax m2.7 (0.71M per task)
opus 4.6 (0.66M per task)

[-]

DeepOrangeSky@reddit

How are KimiK2.5 and Qwen3.5 397b for this (if you tested)?

Also, how does this work. Like does it only count it if it actually successfully completes the task, or is this just the tokens used per task regardless of whether if partially or fully succeeds or totally fails?

I assume it is the latter, or else it would mean Minimax is way stronger at coding than GLM 5.1, right? Or am I not understanding how the metric works?

[-]

zylskysniper@reddit (OP)

Kimi is the yellow dot on graph, not performing well.
Cost wise, it uses about 0.2M per run. The issue is that, kimi doesn't put enough effort at tool call. It only calls \~5.7 tools per task, while glm 5.1 calls 28.9 tools per task. That's part of the reason why kimi gets low performance score.

Never tested qwen 3.5 397b though.

> Also, how does this work. Like does it only count it if it actually successfully completes the task, or is this just the tokens used per task regardless of whether if partially or fully succeeds or totally fails?

I only excluded runs that failed due to provider (openrouter) error or runtime (openclaw) error. Others are considered successful runs and are included in evaluation and cost/token calculation.

In general the token per task is positively correlated with performance but not guarantee. You can check full stats here https://app.uniclaw.ai/arena/model-stats?via=reddit . Score means whether a model is stronger (quality of the output), and Avg Cost means how much you pay on average per run. Other stats are more like auxiliary metrics helping you understand why a model performs good/bad or cost more/less.

[-]

PromptInjection_@reddit

It is one of the best coding models out there. However, for creative writing i still prefer Sonnet or Opus.

[-]

victoryposition@reddit

I can confirm this is the smartest local coding model .....and the **only** reason it's not perfect is qwen 3.5 397b runs twice as fast, is multi modal, uses half the vram and works great with fp8 kv cache.

[-]

SSOMGDSJD@reddit

Only 21 battles and spread bars big enough to encompass the entire top 7. Also shits out 3x more tokens than opus. Interesting results and well done site though, looking forward to more data being collected

[-]

Rorqualx@reddit

I tried to get glm5.1 to execute a prompt that Claude has no issues getting setup and running successfully and it had so many bad assumptions that made it so frustrating to use having to correct its behavior and not getting any worth while results

[-]

Leafytreedev@reddit

Am I reading your graph right? Qwen 3.5 27b costs more to run than 230b minimax m2.7? Why is that?

[-]

zylskysniper@reddit (OP)

Yes you read it right, and that's because qwen has no prompt caching on openrouter, so all read costs $0.195 per Mtoken while minimax costs $0.06 per Mtoken if cache hit.

For Gemma, it's because it doesn't try hard compared to other models and thus only uses a fraction of the tokens. On average Gemma only calls 5 tools per task, while GLM 5.1 calls 29 tools per task... So the per task cost is very low

[-]

atape_1@reddit

For any company that needs fully air gaped development, this is absolutely incredible.

[-]

monjodav@reddit

how did you use it? coding plan or api key through openrouter?

[-]

zylskysniper@reddit (OP)

for benchmark i used api key through openrouter (easier implementation so it works for any model) but for personal use coding plan for sure