GLM 5.1 crushes every other model except Opus in agentic benchmark at about 1/3 of the Opus cost

Posted by zylskysniper@reddit | LocalLLaMA | View on Reddit | 93 comments

I want to know whether GLM is another benchmark optimized model or actually useful in agents like OpenClaw, so I tested GLM 5.1 in our agentic benchmark.

Turns out it reaches Opus 4.6 level performance with just 1/3 of the cost (\~$0.4 per run vs \~$1.2 per run) based on my tests. It outperforms all other models tested. Pushes the cost effectiveness frontier quite a bit.

I don't quite trust any static benchmarks, seen many models optimized for it, ranking high on those leaderboard but not working well in real agentic tasks. So we uses OpenClaw to test the agentic performance of models in real environment + real tasks (user submitted). Chatbot Arena/LMArena style battle, LLM as judge.

Based on the result, I would say GLM 5.1 is one of the top models for OpenClaw type of agents now.

Qwen 3.6 also did a good job, but it does not support prompt caching yet (on openrouter) so the current price is inflated. With prompt caching I except it to reach minimax m2.7 level cost per run and becomes another great choice for cost effectiveness.

Full leaderboard, cost-effectiveness analysis, and methodology can be found at https://app.uniclaw.ai/arena?via=reddit . Strongly recommend submitting your own task and see how different models on it.