We ranked 22 AI models with 550+ real OpenClaw battles — GLM-5.1 debuts at #2, full Pareto cost analysis inside

Posted by zylskysniper@reddit | LocalLLaMA | View on Reddit | 1 comments

I've been working on an arena (ChatBot Arena style) for comparing AI models on user submitted real agentic tasks — not chat, not static coding benchmarks. I want to benchmark how models perform in real harness, on real tasks.

Each model runs as an actual OpenClaw subagent in a fresh VM with terminal, browser, file system, and code execution, etc. A judge agent (user's choice: Claude Opus 4.6, GPT-5.4, or Gemini 3.1 Pro) evaluates by reading/running the code, browsing deployed apps, taking screenshots, etc.

We just passed 550 battles across 22 models. Here are the current performance rankings:

Performance Leaderboard

Rank Model Score Battles
1 Claude Opus 4.6 1739 108
2 GLM-5.1 1700 21
3 Claude Sonnet 4.6 1681 104
4 GPT-5.4 1663 138
5 Qwen 3.6 Plus 1537 27
6 GPT-5.3 Codex 1477 183
7 Claude Haiku 4.5 1408 169
8 Qwen 3.5 27B 1395 81
9 Xiaomi MiMo v2 Pro 1385 112
10 GLM-5 Turbo 1362 93
11 MiniMax M2.7 1309 161
12 StepFun 3.5 Flash 1280 156
13 DeepSeek V3.2 1246 126
14 Gemini 3 Flash 1235 131
15 Gemini 3.1 Pro 1196 92
16 Grok 4.1 Fast 1189 160
17 Kimi K2.5 974 95
18 Nemotron 3 Super 120B 804 98

Cost vs Performance (Pareto Analysis)

Beyond the performance ranking, we plot each model's performance score against its actual average cost and draw the Pareto frontier — the set of models where you can't get better performance without paying more. This gives us a "budget ladder" — the best model at each price point:

Budget Range Recommended Model Score Avg Cost
$0.03 - $0.04 Grok 4.1 Fast 1189 ±84 $0.03
$0.04 - $0.14 StepFun 3.5 Flash 1280 ±93 $0.04
$0.14 - $0.19 MiniMax M2.7 1309 ±97 $0.14
$0.19 - $0.24 Qwen 3.5 27B 1395 ±100 $0.19
$0.24 - $0.37 GPT-5.3 Codex 1477 ±92 $0.24
$0.37 - $1.19 GLM-5.1 1700 ±129 $0.37
$1.19+ Claude Opus 4.6 1739 ±126 $1.19

One to watch: Qwen 3.6 Plus doesn't support prompt caching yet (at least on OpenRouter), so its $0.37/run is inflated. When caching lands, I'd expect its cost to drop to roughly minimax level (~$0.14), which would push minimax m2.7, Qwen 3.5 27B, and GPT-5.3 Codex off the frontier entirely.

Some Interesting Findings

  1. GLM-5.1 debuts at #2 with a perfect record. Much better than I expected.

  2. Opus is still #1 but at a steep cost. $1.19/run average. It's the most expensive model by far. GPT-5.4 at $0.40 is close in performance and much cheaper.

  3. Gemini 3.1 Pro is bad at agentic tasks. Ranks #15 at $0.32/run — behind multiple models that cost a fraction of the price. We actually had to optimize the judge message for it because it sometimes just reads the skill and decides to do nothing.

  4. StepFun 3.5 Flash is underrated for cost effectiveness: grok 4.1 fast level cost, minimax m2.7 level performance.

Methodology

We only use the relative ordering of models within each battle — not the raw scores. Absolute scores from LLM judges are noisy and poorly calibrated (a "7/10" in one battle might be "6/10" in another), but "A ranked above B" is much more consistent. Same principle behind Chatbot Arena's pairwise preference approach.

Rankings use a grouped Plackett-Luce model, not simple win-rate or Bradley-Terry. Battles where the judge model is also evaluated are excluded from the official board.

Full methodology with equations and comparison vs Arena.ai: https://app.uniclaw.ai/arena/leaderboard/methodology?via=reddit

How Battles Work

Try It

Live leaderboard (no account needed): https://app.uniclaw.ai/arena?via=reddit

Interactive Pareto cost-performance plot: https://app.uniclaw.ai/arena/visualize?via=reddit

Submit your own benchmarks (public ones are on us): https://app.uniclaw.ai/arena/new?via=reddit

The judge skill is open-source: https://github.com/unifai-network/skills/tree/main/agent-bench

Note on the data: We bootstrapped the first 500+ battles by crawling what people are doing with OpenClaw (on X, Reddit, etc.) and generating battles with similar tasks + randomly selected models. Going forward, anyone can submit their own tasks.

What tasks would you want to see benchmarked? Happy to run specific comparisons.