We ranked 22 AI models with 550+ real OpenClaw battles — GLM-5.1 debuts at #2, full Pareto cost analysis inside
Posted by zylskysniper@reddit | LocalLLaMA | View on Reddit | 1 comments
I've been working on an arena (ChatBot Arena style) for comparing AI models on user submitted real agentic tasks — not chat, not static coding benchmarks. I want to benchmark how models perform in real harness, on real tasks.
Each model runs as an actual OpenClaw subagent in a fresh VM with terminal, browser, file system, and code execution, etc. A judge agent (user's choice: Claude Opus 4.6, GPT-5.4, or Gemini 3.1 Pro) evaluates by reading/running the code, browsing deployed apps, taking screenshots, etc.
We just passed 550 battles across 22 models. Here are the current performance rankings:
Performance Leaderboard
| Rank | Model | Score | Battles |
|---|---|---|---|
| 1 | Claude Opus 4.6 | 1739 | 108 |
| 2 | GLM-5.1 | 1700 | 21 |
| 3 | Claude Sonnet 4.6 | 1681 | 104 |
| 4 | GPT-5.4 | 1663 | 138 |
| 5 | Qwen 3.6 Plus | 1537 | 27 |
| 6 | GPT-5.3 Codex | 1477 | 183 |
| 7 | Claude Haiku 4.5 | 1408 | 169 |
| 8 | Qwen 3.5 27B | 1395 | 81 |
| 9 | Xiaomi MiMo v2 Pro | 1385 | 112 |
| 10 | GLM-5 Turbo | 1362 | 93 |
| 11 | MiniMax M2.7 | 1309 | 161 |
| 12 | StepFun 3.5 Flash | 1280 | 156 |
| 13 | DeepSeek V3.2 | 1246 | 126 |
| 14 | Gemini 3 Flash | 1235 | 131 |
| 15 | Gemini 3.1 Pro | 1196 | 92 |
| 16 | Grok 4.1 Fast | 1189 | 160 |
| 17 | Kimi K2.5 | 974 | 95 |
| 18 | Nemotron 3 Super 120B | 804 | 98 |
Cost vs Performance (Pareto Analysis)
Beyond the performance ranking, we plot each model's performance score against its actual average cost and draw the Pareto frontier — the set of models where you can't get better performance without paying more. This gives us a "budget ladder" — the best model at each price point:
| Budget Range | Recommended Model | Score | Avg Cost |
|---|---|---|---|
| $0.03 - $0.04 | Grok 4.1 Fast | 1189 ±84 | $0.03 |
| $0.04 - $0.14 | StepFun 3.5 Flash | 1280 ±93 | $0.04 |
| $0.14 - $0.19 | MiniMax M2.7 | 1309 ±97 | $0.14 |
| $0.19 - $0.24 | Qwen 3.5 27B | 1395 ±100 | $0.19 |
| $0.24 - $0.37 | GPT-5.3 Codex | 1477 ±92 | $0.24 |
| $0.37 - $1.19 | GLM-5.1 | 1700 ±129 | $0.37 |
| $1.19+ | Claude Opus 4.6 | 1739 ±126 | $1.19 |
One to watch: Qwen 3.6 Plus doesn't support prompt caching yet (at least on OpenRouter), so its $0.37/run is inflated. When caching lands, I'd expect its cost to drop to roughly minimax level (~$0.14), which would push minimax m2.7, Qwen 3.5 27B, and GPT-5.3 Codex off the frontier entirely.
Some Interesting Findings
-
GLM-5.1 debuts at #2 with a perfect record. Much better than I expected.
-
Opus is still #1 but at a steep cost. $1.19/run average. It's the most expensive model by far. GPT-5.4 at $0.40 is close in performance and much cheaper.
-
Gemini 3.1 Pro is bad at agentic tasks. Ranks #15 at $0.32/run — behind multiple models that cost a fraction of the price. We actually had to optimize the judge message for it because it sometimes just reads the skill and decides to do nothing.
-
StepFun 3.5 Flash is underrated for cost effectiveness: grok 4.1 fast level cost, minimax m2.7 level performance.
Methodology
We only use the relative ordering of models within each battle — not the raw scores. Absolute scores from LLM judges are noisy and poorly calibrated (a "7/10" in one battle might be "6/10" in another), but "A ranked above B" is much more consistent. Same principle behind Chatbot Arena's pairwise preference approach.
Rankings use a grouped Plackett-Luce model, not simple win-rate or Bradley-Terry. Battles where the judge model is also evaluated are excluded from the official board.
Full methodology with equations and comparison vs Arena.ai: https://app.uniclaw.ai/arena/leaderboard/methodology?via=reddit
How Battles Work
- You submit any task + pick 2-5 models
- A judge agent spawns one subagent per model on a fresh VM
- Each model solves the task independently with full tool access (terminal, browser, files, code)
- The judge evaluates by running code, browsing results, taking screenshots, etc
- Full conversation history, workspace files, and judge reasoning are preserved
Try It
Live leaderboard (no account needed): https://app.uniclaw.ai/arena?via=reddit
Interactive Pareto cost-performance plot: https://app.uniclaw.ai/arena/visualize?via=reddit
Submit your own benchmarks (public ones are on us): https://app.uniclaw.ai/arena/new?via=reddit
The judge skill is open-source: https://github.com/unifai-network/skills/tree/main/agent-bench
Note on the data: We bootstrapped the first 500+ battles by crawling what people are doing with OpenClaw (on X, Reddit, etc.) and generating battles with similar tasks + randomly selected models. Going forward, anyone can submit their own tasks.
What tasks would you want to see benchmarked? Happy to run specific comparisons.
CriticallyCarmelized@reddit
Great work! Thanks for posting these. Kind of a miss to exclude Gemma 4 at this point though.