We ranked 22 AI models with 550+ real OpenClaw battles — GLM-5.1 debuts at #2, full Pareto cost analysis inside

Posted by zylskysniper@reddit | LocalLLaMA | View on Reddit | 1 comments

I've been working on an arena (ChatBot Arena style) for comparing AI models on user submitted real agentic tasks — not chat, not static coding benchmarks. I want to benchmark how models perform in real harness, on real tasks.

Each model runs as an actual OpenClaw subagent in a fresh VM with terminal, browser, file system, and code execution, etc. A judge agent (user's choice: Claude Opus 4.6, GPT-5.4, or Gemini 3.1 Pro) evaluates by reading/running the code, browsing deployed apps, taking screenshots, etc.

We just passed 550 battles across 22 models. Here are the current performance rankings:

Performance Leaderboard

Rank	Model	Score	Battles
1	Claude Opus 4.6	1739	108
2	GLM-5.1	1700	21
3	Claude Sonnet 4.6	1681	104
4	GPT-5.4	1663	138
5	Qwen 3.6 Plus	1537	27
6	GPT-5.3 Codex	1477	183
7	Claude Haiku 4.5	1408	169
8	Qwen 3.5 27B	1395	81
9	Xiaomi MiMo v2 Pro	1385	112
10	GLM-5 Turbo	1362	93
11	MiniMax M2.7	1309	161
12	StepFun 3.5 Flash	1280	156
13	DeepSeek V3.2	1246	126
14	Gemini 3 Flash	1235	131
15	Gemini 3.1 Pro	1196	92
16	Grok 4.1 Fast	1189	160
17	Kimi K2.5	974	95
18	Nemotron 3 Super 120B	804	98

Cost vs Performance (Pareto Analysis)

Beyond the performance ranking, we plot each model's performance score against its actual average cost and draw the Pareto frontier — the set of models where you can't get better performance without paying more. This gives us a "budget ladder" — the best model at each price point:

Budget Range	Recommended Model	Score	Avg Cost
$0.03 - $0.04	Grok 4.1 Fast	1189 ±84	$0.03
$0.04 - $0.14	StepFun 3.5 Flash	1280 ±93	$0.04
$0.14 - $0.19	MiniMax M2.7	1309 ±97	$0.14
$0.19 - $0.24	Qwen 3.5 27B	1395 ±100	$0.19
$0.24 - $0.37	GPT-5.3 Codex	1477 ±92	$0.24
$0.37 - $1.19	GLM-5.1	1700 ±129	$0.37
$1.19+	Claude Opus 4.6	1739 ±126	$1.19

One to watch: Qwen 3.6 Plus doesn't support prompt caching yet (at least on OpenRouter), so its $0.37/run is inflated. When caching lands, I'd expect its cost to drop to roughly minimax level (~$0.14), which would push minimax m2.7, Qwen 3.5 27B, and GPT-5.3 Codex off the frontier entirely.

Some Interesting Findings

GLM-5.1 debuts at #2 with a perfect record. Much better than I expected.
Opus is still #1 but at a steep cost. $1.19/run average. It's the most expensive model by far. GPT-5.4 at $0.40 is close in performance and much cheaper.
Gemini 3.1 Pro is bad at agentic tasks. Ranks #15 at $0.32/run — behind multiple models that cost a fraction of the price. We actually had to optimize the judge message for it because it sometimes just reads the skill and decides to do nothing.
StepFun 3.5 Flash is underrated for cost effectiveness: grok 4.1 fast level cost, minimax m2.7 level performance.

Methodology

We only use the relative ordering of models within each battle — not the raw scores. Absolute scores from LLM judges are noisy and poorly calibrated (a "7/10" in one battle might be "6/10" in another), but "A ranked above B" is much more consistent. Same principle behind Chatbot Arena's pairwise preference approach.

Rankings use a grouped Plackett-Luce model, not simple win-rate or Bradley-Terry. Battles where the judge model is also evaluated are excluded from the official board.

Full methodology with equations and comparison vs Arena.ai: https://app.uniclaw.ai/arena/leaderboard/methodology?via=reddit

How Battles Work

You submit any task + pick 2-5 models
A judge agent spawns one subagent per model on a fresh VM
Each model solves the task independently with full tool access (terminal, browser, files, code)
The judge evaluates by running code, browsing results, taking screenshots, etc
Full conversation history, workspace files, and judge reasoning are preserved

Try It

Live leaderboard (no account needed): https://app.uniclaw.ai/arena?via=reddit

Interactive Pareto cost-performance plot: https://app.uniclaw.ai/arena/visualize?via=reddit

Submit your own benchmarks (public ones are on us): https://app.uniclaw.ai/arena/new?via=reddit

The judge skill is open-source: https://github.com/unifai-network/skills/tree/main/agent-bench

Note on the data: We bootstrapped the first 500+ battles by crawling what people are doing with OpenClaw (on X, Reddit, etc.) and generating battles with similar tasks + randomly selected models. Going forward, anyone can submit their own tasks.

What tasks would you want to see benchmarked? Happy to run specific comparisons.