AgentBench v0.2.9

Posted by Grand-Entertainer589@reddit | LocalLLaMA | View on Reddit | 3 comments

AgentBench is built for the part of AI agents that actually matters once the demo ends.

Most benchmarks still reward one-shot success. AgentBench goes after the harder stuff: long-session reliability, state drift, MCP and tool workflows, cross-run regressions, and leaderboard trust. It doesn’t just ask “can an agent solve one task?” It asks “does it stay reliable over time, under pressure, across runs, and in public?”

It also has a live leaderboard with separate Verified and Community lanes, so people can actually tell what they’re looking at instead of treating every score like it carries the same weight.

If you’re building or testing agents, benchmarks need to move closer to production reality. That’s what this is aiming for.