AgentBench v0.2.9
Posted by Grand-Entertainer589@reddit | LocalLLaMA | View on Reddit | 3 comments
AgentBench is built for the part of AI agents that actually matters once the demo ends.
Most benchmarks still reward one-shot success. AgentBench goes after the harder stuff: long-session reliability, state drift, MCP and tool workflows, cross-run regressions, and leaderboard trust. It doesn’t just ask “can an agent solve one task?” It asks “does it stay reliable over time, under pressure, across runs, and in public?”
It also has a live leaderboard with separate Verified and Community lanes, so people can actually tell what they’re looking at instead of treating every score like it carries the same weight.
If you’re building or testing agents, benchmarks need to move closer to production reality. That’s what this is aiming for.
fragment_me@reddit
You know you can just edit your post ?
ttkciar@reddit
These bots deliberately avoid putting the full link into their slop-posts to avoid detection by some moderation automation.
Grand-Entertainer589@reddit (OP)
AgentBench: https://github.com/OmnionixAI/AgentBench