Do you benchmark local models as agents, or only on single prompts?

Posted by sahanpk@reddit | LocalLLaMA | View on Reddit | 13 comments

Curious how people test tool use locally. A model can look fine in chat and still fall apart once state, retries, and bad tool results show up.

[-]

ExtremeAdventurous63@reddit

You find a standard way to do that I would really appreciate you sharing it. At the moment, the best thing that worked for me was to build a custom script to run a coding agent with a couple of fixed prompts and then use an AI Gudge to score them, but the results are too variable for me at the moment, so nothing worth sharing

[-]

sahanpk@reddit (OP)

same problem here. fixed prompts are easy; stable grading is the mess. i’d trust pass/fail task traces more than one judge score.

[-]

ExtremeAdventurous63@reddit

Yeah the hard part is finding a way to make complex agentic tasks algorithmically scorable without relying on another llm as a judge

[-]

sahanpk@reddit (OP)

same. closest i've found is deterministic receipts: file changed, test passed, page state changed, no judge needed.

[-]

Bulky-Priority6824@reddit

I don't know what I'm doing half the time someyimes when I try to test wuality with one model vs another I get Sonnet to poop out a coding prompt template (like a design and implement a database using xyz blah blah blah) and then have the local model poop it out then I use Opus to judge the poop out code to compare it across other models poop.

[-]

sahanpk@reddit (OP)

honestly that's not far off. ugly prompts + a separate judge still beats pretending one clean benchmark says much.

[-]

DinoAmino@reddit

There is an "industry standard" benchmark for this. The Berkeley Function Calling Leaderboard. Does single and multi turn and has a hallucination measurement too.

Repo here https://github.com/ShishirPatil/gorilla

Or pip install bfcl-eval==2025.12.17

[-]

cleversmoke@reddit

I benchmark with llama.cpp:full-cuda13 for a quick check on PP and TG and then I run it against one of my use cases as an agent such as "read and execute this .md file that contains the instructions". I gauge how well it follows the instructions, it's output quality, and the time it took in full.

[-]

sahanpk@reddit (OP)

that md-file test is the kind of thing normal benches miss. following instructions under tool friction tells you way more than a clean prompt.

[-]

cleversmoke@reddit

Agreed! I'm more interested in how models performs for my actually use cases. For example, prior to MTP I had my dual agents research and recommend on 5 stock tickers per batch, it would take 39 mins to complete the batch. After, with MTP, on the same 5 tickers, processing times dropped to 23 minutes. Near identical output. I was blown away!

[-]

Parzival_3110@reddit

I would test agents on traces, not just final text. A local model can look fine on a clean prompt and then fall over when the page changes, a click does nothing, auth appears, or a tool returns partial state.

For browser agents specifically, I like tasks with visible receipts: owned tab, DOM read, action, observed page change, retry if nothing changed, stop if captcha or login risk appears. That gives you pass fail runs you can compare across models instead of asking a judge to grade vibes.

Bias disclosed, I am building FSB around that style of real Chrome control for Claude and Codex: https://full-selfbrowsing.com/agents

[-]

edsonmedina@reddit

I usually do both. Just a few AI riddles for sanity check (and an initial sense of generation speed), then i run it in an agent and try building some basic apps.

[-]

sahanpk@reddit (OP)

yeah that’s probably the right split. quick speed sanity check first, then one ugly agent task where tool calls can actually fail.