How do you objectively tell if your custom agent tools are actually better?

Posted by Own_Suspect5343@reddit | LocalLLaMA | View on Reddit | 8 comments

I've been running Qwen3.6-35B-A3B locally in pi agent and hit cat spam problem. Agent just ignore read tool and the model gets stuck reading the same file 3-4 times using cat, or dumping entire 2k-line logs instead of grepping.

I write custom tool for replacement. Feels like it helped. The agent makes fewer calls, doesn't re-read the same file blindly, and tasks seem to finish faster.

But I have zero objective way to know if it's actually better.

Maybe I'm just cherry-picking the tasks where it works.

So I'm curious — how do you test whether your tool set is genuinely improving things? Do you write benchmarks?