How do you objectively tell if your custom agent tools are actually better?

Posted by Own_Suspect5343@reddit | LocalLLaMA | View on Reddit | 8 comments

I've been running Qwen3.6-35B-A3B locally in pi agent and hit cat spam problem. Agent just ignore read tool and the model gets stuck reading the same file 3-4 times using cat, or dumping entire 2k-line logs instead of grepping.

I write custom tool for replacement. Feels like it helped. The agent makes fewer calls, doesn't re-read the same file blindly, and tasks seem to finish faster.

But I have zero objective way to know if it's actually better.

Maybe I'm just cherry-picking the tasks where it works.

So I'm curious — how do you test whether your tool set is genuinely improving things? Do you write benchmarks?

[-]

Queasy-Contract9753@reddit

IMHO benchmarks are overrated. They're like the standardized exams you took in school. Doing real bad is a tree flag, but being very good at it doesn't need you'll be smart or good at the job I need today.

Im not a big fan of mainstream agents tbh. They make too many calls,I find it too far removed from the final prompt that actually gets sent to the model. My own scripts at least I can tell what they're doing. If they break it is visible and I can address that. That's really the only objective marker imho - does it do your job in practise? That's not to shit on guys using codex and Claw or whatever. I'm sure there are much smarter and harder working guys than me who use them successfully.

[-]

Exact_Guarantee4695@reddit

yeah this is one of those things where vibes lie fast. i keep a tiny replay set of tasks that previously went sideways and score boring stuff: repeated file reads, raw log dumps, tool call count, and whether it finished with tests green. biggest signal for tool changes has been same prompt, fewer recovery loops, not total runtime. are you logging tool calls as json yet? that's usually enough to build the first eval harness.

[-]

Ok-Measurement-1575@reddit

I pulled a question out of MMLU-Pro (which all qwen 3.6 models seem to do worse on - despite claiming otherwise).

35b UDQ4 - burned several thousand tokens and took, say, 5~ minutes to answer.

27b UDQ4 - as above but took over 20~ mins, dithered in the CoT constantly like an ADHDer and then gave the wrong answer.

I got opus to write 3 generic MCPs and then reformulated the question so it couldn't be benchmaxxed in the same way (no longer appears in any textbooks, at least).

35b UDQ4 solved it using the new tools in 31 seconds.

I've been using questions like this against LLMs for about 2 years and I've never seen such a compelling result.

[-]

havnar-@reddit

Pi has no guardrails. So you are responsible for telling it what to do. However qwen loves to get stuck in loops or overthink things.

Start by properly defining what the llm has to do, or it will guess and just do that.

[-]

markussss@reddit

I have been using qwen3.6-35-a3b for parsing and transforming approximately 250 MB of quite densely packed HTML. In the beginning, the model consistently wanted to read every file "to get an overview", ending up using cat, head and tail to read either the entire file, or 100, 50, 30 lines at a time across most or all files. This quickly filled the available context window and memory, and didn't give any benefit.

I have had some success with explaining to it that it is running on limited hardware, and explicitly stating that we are *not* using LLMs to parse and transform the text, but that we are using LLMs to orchestrate parsing and transforming text, and after that it has been a breeze chewing through the dataset. I had further improvements from instructing it to read one, two or three lines at a time, but only in order to understand the structure of the files, and not to get any overviews. However, this last improvement seems to be more about how the data is structured and compressed. It is common for agents to read line count and assuming normal HTML, with one, or only a few, tags per line, but when reading 10 lines is *far* means reading 10 lines of up to 50 000 000 characters, tools like cat, head and tail doesn't help at all.

It seems to me that explaining the data as well as the hard limits of the environment works alright.

[-]

kaeptnphlop@reddit

Tell us about your setup. Which quant are you using? Inference settings? Is reasoning on / off?

I just got pi running in a container. It even figured out to use alternatives to tools that are not available in Alpine Linux.

[-]

666666thats6sixes@reddit

A test/benchmark suite... every time I feel like a task is particularly interesting, I add it to the suite (copy the repo + prompt as they were before the interesting task).

The suite measures tokens used, tool calls, how many retries until tests passed.

It's nothing special, I had an agent (qwen3.6 27b) look at the sql benchmark and had it build a similar UX for general testing. Tests are in columns, each line is a new model/parameter set/harness.

[-]

redmctrashface@reddit

I am also interested but regarding various models. Are there any benchmarks or things like that available somewhere? Or is it just manually testing until you have a good idea of how it behaves?