Benchmarking methods

Posted by Forward_Jackfruit813@reddit | LocalLLaMA | View on Reddit | 5 comments

The philosophies of benchmarking or at least comparing these things are driving me nuts.

A lot of people like to use one-shot prompts across different models, but that isn't going to be accurate as you can get different results from the same model as well as the harness and system prompts themself doing most of the work.

Also if you're wanting to test agentic capabilities, the quality of the tools come into question.

Then you have to worry about the simple stuff. What quant are you using and are your settings optimal? If one model can iterate and create a better output, how do you compare that to a model that did almost as good in one shot, but can't iterate or troubleshoot?

There seems to be way too many variables to account for when comparing quality. I would like to hear how others are quantitatively measuring the output quality of these models.