Benchmarking methods

Posted by Forward_Jackfruit813@reddit | LocalLLaMA | View on Reddit | 5 comments

The philosophies of benchmarking or at least comparing these things are driving me nuts.

A lot of people like to use one-shot prompts across different models, but that isn't going to be accurate as you can get different results from the same model as well as the harness and system prompts themself doing most of the work.

Also if you're wanting to test agentic capabilities, the quality of the tools come into question.

Then you have to worry about the simple stuff. What quant are you using and are your settings optimal? If one model can iterate and create a better output, how do you compare that to a model that did almost as good in one shot, but can't iterate or troubleshoot?

There seems to be way too many variables to account for when comparing quality. I would like to hear how others are quantitatively measuring the output quality of these models.

[-]

mr_Owner@reddit

I benchmark currently real world usage with the seed flag to expect same behavior.

Until so far, i am working via a llm harness to plan out a certain prompt on a existing codebase, and i ask to review itself afterwards.

I them compare the output with other plans and ask a cloud frontier llm to grade the plans.

So far, i have mixed results meaning i am almost certain that quant + kv cache has a high impact. Even with mtp on dense llm's the outcome seems (in my case) less deep in comparison to non mtp.

Tldr, real world own usages qwen3.6 35b a3b at q8 and kv at q8 has been better at planning, scanning and understanding codebase then qwen3.6 27b q6 and kv at q8 with and without mtp.

Counterintuitive outcome, maybe this give you some Insights ymmv

[-]

Fheredin@reddit

My personal benchmark is to ask a LLM to split a relatively high point cribbage hand. This is because this is a simple benchmark to run, and LLMs tend to perform abominably at it because they stop counting points far too early to conserve processing tokens. Most LLMs only see one or two scoring combinations. This leads to a very poor choice on par with the very worst human players.

[-]

Bulky-Priority6824@reddit

seems like setting it up for failure with something it wasnt designed to do?

[-]

Bulky-Priority6824@reddit

There's a broad range of user types and uses for llms from guys with 16gb ddr4 trying to build a girlfriend to wizards living in live ebooks to gals with 10's of thousands of dollars in GPU's doing science, coding etc etc. Then there is everyone scattered in between from the bottom to the top.

So yea. different people are providing/looking for different metrics. Some want faster responses from Becky and others seek higher accuracy.

It would be nice to have an all-in-one tool that tests like 5 different user levels/types

[-]

Old-Tumbleweed1422@reddit

Trying to run static benchmarks like MMLU or HumanEval has been a waste of time for a while now - models are hopelessly data-contaminated on those tests during pre-training. In prod we completely ditched this in favor of the LLM-as-a-judge pattern to validate specific business scenarios. We write a set of 50–100 custom prompts taken from real logs, run the model we’re testing, and ask something like Sonnet to grade the output on a 1-to-5 scale using strict criteria

Sure, it burns a lot of tokens, but it's the only way to measure actual, real-world utility instead of some synthetic benchmark in a vacuum