Benchmarks and evals

Posted by selund1@reddit | LocalLLaMA | View on Reddit | 10 comments

How are people running evals and benchmarks currently?

I've mostly been pulling datasets from papers (github really) and huggingface and ended up with a bunch of spaghetti python as a result. Looking for something better..

How are you thinking about evals? Do you care about them at all?
How much are you vibe checking your local setup vs evaluating?
I've heard some people setup their own eval sets (like 20 Q/A style questions), would love to hear how and why

Seems like everything in this space that there's a million ways to do something and I'd rather hear about real experiences from the community rather than some hype-fueled article or marketing materials

[-]

kryptkpr@reddit

I crawl around in these trenches in the name of fun.

It started with 6 prompts, after several years of sliding down the slope my average evaluation is now about 20K prompts.. In order to get reasonable 95% confidence intervals and decent looking difficult surfaces, the population size piper must be paid.

The docs linked above cover the methodology and how it has evolved and why it's different from most others in some depth I don't want to reproduce it here, but happy to answer questions.

[-]

Chromix_@reddit

Nice circumstance that you have Qwen3-next right next to gpt-oss-120b on the leaderboard there, as that topic just came up a few hours ago. It would've been great to also see the thinking version there, given that the instruct version is already surprisingly close to the oss-120b.

[-]

kryptkpr@reddit

I have a bunch of "big model" evals locally that I am sitting on - the 235B thinking scores much worse then instruct, it lands in like 14th! It's truncations are off the hook and nothing I can do fixes them, repeat penalty values from 0.2-1.0 didnt help.. it cannot shut up. I suspect it needs a force-think-stop to be usable, which I can do but haven't gotten around to yet.

[-]

Chromix_@reddit

I was thinking more of Qwen3-Next-Thinking. Regarding presence penalty I had more success going without it and using --repeat-penalty 1.1 --repeat-last-n 3 instead. It still looped in some cases, but way less than before. It hurt the reasoning in a few benchmarks though, but: Having more completed tasks with slightly lower quality still led to higher scores than having way less completed tasks, despite bit higher quality.

[-]

kryptkpr@reddit

Sorry I got confused between all the qwens it was the 80b in talking about above.

next-80b-thinking is throwing 7% global truncation at me @8k ctx, I just started a run where after 4k a proxy will hard inject

I didn't actually test repeat penalty, I tested presence penalty because that's what the qwen3-next model card said to do.. but that's a worthwhile experiment as well, I will try your setting and compare vs thought-shaping and vanilla.. it's kinda bothering me that this model ranks so low, want to get to the bottom of this.

[-]

Daniokenon@reddit

Benchmarks are merely a curiosity. If models are trained on test questions or are designed to achieve high scores, they don't say much about the model's actual capabilities. It's best to test the model on what you need it for. I usually prepare a few or a dozen questions/tasks that I'm well-versed in and then observe how the model performs.

[-]

selund1@reddit (OP)

How many would you typically prepare? Do you have a certain methodology or is it purely vibes?

[-]

Daniokenon@reddit

Initially, 5 is enough. I evaluate the answers myself. If the model can handle it, I'll test further. I don't have any tools or scripts, yes, I know it's time-consuming, but I don't test models often.

[-]

YearZero@reddit

You can add vbscript to excel to create a function that passes whatever values from any other cells + your prompt etc to your OpenAI compatible endpoint. So I have a function called CallLLM() in excel, and I pass it the benchmark questions which are just a normal Excel column. It records the answers into the cell where the function is. Then, if it's a benchmark that requires an LLM to verify correctness, the same exact process happens again - another call, pass the answer with prompt.

Press ALT+F11 in Excel, right click on "Modules", insert a new module, and here's my vbscript that works (in Windows, have an LLM adjust it for Macs):

(Reddit wouldn't let me post a comment with my code in it)

I have an excel with a bunch of tabs that have different benchmarks, and this method works incredibly well.

It can also come in handy if you just want a universal function in excel that passes the values of other multiple cells, and does something with it in general.

[-]

selund1@reddit (OP)

Love excel.

Sounds like you're using an llm as a judge to measure how good the response is or am I missing something?