Did you create a new benchmark? Good, keep it to yourself, don't release how it works until something beats it.

[-]

offlinesir@reddit

It's tough to do this, though, because repeatability is much needed in order to trust a benchmark. Here's an example:

In my own private benchmark (and this is all made up), Qwen 3 scores #1, Meta Llama 4 scores #2, and GPT 5 scores #3.

You may be saying "uhhh, what?" and "Meta's Llama models are not above GPT 5!" but there's no possible way to repeat the test, so you kinda have to trust me (and likely you may not, assuming I work at meta.

A better strategy is to release a subset of the dataset for a benchmark, instead of the whole benchmark, increasing visibility and openness while not being as benchmaxxable as a fully open dataset (eg, AIME 24).

[-]

EmirTanis@reddit (OP)

You do not have to release anything, if you do then it isn't faithful to what I proposed. This is a compromise that has to be made and having to seek validation for your benchmark is basically saying "I don't trust my own testing and benchmark" It ruins the whole point and you're much better off actually trusting in what you built and the benchmark to do it's thing. Of course if there's something anamolous you should investigate to see if the model acted as expected.

[-]

TechnoByte_@reddit

A benchmark that releases nothing and demands blind trust is not reliable, it replaces evidence with "just trust me bro"

Trust in a benchmark isn't derived from the creator's confidence, but from the community's ability to verify its methodology, scrutinize its data, and reproduce its results

Without transparency, we can't know if the benchmark is biased, if the evaluation metrics are flawed, or if scoring errors have occurred

Seeking external validation isn't a sign of weakness, it is the foundation of scientific credibility, making sure the benchmark is robust and fair

LiveBench both protects the integrity of its most recent questions to prevent contamination, while being transparent about its methodology, code, and past data

[-]

Unusual_Guidance2095@reddit

This is true, but it‘s also kind of hard to do with certain benchmarks. For example, I have a “world knowledge” benchmark where my scope is extremely limited but that’s the point: the point is to see if it knows about niche things, I chose my niche just because it’s something I’m interested in but it’s a gauge for all niches. Something similar to my benchmark would be for example like average temperature for different months for different cities around the world. But if I just release my benchmark and name it “weather estimator around the world” the people training it can cheat and just give it more weather data. This would defeat the entire point of having this benchmark: proxying how many corners of the internet it went through. And this is a consistent trend: since I didn’t release my benchmark publicly, I can tell that models increasingly know little about a geology-adjacent field, for example QWEN and GLM do so terribly compared to the original Deepseek, and the new Deepseek models are getting worse.

[-]

TechnoByte_@reddit

Purely closed benchmarks cannot be trusted.

See LiveBench, it's both contamination-free and transparent.

They regularly release new questions and completely refresh the benchmark every six months, while also delaying the public release of recent questions to prevent models from being trained on the test set.

They also providing public access to its leaderboard, code, data from previous releases, making sure it's transparent

[-]

IrisColt@reddit

Yes, I’ve got one. It wasn’t designed on purpose... it came about because Claude 4.5, GPT-5, Gemini 2.5, and Grok 4 kept failing at it, heh

[-]

RRO-19@reddit

Benchmarks are getting gamed to death. Models optimize for tests instead of real-world performance. Secret benchmarks help, but the real solution is evaluating models on actual use cases, not synthetic tests.

[-]

kryptkpr@reddit

https://www.reddit.com/r/LocalLLaMA/s/fmeG0vw0tU

My benchmark is designed specifically to deal with this problem.

[-]

No-Refrigerator-1672@reddit

How exactly are you proposing to keep benchmark secret? Claude, OpenAI, Google, Grok, etc. will never agree to send you model weoghts for evaluation, they'll only provide API keys; the moment you run a pass of your benchmark through their API they get to log and keep all of your tasks, and use them for training if they want. Keeping benchmarks task a secret is basically impossible.

[-]

_yustaguy_@reddit

Well they can't log the solution too

[-]

egomarker@reddit

They have an army of PhDs to write all the solutions.

[-]

_yustaguy_@reddit

What prevents those PhDs from writing the tests themselves in that case?

Why would they filter 10000 of erp and "how many r's are there in retard" messages to find a potentially good question instead of just making them themselves?

[-]

egomarker@reddit

They do that too.

[-]

No-Refrigerator-1672@reddit

How does it matter? They can employ human to solve the task after the test, and use this data to train the next model.

[-]

LienniTa@reddit

they cant benchmax if they dont have metrics

[-]

No-Refrigerator-1672@reddit

They have your tasks and they have your public description on what's this benchmark is about, it's not too hard to guess the metrics and correct answers from that.

[-]

LienniTa@reddit

what public description are you talking about? there is nothing in the open. For API all my questions would just be normal questions that one can ask to chatgpt chat directly. There is no connection between those and the leaderboard i may post.

[-]

No-Refrigerator-1672@reddit

A benchmark is, at the very minimum, required to state what exactly it measures in short description (a few sentences), otherwise it's useless for auditory. This is enough to guess how the correct answer should look, given that you have the tasks logged from your first run.

[-]

LienniTa@reddit

i dont understand how API provider can link the benchmark results and questions asked. It is just impossible. I dont put my api key next to benchmark chart.

[-]

No-Refrigerator-1672@reddit

That's easy. If they provide you preliminary access to get the benchmarks done for the technical review that they publish along with the model, then they know precisely who and which keys have. If you're testing publicly available models, then they can just parse their logs, identify keys that burst a large corpus of requests, with complete silence before and after, and then just parse all such bursts with LLM to identify ones that look like a benchmark, i.e. all their topics will align with your proposed benchmark's scope.

[-]

LienniTa@reddit

You are probably right. It amuses me to think that they will apply such effor to my furry novel porn writing benchmark. But that might be a real issue for the private bench of a company im working in. I guess all API provers have "evals detection" going on.

[-]

Mart-McUH@reddit

That only works if you benchmark only local models (you run yourself). The moment you want to benchmark over API (eg closed model) they will have your dataset, because you need to do inference on their side (they will get your prompts and so benchmark tasks).

As with any serious competition or exam, unless it tests memory, it only works when you use new set of problems each time.

[-]

EmirTanis@reddit (OP)

Last I checked most API respect e.g."do not train on my data" / "use it to improve the product", is that not the case anymore?

[-]

egomarker@reddit

You think people who stole all the data on the internet really respect your checkmarks?

[-]

Mart-McUH@reddit

Does not matter. They got your prompts. Assume they have your questions.

Besides they do not need to train exactly on your data (all they need is to create training data that will teach it to solve that problem), so they do not even need to break that clause.

[-]

EmirTanis@reddit (OP)

If someone made a private benchmark, first benchmarked with local models, then after with the API, Afterwards if the next model is very good at the benchmark, That'd lead to a pretty big fraud thing, lots of people trust "do not train on my data" I don't think they'd do that, would be pretty scandalous and obvious for the worker's.

[-]

ZoroWithEnma@reddit

I stopped caring about these benchmarks long ago. If reddit says it's a good model, then it may be good, I'll try and if I like it it's a good model for me.

[-]

ttkciar@reddit

Yep, been thinking about exactly this with regard to replacing my inference test query set with new queries. I've been sharing too many raw results here on Reddit, and noticed that various recent models seem trained on them.

One idea I've been batting around is to modify my test framework so that instead of querying the models with the queries, it runs the queries through a round or two of Evol-Instruct, to synthesize new queries of equivalent utility (testing the same model skills), but worded completely differently, and prompting the model with those.

That would enable me to share a test's raw results without exposing the hard-coded query list; only the Evol-Instruct synthesized queries would be exposed in the raw results.

[-]

Mabuse046@reddit

But why would it matter? Benchmarks are quaint, but ultimately taken with a grain of salt. I guarantee anyone who uses LLM's regularly has their own experiences of which ones work better for them than others at any given tasks, and we all know that rarely aligns with benchmark scores. In honesty we'd probably all be better off if anyone with a benchmark just kept it to themself entirely - they mostly serve to mislead people who haven't been around long enough to know any better.

[-]

fuck_cis_shit@reddit

this is the way. the only evals are private evals and garbage evals