Did you create a new benchmark? Good, keep it to yourself, don't release how it works until something beats it.
Posted by EmirTanis@reddit | LocalLLaMA | View on Reddit | 31 comments
Only release leaderboards / charts. This is the only way to avoid pollution / interference from the AI companies.
No_Novel8228@reddit
Gotcha
offlinesir@reddit
It's tough to do this, though, because repeatability is much needed in order to trust a benchmark. Here's an example:
In my own private benchmark (and this is all made up), Qwen 3 scores #1, Meta Llama 4 scores #2, and GPT 5 scores #3.
You may be saying "uhhh, what?" and "Meta's Llama models are not above GPT 5!" but there's no possible way to repeat the test, so you kinda have to trust me (and likely you may not, assuming I work at meta.
A better strategy is to release a subset of the dataset for a benchmark, instead of the whole benchmark, increasing visibility and openness while not being as benchmaxxable as a fully open dataset (eg, AIME 24).
EmirTanis@reddit (OP)
You do not have to release anything, if you do then it isn't faithful to what I proposed. This is a compromise that has to be made and having to seek validation for your benchmark is basically saying "I don't trust my own testing and benchmark" It ruins the whole point and you're much better off actually trusting in what you built and the benchmark to do it's thing. Of course if there's something anamolous you should investigate to see if the model acted as expected.
TechnoByte_@reddit
A benchmark that releases nothing and demands blind trust is not reliable, it replaces evidence with "just trust me bro"
Trust in a benchmark isn't derived from the creator's confidence, but from the community's ability to verify its methodology, scrutinize its data, and reproduce its results
Without transparency, we can't know if the benchmark is biased, if the evaluation metrics are flawed, or if scoring errors have occurred
Seeking external validation isn't a sign of weakness, it is the foundation of scientific credibility, making sure the benchmark is robust and fair
LiveBench both protects the integrity of its most recent questions to prevent contamination, while being transparent about its methodology, code, and past data
Unusual_Guidance2095@reddit
This is true, but it‘s also kind of hard to do with certain benchmarks. For example, I have a “world knowledge” benchmark where my scope is extremely limited but that’s the point: the point is to see if it knows about niche things, I chose my niche just because it’s something I’m interested in but it’s a gauge for all niches. Something similar to my benchmark would be for example like average temperature for different months for different cities around the world. But if I just release my benchmark and name it “weather estimator around the world” the people training it can cheat and just give it more weather data. This would defeat the entire point of having this benchmark: proxying how many corners of the internet it went through. And this is a consistent trend: since I didn’t release my benchmark publicly, I can tell that models increasingly know little about a geology-adjacent field, for example QWEN and GLM do so terribly compared to the original Deepseek, and the new Deepseek models are getting worse.
TechnoByte_@reddit
Purely closed benchmarks cannot be trusted.
See LiveBench, it's both contamination-free and transparent.
They regularly release new questions and completely refresh the benchmark every six months, while also delaying the public release of recent questions to prevent models from being trained on the test set.
They also providing public access to its leaderboard, code, data from previous releases, making sure it's transparent
IrisColt@reddit
Yes, I’ve got one. It wasn’t designed on purpose... it came about because Claude 4.5, GPT-5, Gemini 2.5, and Grok 4 kept failing at it, heh
RRO-19@reddit
Benchmarks are getting gamed to death. Models optimize for tests instead of real-world performance. Secret benchmarks help, but the real solution is evaluating models on actual use cases, not synthetic tests.
kryptkpr@reddit
https://www.reddit.com/r/LocalLLaMA/s/fmeG0vw0tU
My benchmark is designed specifically to deal with this problem.
No-Refrigerator-1672@reddit
How exactly are you proposing to keep benchmark secret? Claude, OpenAI, Google, Grok, etc. will never agree to send you model weoghts for evaluation, they'll only provide API keys; the moment you run a pass of your benchmark through their API they get to log and keep all of your tasks, and use them for training if they want. Keeping benchmarks task a secret is basically impossible.
_yustaguy_@reddit
Well they can't log the solution too
egomarker@reddit
They have an army of PhDs to write all the solutions.
_yustaguy_@reddit
What prevents those PhDs from writing the tests themselves in that case?
Why would they filter 10000 of erp and "how many r's are there in retard" messages to find a potentially good question instead of just making them themselves?
egomarker@reddit
They do that too.
No-Refrigerator-1672@reddit
How does it matter? They can employ human to solve the task after the test, and use this data to train the next model.
LienniTa@reddit
they cant benchmax if they dont have metrics
No-Refrigerator-1672@reddit
They have your tasks and they have your public description on what's this benchmark is about, it's not too hard to guess the metrics and correct answers from that.
LienniTa@reddit
what public description are you talking about? there is nothing in the open. For API all my questions would just be normal questions that one can ask to chatgpt chat directly. There is no connection between those and the leaderboard i may post.
No-Refrigerator-1672@reddit
A benchmark is, at the very minimum, required to state what exactly it measures in short description (a few sentences), otherwise it's useless for auditory. This is enough to guess how the correct answer should look, given that you have the tasks logged from your first run.
LienniTa@reddit
i dont understand how API provider can link the benchmark results and questions asked. It is just impossible. I dont put my api key next to benchmark chart.
No-Refrigerator-1672@reddit
That's easy. If they provide you preliminary access to get the benchmarks done for the technical review that they publish along with the model, then they know precisely who and which keys have. If you're testing publicly available models, then they can just parse their logs, identify keys that burst a large corpus of requests, with complete silence before and after, and then just parse all such bursts with LLM to identify ones that look like a benchmark, i.e. all their topics will align with your proposed benchmark's scope.
LienniTa@reddit
You are probably right. It amuses me to think that they will apply such effor to my furry novel porn writing benchmark. But that might be a real issue for the private bench of a company im working in. I guess all API provers have "evals detection" going on.
Mart-McUH@reddit
That only works if you benchmark only local models (you run yourself). The moment you want to benchmark over API (eg closed model) they will have your dataset, because you need to do inference on their side (they will get your prompts and so benchmark tasks).
As with any serious competition or exam, unless it tests memory, it only works when you use new set of problems each time.
EmirTanis@reddit (OP)
Last I checked most API respect e.g."do not train on my data" / "use it to improve the product", is that not the case anymore?
egomarker@reddit
You think people who stole all the data on the internet really respect your checkmarks?
Mart-McUH@reddit
Does not matter. They got your prompts. Assume they have your questions.
Besides they do not need to train exactly on your data (all they need is to create training data that will teach it to solve that problem), so they do not even need to break that clause.
EmirTanis@reddit (OP)
If someone made a private benchmark, first benchmarked with local models, then after with the API, Afterwards if the next model is very good at the benchmark, That'd lead to a pretty big fraud thing, lots of people trust "do not train on my data" I don't think they'd do that, would be pretty scandalous and obvious for the worker's.
ZoroWithEnma@reddit
I stopped caring about these benchmarks long ago. If reddit says it's a good model, then it may be good, I'll try and if I like it it's a good model for me.
ttkciar@reddit
Yep, been thinking about exactly this with regard to replacing my inference test query set with new queries. I've been sharing too many raw results here on Reddit, and noticed that various recent models seem trained on them.
One idea I've been batting around is to modify my test framework so that instead of querying the models with the queries, it runs the queries through a round or two of Evol-Instruct, to synthesize new queries of equivalent utility (testing the same model skills), but worded completely differently, and prompting the model with those.
That would enable me to share a test's raw results without exposing the hard-coded query list; only the Evol-Instruct synthesized queries would be exposed in the raw results.
Mabuse046@reddit
But why would it matter? Benchmarks are quaint, but ultimately taken with a grain of salt. I guarantee anyone who uses LLM's regularly has their own experiences of which ones work better for them than others at any given tasks, and we all know that rarely aligns with benchmark scores. In honesty we'd probably all be better off if anyone with a benchmark just kept it to themself entirely - they mostly serve to mislead people who haven't been around long enough to know any better.
fuck_cis_shit@reddit
this is the way. the only evals are private evals and garbage evals