Stop benchmarking inference providers, a guide to easy evaluation

Posted by HauntingMoment@reddit | LocalLLaMA | View on Reddit | 7 comments

Hey ! Nathan from huggingface here, i maintained the Open LLM Leaderboard and in that time, I’ve evaluated around 10k model. I think there’s a pretty big misconception in how people benchmark LLMs.

Most setups I see rely on inference providers like OpenRouter or Hugging Face's inference providers.

Which is convenient, but there’s a catch

You’re often not actually benchmarking the model. You’re benchmarking the provider.

Between quantization, hidden system prompts, routing, or even silent model swaps, the results can be far from the actual model performance.

The actual “source of truth” for open source models is transformers.

So instead of evaluating through providers, I switched to:

Running models via transformers serve (OpenAI-compatible server)
Using inspect-ai as the eval harness
Spinning everything up with HF Jobs (on-demand GPUs)
Publishing results back to the hub

This way:

You control exactly what model is being run
You get reproducible results
You can scale to a lot of models without too much infra pain

Once everything is wired up, benchmarking becomes almost trivial.

You can run something like:

hf jobs uv run script.py \
--flavor l4x1 \
--secrets HF_TOKEN \
-e TRANSFORMERS_SERVE_API_KEY="1234"

And just swap:

the model
the hardware
the benchmark (GPQA, SWE-bench, AIME, etc.) You can then push eval results back to model repos and have them show up in community leaderboards on Hugging Face.

Here is a more detailed article I wrote describing the process: https://huggingface.co/blog/SaylorTwift/benchmarking-on-the-hub

Curious to hear your thoughts!

Are you benchmarking via providers or self-hosted?
Have you run into inconsistencies between endpoints?
Any better setups/tools I should look at?

Happy to share more details if people are interested.

[-]

ttkciar@reddit

Hello Nathan! Was this post LLM-generated?

[-]

hey! Not really, it's based on the blog posts i wrote about the subject. I used an LLM to do a summary as it was a bit too in depth and reworked some parts. I did not know about this rule, sorry! I can re-do the post if needed.

Here are the original article posts, written by myself. 1. https://x.com/nathanhabib1011/status/2043686339531399676?s=20 2. https://huggingface.co/blog/SaylorTwift/benchmarking-on-the-hub

[-]

ttkciar@reddit

Okie-doke, thanks for the clarification. I hated to take down a post from HF staff, but didn't want to leave it up too long.

If you write a new post not generated by an LLM, we would be glad to have it.

[-]

ttkciar@reddit

Violates Rule Three: Low-effort posts (LLM-generated content)

[-]

PaceZealousideal6091@reddit

Thanks! I have always thought about this. And now its been even more evident with hoe tough iy has been to have gemma 4 run stably with so many issues to iron our. None of the benchmarks mean anything. Everyday a new fix in llcpp makes these benchmark outdated. Definitely, benchmarking a model needs to be done exclusively at full precision using transformers. Having said that, its also important for users to have access to these benchmarks run for quantized models in inference engines like lcpp because most cant afford to run models at full precision and use ggufs for practical purposes. So, inference provider based benchmarks are definitely needed to understand how things are working at the ground level. So, I guess both benchmarks have their own place.

[-]

HauntingMoment@reddit (OP)

absolutely, we need more transpirancy on how the models are run by benchmarkers and how they are served by inference providers

[-]

Klarts@reddit

I think there are some benefits of these benchmarks for those who rely on cloud gpu service (vast, runpod, hyperstack, etc) for their own deployment/use.

This will hopefully provide a rough estimate as to how the models will perform.

The person benchmarking would just need to provide more info about the setup and deployments for context.