Stop benchmarking inference providers, a guide to easy evaluation

Posted by HauntingMoment@reddit | LocalLLaMA | View on Reddit | 7 comments

Hey ! Nathan from huggingface here, i maintained the Open LLM Leaderboard and in that time, I’ve evaluated around 10k model. I think there’s a pretty big misconception in how people benchmark LLMs.

Most setups I see rely on inference providers like OpenRouter or Hugging Face's inference providers.

Which is convenient, but there’s a catch

You’re often not actually benchmarking the model. You’re benchmarking the provider.

Between quantization, hidden system prompts, routing, or even silent model swaps, the results can be far from the actual model performance.

The actual “source of truth” for open source models is transformers.

So instead of evaluating through providers, I switched to:

This way:

Once everything is wired up, benchmarking becomes almost trivial.

You can run something like:

hf jobs uv run script.py \
--flavor l4x1 \
--secrets HF_TOKEN \
-e TRANSFORMERS_SERVE_API_KEY="1234"

And just swap:

Here is a more detailed article I wrote describing the process: https://huggingface.co/blog/SaylorTwift/benchmarking-on-the-hub

Curious to hear your thoughts!

Happy to share more details if people are interested.