How do you benchmark the cognitive performance of local LLM models?

Posted by LastikPlastic@reddit | LocalLLaMA | View on Reddit | 8 comments

Hey everyone,

I’ve been experimenting with running local LLMs (mainly open-weight models from Hugging Face) and I’m curious about how to systematically benchmark their cognitive performance — not just speed or token throughput, but things like reasoning, memory, comprehension, and factual accuracy.

I know about lm-evaluation-harness, but it’s pretty cumbersome to run manually for each model. I’m wondering if:

there’s any online tool or web interface that can run multiple benchmarks automatically (similar to Hugging Face’s Open LLM Leaderboard, but for local models), or
a more user-friendly script or framework that can test reasoning / logic / QA performance locally without too much setup.

Any suggestions, tools, or workflows you’d recommend?
Thanks in advance!

[-]

mw_morris@reddit

I use them for the tasks I want them to perform and see which one does the best job. Incredibly subjective sure, but in the end that’s the point (at least for me) since “the purpose of a system is what it does” and all that.

[-]

LastikPlastic@reddit (OP)

a lot models available, I want to chose good for 1 try, ha-ha

[-]

kryptkpr@reddit

I built my own evaluation suite to solve this very problem, but I'm afraid it's all CLI and it runs tens of thousands of tests so it's not a quick-turnaround kinda thing, it's a detailed analysis of reasoning capability.

[-]

Accomplished_Mode170@reddit

If you enable configurable sampling against a target KV corpus it could minimize the need for training SAEs and demonstrating n-regularizations

Read: I’ve got a CLI for SAE-Lens I’ve been meaning to push for regulatory compliance; love this as graduated control using class-specific entitlements

TL;DR ❤️ pre-processing the binaries 📊

[-]

LastikPlastic@reddit (OP)

WOW! that's awesome, I read about it

[-]

MongooseOriginal6450@reddit

We are also building AI agent and have been looking into the same space recently.

From what I’ve seen:

Maxim AI → feels more full-stack- prompt versioning, simulations, pre/post-release evals, voice testing, and observability in one place. Good for cross-functional teams (PM + Eng).
LangSmith → great for tracing if you’re already on LangChain.
Braintrust → solid for automated evals, more dev-heavy.
Arize / Galileo → better for model observability or dataset-level evals, not so much agent workflows.
Fiddler → enterprise ML monitoring, not really built for LLMs.

[-]

maxim_karki@reddit

The cognitive benchmarking space is honestly pretty fragmented right now, but there are some decent options beyond lm-evaluation-harness. For local testing, I'd actually recommend checking out OpenAI's evals framework even though it sounds counterintuitive - it works great with local models through API wrappers like vLLM or text-generation-webui's API mode. You can customize the cognitive tests way easier than with harness, and it handles multi-turn reasoning tasks much better.

What I've been doing lately is building custom eval suites that actually matter for my use cases rather than chasing leaderboard scores. Generic benchmarks like MMLU don't really tell you if a model will be good at your specific reasoning tasks. At Anthromind we've found that creating domain-specific cognitive tests (even just 50-100 examples) gives you way better signal than running massive benchmark suites. For factual accuracy specifically, I'd suggest looking into retrieval-augmented evaluation setups where you can test how well models reason over provided context vs just hallucinating from training data.

[-]

Ashleighna99@reddit

Stop chasing leaderboards; build a small, repeatable eval pipeline that reflects OP’s tasks and rerun it across models. For a friendlier harness, try Hugging Face’s lighteval with vLLM or Ollama; it’s simpler than lm-evaluation-harness and handles GSM8K/BBH/SQuAD subsets out of the box. For multi-turn reasoning, run FastChat’s MT-Bench locally and judge with a stronger local model or a single remote pass. For factual accuracy in RAG, use Ragas and track faithfulness, context precision, and answer relevancy. Add a memory test: a simple needle-in-a-haystack doc plus multi-turn recall. Lock seeds, temperature/top_p, and do self-consistency (k=5) to reduce variance; normalize answers (regex, number equivalence) before scoring. For tracking, LangSmith or Weights & Biases give you run dashboards; DreamFactory has been handy for exposing ground-truth tables and eval prompts as quick APIs that my eval scripts can pull from alongside those tools. Bottom line: a tight, domain-specific suite you can click-and-rerun beats generic scores.