How do you benchmark the cognitive performance of local LLM models?

Posted by LastikPlastic@reddit | LocalLLaMA | View on Reddit | 8 comments

Hey everyone,

I’ve been experimenting with running local LLMs (mainly open-weight models from Hugging Face) and I’m curious about how to systematically benchmark their cognitive performance — not just speed or token throughput, but things like reasoning, memory, comprehension, and factual accuracy.

I know about lm-evaluation-harness, but it’s pretty cumbersome to run manually for each model. I’m wondering if:

Any suggestions, tools, or workflows you’d recommend?
Thanks in advance!