Benchmarks and evals

Posted by selund1@reddit | LocalLLaMA | View on Reddit | 10 comments

How are people running evals and benchmarks currently?

I've mostly been pulling datasets from papers (github really) and huggingface and ended up with a bunch of spaghetti python as a result. Looking for something better..

Seems like everything in this space that there's a million ways to do something and I'd rather hear about real experiences from the community rather than some hype-fueled article or marketing materials