TheaterFire

examples : add llama-eval by ggerganov · Pull Request #21152 · ggml-org/llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 22 comments

now you can evaluate your models at home

Reply to Post

22 Comments

llama-impersonator@reddit

it would be nice if lcpp supported echo so lm-eval could work directly without some bs transformer integration.
View on Reddit #85870424

Organic_Scarcity_495@reddit

having a standardized eval script inside llama.cpp itself is great. saves everyone from setting up their own janky benchmark pipeline that measures different things
View on Reddit #85845433

Luigi_Boy_96@reddit

Lol, I was literally spending last week creating my own benchmarking repo with some nut tasks to see how fast and how accurate the models were. At least it was a fun experiment to see how some models reason.
View on Reddit #85864288

ttkciar@reddit

Can attest to the truth of this, having written my own janky benchmark pipeline that measures weird things.
View on Reddit #85847358

lumos675@reddit

I really don't care about the output time. Cause think about. Maximum how many line of code you need to write in one go? 3000 lines? Still it's not as time consuming as prefill of 150k context.
View on Reddit #85858007

wektor420@reddit

Good find, something similiar for vllm would be cool
View on Reddit #85855924

StorageHungry8380@reddit

-c 4194304 -np 256 That's not your grandpa's GPU... Not that it requires it, just... not the parameters I run at home. Very cool addition, been wanting to run benches easily at home while tinkering.
View on Reddit #85830832

spaceman_@reddit

Likely running on CPU, given the high `np` value, no?
View on Reddit #85834413

PANIC_EXCEPTION@reddit

why big gpu when many cpu do trick?
View on Reddit #85847539

fiery_prometheus@reddit

Someone like him likely has donated datacenter GPUs, can't imagine he wouldn't have those at this point
View on Reddit #85846191

perkia@reddit

The real home was the datacenters we slept in along the way
View on Reddit #85836966

coherentspoon@reddit

Thanks for making us aware.
View on Reddit #85844211

TheBlueMatt@reddit

Hopefully this leads to more formal (even if benchmaxxed) results for quantized models - just looking at divergence may or may not capture the quality of a quantization fully and this might help.
View on Reddit #85841430

Eyelbee@reddit

Doesn't seem very good.  Isn't aime datasets proprietary? Also why do we need llm as a judge for aime? Catn't see the loglikelihood scoring too
View on Reddit #85836255

computehungry@reddit

Oh this is nice. Although it might look trivial, when I tried to bench some models, I found that so many benchmarks just ask for "API_KEY" without any (local) server option. Sure it's not too hard to vibe-hook them, but still pretty great to have out of the box.
View on Reddit #85828002

Zc5Gwu@reddit

I hope it brings a little more rigor to people’s vibes about different quants. 
View on Reddit #85835795

ketosoy@reddit

Having fought with lm-eval for many days, I look forward to having an eval tool with some gg level elegance.
View on Reddit #85828819

Dany0@reddit

ggs my friend
View on Reddit #85834833

Chromix_@reddit

"now you can evaluate your models at home" -> now you can heat your home ;-) (Maybe slightly less when [restricting power usage](https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/stop_wasting_electricity/) and undervolting a bit) It's also nice that there is now a single, fixed way of evaluation. No more oddness with everyone adapting an existing benchmark to local models in a different way, running it with different versions of dependencies, and so on. The scores of the same model differed quite a bit, depending on how it was evaluated, as I found with the SuperGPQA benchmark, and I'm not even talking about the regular variation between runs here.
View on Reddit #85834658

Far-Low-4705@reddit

this was very very much needed
View on Reddit #85834648

RIP26770@reddit

Dope 😎
View on Reddit #85834096

a_beautiful_rhind@reddit

The tests take a while but it's a good benchmark to see if your LLM is underperforming. I had to reduce the simultaneous requests from the ridiculous number it does by default.
View on Reddit #85833159