Top 10 Models on Humanity's Last Exam. Opus 4.6 is in the lead.
Posted by Ok_Presentation1577@reddit | LocalLLaMA | View on Reddit | 9 comments
With the new release of Opus 4.6, here's the top 10 in HLE. I know they're just benchmarks and don't mean anything on their own, but it's still interesting to make comparisons when a new model comes out.
Post: I also really enjoyed reading the System Card Anthropic published on their blog, there you can find information for use cases like finance, cybersecurity, biology etc.


Legitimate_Ideal228@reddit
chatgpt, gemini, .. is the shit for researcher !!! Claude Opus/Sonnet is the best !!!!
zball_@reddit
I feel HLE is actually BS and just a glorified browsecomp
davikrehalt@reddit
i'm sorry but the first chart mixes tools with no tools?? according to your second image.
Ok_Presentation1577@reddit (OP)
Yes, the first graph is a mix, it only evaluates the highest scores regardless of tool use
Accomplished_Ad9530@reddit
Not for the GPT scores
Accomplished_Ad9530@reddit
Yeah, so Claude gets tools, GPT no tools, and who knows about each of the open weights models
ttkciar@reddit
It's gratifying to see so many open-weight models in the top ten.
Claude is on top, but not by a lot! Open-weight models are practically nipping at the commercial inference services' heels.
Dry_Yam_4597@reddit
Keep 'em coming baby.
davikrehalt@reddit
but I honestly think it is more of a indictment of the benchmark than anything else :(