Top 10 Models on Humanity's Last Exam. Opus 4.6 is in the lead.

Posted by Ok_Presentation1577@reddit | LocalLLaMA | View on Reddit | 9 comments

With the new release of Opus 4.6, here's the top 10 in HLE. I know they're just benchmarks and don't mean anything on their own, but it's still interesting to make comparisons when a new model comes out.

Post: I also really enjoyed reading the System Card Anthropic published on their blog, there you can find information for use cases like finance, cybersecurity, biology etc.

[-]

Legitimate_Ideal228@reddit

chatgpt, gemini, .. is the shit for researcher !!! Claude Opus/Sonnet is the best !!!!

[-]

zball_@reddit

I feel HLE is actually BS and just a glorified browsecomp

[-]

davikrehalt@reddit

i'm sorry but the first chart mixes tools with no tools?? according to your second image.

[-]

Ok_Presentation1577@reddit (OP)

Yes, the first graph is a mix, it only evaluates the highest scores regardless of tool use

[-]

Accomplished_Ad9530@reddit

Not for the GPT scores

[-]

Accomplished_Ad9530@reddit

Yeah, so Claude gets tools, GPT no tools, and who knows about each of the open weights models

[-]

ttkciar@reddit

It's gratifying to see so many open-weight models in the top ten.

Claude is on top, but not by a lot! Open-weight models are practically nipping at the commercial inference services' heels.

[-]

Dry_Yam_4597@reddit

Keep 'em coming baby.

[-]

davikrehalt@reddit

but I honestly think it is more of a indictment of the benchmark than anything else :(