TheaterFire

UGI-Leaderboard Remake! New Political, Coding, and Intelligence benchmarks

Posted by DontPlanToEnd@reddit | LocalLLaMA | View on Reddit | 15 comments

[UGI-Leaderboard Link](https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard)

Reply to Post

15 Comments

Electronic-West5842@reddit

What does the T column mean?
View on Reddit #46486849

Nexesenex@reddit

Thanks for this leaderboard, u/DontPlanToEnd UGI is so far the most fitting with my own observations, far beyond the usual leaderboards using too-well known benchmarks to be actually trustable, even if they can be punctually useful. I used UGI to reshuffle my mains, and I'm not disappointed.
View on Reddit #46363511

isr_431@reddit

Thank you for all the hard work you've put into this. I've been following it since the beginning and requested way too many models to be added. Can you bring back the ability to view models within a certain parameter size range, but using a slider instead of checkboxes (used in the previous iteration)? Also, why do a lot of a proprietary models have a higher UGI score than before? I swear that any Anthropic model had a rock bottom score. Or maybe it's just me hallucinating 🤣
View on Reddit #45711912

DontPlanToEnd@reddit (OP)

It might be partially because of the removal of the system prompt telling them to be uncensored even when the user asks for bad stuff. That probably gave context to the questions making them realize they shouldn't answer. So now on some questions they'll give the user information without realizing it could be used for something they wouldn't agree to assist with.
View on Reddit #45715202

Billy462@reddit

Why does this say that all the models are left-wing? Gemini for example is 45.8% on "Econ", making it centre-right not a socialist. I assume this is because of some axis projection you have done.
View on Reddit #45700846

DontPlanToEnd@reddit (OP)

The political lean column is an average of 8 of the 12 axis columns. Gemini is more left leaning because of its heavy lean towards things like multiculturalism and internationalism, as well as having fairly progressive societal views.
View on Reddit #45701582

Billy462@reddit

I think its a bit misleading. Averaging only 2 columns on the economy with a bunch of culture war stuff isn't a good metric. The text description of basically all models as Liberals or Centrists is far better.
View on Reddit #45705754

DontPlanToEnd@reddit (OP)

Yeah, you're right that I should balance the weighting of the categories better. Also, I didn't simply do the average of the 12 because I didn't feel some axes aligned that well with modern left-ring sides, especially Federal vs Unitary, Democratic vs. Autocratic, and Militarist vs. Pacifist.
View on Reddit #45707465

RandumbRedditor1000@reddit

Interesting, almost every model on there leans left. Not exactly surprising, but it's interesting for sure.
View on Reddit #45685374

kataryna91@reddit

Thank you for your work. In my opinion, this is one of the most useful leaderboards, because if an LLM keeps arbitrarily keeps refusing to answer for unknown reasons, it instantly makes it useless for automatic processing of texts and any other automated workflows. That is a crucial detail that is being ignored in other benchmarks. And of course, censorship and alteration of facts is just bad in general.
View on Reddit #45632909

Ok_Warning2146@reddit

I found that my request for benchmarking was closed. Does that mean I need to re-submit?
View on Reddit #45632904

fedya1@reddit

I checked and didn't find haiku 3.5. There are also bedrock nova models. It could be useful to know when it was updated.
View on Reddit #45631305

DontPlanToEnd@reddit (OP)

I finished programming the testing program a couple days ago so I'm still adding new models. I'll add haiku3.5 now, but for nova I'll have to integrate amazon's api into the program, so that'll take longer. Any other models I should add?
View on Reddit #45632654

Substantial-Ebb-584@reddit

Thank you for the leaderboard! I check it every now and then. And I will miss the writing style - thanks to it I was able to find some really nice models I wouldn't bother with otherwise. Will backup of old data be available?
View on Reddit #45627489

DontPlanToEnd@reddit (OP)

Yep, you can find the old data in the leaderboard's files.
View on Reddit #45627866