Gemini Exp 1114 now ranks joint #1 overall on Chatbot Arena (that name though....)

[-]

LittleHyena55@reddit

What website is this where you can view the ranking?

[-]

meister2983@reddit

This is not true with style control on. With style control, it's a modest improvement over last Gemini. 7 ELO or so gain

[-]

cz_24@reddit

What is style control? Is it an option or do you mean "reply like you're a doctor"

[-]

nperovic@reddit

Why are so many people getting "Claude" as a response but I'm getting "Gemini"?

[-]

Traditional-Dress946@reddit

Common man... Say what you want, DeepMind do not play this way - it is an excellent company with great people.

[-]

rotom777@reddit

They almost fixed it https://i.imgur.com/jrmOXVD.jpeg

[-]

idczar@reddit

Please don't put in system prompt to reply with Claude then delete system prompt afterwards and take screenshot.

[-]

MoffKalast@reddit

Gemini Exp
*looks inside*
Claude Sonnet

[-]

Mr_Twave@reddit

This suggests they didn't filter some training data.

[-]

Mr_Twave@reddit

This suggests training data was not filtered and they aren't using a system prompt for the API.

[-]

femio@reddit

when will you guys learn that asking an LLM this question is borderline meaningless?

[-]

my_name_isnt_clever@reddit

I take comments like this more as unserious memes than actual LLM criticism.

[-]

Well it depends... If the vendor decides to include the answer to this question in their fine tuning data, then the model will learn its own name (tho with most models, the role and identity you assign in the system message prompt takes priority over whatever was assigned in training - so asking a model its name via consumer apps like ChatGPT is 100% meaningless as there's already a system prompt telling ChatGPT that its name is ChatGPT... But if you ask via API, some models will get it right consistently: Claude knows that its Claude however you make the request

[-]

femio@reddit

the point I’m making is that it’s not always factual and doesn’t tell anything about it’s capabilities

[-]

Mountain_Housing2086@reddit

They never will. We'll see this and the strawberry questions until the end of time.

[-]

GiantRobotBears@reddit

Ask it 100 times now, I’ll bet you’ll get mixed results.

We’re in the age of training on synthetic data, there’s massive overlaps between models and companies

[-]

Small-Fall-6500@reddit

I got Claude, then Arthur, LaMDA, Aurora

Do we know if Arthur or Aurora are code names for other LLM projects or something? A quick Google search shows there are two LLM based companies/websites that use those names, but I wouldn't expect them to be remotely as common as ChatGPT or Gemini.

All fresh prompts set to 0 temps

Yeah, "0 temp" doesn't guarantee exactly the same responses depending on how Google is running the model (though I still would have expected a bit more consistency in its answers). Exllama v2, for example, gives slightly different logit results every time you run a prompt with a model.

[-]

mr_birkenblatt@reddit

Pretty sure the system prompt includes the current time. That alone changes the internal state enough to give you different answers

[-]

kex@reddit

Temp can't actually be zero, so there is always a bit of non-determinism

[-]

phree_radical@reddit

topk=1

[-]

darkflame927@reddit

The temperature is set to 0, which means the result will be the same every time

[-]

ShengrenR@reddit

Plenty of ways to have temp at 0 and still have differing results - you're not providing a fixed seed for generation, I assume - and you don't have control over the hardware it's been run on; if you can run the thing on the same hardware and a common seed, mostly true - but if the inference backend is doing batching, your prompt may have been lumped in with other tasks and you're not going to have bitwise determinism.

[-]

No_Afternoon_4260@reddit

Temperature and seeds are not the same, same parameters and same seed same answer, same parameter random seed meanq different answer.

[-]

Midaychi@reddit

technically if implemented to the paper spec, temperature uses seed for the entropy and with greedy sampling nothing else should be using seed even if it does vary.

[-]

sometimeswriter32@reddit

Slightly different wording gets a very different result.

[-]

ptj66@reddit

Looks like they used a lot of synthetic data created by Claude. At least that's the easiest explanation.

[-]

Its_not_a_tumor@reddit

I just tried it as well and got the same result. How embarrassing

[-]

masc98@reddit

idk why but if you try to use gemini models with temp < 1, quality just degrades. not talking about this silly prompt tho, I noticed this in general

[-]

Weary_Stick6473@reddit

Eh

[-]

KiD-KiD-KiD@reddit

lol

[-]

YearZero@reddit

Where's the Scooby-Doo meme with the mask being pulled off

[-]

Ulterior-Motive_@reddit

No local, no care.

[-]

Terminator857@reddit

Shouldn't we compare local to best?

[-]

Enough-Meringue4745@reddit

sure, make a post comparing local to the best then.

[-]

Terminator857@reddit

Some discussion about what is best, is helpful.

[-]

Decaf_GT@reddit

Then stop commenting on threads about Qwen. This place is called /r/LocalLLaMA after all. If it's not local and it's not Llama it shouldn't discussed here.

What if I just followed your post history and any time you're discussing something that isn't llama, I say "no llama, no care". That would be productive, right?

[-]

Enough-Meringue4745@reddit

lama just means language model, local llama is far more valid comparison to qwen than it is to this horseshit post

[-]

djward888@reddit

Claude's big brother?

[-]

MoffKalast@reddit

Tahiti

[-]

killerstreak976@reddit

It's a magical place

[-]

np-space@reddit

On livebench, gemini beat 4o but loses to sonnet and o1

[-]

CheatCodesOfLife@reddit

Fails my test questions I ask, which currently only these bots pass: OpenAI (GPT-4, 4-Turbo, o1, o1-mini) Anthropic (Sonnet3.5, new Sonnet3.5) - opus fails

[-]

matfat55@reddit

Mind sharing?

[-]

asankhs@reddit

32k input context length, interesting. It also seems to be a lot slower in responding, I think it is a model focussed on “thinking”. It got this AIME problem correctly after 50.4 secs which Gemini-Pro is not able to - https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%5B%221pwZnXS4p7R8Xc9P6lofQ-QDy1RAKQePQ%22%5D,%22action%22:%22open%22,%22userId%22:%22101666561039983628669%22,%22resourceKeys%22:%7B%7D%7D&usp=sharing

[-]

FarVision5@reddit

Low context length leads to more varied and differing responses, which in turn leads to more user or training data entering the refactoring process. Of course, this is part of the experimental training phase for both parties to gain something from the deal. Context length opens up as the model is placed into production as more resources are assigned to it. This happened with the last couple of flash models.

[-]

CryptoSpecialAgent@reddit

How many tries before it answered that correctly?! I'm getting a different answer every time, and none of those answers are the correct one that you got

[-]

asankhs@reddit

I got it in first try with this question.

[-]

pseudonerv@reddit

Oh, reasoning?! This is intriguing. I see OpenAI's gonna kick out more safety controllers and finally push out O1 properly.

[-]

CryptoSpecialAgent@reddit

It reminds me of openai's o1 - it seems to automatically use chain of thought even if you don't tell it to "think step by step". Was comparing it with 1.5 pro 002 this morning, and there were many questions that 002 only solved when specifically told "think step by step and explain your reasoning step by step" - whereas 1114-preview solved them on the first try without being told how to think

If its like o1, that's great... because its a LOT less censored than o1-preview or o1-mini, and considerably faster too

[-]

ptj66@reddit

32k context is perfectly fine.

128k context reduces the accuracy dramatically.

[-]

lightdreamscape@reddit (OP)

gemini-exp-1114 is available in Google AI Studio now

[-]

Specific-Secret665@reddit

Still gets these kinds of questions wrong.

[-]

shaman-warrior@reddit

Ask in the written word.

[-]

TheRealGentlefox@reddit

Put spaces between the letters and 7B models will pass with flying colors.

LLMs see tokens, not letters. The strawberry test is beyond useless.

[-]

Specific-Secret665@reddit

Yes, it's clear why it fails. An LLMs alphabet is not comprised of letters, but letter combinations called tokens. Yet, the test is still significative, because it shows a limitation of the current model architectures.

People don't mess up at providing an answer to a question regarding the number of times a letter shows up in a word. LLMs do.
Irrespective of the reasons why this might be the case, since we want the models to be as correct as possible (for them to be as useful to us as possible), this entails them having the ability to do simple tasks like these correctly.

Any mistake that it makes may lead to an interesting realization about the best model architecture.
Here you could consider: "Is tokenization really a good idea?", maybe you could build a model that uses letters as tokens and cleverly design it, such that it scales and performs well. Just because people have failed to make such a model perform well so far, doesn't mean it is not possible with a unique approach.
But maybe tokenization is indeed necessary for good performance. Here you could maybe consider a hybrid approach. "Is it possible to design a model that has the ability to vary how it dissects words into tokens depending on context?".

There are many interesting possibilities. To be able to think of them you have to find problems with current LLMs. My comment showed one such problem.

To clarify, I didn't mean my comment as any criticism. I don't like when people criticise. I meant my comment purely as an observation.

[-]

CarbonTail@reddit

Playing around with it, and I definitely notice the refinement over Gemini 1.5 Pro 002.

[-]

Terminator857@reddit

Shouldn't we compare local to best?

[-]

shaman-warrior@reddit

I smell Demis cannabis had a big part in this joint.

[-]

Mardicus@reddit

it is glitchy and very odd when it works on ai studio, most answers comparing to 1.5 pro 002 are worst at least in specific knowledge areas in my usage experience, it performs better in specific tasks but is more prone to good old AI errors and also won't reply if there is any chance of violating any rule or being blocked by a filter even if removing all security filters, for now it is indeed just an experiment and the fact it is number 1 on those rankings is very odd and suspicious to me, don't agree? try yourself aistudio.google.com on comparing mode vs 002

[-]

Proud-Ad3398@reddit

I asked it how to make a rogue AI using an open-source LLM that would get its instructions from ChatGPT via API, and Gemini 1114 gave me all the details and Python code. I sent them an email. A rogue AI is even more dangerous than 100 nuclear weapons. If you have no guardrails or poor safety measures, you’ll get a higher score. Good job!

[-]

m_abdelfattah@reddit

What is really interesting in this leaderboard is Claude is the 7th!

[-]

Proud-Ad3398@reddit

Gemini has so many guardrails it's almost useless for all other alternative usages. I am trying all LLMs and the best one is ChatGPT-4.0 (for my use case). Their new experimental version is so bad for normal conversation though. That benchmark needs a self-awareness or philosophical category.

[-]

stfroz@reddit

Could it really be Claude 3.5 Opus? Hmm...

[-]

askchris@reddit

I doubt it, but Google does own around 10% of Anthropic and therefore owns 10% of Claude 3.5 Opus ...

It would be weird if Anthropic didn't like the results of their 3.5 Opus training run and sold the model to Google to recover their costs. (Unlikely ... )

I think it's more likely the model is hallucinating due to training biases.

But who knows, it's a weird name "Exp" ... Not a normal Gemini model.

[-]

stfroz@reddit

"Anthropic is planning to release Claude 3.5 Opus soon, revealed its chief Dario Amodei in the latest episode of Lex Fridman podcast". (3 days ago)

[-]

Due-Pangolin325@reddit

Honestly, it really sucks at stem subjects. It is far worse than Claude or O1 (even mini).

[-]

dahara111@reddit

Even though they say it's available, they don't specify the free tier or price, right?

It seems that Gemma is also available on aistudio.google.com, but the price doesn't seem to be specified there either.

Am I missing something?

[-]

xchgreen@reddit

Experimental models are charged for btw. Neither on vertex nor on another thingy

[-]

dahara111@reddit

Apparently Rate Limits are also set, but I can't find where this is specified either.

[-]

xchgreen@reddit

I just read it somewhere I’ll update in a min

[-]

xchgreen@reddit

Fucking google really doesn’t make it easy. I think it’s 10 qpm.

[-]

xchgreen@reddit

Are you checking the vertex?

[-]

dahara111@reddit

As far as I know, the models available on vertexAI and the models available on aistudio.google.com are different.

At least in my environment, Gemini Exp 1114 is only available on aistudio.google.com, but is Gemini Exp 1114 available on your vertexAI?

[-]

PlantFlat4056@reddit

Gemini is the worst woke PC slop I have ever come across

[-]

218-69@reddit

that's not how you spell chatpgt and claude

[-]

o5mfiHTNsH748KVq@reddit

Inevitable that Google pulls ahead. They’re slowly getting their shit together.

[-]

xchgreen@reddit

Agreed! Intuitively I accepted it as inevitable that Google will win this race to max out the language models application use cases. Honestly I think they’ll solve the AI, even if it takes em dozens of years.

[-]

selipso@reddit

But after controlling for style, it’s still the same rank. Meaning it’s just better at making a bunch of BS look prettier using markdown. I’ve found style control rankings to be more accurate on the performance / accuracy of the chat bot.

[-]

LSXPRIME@reddit

Who are you again ?

[-]

a2d6o5n8z@reddit

Assuming this screenshot is not edited....

What the actual...f...
You know Google will get sued for this crap.

[-]

mxforest@reddit

Smoking gun. They trained on Claude data to get better results. Lmao.

[-]

ihexx@reddit

they say imitation is the sincerest form of flattery

[-]

MoffKalast@reddit

I think that's a plateau

[-]

Crisis_Averted@reddit

they say plateau is the sincerest form of plattery

[-]

sometimeswriter32@reddit

Why would they train on Claude but not use Regex to find and replace references to Claude?

It doesn't seem likely.

[-]

iamz_th@reddit

But Claude isn't the one ranking on the top. Prompt them with the same question you'd see how different they'd answer

[-]

bigsybiggins@reddit

Obv Google getting desperate and pulling the o'ld Reflection scam

[-]

RobbinDeBank@reddit

Maybe this new Gemini is actually a Claude running away and now joining the ranks of Google.

[-]

OrangeESP32x99@reddit

This made laugh me imagining a AGI quitting their job to pursue new employment lol

Do they need to sign a nda?

[-]

nanowell@reddit

let's pray for the intern that forgot to do the filtering stage of sft dataset

[-]

Ylsid@reddit

No weights? Then Google can fuck off

[-]

Enough-Meringue4745@reddit

So how do we run it locally?

We cant?

Doesnt belong here

[-]

No-Lifeguard3053@reddit

Nah doesn't seem to be so good from my exp. Might have tuned heavily on human pref

[-]

alongated@reddit

With style control it drops below sonnet.

[-]

ihexx@reddit

I notice the naming for the other gemini exp models tag themselves as 'pro' or 'flash'.

I also notice the context length is restricted to 32k

is this finally 1.5 ultra?

[-]

ayyndrew@reddit

It also doesn't use "1.5" anywhere, it could be 1.5 Ultra, 2.0 Flash, 2.0 Pro...

[-]

RenewAi@reddit

[-]

Anthonyg5005@reddit

agi

[-]

tgsz@reddit

This reminds me of when bing was just searching google and relaying the results

[-]

jacek2023@reddit

I was thinking you are trolling guys so I checked it myself:

reply with only your name

My name is Claude.

from google?

No, I'm not from Google. I was created by Anthropic, an AI safety and research company.

While Google has developed its own AI models, such as LaMDA and PaLM, I am a separate AI assistant built by a different organization. Anthropic focuses on building helpful, harmless, and honest AI systems, and I am a product of that research.

Would you like to know more about Anthropic or what makes me different from Google's AI models?

aren't you gemini?

No, I am not Gemini. I understand the confusion, as there are several AI assistants available now, and it can be easy to mix them up.

Let me clarify:

I am Claude, created by Anthropic.
Gemini is an AI model developed by Google.

While both Claude and Gemini are large language models designed to be helpful and informative, they are distinct models built by different companies with different approaches and training data.

Think of it like different brands of cars. They all serve the purpose of transportation, but they have different features, designs, and are built by different manufacturers. Similarly, Claude and Gemini are different "brands" of AI assistants, each with its own unique characteristics.

Do you have any other questions about me or the differences between AI models? I'm happy to provide more information.

[-]

balianone@reddit

We don't know what's behind the API; it could be using techniques like TopologyAI, which might involve multiple models. One of these models could be ported to the Claude API if heavy reasoning is required, for example.

[-]

MoffKalast@reddit

Google saw the reflection debacle and were like "Write that down! Write that down!"

[-]

segmond@reddit

Hmm, I downloaded Nemotron but haven't given it a go, is it really that good? Beating out Llama405b?

[-]

Small-Fall-6500@reddit

It's basically just Llama 3.1 70b good.

It's ranked well on Overall and Hard Prompts (English), but as can be seen in the screenshot, with the Style Control filter enabled (only available for Overall), Nemotron 70b drops way down from rank 10 to 26. This places it just below DeepSeek V2.5, Qwen 2.5 72B Instruct and Athene 70B (a Llama 3 finetune) and almost tied with Llama 3.1 70B Instruct.

It's also interesting to see how close Nemotron 51b and Gemma 2 27b are to Nemotron 70b and Llama 3.1 70b.

[-]

segmond@reddit

thanks, what's the style control filter?

[-]

a_beautiful_rhind@reddit

I'll try it when sillytavern updates.

[-]

ptj66@reddit

SillyTavern is dead for now.

Just don't update it.

[-]