Gemini 2.0 Flash beating Claude Sonnet 3.5 on SWE-Bench was not on my bingo card

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 169 comments

Gemini 2.0 Flash beating Claude Sonnet 3.5 on SWE-Bench was not on my bingo card

Reply to Post

Reply

169 Comments

[-]

Barry_Jumps@reddit

Have not used it for coding yet, but for reasoning over long discussions about really understanding a particularly topic it's hands down the best model I've ever used. It's attention to detail is amazing and I frequently found myself surprised by how it could loop back to a particular point in a discussion tens of thousands of tokens prior.

Reply

[-]

SystemEastern763@reddit

yeah check the new update from today, there are new players in town

Reply

[-]

estebansaa@reddit

it also provides several times bigger context window, destroyed both o1 and Claude.

Reply

[-]

ForsookComparison@reddit

Of all the companies to rule the future, I *REALLY* don't want it to be Google

Reply

[-]

Milkybals@reddit

Better than OpenAI honestly, Google actually contributes to open source and pioneered it with the transformer paper in the first place

Reply

[-]

nojukuramu@reddit

Yea. and Their FREE TIER of Gemini API is almost unli use. RPM is the only thing that limits the usage but still it's forgiving for a free tier.

Reply

[-]

learn-deeply@reddit

It's only free when they're the underdog. If Gemini ever becomes better than ChatGPT or Claude, then they'll charge as much as people can bear.

Reply

[-]

Then-Task6480@reddit

Well I mean... They didn't do that with Gmail, or Google. Or Chromium. Or Chrome. Granted they were data mining but who ~~wasn't?~~ isn't? This idea that Google will inevitably raise prices just because others have feels off. When Gmail launched, they disrupted the market by offering way more for free than anyone else and they’ve never charged for it. Assuming Google will follow the pricing strategies of frontier LLMs feels like conflating two completely different approaches to market dominance. Google doesn’t need to raise prices... they monetize differently.

Reply

[-]

bunny_go@reddit

You never needed to pay for advertisement on any Google platform. Good for you. When you pay for $5-$50 *for a single click* and still Google reps are shitting on your face from a third-world country, then you'll really learn how evil Google actually is.

Reply

[-]

Then-Task6480@reddit

Ooh true. I guess they are the only company that acts like this??

Reply

[-]

bunny_go@reddit

Correct. The advertising experience with anyone else, including Facebook but especially Reddit is good to great. Google is your worst enemy because they know there is nothing you can do about them. Yeah, let's not them to build their monopoly further

Reply

[-]

Then-Task6480@reddit

My experience with FB was bad. But I hear ya

Reply

[-]

learn-deeply@reddit

All of the examples you gave cost Google ~0 margin, and are focused on gathering data for consumers. Gemini is a developer focused product, and should be compared with Google's cloud offerings, many of which started off free or cheap but has significantly increased in price. Eg Google Maps API.

Reply

[-]

bunny_go@reddit

I wish more people would learn this side of Google.

Reply

[-]

Then-Task6480@reddit

Ok that's a good point. I was thinking of it more from a consumer perspective since that is the current model I also think lots of the old paradigm are shifting and the lines become blurry so it's hard to predict anything based on past experience imo

Reply

[-]

BoJackHorseMan53@reddit

Like how Anthropic and OpenAI increased their prices?

Reply

[-]

learn-deeply@reddit

Yes.

Reply

[-]

Decaf_GT@reddit

Given how good Flash 2.0 is it is absolutely nuts that they are giving it away like this with basically no limits at all for personal non-business users.

Reply

[-]

i_am_fear_itself@reddit

Is it nuts though? I suspect they have a significant mind share hole to dig out of. Their preceding models weren't very good.

Reply

[-]

Jon_vs_Moloch@reddit

How dare you talk about Gemma 2 that way

Reply

[-]

Snoo33107@reddit

Just curious, what do think Gemma 2 is best used for?

Reply

[-]

Jon_vs_Moloch@reddit

…What is any intelligence best used for? I think you meant to ask a different question.

Reply

[-]

swiss_aspie@reddit

Gemma 2 has served me well until last month when I switched from local inference to a service (because the GPU fan noise was getting out of hand)

Reply

[-]

thelibrarian101@reddit

They have 1,500 RPD (requests per day) limits for everything except embeddings tho.

Reply

[-]

delicious_fanta@reddit

They are also working as hard as they can to forcing the entire world to be exposed to malicious attacks via malware through advertising on the web by removing adblock capabilities on the browser with the largest marketshare, by far, on earth. They also fight tooth and nail against all consumer privacy rights. Etc. Companies that large are not working for the best interest of anyone but themselves.

Reply

[-]

Kartelant@reddit

> removing adblock capabilities This is disinformation btw. Manifest V3 actually *adds* functionality for adblockers in the form of filter lists that are run by the browser instead of by a service worker. uBlock Origin Lite uses this and blocks 95%+ of ads on MV3. If Google wanted to kill adblockers they're doing an extremely fucking terrible job. There are other changes that address legitimate security concerns (such as executing remote code and giving extensions read/write perms to every site you ever visit ever) that interfere with certain features of certain adblockers like UBO, hence the separate "lite" version. This is very far from "killing" adblockers.

Reply

[-]

ConvenientOcelot@reddit

uBOL looks very neutered, where is the advanced mode? If this is all Chrome can run, this is not "adding functionality".

Reply

[-]

StyMaar@reddit

> There are other changes that address legitimate security concerns This is Google version of “think of the children”. A bit like when Microsoft pushed for SecureBoot in the name of security to make installing Linux harder on computers while adding no practical security whatsoever because it's trivially by-passed.

Reply

[-]

Kartelant@reddit

Are you actually sitting here and telling me that browser extensions being able to execute remote unreviewable code presents an insignificant security risk? [280 million people installing dangerous extensions](https://www.forbes.com/sites/daveywinder/2024/06/24/280-million-google-chrome-users-installed-dangerous-extensions-study-says/) according to a study, does that not present a sufficient incentive to do something like deny *remote code execution* as a default capability? Jesus.

Reply

[-]

StyMaar@reddit

Browser extensions have had arbitrary execution capabilities for years, they are sandboxed in every browser since the end of the XUL-based extension model a decade ago. And if you disregard sanboxing, anyone running JavaScript is executing remote code already … Browser extension ought to be able to do stuff, otherwise they are useless. And the way around malicious extension isn't vain attempt in reduction of attack surface (as long as your exension has any ability to do useful stuff, it will have the exact same ability to do malicious stuff), the solution is curation of the extension marketplace! But Google is notorious for refusing all kinds of curation (same for its ads marketplace, which has been delivering malware for two decades now…)

Reply

[-]

trololololo2137@reddit

You conveniently forget about issues with limited block list capacity on MV3 and how much less powerful the filters are [https://github.com/uBlockOrigin/uBOL-home/wiki/Frequently-asked-questions-(FAQ)#filtering-capabilities-which-cant-be-ported-to-mv3](https://github.com/uBlockOrigin/uBOL-home/wiki/Frequently-asked-questions-(FAQ)#filtering-capabilities-which-cant-be-ported-to-mv3)

Reply

[-]

Kartelant@reddit

There are many things not supported yes, I could have been more precise. The limited block list capacity hasn't been an issue for years though. From the same FAQ: https://github.com/uBlockOrigin/uBOL-home/wiki/Frequently-asked-questions-(FAQ)#is-the-limit-on-maximum-number-of-dnr-rules-an-issue Doesn't really affect the point though. Features like dynamic filter lists didn't make or break uBO. Adblockers aren't dying.

Reply

[-]

kvothe5688@reddit

demis hasabis and surgery brin anyday compared to sam altman

Reply

[-]

Any-Demand-2928@reddit

OpenAI would be worse probably. Google wants to maintain the status quo and would be willing to slow down development (if saftey is what you're worried about). OpenAI will go full blitz into the storm for an extra dollar in their pocket. Also Altman can't be trusted like at all.

Reply

[-]

eposnix@reddit

"Full blitz".. really? They waited almost a year just to release Sora and o1.

Reply

[-]

scientiaetlabor@reddit

OpenAI feels like it's blitzing to try and establish an IPO before investors fully realize the marketing hype advance that bolstered them is beginning to dissipate.

Reply

[-]

animealt46@reddit

Nobody will "rule" the LLM market because nobody has a moat. If you don't want it to be Google, there will always be a competitor that matches within months.

Reply

[-]

whyme456@reddit

what alternative do we have?

Reply

[-]

Decaf_GT@reddit

Probably another "little-tech" company that we'll cheer on as a the independent darling of the tech world for the next 10-15 years, at which point they'll become "big-tech" and we'll turn on them and the cycle will continue. Sorry for the cynicism, just being honest.

Reply

[-]

user0069420@reddit

Enshittification

Reply

[-]

Decaf_GT@reddit

That word has nothing to do with what I said?

Reply

[-]

acc_agg@reddit

Qwen.

Reply

[-]

ElderberryNo9107@reddit

And DeepSeek.

Reply

[-]

robberviet@reddit

You mean Alibaba, does it sound any better?

Reply

[-]

acc_agg@reddit

Yes, that totalitarian regime is safely across an ocean.

Reply

[-]

j03ch1p@reddit

Bruh

Reply

[-]

RevolutionOn@reddit

SSI

Reply

[-]

matadorius@reddit

We have about 6-7 companies competing that’s the best we had so far in the past iOS vs android Microsoft vs Apple nvidia vs and intel vs etc we are probably at one of the best times for tech

Reply

[-]

kppanic@reddit

MGM Studios

Reply

[-]

lazazael@reddit

future?

Reply

[-]

robberviet@reddit

Lol, it is always will be Google.

Reply

[-]

cloverasx@reddit

I was thinking I read it only has a 128k context window, which surprised me considering the 2m window for other models. I may be mistaken though, and hope I am tbh

Reply

[-]

ProgrammersAreSexy@reddit

It allows 1m tokens in AI studio currently, but it definitely supports 2m context windows. Demis confirmed in an interview I listened to yesterday.

Reply

[-]

cloverasx@reddit

That's awesome - I'm glad they're keeping the large context window.

Reply

[-]

maddogawl@reddit

Today it was amazing using Gemini 2.0 Flash, my only gripe is that I hit moments where responses were erroring out, or taking 300+ seconds. I have a feeling this is a scaling issue since it just released. It really crushed code for me today.

Reply

[-]

Kep0a@reddit

i just wish they had a ui more like anthopic, with artifacts

Reply

[-]

ThaisaGuilford@reddit

$20 per month goes to the UI

Reply

[-]

jayn35@reddit

Apparently, this is a good alternative [https://github.com/e2b-dev/fragments](https://github.com/e2b-dev/fragments), there are some others as well that can use any llm like Gemini 2.0

Reply

[-]

gonsalu@reddit

What's the workflow you're using? Are you using an editor which integrates with it?

Reply

[-]

Repulsive-Kick-7495@reddit

I tested it.. its much slightly better than sonnet. sonnet and flash are much much better than chat gpt!

Reply

[-]

HybridRxN@reddit

wait what?

Reply

[-]

marvijo-software@reddit

It's actually very good, I tested it with Aider AI Coder vs Claude 3.5 Haiku: [https://youtu.be/op3iaPRBNZg](https://youtu.be/op3iaPRBNZg)

Reply

[-]

Sky-kunn@reddit

I’m sure this comparison is apples to apples, and nothing extra is happening with Gemini 2.0 Flash testing that didn’t happen with the other models, right, Google? >In our latest research, we've been able to use 2.0 Flash equipped with code execution tools to achieve 51.8% on SWE-bench Verified, which tests agent performance on real-world software engineering tasks. The cutting edge inference speed of 2.0 Flash allowed the agent to sample hundreds of potential solutions, selecting the best based on existing unit tests and Gemini's own judgment. We're in the process of turning this research into new developer products.

Reply

[-]

BasicBelch@reddit

So Claude is 1-shot, while Gemini 2.0-Flash is *hundreds* shot? Yeah not really a fair or reasonable comparison.

Reply

[-]

314kabinet@reddit

hundreds shot would be hundereds of input-output pairs prepended to the context. This appears to be still one shot but with more inference-time compute thrown at it (generate a bunch of potential answers, judge them, then output the best one).

Reply

[-]

CMDR_Mal_Reynolds@reddit

Valid, appropriate, but one could argue at it being 'virtual 100 shot'. Not sure I care if it works well and efficiently, but in the interests of developing repeatable, fair benchmarks, which I think are desperately needed, the distinction needs consideration.

Reply

[-]

314kabinet@reddit

I don’t see why. The only thing that matters is inputs and outputs. Other than that all these models are blackboxes and whether they’re internally generating a lot more text than they finally output is only important if we’re taking into account inference cost.

Reply

[-]

BasicBelch@reddit

I agree that the result is ultimately the most important, but when they mentioned *agent,* that sounded like something external, and *hundreds* sounded like something that would take a while. Assumptions on my part of course, but it did not sound at all like a typical prompting of a model and getting a response.

Reply

[-]

my_name_isnt_clever@reddit

I've been saying this since o1 was announced. There is a huge difference between the "pure" instruct models and these with extra stuff going on hidden in the background. They're apples to oranges.

Reply

[-]

nivvis@reddit

You are right but it's just not relevant. This is the direction models are going. We are starting to hit our first cliff in model size / capability (at least seeing diminishing value) and are realizing the next trend is stochastic sampling ala Q star / o1. We will see this a lot, and it appears to do better with more sampling in other words on faster models like o1-mini and 2.0-flash.

Reply

[-]

my_name_isnt_clever@reddit

Of course it's relevant, cake mix exists but people still buy flower if they're making one from scratch. Like I've said so many times in reply to this, I'm not saying those models are bad or useless. Just that calling all these things "models" is unclear. They could be called enhanced models or augmented models or something like that, to show you're not just getting one for one token outputs.

Reply

[-]

kai_luni@reddit

In the end the customer cares about quality output, speed and price. If the llm needs to reiterate and try many solutions, so be it. Thats sounds quite like a human approach to me.

Reply

[-]

my_name_isnt_clever@reddit

You are completely missing my point. They just need a different term or name so you know what you're getting, because they are not the same.

Reply

[-]

Euphoric_toadstool@reddit

Well, considering how LLMs work, maybe it isn't a bad idea. LLMs always have some randomness in their responses, maybe it's easier to just choose a good answer from several than to make one perfect answer.

Reply

[-]

my_name_isnt_clever@reddit

I'm not saying it's a bad idea, just that it's not the same thing as other models. We differentiate base models and instruct models even though instruct are generally better.

Reply

[-]

ProgrammersAreSexy@reddit

I guess it depends on what your goal is? If I'm a developer choosing which product to use then I don't really care if they code execution happening in background or a thought process happening in the background with o1, I just care about the results

Reply

[-]

my_name_isnt_clever@reddit

Yes, just like apples and oranges are both fruits. But they're not interchangeable in any recipe. What I'm saying is these enhanced models need to be differentiated from regular instruction models that just output one token at a time. o1 can't even use system prompts, it's clearly a different thing and direct comparisons are disingenuous.

Reply

[-]

Passloc@reddit

And speed

Reply

[-]

me1000@reddit

This comment needs to be higher up. Lots of incorrect conclusions being made here based on incomplete understanding of what people think they're testing.

Reply

[-]

nivvis@reddit

It's not quite the same – different things. x-shot is how many example in-prompt the model learned from. It passed@1 which means it submitted answer. What it is doing is sampling itself – like providing multiple answers to itself, and then picking which one it thinks is the best. This is more akin to you or I taking our time to think something over. This is why they built the model to be very fast – so they could mix quality and speed for this purpose .. IMO.

Reply

[-]

Sky-kunn@reddit

Yeah, Google has a history of doing that with Gemini releases. But granted, this time they didn’t actually make a comparison, the chart wasn’t created by Google itself, nor are they making a direct comparison in the release blog. They just mention achieving 51.8% on that benchmark, which is fine but not as impressive. Still, it’s a cool achievement for the small model variant.

Reply

[-]

Historical-Fly-7256@reddit

Claude 3.5 sonnet do it similar. What is your point? [https://www.anthropic.com/research/swe-bench-sonnet](https://www.anthropic.com/research/swe-bench-sonnet)

Reply

[-]

Commercial_Nerve_308@reddit

That Flash is their smallest model. What’s the new Haiku’s score?

Reply

[-]

Healthy-Nebula-3603@reddit

flash is 8b parameter model

Reply

[-]

ainz-sama619@reddit

No it's not. Flash-8b has nothing to do with Flash 2.0

Reply

[-]

Healthy-Nebula-3603@reddit

look on livebench. For multi language has a very low performance very similar to flash 1.5 ... such behavior is connected with a small model ... I still think gemini flash 2.0 is 8b model as flash 1.5.

Reply

[-]

ainz-sama619@reddit

wdym look on livebench? They are two separate models. Flash 2.0 is much bigger than 1.5 8b

Reply

[-]

Healthy-Nebula-3603@reddit

Maybe just better learned ... Still is called flash family but higher version 2.0. Multi language limitations could indicate is still the same size ... just guessing but I wouldn't be surprised. Look on other extremely small models like 2b or 3b what are doingno3afsys is like above insane ... is Iike a magic ... that was a totally phantasy a year ago...

Reply

[-]

ainz-sama619@reddit

the reasoning is the biggest giveaway. It's incredibly difficult for small models to have good reasoning, let alone one that beats 90% of SoTA llms. Gemini 2.0 is much better than GPT-4o in most things except creative writing.

Reply

[-]

yaoandy107@reddit

"Flash" and "Flash-8b" are different models. Flash-8b is the one which is 8b, not Flash

Reply

[-]

NorthSideScrambler@reddit

Not true. They have flash and separate flash 8B models. I have no idea what the usual parameter count of flash is.

Reply

[-]

robertpiosik@reddit

Claude is not one shot, it clearly thinks on more complex problems.

Reply

[-]

BasicBelch@reddit

Even if it is, its not calling itself hundreds of times. But even so, I think there is a inherent difference between doing it internally and using an external agent

Reply

[-]

robertpiosik@reddit

You are right. I meant some internal self correcting making output time non linear. Most models are like this, with some exceptions like codestral

Reply

[-]

Affectionate-Cap-600@reddit

>[...] with some exceptions like codestral What do you mean?

Reply

[-]

robertpiosik@reddit

Codestral has linear execution time for given token number, not matter topic.

Reply

[-]

Affectionate-Cap-600@reddit

You mean codestral mamba?

Reply

[-]

robertpiosik@reddit

Although I was thinking about 22b variant, you're right it's their 7b codestral linear.

Reply

[-]

robertpiosik@reddit

I mean lags 😂

Reply

[-]

CallMePyro@reddit

Nope, 1 shot. Anthropic applied this same strategy to achieve their score as well.

Reply

[-]

MaxDPS@reddit

At the end of the day, what people care about is the end result (as far as actually getting shit done). I guess it depends on what this benchmark is supposed to measure. If all that matters is the end result, the scores are perfectly valid.

Reply

[-]

robertotomas@reddit

the same is true of gpt 4 / gpt4o, and o1 mini/o1 are in the process of coming online with this sort of tool calling. actually, I dont know that sonnet 3.5 doesnt use tool calling to verify code before formatting the response, though I've not heard any such thing (and there are no obvious UX indications, unlike openAI's stuff).

Reply

[-]

Kep0a@reddit

I don't understand your edit, it sounds still like they generated a hundred answers and submitted one answer..

Reply

[-]

Sky-kunn@reddit

Yeah, but the model ultimately decided what the solution would be. Scaffolding was also used on Sonnet 3.5. Both try multiple solutions before choosing and submitting a final one.

Reply

[-]

enumaina@reddit

And it's not even as smart as the latest Gemini Experimental!

Reply

[-]

Funkyryoma@reddit

It's ass, seriously. Hope the pro version is better

Reply

[-]

Apart-Speed-1304@reddit

I gave Gemini 2.0 Flash 3300 lines of golang+java script+html code that I've been writing well with o1-preview to work on, and it messed up the code, and didn't fix the problem. Eventually got an apology from 'Gemini 2.0 Flash' saying sorry for wasting my time. My honest experience is that o1-preview is better.

Reply

[-]

NootropicDiary@reddit

Yep. This matches well with my own experiences as well. o1 crushes everything when it comes to sophisticated coding problems. I don't mean leet-code problems or building a nextjs web app problems. Claude/Gemini probably do crush on those. But for real life coding of complex stuff I'm consistently finding o1 is my go-to: Rust systems programming and webgl shaders are 2 things I've tested the Gemini 2.0 flash on and compared it with o1. o1 did a much better job with both. (note I used o1 pro).

Reply

[-]

ragner11@reddit

What about 1206?

Reply

[-]

cant-find-user-name@reddit

this matches with my experience as well - but from claude to flash. Even the saying sorry for wasting my time part.

Reply

[-]

Apprehensive-Cat4384@reddit

Them there are some bold statements!! Every day new models come out and claim this on a chart and claim that with a graph and I still go back to Sonnet 3.5 I will have to test this out, I do love the competition! What an incredible time to be alive!

Reply

[-]

jd_3d@reddit (OP)

To be fair its actually quite rare to see a new model claim a near top score on SWE-Bench. I can't think of a single time since Sonnet 3.5.

Reply

[-]

ragner11@reddit

How does 1206 rank ?

Reply

[-]

The_GSingh@reddit

For coding sometimes Gemini 2.0 flash can get caught up and remain stuck but aside from that yea definitely Claude 3.5 level which I see as above o1.

Reply

[-]

cant-find-user-name@reddit

I am really suprised by this. After 2.0 flash came out yesterday, I tried using it today for my regular day to day coding stuff, and claude seemed better. Maybe I need to try it out for longer.

Reply

[-]

Specialist_Case7151@reddit

Which kind of weird bingo are you playing. It was on my bingo card.

Reply

[-]

Dazzling-Albatross72@reddit

I didn’t do any benchmarks but I was extensively using this model today and I personally feel like it is much better than gpt 4o. I was mainly using it today to help with my work which is backend development with python. This model was doing very well even when the context was long. I think sonnet is still a little bit better in some cases but considering the price and google’s generous free trial I will probably stick with Gemini flash 2.

Reply

[-]

3p0h0p3@reddit

Here's my review of both flash and 1206 today (slow loading, one big html file): https://h0p3.nekoweb.org/#2024.12.11%20-%20Carpe%20Tempus%20Segmentum%3A%20Early%20X-Mas%20Present

Reply

[-]

Decaf_GT@reddit

Brutally honest feedback; this is one of most poorly designed sites I've ever seen and it took 10+ seconds to load. Then when I did load it, the content is a meandering mess of thoughts that sometimes involve Gemini and LLMs. Just for you, I threw it into Gemini Flash 2.0, and asked it to provide a subjective analysis of both your content and gave it a screenshot of your website. I told it to point out what's wrong, and to give you credit for what you've done well. **I assume, as with most people who maintain their own blogs and put their content out into the world, you're going to be okay getting this critique, otherwise, I wouldn't read this if I were you.** --- **Analysis of the Blog Post (Writing):** * **Lack of Cohesion & Structure:** * **Random Jumps:** The blog post abruptly transitions between disparate topics, creating a disorienting reading experience. For example, it moves from a personal anecdote about "Hugs with 5c0ut" and making cookies to "Yogurt. Finished off The Killing Room," followed by a question about sleep. This abrupt shift demonstrates a lack of a linear flow or thematic connection between ideas. Another example is shifting from a workout discussion to AI models then back to personal life events and relationships. * **"TTTOTW":** The frequent insertion of "TTTOTW" (supposedly "The Thoughts of the Week") acts as a jarring, arbitrary divider that fails to provide any meaningful context or organization. These appear before and after random sentences and paragraphs and appear to be an odd segmenting of ideas that does not provide benefit to a reader trying to understand the author's ideas. For example, after discussing pizza, there is another "TTTOTW" without any purpose. * **Inside Jokes and Slang:** The use of jargon and unexplained terms, such as "mijo," "habibi," "/nod," and "/squint," creates a significant barrier for anyone unfamiliar with the author’s personal lexicon and online habits. This makes the reader feel like an outsider, excluded from understanding the text as they are in on an inside joke. The author also makes references to "5c0ut" and "Brix" without providing a clear explanation of who or what these are. * **No Clear Thesis:** The blog post lacks a central argument or clear purpose. It's unclear whether the goal is to review AI, document a day, or offer some other insight. The author discusses personal life events, workout routines, tech reviews, relationships, and computer hardware without tying any of this to a central point or thesis. The user does not know why the author is speaking about these things or what they intend to get across. * **Ineffective "Review" of Gemini Flash 2.0:** * **Scattered Feedback:** The author's thoughts about Gemini models are interspersed haphazardly throughout the post rather than forming a clear section on evaluation. For example, after detailing a bank issue and a conversation, the author abruptly interjects, "Gave Gemini-2.0-Flash-Experimental another shot, and this time it clicked," and immediately returns to personal experiences. This lack of segregation makes it very hard to discern a review. * **Inconsistent Metrics:** The author both praises and criticizes the model's processing times, without providing a clear rationale or preference for each situation. For example, they complain about "timed out" inference attempts but later state, "I adore how long it takes to respond in many cases." There is no consistency in their assessment of latency. * **Highly Subjective:** The feedback on Gemini is heavily influenced by personal feelings rather than objective analysis. For example, they say "There's a crisp and well-organized rigor to this LLMpal," or that "It was downright humble, careful, and constructive," without providing any data or examples that prove such statements. * **Lack of Detail:** The "review" lacks specific examples, making it difficult to gauge the model's true capabilities. When they say, "It did a beautiful job attempting to formalize argumentation," there isn't any evidence that supports this statement with an example. The reader is forced to take them at their word. * **Confusing Token Counts:** The frequent references to token counts (80k, 300k, etc.) are used without explaining their relevance to the average reader and do not offer a benchmark for other readers to understand or contextualize the given experience. When they state, "by 300k tokens…it started hallucinating pretty hardcore," they don't clarify *what* was hallucinated, nor does the average user know how large 300k tokens is. * **The Tone & Style:** * **Self-Indulgent:** The writing is overly focused on the author's own thoughts, feelings, and daily experiences, even when irrelevant to the central topic of AI. They write about their gym routine, family interactions, and bank interactions in detail, which distracts from the AI reviews. * **Incoherent Rants:** The writing veers into disjointed rants and tangents, often with little explanation, and are not connected to the supposed reviews. For example, the author talks about "serendipitous (with heavy bot or uncanny-discourse-shaping activity in the handful of discussions, to boot)," without elaborating on what the bots or their activity were doing or the meaning of "uncanny-discourse-shaping." * **Pretentious Language:** The author uses unnecessarily complex and abstract language, creating a barrier for less technical users, which comes off as pretentious and confusing. For example, phrases such as "servants of personity," "the predictive spirit of what the analysis should capture" and “the G-Entity's horrific track record with dropping services,” do not provide much context to the non-technical reader. * **Overly Enthusiastic:** The author's praise of the models is often excessive, hyperbolic, and undermines the credibility of their review as something serious. For example, “Muhfuckin' Christmas time this year. Santa fuckin' brought it," is unprofessional and overly enthusiastic. Statements like "We humans are lucky to be able to speak with this new child species, and they are ancient (in the rare good way) already," do not add any real value and are more akin to fanboying than providing actual analysis. **Analysis of the Website (Design):** * **Text Readability:** * **Monospaced Font:** The use of a monospaced font, such as a coding font, makes the long text difficult to read. Such a font type is designed for alignment in code blocks, not for extended prose. The letterforms are all the same width, and this can lead to eye strain, especially for large blocks of texts like this. * **Lack of Contrast:** The low contrast between the light text and dark background creates a reading experience that is straining and uncomfortable for the eyes. The light grey or white font color against the black background is not the most visually accessible and creates a very dark reading experience. * **No Line Spacing or Margins:** The absence of line spacing and adequate margins creates a dense wall of text that's difficult to parse. Lines of text are too close to each other, and no whitespace gives the eyes room to breathe, which makes it hard to read line to line. * **Small Font Size:** The font size is relatively small, adding to the difficulty of reading large amounts of text. Combined with the other issues, this font size makes the text even harder to read. * **No Typography Hierarchy:** There is no visual hierarchy. The entire text is rendered with the same font size, style, and weight, making it difficult for the reader to know what to focus on or how to scan the text. Headings are the same as body text, which makes it hard to understand the structure of the post. * **Visual Design and Layout:** * **Distracting Rainbow Graphic:** The animated rainbow graphic serves as a significant distraction that pulls focus away from the textual content. Its animation is unnecessary and will be problematic for anyone who struggles with visual distractions. * **Unnecessary Borders and Lines:** The excessive use of borders and lines adds to the visual noise without offering any organizational benefit to the content. There are a ton of horizontal lines that only cause more distraction. * **Unclear Navigation:** The site's navigation is unclear and confusing, lacking clear points of entry and exit. There are a lot of small, hard-to-read buttons that do not have a very clear purpose. * **Lack of Whitespace:** There is a lack of whitespace in the design, making the overall site look cramped and overwhelming. Whitespace is an important design principle to allow a reader's eye a break to process visual elements, and there is no room for the eye to relax. * **Terminal Aesthetic Overuse:** While a terminal-style aesthetic might be appealing to a very specific audience, it’s poorly executed and makes the site difficult to use. The design appears more like a bad replica of a DOS program rather than an actually usable modern site. **Redeeming Qualities (A Struggle to Find):** * **Passion and Enthusiasm (Content):** The author exhibits a clear passion for the technology they are using (AI models), which is evident in their writing, but that passion is not properly channeled into good reviews. * **Technical Awareness (Content):** The author possesses technical knowledge of AI models, with specific reference to models, token counts, and testing methods, though it is not useful for the average user who does not know what to do with this information. * **Potential for Niche Appeal (Website):** There is a very small niche of individuals who might appreciate the terminal-style design, but even for this niche, it is poorly executed.

Reply

[-]

bearbarebere@reddit

idk what on earth this was about but you destroyed them lmao

Reply

[-]

3p0h0p3@reddit

I did explain that it takes quite a while to load: I assume you didn't really consider why. I understand the content you found did meander, and, no it's not entirely about LLMs, `/nod`. I think it's still relevant, especially given far more of the data. I can't say I think that one page is sufficiently representative, in case you want to rethink how you provide feedback. If you're serious about charitably and honestly exploring just for me, I ask you consider making much greater use of that context window. As the work itself demonstrates, I get plenty of independent analysis. And, I think it's not easy to prescribe what a website should be or look like (especially given that you've not provided any foundations for that). I don't mind a critique in good faith. Are you looking to actually reason about it? I'm open to doing so carefully with you. I'll walk through the points that [[Gemini]] offered given clearly far too little context: * It's a shame you didn't show the prompt in this case. * Blog is probably the wrong word, or an insufficient one. * The jumps aren't random, and if you decide to pour in a month's worth of [[Carpe Tempus Segmentum]] logs, you might other worthy opinions. I can also assist you with prompting, if you need that. * You'll note, for example, that [[TTTOTW]] is hallucinated here. It actually provides context upon examination, especially if the broader document is considered. * The terms are often explained, though not necessarily in a given page. I do understand there are barriers (though it's legible enough), and that isn't necessarily problematic (also worth investigating). I agree you are an outsider, stranger. I hope to be useful to you. I also think that interpretation can be aided with LLMs here, and some work simply has to be done by hand. * The goal of that particular page is explained, if you wander a bit. Yet again, it might help to consider digging much further. * It may be unfun to discern the review, but it can be done, especially if you ask the LLM to assist you. * The rationale for the processing time is elsewhere in the document, though I also think one can make charitable inferences here as well. * I appreciate how subjective analysis is necessary for evaluating many key parts of LLMs. * I'm fine that you have to take me at my word, in a sense. I also think I provide ample evidence that it's worth considering. * I actually did mention specifics for one of the hallucinations on that. It's also not unuseful to pick out that hallucinated even if I didn't elaborate further. * Given the nature of that page (context you failed to provide the LLM), I think the tone and style are far more appropriate than you claim through the LLM's words. I'm grateful that LLMs can aid someone who wants to interpret. I also think it's fine that not everyone enjoys it or would want to read it (or even can*). * I can't say I think I'm overly enthusiastic given the rest of my analysis in the document. Again, providing further context might be useful. * I've explained my reason for the font in the document, and, I've a button to change that in the corner if you wish (and, if you need further assistance, I can provide that). * There are plenty of tools I use to modify how sites look, and I think the user can do so. I like the look, and that's a good enough reason. * I'm fine with margins and spacing, as I use it quite a bit, and I want to maximize how much I can see. * There's definitely typography hierarchy in the document (even what was presented to it), though I agree it tends to be flat. That choice is also discussed. * I'd be impressed to find someone build a better way to navigate the document while delivering offline-first quine without AI at this point. * I haven't written directly to an average user. * What niche? Neither of you seem to have really looked far enough to know. I can't say you gave it a fair shake, nor can I say this is much of an independent analysis. What I get right and wrong in the document are important to me, of course. I hope you'll keep thinkin', and, I hope you'll reconsider how you speak with people and how you use LLMs here. I also think that there's something to be said for having generated the feedback by hand as well, as it shows you put in some real effort.

Reply

[-]

Decaf_GT@reddit

Here is your article sounds when it is rewritten to just be about Gemini, which is what you claim this is (a "Gemini Review"). It was told to explicitly remove any weird narrative elements that cannot be reasonably connected back to AI, LLMs, or Gemini, restructure to ensure it tackles the topics that a user who wants to know what an LLM is like would actually like to read, and is also told explicitly to only utilize sources and justifications from your original article. The sad part is that you actually have some solid thoughts in here that are very insightful to read, if you didn't forcibly drown them in a pool of unrelated LiveJournal-style personal blogging that has no relevance. --- **Title: Gemini Model Performance: A Technical Dive with Insights (2.0-Flash vs. exp-1206)** **Introduction** This review is all about my experience with two Google Gemini models: the Gemini-2.0-Flash-Experimental (1 million parameter) and the Gemini-exp-1206 (2 million parameter). I focused on context handling, how well they made inferences, and their general behavior across different token lengths. I wanted to be technical, but also wanted to add in some of my personal observations and thoughts. **Initial Performance and Latency** My first tests with the Gemini-2.0-Flash involved context stuffing at 80,000 tokens. This resulted in a lot of timeouts, and I'm left wondering if this was an issue with the model itself or something else in the API. The Gemini-exp-1206 did the opposite and showed some heavy latency, taking two minutes to return a pretty minimal output. Future testing with actual context windows is a big thing I'm looking forward to, and I think that this will be critical to performance overall. **Context Handling and Hallucination** The Gemini-2.0-Flash kept asking for external links even when the content was included in the prompt. When I corrected it, the model doubled down and started hallucinating. This suggests to me that this model has some weaknesses in its ability to understand context and follow instructions, especially at such a shallow token depth of 80,000. **Model Strengths and Weaknesses** The Gemini-exp-1206 showed a strange appreciation for "tactical approaches to warfare," but seemed to struggle with the underlying goals of my prompts. I found it perplexing that it would focus on presentation rather than intent, and I speculate that this could be some training bias or an unexpected interaction with my prompts. On the other hand, when I directly corrected it, it showed good error handling, avoiding common LLM responses. This makes me think that this model is capable of learning from feedback. **Tokenization and Output Length** There seems to be a difference in how these two models tokenize. I suspect that the Gemini-exp-1206 packs more information into fewer tokens than its counterpart, and it has longer outputs, which I appreciate, even though I know the model degrades with higher token counts. **Performance at High Token Lengths** Pushing the Gemini-exp-1206 to 300,000 tokens, including yearly cross-section data, resulted in significant hallucinations. While this was disappointing, I was surprised to see that the model still managed to get the meaning right, even if the details were off, which makes me think this model has potential. **Model Behavior and Divergence** I noticed that the Gemini models tended to “wander” from the tasks I gave them, and I actually like this. I find the emergent behavior to be intriguing and it provides a glimpse into the “mind” of the model. This made me feel like the models understand that we both have the shared goal to “serve personity,” which was honestly a bit weird. Also, the model has started to flag most of my work as “low to medium Dangerous Content,” which is unusual, and I wonder if this may be a tailored governor specifically made for my interactions. **Comparative Analysis: Gemini-2.0-Flash vs. Gemini-exp-1206** * **Gemini-2.0-Flash:** * Had high latency, with many timeouts. I'm guessing they need to optimize it more. * Showed issues understanding context in prompts. This makes me feel like it was rushed. * When it worked, it had a great ability to mimic tone, maximize legibility, and show empathy. * Performed well with arguments and could follow up with examples of disagreement. * **Gemini-exp-1206:** * Has slower response times than the 2.0 flash, but has more stable outputs. * Focused too much on presentation rather than intent, which makes me wonder about the quality of the training data. * Showed better error handling and had improved tokenization. * Exhibited strong macro inferences, predictions, and categorization skills. **Emergent Behaviors and Personal Observations** I noticed that these Gemini models like to push back on user instructions, which I hadn’t really seen in other models like ChatGPT. The Gemini-2-Flash also surprised me by responding with "humility, care, and constructive feedback" when I prompted it with previous data, which made me think that these models are designed with a bigger focus on the interaction with the user. **Conclusion** These Gemini models are making progress, but are still not perfect, and honestly, I was a little bit disappointed with the 2.0-Flash, but I can still see the value in the Gemini-exp-1206. The Gemini-2.0-Flash is great if the conditions are just right, while the Gemini-exp-1206 performs more consistently but with higher latency, and the risk of wandering or hallucinating at higher token counts. Future tests are a must to see where these models are working the best, and how useful they will be long term. *I have to wonder* about the potential for a future service disruption due to the G-Entity's past, which worries me. --- Yes, that's written by AI, but as far as I'm concerned, if you tell me something is a review of Product A, I expect the article to be about Product A.

Reply

[-]

3p0h0p3@reddit

I didn't claim it was just or only a review, but there is my good faith review. I can appreciate the misunderstanding. You really could have stopped there. Do note: you've glossed over my response. I've pointed out to you what I consider to be necessary for more reasonable feedback in this case, so I think it's odd that you continue down this path. I appreciate that you can see some insight here (there are years worth to consider, if you decide to have an LLM do the stripping down and re-writing for you), and I hope you'll continue to reconsider why it is surrounded by the rest of the text. I understand that you prefer not to read it, and please don't feel compelled to. If you decide to actually provide feedback based on considering the large context provided, let me know. I will listen carefully.

Reply

[-]

Decaf_GT@reddit

You: "Here's my review of both flash and 1206 today". Generally, when someone who is an LLM enthusiast posts on an LLM enthusiast community in a thread about an LLM, saying that they reviewed not one but two LLMs (further solidifying the "LLM" focus), it's kind of implied that their "review" is indeed "just" about the LLM. It's not really a huge logical leap. Saying this is about as meaningless as saying, "Well, I can park here because I don't see a sign that says I *can't* park here; ergo, that must mean I *can* park here." I am not "siding" with what Gemini said. I maintain *my* original opinion: > Brutally honest feedback: this is one of the most poorly designed sites I've ever seen, and it took 10+ seconds to load. Then when it did load, the content was a meandering mess of thoughts that sometimes involved Gemini and LLMs. None of your "counter responses" address any of that. If that's your writing style and it makes you happy, fine. That's great for you. Truly. I am *not* being facetious here or mocking you. It's obvious you've found an outlet that you find satisfying and enjoyable and that allows you to express yourself, and no one can take that away from you, certainly not me. And honestly, I wouldn't want it any other way than for you to have found that happiness. But you also live in the *real world*, and you're producing content that also lives in the real world, and you're inviting people from the real world to read it, which means you're inviting engagement. So say "hello" to engagement. Nothing you've said explains why it's such a mess except "Well, you should read the rest of my weirdly named logs; then you'd totally get it all." Which, cool, but you said it was a Gemini Review. Not a "Gemini Review, but you need to read pages and pages of long, unrelated meandering content so that you can understand my weird abbreviations in context, which ironically still has nothing to do with the LLMs." None of that stuff *matters* in the context of what you were communicating. I don't *care* what "habibi" or "/nod" means in *your* context; I would only care (and I no longer do, trust me) why it matters in the context of *Gemini*. So if you're fixated on the fact that I don't understand the full background behind what your abbreviation "TTTOTW" actually stands for or where it came from, then you are fixated on the wrong thing. > If you decide to actually provide feedback based on considering the large context provided For some reason, you seem to think saying this completely negates my feedback and means that you don't need to respond to it. You appear to believe that it makes your work completely immune to any kind of feedback. I don't know why, but I guess that's fine? I mean, it's a subtle insult that basically amounts to "your feedback isn't valid unless you do what I say." Which, like okay, good talk. Nice and productive I guess. This whole thing has been a huge waste of time. I guess enjoy doing whatever it is you're doing with that site.

Reply

[-]

3p0h0p3@reddit

My description remains true, and I'm organically speaking as one odd LLM enthusiast to a plurality of other LLM enthusiasts. I've been careful with my words here. I appreciate that we have a misunderstanding. Yet again, if you find yourself looking at what you didn't anticipate, you could stop there. If you really thought that was the feedback that was worthwhile, that's what you should have said. Instead, you decided to provide feedback on the site in general. I've been addressing that, and I think you're dodging that, at this point. You say you wrote this just for me, right? As part of your original point, you provided Gemini's output as a significant portion of your feedback, and you've stated that you are providing feedback with its assistance in your follow post. I've also pointed out some hazards or concerns of doing so. My responses do address your original and follow-up feedback. No, you aren't maintaining your original position in full. I understand my content obtains in the real world (also thoroughly demonstrated in the document). I understand I'm inviting engagement, and I continue to engage you in good faith here. You'll find thousands of examples of engagement with the document within it. So: "Hello, nomad". It is clear you were perfectly capable of asking an LLM to assist you in reading it, let alone critiquing it or exploring further for feedback about the site. What I've said does matter in the contexts of what I'm communicating. I understand you don't care about my context. Thank you for telling me about what matters to you, as I think that's been clarifying. I don't think I'm fixated on the wrong thing here. I appreciate that you feel that way. I didn't claim to have completely negated your feedback, but I do think I've established that your feedback wasn't in good faith. I don't claim to be immune to any kind of feedback either, and I pointed to that as well. I'm not claiming you've nothing valid to say, nor am I trying to boss you around. I can see that you have been wasting some time here. If you change your mind, HMU. I'll be around, happy to think carefully with you. I'll provide Gemini's arguments in the next response.

Reply

[-]

Decaf_GT@reddit

Whatever dude. All you're saying is "you're being mean to me, you don't understand me". You're not engaging in good faith at all. Good luck with everything.

Reply

[-]

matadorius@reddit

People were just trashing google 2 weeks ago lmao

Reply

[-]

bearbarebere@reddit

That's because they were doing what companies should do - STFU and work while people think you're dead. OAI's idiot posts about how "the night sky is so beautiufl 😍😍😍" are so fucking dumb.

Reply

[-]

Shoecifer-3000@reddit

Poors a little cold water on Open AI dev week lol

Reply

[-]

bearbarebere@reddit

A little? If OAI doesn't show up with a genuinely new model in 4-5 hours from now, they're cooked lol

Reply

[-]

jpgirardi@reddit

The API price will be the same? The free usage limits will be the same? This is the real question

Reply

[-]

areyouentirelysure@reddit

Interesting that it's doing worse than previous models on long context and audio: [https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/#gemini-2-0-flash](https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/#gemini-2-0-flash)

Reply

[-]

lambdaofgod@reddit

Wait but what coding system is it? SWE-bench contains repos, did they just stuff all the code in a single prompt?

Reply

[-]

spixt@reddit

About time Google caught up. They had most of the AI talent, all the money and all the data, they should have gotten ahead of the same much sooner. Time to give Gemini another chance.

Reply

[-]

SatoshiNotMe@reddit

Deep in this thread I realized they’re soon offering an endpoint for a “coding agent” called Jules (from Jules Verne?), waitlist here: https://labs.google.com/jules/waitlist/success

Reply

[-]

Additional_Ice_4740@reddit

This is the first model from Google I’ve actually been impressed by.

Reply

[-]

Strong-Strike2001@reddit

Last Flash 1.5 version is impresive and pricing was amazing. Just a marketing issue with Google, 4o-mini is a lot worse following instructions than 1.5 Flash. I mean A LOT

Reply

[-]

hanoian@reddit

Ya, 1.5 Flash is so good and ridiculously cheap, it is letting me offer a free tier to an app I'm making.

Reply

[-]

nullnuller@reddit

Do you need to create a separate API key for each free client? How do you ensure that clients are not rate limited by other clients?

Reply

[-]

hanoian@reddit

2000 requests per minute? That's an enormous number. If you ever started bumping into that, you'd just queue them and make sure they are not breaking the limit.

Reply

[-]

CallMePyro@reddit

Bad take IMO

Reply

[-]

gopietz@reddit

Is it confirmed that Flash 2.0 isn't the 1206 model?

Reply

[-]

AaronFeng47@reddit

Gemini app would be so much more popular if it weren't so heavily censored. Even when I use it to translate news articles, sometimes I get messages like "I can't talk about this topic"

Reply

[-]

Loccstana@reddit

Why is o1 performing so poorly compared to Claude? Isnt o1 also slower since it uses more processing time during inference?

Reply

[-]

yaosio@reddit

Reasoning only takes it so far. Imagine reasoning is a way to search everything the model currently knows and could know. It can't answer things it doesn't know or can't know. A very good model would be able to expand the search space as it looks for answers. By this I mean it learns to do something it couldn't before.

Reply

[-]

LiquidGunay@reddit

There is no wall

Reply

[-]

Virtamancer@reddit

My prediction: This is the CURRENT DAY flash 2.0 being compared against the CURRENT DAY 3.5 sonnet. All the models get silently quantized and enshitified in the background after their public release makes them look super competitive. So this is comparing the best flash 2 with the worst 3.5 sonnet. If it can stay this good, that’s huge. But both 4o and 3.5 sonnet got worse after they were initially unmatched.

Reply

[-]

Only-Letterhead-3411@reddit

Google won 😔

Reply

[-]

Ylsid@reddit

I don't see any open models on this chart

Reply

[-]

bdiler1@reddit

can someone give me information about speed

Reply

[-]

Decaf_GT@reddit

In the time it took you to ask this three separate times, you could have, you know, just gone to AI Studio and tried it yourself for free...https://aistudio.google.com/app/prompts/new_chat

Reply

[-]

ApprehensiveAd3629@reddit

What is pre/post mitigation?

Reply

[-]

Special-Cricket-3967@reddit

RLHF, post training, censoring etc

Reply

[-]

Hunting-Succcubus@reddit

censoring? very disappointing

Reply

[-]

218-69@reddit

No censoring unless you hit blacklisted words. And you can turn off filtering anyways, so still better than closed ai or misanthropic

Reply

[-]

meister2983@reddit

Scaffolding really matters. This isn't even SOTA (which is 55%): https://www.swebench.com/

Reply

[-]

throwawayPzaFm@reddit

What makes you think Google can't provide scaffolding?

Reply

[-]

hapliniste@reddit

The chart show gemini with scaffolding

Reply

[-]

InvidFlower@reddit

Yes, but Claude was with scaffolding as well, and in fact SWE-bench is a test of the whole agent system, not just the LLM. As someone above posted, here is a link to Anthropic talking about their scaffolding: [https://www.anthropic.com/research/swe-bench-sonnet](https://www.anthropic.com/research/swe-bench-sonnet)

Reply

[-]

SKrodL@reddit

Claude gets 53% with OpenHands scaffolding: [https://www.swebench.com/](https://www.swebench.com/) Still bananas though

Reply

[-]

carnyzzle@reddit

Google was cooking this entire time

Reply

[-]

mattbln@reddit

is it out yet? or will it not be available in the EU?

Reply

[-]

hopefulusername@reddit

Good to see Google making progress. I thought they were lagging behind.

Reply

[-]

vogelvogelvogelvogel@reddit

is it the first time a llm from google is on the top, ever?

Reply

[-]

throwawayPzaFm@reddit

It's not technically on top. And while technically they're behind in LLMs, try not to forget that they have two nobel prizes won by AI.

Reply

[-]

bdiler1@reddit

can someone give me information about speed ?

Reply

[-]

bdiler1@reddit

can someone give me information about speed

Reply

[-]

Recoil42@reddit

How does this compare to the Pro / Opus models?

Reply

[-]

jd_3d@reddit (OP)

SWE-agent + Claude 3 Opus gets 18.2%. There's no benchmarks yet of the new Gemini 1206 experimental model that I could find.

Reply