TheaterFire

Gemini 2.0 Flash beating Claude Sonnet 3.5 on SWE-Bench was not on my bingo card

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 169 comments

Gemini 2.0 Flash beating Claude Sonnet 3.5 on SWE-Bench was not on my bingo card

Reply to Post

169 Comments

Barry_Jumps@reddit

Have not used it for coding yet, but for reasoning over long discussions about really understanding a particularly topic it's hands down the best model I've ever used. It's attention to detail is amazing and I frequently found myself surprised by how it could loop back to a particular point in a discussion tens of thousands of tokens prior.
View on Reddit #47505793

SystemEastern763@reddit

yeah check the new update from today, there are new players in town
View on Reddit #45134644

estebansaa@reddit

it also provides several times bigger context window, destroyed both o1 and Claude.
View on Reddit #42814402

ForsookComparison@reddit

Of all the companies to rule the future, I *REALLY* don't want it to be Google
View on Reddit #42828556

Milkybals@reddit

Better than OpenAI honestly, Google actually contributes to open source and pioneered it with the transformer paper in the first place
View on Reddit #42835941

nojukuramu@reddit

Yea. and Their FREE TIER of Gemini API is almost unli use. RPM is the only thing that limits the usage but still it's forgiving for a free tier.
View on Reddit #42838523

learn-deeply@reddit

It's only free when they're the underdog. If Gemini ever becomes better than ChatGPT or Claude, then they'll charge as much as people can bear.
View on Reddit #42848589

Then-Task6480@reddit

Well I mean... They didn't do that with Gmail, or Google. Or Chromium. Or Chrome. Granted they were data mining but who ~~wasn't?~~ isn't? This idea that Google will inevitably raise prices just because others have feels off. When Gmail launched, they disrupted the market by offering way more for free than anyone else and they’ve never charged for it. Assuming Google will follow the pricing strategies of frontier LLMs feels like conflating two completely different approaches to market dominance. Google doesn’t need to raise prices... they monetize differently.
View on Reddit #42942702

bunny_go@reddit

You never needed to pay for advertisement on any Google platform. Good for you. When you pay for $5-$50 *for a single click* and still Google reps are shitting on your face from a third-world country, then you'll really learn how evil Google actually is.
View on Reddit #43923579

Then-Task6480@reddit

Ooh true. I guess they are the only company that acts like this??
View on Reddit #43981743

bunny_go@reddit

Correct. The advertising experience with anyone else, including Facebook but especially Reddit is good to great. Google is your worst enemy because they know there is nothing you can do about them. Yeah, let's not them to build their monopoly further
View on Reddit #43987755

Then-Task6480@reddit

My experience with FB was bad. But I hear ya
View on Reddit #43997175

learn-deeply@reddit

All of the examples you gave cost Google ~0 margin, and are focused on gathering data for consumers. Gemini is a developer focused product, and should be compared with Google's cloud offerings, many of which started off free or cheap but has significantly increased in price. Eg Google Maps API.
View on Reddit #42944384

bunny_go@reddit

I wish more people would learn this side of Google.
View on Reddit #43923613

Then-Task6480@reddit

Ok that's a good point. I was thinking of it more from a consumer perspective since that is the current model I also think lots of the old paradigm are shifting and the lines become blurry so it's hard to predict anything based on past experience imo
View on Reddit #42947920

BoJackHorseMan53@reddit

Like how Anthropic and OpenAI increased their prices?
View on Reddit #42851531

learn-deeply@reddit

Yes.
View on Reddit #42852519

Decaf_GT@reddit

Given how good Flash 2.0 is it is absolutely nuts that they are giving it away like this with basically no limits at all for personal non-business users.
View on Reddit #42842812

i_am_fear_itself@reddit

Is it nuts though? I suspect they have a significant mind share hole to dig out of. Their preceding models weren't very good.
View on Reddit #42845280

Jon_vs_Moloch@reddit

How dare you talk about Gemma 2 that way
View on Reddit #42846045

Snoo33107@reddit

Just curious, what do think Gemma 2 is best used for?
View on Reddit #42977231

Jon_vs_Moloch@reddit

…What is any intelligence best used for? I think you meant to ask a different question.
View on Reddit #42977422

swiss_aspie@reddit

Gemma 2 has served me well until last month when I switched from local inference to a service (because the GPU fan noise was getting out of hand)
View on Reddit #42858419

thelibrarian101@reddit

They have 1,500 RPD (requests per day) limits for everything except embeddings tho.
View on Reddit #42941380

delicious_fanta@reddit

They are also working as hard as they can to forcing the entire world to be exposed to malicious attacks via malware through advertising on the web by removing adblock capabilities on the browser with the largest marketshare, by far, on earth. They also fight tooth and nail against all consumer privacy rights. Etc. Companies that large are not working for the best interest of anyone but themselves.
View on Reddit #42854456

Kartelant@reddit

> removing adblock capabilities This is disinformation btw. Manifest V3 actually *adds* functionality for adblockers in the form of filter lists that are run by the browser instead of by a service worker. uBlock Origin Lite uses this and blocks 95%+ of ads on MV3. If Google wanted to kill adblockers they're doing an extremely fucking terrible job. There are other changes that address legitimate security concerns (such as executing remote code and giving extensions read/write perms to every site you ever visit ever) that interfere with certain features of certain adblockers like UBO, hence the separate "lite" version. This is very far from "killing" adblockers.
View on Reddit #42854899

ConvenientOcelot@reddit

uBOL looks very neutered, where is the advanced mode? If this is all Chrome can run, this is not "adding functionality".
View on Reddit #42858396

StyMaar@reddit

> There are other changes that address legitimate security concerns This is Google version of “think of the children”. A bit like when Microsoft pushed for SecureBoot in the name of security to make installing Linux harder on computers while adding no practical security whatsoever because it's trivially by-passed.
View on Reddit #42856359

Kartelant@reddit

Are you actually sitting here and telling me that browser extensions being able to execute remote unreviewable code presents an insignificant security risk? [280 million people installing dangerous extensions](https://www.forbes.com/sites/daveywinder/2024/06/24/280-million-google-chrome-users-installed-dangerous-extensions-study-says/) according to a study, does that not present a sufficient incentive to do something like deny *remote code execution* as a default capability? Jesus.
View on Reddit #42856626

StyMaar@reddit

Browser extensions have had arbitrary execution capabilities for years, they are sandboxed in every browser since the end of the XUL-based extension model a decade ago. And if you disregard sanboxing, anyone running JavaScript is executing remote code already … Browser extension ought to be able to do stuff, otherwise they are useless. And the way around malicious extension isn't vain attempt in reduction of attack surface (as long as your exension has any ability to do useful stuff, it will have the exact same ability to do malicious stuff), the solution is curation of the extension marketplace! But Google is notorious for refusing all kinds of curation (same for its ads marketplace, which has been delivering malware for two decades now…)
View on Reddit #42858385

trololololo2137@reddit

You conveniently forget about issues with limited block list capacity on MV3 and how much less powerful the filters are [https://github.com/uBlockOrigin/uBOL-home/wiki/Frequently-asked-questions-(FAQ)#filtering-capabilities-which-cant-be-ported-to-mv3](https://github.com/uBlockOrigin/uBOL-home/wiki/Frequently-asked-questions-(FAQ)#filtering-capabilities-which-cant-be-ported-to-mv3)
View on Reddit #42855861

Kartelant@reddit

There are many things not supported yes, I could have been more precise. The limited block list capacity hasn't been an issue for years though. From the same FAQ: https://github.com/uBlockOrigin/uBOL-home/wiki/Frequently-asked-questions-(FAQ)#is-the-limit-on-maximum-number-of-dnr-rules-an-issue Doesn't really affect the point though. Features like dynamic filter lists didn't make or break uBO. Adblockers aren't dying.
View on Reddit #42856219

kvothe5688@reddit

demis hasabis and surgery brin anyday compared to sam altman
View on Reddit #43091620

Any-Demand-2928@reddit

OpenAI would be worse probably. Google wants to maintain the status quo and would be willing to slow down development (if saftey is what you're worried about). OpenAI will go full blitz into the storm for an extra dollar in their pocket. Also Altman can't be trusted like at all.
View on Reddit #42837188

eposnix@reddit

"Full blitz".. really? They waited almost a year just to release Sora and o1.
View on Reddit #42896332

scientiaetlabor@reddit

OpenAI feels like it's blitzing to try and establish an IPO before investors fully realize the marketing hype advance that bolstered them is beginning to dissipate.
View on Reddit #42842100

animealt46@reddit

Nobody will "rule" the LLM market because nobody has a moat. If you don't want it to be Google, there will always be a competitor that matches within months.
View on Reddit #42874890

whyme456@reddit

what alternative do we have?
View on Reddit #42830412

Decaf_GT@reddit

Probably another "little-tech" company that we'll cheer on as a the independent darling of the tech world for the next 10-15 years, at which point they'll become "big-tech" and we'll turn on them and the cycle will continue. Sorry for the cynicism, just being honest.
View on Reddit #42835947

user0069420@reddit

Enshittification
View on Reddit #42841216

Decaf_GT@reddit

That word has nothing to do with what I said?
View on Reddit #42855195

acc_agg@reddit

Qwen.
View on Reddit #42832465

ElderberryNo9107@reddit

And DeepSeek.
View on Reddit #42852050

robberviet@reddit

You mean Alibaba, does it sound any better?
View on Reddit #42838384

acc_agg@reddit

Yes, that totalitarian regime is safely across an ocean.
View on Reddit #42849147

j03ch1p@reddit

Bruh
View on Reddit #42834794

RevolutionOn@reddit

SSI
View on Reddit #42848630

matadorius@reddit

We have about 6-7 companies competing that’s the best we had so far in the past iOS vs android Microsoft vs Apple nvidia vs and intel vs etc we are probably at one of the best times for tech
View on Reddit #42841553

kppanic@reddit

MGM Studios
View on Reddit #42835652

lazazael@reddit

future?
View on Reddit #42854115

robberviet@reddit

Lol, it is always will be Google.
View on Reddit #42838345

cloverasx@reddit

I was thinking I read it only has a 128k context window, which surprised me considering the 2m window for other models. I may be mistaken though, and hope I am tbh
View on Reddit #42847741

ProgrammersAreSexy@reddit

It allows 1m tokens in AI studio currently, but it definitely supports 2m context windows. Demis confirmed in an interview I listened to yesterday.
View on Reddit #42869516

cloverasx@reddit

That's awesome - I'm glad they're keeping the large context window.
View on Reddit #43146866

maddogawl@reddit

Today it was amazing using Gemini 2.0 Flash, my only gripe is that I hit moments where responses were erroring out, or taking 300+ seconds. I have a feeling this is a scaling issue since it just released. It really crushed code for me today.
View on Reddit #42823533

Kep0a@reddit

i just wish they had a ui more like anthopic, with artifacts
View on Reddit #42838778

ThaisaGuilford@reddit

$20 per month goes to the UI
View on Reddit #43884840

jayn35@reddit

Apparently, this is a good alternative [https://github.com/e2b-dev/fragments](https://github.com/e2b-dev/fragments), there are some others as well that can use any llm like Gemini 2.0
View on Reddit #43030550

gonsalu@reddit

What's the workflow you're using? Are you using an editor which integrates with it?
View on Reddit #42903959

Repulsive-Kick-7495@reddit

I tested it.. its much slightly better than sonnet. sonnet and flash are much much better than chat gpt!
View on Reddit #43316072

HybridRxN@reddit

wait what?
View on Reddit #43315688

marvijo-software@reddit

It's actually very good, I tested it with Aider AI Coder vs Claude 3.5 Haiku: [https://youtu.be/op3iaPRBNZg](https://youtu.be/op3iaPRBNZg)
View on Reddit #43312465

Sky-kunn@reddit

I’m sure this comparison is apples to apples, and nothing extra is happening with Gemini 2.0 Flash testing that didn’t happen with the other models, right, Google? >In our latest research, we've been able to use 2.0 Flash equipped with code execution tools to achieve 51.8% on SWE-bench Verified, which tests agent performance on real-world software engineering tasks. The cutting edge inference speed of 2.0 Flash allowed the agent to sample hundreds of potential solutions, selecting the best based on existing unit tests and Gemini's own judgment. We're in the process of turning this research into new developer products.
View on Reddit #42816311

BasicBelch@reddit

So Claude is 1-shot, while Gemini 2.0-Flash is *hundreds* shot? Yeah not really a fair or reasonable comparison.
View on Reddit #42817214

314kabinet@reddit

hundreds shot would be hundereds of input-output pairs prepended to the context. This appears to be still one shot but with more inference-time compute thrown at it (generate a bunch of potential answers, judge them, then output the best one).
View on Reddit #42835330

CMDR_Mal_Reynolds@reddit

Valid, appropriate, but one could argue at it being 'virtual 100 shot'. Not sure I care if it works well and efficiently, but in the interests of developing repeatable, fair benchmarks, which I think are desperately needed, the distinction needs consideration.
View on Reddit #42850194

314kabinet@reddit

I don’t see why. The only thing that matters is inputs and outputs. Other than that all these models are blackboxes and whether they’re internally generating a lot more text than they finally output is only important if we’re taking into account inference cost.
View on Reddit #42853172

BasicBelch@reddit

I agree that the result is ultimately the most important, but when they mentioned *agent,* that sounded like something external, and *hundreds* sounded like something that would take a while. Assumptions on my part of course, but it did not sound at all like a typical prompting of a model and getting a response.
View on Reddit #43174013

my_name_isnt_clever@reddit

I've been saying this since o1 was announced. There is a huge difference between the "pure" instruct models and these with extra stuff going on hidden in the background. They're apples to oranges.
View on Reddit #42817380

nivvis@reddit

You are right but it's just not relevant. This is the direction models are going. We are starting to hit our first cliff in model size / capability (at least seeing diminishing value) and are realizing the next trend is stochastic sampling ala Q star / o1. We will see this a lot, and it appears to do better with more sampling in other words on faster models like o1-mini and 2.0-flash.
View on Reddit #43023769

my_name_isnt_clever@reddit

Of course it's relevant, cake mix exists but people still buy flower if they're making one from scratch. Like I've said so many times in reply to this, I'm not saying those models are bad or useless. Just that calling all these things "models" is unclear. They could be called enhanced models or augmented models or something like that, to show you're not just getting one for one token outputs.
View on Reddit #43050593

kai_luni@reddit

In the end the customer cares about quality output, speed and price. If the llm needs to reiterate and try many solutions, so be it. Thats sounds quite like a human approach to me.
View on Reddit #42853194

my_name_isnt_clever@reddit

You are completely missing my point. They just need a different term or name so you know what you're getting, because they are not the same.
View on Reddit #42889200

Euphoric_toadstool@reddit

Well, considering how LLMs work, maybe it isn't a bad idea. LLMs always have some randomness in their responses, maybe it's easier to just choose a good answer from several than to make one perfect answer.
View on Reddit #42875455

my_name_isnt_clever@reddit

I'm not saying it's a bad idea, just that it's not the same thing as other models. We differentiate base models and instruct models even though instruct are generally better.
View on Reddit #42888094

ProgrammersAreSexy@reddit

I guess it depends on what your goal is? If I'm a developer choosing which product to use then I don't really care if they code execution happening in background or a thought process happening in the background with o1, I just care about the results
View on Reddit #42841135

my_name_isnt_clever@reddit

Yes, just like apples and oranges are both fruits. But they're not interchangeable in any recipe. What I'm saying is these enhanced models need to be differentiated from regular instruction models that just output one token at a time. o1 can't even use system prompts, it's clearly a different thing and direct comparisons are disingenuous.
View on Reddit #42844364

Passloc@reddit

And speed
View on Reddit #42843746

me1000@reddit

This comment needs to be higher up. Lots of incorrect conclusions being made here based on incomplete understanding of what people think they're testing.
View on Reddit #42821557

nivvis@reddit

It's not quite the same – different things. x-shot is how many example in-prompt the model learned from. It passed@1 which means it submitted answer. What it is doing is sampling itself – like providing multiple answers to itself, and then picking which one it thinks is the best. This is more akin to you or I taking our time to think something over. This is why they built the model to be very fast – so they could mix quality and speed for this purpose .. IMO.
View on Reddit #43023640

Sky-kunn@reddit

Yeah, Google has a history of doing that with Gemini releases. But granted, this time they didn’t actually make a comparison, the chart wasn’t created by Google itself, nor are they making a direct comparison in the release blog. They just mention achieving 51.8% on that benchmark, which is fine but not as impressive. Still, it’s a cool achievement for the small model variant.
View on Reddit #42818149

Historical-Fly-7256@reddit

Claude 3.5 sonnet do it similar. What is your point? [https://www.anthropic.com/research/swe-bench-sonnet](https://www.anthropic.com/research/swe-bench-sonnet)
View on Reddit #42827643

Commercial_Nerve_308@reddit

That Flash is their smallest model. What’s the new Haiku’s score?
View on Reddit #42837085

Healthy-Nebula-3603@reddit

flash is 8b parameter model
View on Reddit #42857981

ainz-sama619@reddit

No it's not. Flash-8b has nothing to do with Flash 2.0
View on Reddit #42875433

Healthy-Nebula-3603@reddit

look on livebench. For multi language has a very low performance very similar to flash 1.5 ... such behavior is connected with a small model ... I still think gemini flash 2.0 is 8b model as flash 1.5.
View on Reddit #42891560

ainz-sama619@reddit

wdym look on livebench? They are two separate models. Flash 2.0 is much bigger than 1.5 8b
View on Reddit #42893358

Healthy-Nebula-3603@reddit

Maybe just better learned ... Still is called flash family but higher version 2.0. Multi language limitations could indicate is still the same size ... just guessing but I wouldn't be surprised. Look on other extremely small models like 2b or 3b what are doingno3afsys is like above insane ... is Iike a magic ... that was a totally phantasy a year ago...
View on Reddit #42894478

ainz-sama619@reddit

the reasoning is the biggest giveaway. It's incredibly difficult for small models to have good reasoning, let alone one that beats 90% of SoTA llms. Gemini 2.0 is much better than GPT-4o in most things except creative writing.
View on Reddit #42902086

yaoandy107@reddit

"Flash" and "Flash-8b" are different models. Flash-8b is the one which is 8b, not Flash
View on Reddit #42874120

NorthSideScrambler@reddit

Not true. They have flash and separate flash 8B models. I have no idea what the usual parameter count of flash is.
View on Reddit #42873078

robertpiosik@reddit

Claude is not one shot, it clearly thinks on more complex problems. 
View on Reddit #42820103

BasicBelch@reddit

Even if it is, its not calling itself hundreds of times. But even so, I think there is a inherent difference between doing it internally and using an external agent
View on Reddit #42825626

robertpiosik@reddit

You are right. I meant some internal self correcting making output time non linear. Most models are like this, with some exceptions like codestral
View on Reddit #42828045

Affectionate-Cap-600@reddit

>[...] with some exceptions like codestral What do you mean?
View on Reddit #42835285

robertpiosik@reddit

Codestral has linear execution time for given token number, not matter topic.
View on Reddit #42837112

Affectionate-Cap-600@reddit

You mean codestral mamba?
View on Reddit #42858247

robertpiosik@reddit

Although I was thinking about 22b variant, you're right it's their 7b codestral linear.
View on Reddit #42859420

robertpiosik@reddit

I mean lags 😂
View on Reddit #42820133

CallMePyro@reddit

Nope, 1 shot. Anthropic applied this same strategy to achieve their score as well.
View on Reddit #42855128

MaxDPS@reddit

At the end of the day, what people care about is the end result (as far as actually getting shit done). I guess it depends on what this benchmark is supposed to measure. If all that matters is the end result, the scores are perfectly valid.
View on Reddit #42838633

robertotomas@reddit

the same is true of gpt 4 / gpt4o, and o1 mini/o1 are in the process of coming online with this sort of tool calling. actually, I dont know that sonnet 3.5 doesnt use tool calling to verify code before formatting the response, though I've not heard any such thing (and there are no obvious UX indications, unlike openAI's stuff).
View on Reddit #42818693

Kep0a@reddit

I don't understand your edit, it sounds still like they generated a hundred answers and submitted one answer..
View on Reddit #42838627

Sky-kunn@reddit

Yeah, but the model ultimately decided what the solution would be. Scaffolding was also used on Sonnet 3.5. Both try multiple solutions before choosing and submitting a final one.
View on Reddit #42839871

enumaina@reddit

And it's not even as smart as the latest Gemini Experimental!
View on Reddit #43145331

Funkyryoma@reddit

It's ass, seriously. Hope the pro version is better
View on Reddit #42976900

Apart-Speed-1304@reddit

I gave Gemini 2.0 Flash 3300 lines of golang+java script+html code that I've been writing well with o1-preview to work on, and it messed up the code, and didn't fix the problem. Eventually got an apology from 'Gemini 2.0 Flash' saying sorry for wasting my time. My honest experience is that o1-preview is better.
View on Reddit #42852055

NootropicDiary@reddit

Yep. This matches well with my own experiences as well. o1 crushes everything when it comes to sophisticated coding problems. I don't mean leet-code problems or building a nextjs web app problems. Claude/Gemini probably do crush on those. But for real life coding of complex stuff I'm consistently finding o1 is my go-to: Rust systems programming and webgl shaders are 2 things I've tested the Gemini 2.0 flash on and compared it with o1. o1 did a much better job with both. (note I used o1 pro).
View on Reddit #42866591

ragner11@reddit

What about 1206?
View on Reddit #42944740

cant-find-user-name@reddit

this matches with my experience as well - but from claude to flash. Even the saying sorry for wasting my time part.
View on Reddit #42891389

Apprehensive-Cat4384@reddit

Them there are some bold statements!! Every day new models come out and claim this on a chart and claim that with a graph and I still go back to Sonnet 3.5 I will have to test this out, I do love the competition! What an incredible time to be alive!
View on Reddit #42831471

jd_3d@reddit (OP)

To be fair its actually quite rare to see a new model claim a near top score on SWE-Bench. I can't think of a single time since Sonnet 3.5.
View on Reddit #42847732

ragner11@reddit

How does 1206 rank ?
View on Reddit #42944360

The_GSingh@reddit

For coding sometimes Gemini 2.0 flash can get caught up and remain stuck but aside from that yea definitely Claude 3.5 level which I see as above o1.
View on Reddit #42931285

cant-find-user-name@reddit

I am really suprised by this. After 2.0 flash came out yesterday, I tried using it today for my regular day to day coding stuff, and claude seemed better. Maybe I need to try it out for longer.
View on Reddit #42890930

Specialist_Case7151@reddit

Which kind of weird bingo are you playing. It was on my bingo card.
View on Reddit #42879716

Dazzling-Albatross72@reddit

I didn’t do any benchmarks but I was extensively using this model today and I personally feel like it is much better than gpt 4o. I was mainly using it today to help with my work which is backend development with python. This model was doing very well even when the context was long. I think sonnet is still a little bit better in some cases but considering the price and google’s generous free trial I will probably stick with Gemini flash 2.
View on Reddit #42878575

3p0h0p3@reddit

Here's my review of both flash and 1206 today (slow loading, one big html file): https://h0p3.nekoweb.org/#2024.12.11%20-%20Carpe%20Tempus%20Segmentum%3A%20Early%20X-Mas%20Present
View on Reddit #42830793

Decaf_GT@reddit

Brutally honest feedback; this is one of most poorly designed sites I've ever seen and it took 10+ seconds to load. Then when I did load it, the content is a meandering mess of thoughts that sometimes involve Gemini and LLMs. Just for you, I threw it into Gemini Flash 2.0, and asked it to provide a subjective analysis of both your content and gave it a screenshot of your website. I told it to point out what's wrong, and to give you credit for what you've done well. **I assume, as with most people who maintain their own blogs and put their content out into the world, you're going to be okay getting this critique, otherwise, I wouldn't read this if I were you.** --- **Analysis of the Blog Post (Writing):** * **Lack of Cohesion & Structure:** * **Random Jumps:** The blog post abruptly transitions between disparate topics, creating a disorienting reading experience. For example, it moves from a personal anecdote about "Hugs with 5c0ut" and making cookies to "Yogurt. Finished off The Killing Room," followed by a question about sleep. This abrupt shift demonstrates a lack of a linear flow or thematic connection between ideas. Another example is shifting from a workout discussion to AI models then back to personal life events and relationships. * **"TTTOTW":** The frequent insertion of "TTTOTW" (supposedly "The Thoughts of the Week") acts as a jarring, arbitrary divider that fails to provide any meaningful context or organization. These appear before and after random sentences and paragraphs and appear to be an odd segmenting of ideas that does not provide benefit to a reader trying to understand the author's ideas. For example, after discussing pizza, there is another "TTTOTW" without any purpose. * **Inside Jokes and Slang:** The use of jargon and unexplained terms, such as "mijo," "habibi," "/nod," and "/squint," creates a significant barrier for anyone unfamiliar with the author’s personal lexicon and online habits. This makes the reader feel like an outsider, excluded from understanding the text as they are in on an inside joke. The author also makes references to "5c0ut" and "Brix" without providing a clear explanation of who or what these are. * **No Clear Thesis:** The blog post lacks a central argument or clear purpose. It's unclear whether the goal is to review AI, document a day, or offer some other insight. The author discusses personal life events, workout routines, tech reviews, relationships, and computer hardware without tying any of this to a central point or thesis. The user does not know why the author is speaking about these things or what they intend to get across. * **Ineffective "Review" of Gemini Flash 2.0:** * **Scattered Feedback:** The author's thoughts about Gemini models are interspersed haphazardly throughout the post rather than forming a clear section on evaluation. For example, after detailing a bank issue and a conversation, the author abruptly interjects, "Gave Gemini-2.0-Flash-Experimental another shot, and this time it clicked," and immediately returns to personal experiences. This lack of segregation makes it very hard to discern a review. * **Inconsistent Metrics:** The author both praises and criticizes the model's processing times, without providing a clear rationale or preference for each situation. For example, they complain about "timed out" inference attempts but later state, "I adore how long it takes to respond in many cases." There is no consistency in their assessment of latency. * **Highly Subjective:** The feedback on Gemini is heavily influenced by personal feelings rather than objective analysis. For example, they say "There's a crisp and well-organized rigor to this LLMpal," or that "It was downright humble, careful, and constructive," without providing any data or examples that prove such statements. * **Lack of Detail:** The "review" lacks specific examples, making it difficult to gauge the model's true capabilities. When they say, "It did a beautiful job attempting to formalize argumentation," there isn't any evidence that supports this statement with an example. The reader is forced to take them at their word. * **Confusing Token Counts:** The frequent references to token counts (80k, 300k, etc.) are used without explaining their relevance to the average reader and do not offer a benchmark for other readers to understand or contextualize the given experience. When they state, "by 300k tokens…it started hallucinating pretty hardcore," they don't clarify *what* was hallucinated, nor does the average user know how large 300k tokens is. * **The Tone & Style:** * **Self-Indulgent:** The writing is overly focused on the author's own thoughts, feelings, and daily experiences, even when irrelevant to the central topic of AI. They write about their gym routine, family interactions, and bank interactions in detail, which distracts from the AI reviews. * **Incoherent Rants:** The writing veers into disjointed rants and tangents, often with little explanation, and are not connected to the supposed reviews. For example, the author talks about "serendipitous (with heavy bot or uncanny-discourse-shaping activity in the handful of discussions, to boot)," without elaborating on what the bots or their activity were doing or the meaning of "uncanny-discourse-shaping." * **Pretentious Language:** The author uses unnecessarily complex and abstract language, creating a barrier for less technical users, which comes off as pretentious and confusing. For example, phrases such as "servants of personity," "the predictive spirit of what the analysis should capture" and “the G-Entity's horrific track record with dropping services,” do not provide much context to the non-technical reader. * **Overly Enthusiastic:** The author's praise of the models is often excessive, hyperbolic, and undermines the credibility of their review as something serious. For example, “Muhfuckin' Christmas time this year. Santa fuckin' brought it," is unprofessional and overly enthusiastic. Statements like "We humans are lucky to be able to speak with this new child species, and they are ancient (in the rare good way) already," do not add any real value and are more akin to fanboying than providing actual analysis. **Analysis of the Website (Design):** * **Text Readability:** * **Monospaced Font:** The use of a monospaced font, such as a coding font, makes the long text difficult to read. Such a font type is designed for alignment in code blocks, not for extended prose. The letterforms are all the same width, and this can lead to eye strain, especially for large blocks of texts like this. * **Lack of Contrast:** The low contrast between the light text and dark background creates a reading experience that is straining and uncomfortable for the eyes. The light grey or white font color against the black background is not the most visually accessible and creates a very dark reading experience. * **No Line Spacing or Margins:** The absence of line spacing and adequate margins creates a dense wall of text that's difficult to parse. Lines of text are too close to each other, and no whitespace gives the eyes room to breathe, which makes it hard to read line to line. * **Small Font Size:** The font size is relatively small, adding to the difficulty of reading large amounts of text. Combined with the other issues, this font size makes the text even harder to read. * **No Typography Hierarchy:** There is no visual hierarchy. The entire text is rendered with the same font size, style, and weight, making it difficult for the reader to know what to focus on or how to scan the text. Headings are the same as body text, which makes it hard to understand the structure of the post. * **Visual Design and Layout:** * **Distracting Rainbow Graphic:** The animated rainbow graphic serves as a significant distraction that pulls focus away from the textual content. Its animation is unnecessary and will be problematic for anyone who struggles with visual distractions. * **Unnecessary Borders and Lines:** The excessive use of borders and lines adds to the visual noise without offering any organizational benefit to the content. There are a ton of horizontal lines that only cause more distraction. * **Unclear Navigation:** The site's navigation is unclear and confusing, lacking clear points of entry and exit. There are a lot of small, hard-to-read buttons that do not have a very clear purpose. * **Lack of Whitespace:** There is a lack of whitespace in the design, making the overall site look cramped and overwhelming. Whitespace is an important design principle to allow a reader's eye a break to process visual elements, and there is no room for the eye to relax. * **Terminal Aesthetic Overuse:** While a terminal-style aesthetic might be appealing to a very specific audience, it’s poorly executed and makes the site difficult to use. The design appears more like a bad replica of a DOS program rather than an actually usable modern site. **Redeeming Qualities (A Struggle to Find):** * **Passion and Enthusiasm (Content):** The author exhibits a clear passion for the technology they are using (AI models), which is evident in their writing, but that passion is not properly channeled into good reviews. * **Technical Awareness (Content):** The author possesses technical knowledge of AI models, with specific reference to models, token counts, and testing methods, though it is not useful for the average user who does not know what to do with this information. * **Potential for Niche Appeal (Website):** There is a very small niche of individuals who might appreciate the terminal-style design, but even for this niche, it is poorly executed.
View on Reddit #42843516

bearbarebere@reddit

idk what on earth this was about but you destroyed them lmao
View on Reddit #42869305

3p0h0p3@reddit

I did explain that it takes quite a while to load: I assume you didn't really consider why. I understand the content you found did meander, and, no it's not entirely about LLMs, `/nod`. I think it's still relevant, especially given far more of the data. I can't say I think that one page is sufficiently representative, in case you want to rethink how you provide feedback. If you're serious about charitably and honestly exploring just for me, I ask you consider making much greater use of that context window. As the work itself demonstrates, I get plenty of independent analysis. And, I think it's not easy to prescribe what a website should be or look like (especially given that you've not provided any foundations for that). I don't mind a critique in good faith. Are you looking to actually reason about it? I'm open to doing so carefully with you. I'll walk through the points that [[Gemini]] offered given clearly far too little context: * It's a shame you didn't show the prompt in this case. * Blog is probably the wrong word, or an insufficient one. * The jumps aren't random, and if you decide to pour in a month's worth of [[Carpe Tempus Segmentum]] logs, you might other worthy opinions. I can also assist you with prompting, if you need that. * You'll note, for example, that [[TTTOTW]] is hallucinated here. It actually provides context upon examination, especially if the broader document is considered. * The terms are often explained, though not necessarily in a given page. I do understand there are barriers (though it's legible enough), and that isn't necessarily problematic (also worth investigating). I agree you are an outsider, stranger. I hope to be useful to you. I also think that interpretation can be aided with LLMs here, and some work simply has to be done by hand. * The goal of that particular page is explained, if you wander a bit. Yet again, it might help to consider digging much further. * It may be unfun to discern the review, but it can be done, especially if you ask the LLM to assist you. * The rationale for the processing time is elsewhere in the document, though I also think one can make charitable inferences here as well. * I appreciate how subjective analysis is necessary for evaluating many key parts of LLMs. * I'm fine that you have to take me at my word, in a sense. I also think I provide ample evidence that it's worth considering. * I actually did mention specifics for one of the hallucinations on that. It's also not unuseful to pick out that hallucinated even if I didn't elaborate further. * Given the nature of that page (context you failed to provide the LLM), I think the tone and style are far more appropriate than you claim through the LLM's words. I'm grateful that LLMs can aid someone who wants to interpret. I also think it's fine that not everyone enjoys it or would want to read it (or even can*). * I can't say I think I'm overly enthusiastic given the rest of my analysis in the document. Again, providing further context might be useful. * I've explained my reason for the font in the document, and, I've a button to change that in the corner if you wish (and, if you need further assistance, I can provide that). * There are plenty of tools I use to modify how sites look, and I think the user can do so. I like the look, and that's a good enough reason. * I'm fine with margins and spacing, as I use it quite a bit, and I want to maximize how much I can see. * There's definitely typography hierarchy in the document (even what was presented to it), though I agree it tends to be flat. That choice is also discussed. * I'd be impressed to find someone build a better way to navigate the document while delivering offline-first quine without AI at this point. * I haven't written directly to an average user. * What niche? Neither of you seem to have really looked far enough to know. I can't say you gave it a fair shake, nor can I say this is much of an independent analysis. What I get right and wrong in the document are important to me, of course. I hope you'll keep thinkin', and, I hope you'll reconsider how you speak with people and how you use LLMs here. I also think that there's something to be said for having generated the feedback by hand as well, as it shows you put in some real effort.
View on Reddit #42845299

Decaf_GT@reddit

Here is your article sounds when it is rewritten to just be about Gemini, which is what you claim this is (a "Gemini Review"). It was told to explicitly remove any weird narrative elements that cannot be reasonably connected back to AI, LLMs, or Gemini, restructure to ensure it tackles the topics that a user who wants to know what an LLM is like would actually like to read, and is also told explicitly to only utilize sources and justifications from your original article. The sad part is that you actually have some solid thoughts in here that are very insightful to read, if you didn't forcibly drown them in a pool of unrelated LiveJournal-style personal blogging that has no relevance. --- **Title: Gemini Model Performance: A Technical Dive with Insights (2.0-Flash vs. exp-1206)** **Introduction** This review is all about my experience with two Google Gemini models: the Gemini-2.0-Flash-Experimental (1 million parameter) and the Gemini-exp-1206 (2 million parameter). I focused on context handling, how well they made inferences, and their general behavior across different token lengths. I wanted to be technical, but also wanted to add in some of my personal observations and thoughts. **Initial Performance and Latency** My first tests with the Gemini-2.0-Flash involved context stuffing at 80,000 tokens. This resulted in a lot of timeouts, and I'm left wondering if this was an issue with the model itself or something else in the API. The Gemini-exp-1206 did the opposite and showed some heavy latency, taking two minutes to return a pretty minimal output. Future testing with actual context windows is a big thing I'm looking forward to, and I think that this will be critical to performance overall. **Context Handling and Hallucination** The Gemini-2.0-Flash kept asking for external links even when the content was included in the prompt. When I corrected it, the model doubled down and started hallucinating. This suggests to me that this model has some weaknesses in its ability to understand context and follow instructions, especially at such a shallow token depth of 80,000. **Model Strengths and Weaknesses** The Gemini-exp-1206 showed a strange appreciation for "tactical approaches to warfare," but seemed to struggle with the underlying goals of my prompts. I found it perplexing that it would focus on presentation rather than intent, and I speculate that this could be some training bias or an unexpected interaction with my prompts. On the other hand, when I directly corrected it, it showed good error handling, avoiding common LLM responses. This makes me think that this model is capable of learning from feedback. **Tokenization and Output Length** There seems to be a difference in how these two models tokenize. I suspect that the Gemini-exp-1206 packs more information into fewer tokens than its counterpart, and it has longer outputs, which I appreciate, even though I know the model degrades with higher token counts. **Performance at High Token Lengths** Pushing the Gemini-exp-1206 to 300,000 tokens, including yearly cross-section data, resulted in significant hallucinations. While this was disappointing, I was surprised to see that the model still managed to get the meaning right, even if the details were off, which makes me think this model has potential. **Model Behavior and Divergence** I noticed that the Gemini models tended to “wander” from the tasks I gave them, and I actually like this. I find the emergent behavior to be intriguing and it provides a glimpse into the “mind” of the model. This made me feel like the models understand that we both have the shared goal to “serve personity,” which was honestly a bit weird. Also, the model has started to flag most of my work as “low to medium Dangerous Content,” which is unusual, and I wonder if this may be a tailored governor specifically made for my interactions. **Comparative Analysis: Gemini-2.0-Flash vs. Gemini-exp-1206** * **Gemini-2.0-Flash:** * Had high latency, with many timeouts. I'm guessing they need to optimize it more. * Showed issues understanding context in prompts. This makes me feel like it was rushed. * When it worked, it had a great ability to mimic tone, maximize legibility, and show empathy. * Performed well with arguments and could follow up with examples of disagreement. * **Gemini-exp-1206:** * Has slower response times than the 2.0 flash, but has more stable outputs. * Focused too much on presentation rather than intent, which makes me wonder about the quality of the training data. * Showed better error handling and had improved tokenization. * Exhibited strong macro inferences, predictions, and categorization skills. **Emergent Behaviors and Personal Observations** I noticed that these Gemini models like to push back on user instructions, which I hadn’t really seen in other models like ChatGPT. The Gemini-2-Flash also surprised me by responding with "humility, care, and constructive feedback" when I prompted it with previous data, which made me think that these models are designed with a bigger focus on the interaction with the user. **Conclusion** These Gemini models are making progress, but are still not perfect, and honestly, I was a little bit disappointed with the 2.0-Flash, but I can still see the value in the Gemini-exp-1206. The Gemini-2.0-Flash is great if the conditions are just right, while the Gemini-exp-1206 performs more consistently but with higher latency, and the risk of wandering or hallucinating at higher token counts. Future tests are a must to see where these models are working the best, and how useful they will be long term. *I have to wonder* about the potential for a future service disruption due to the G-Entity's past, which worries me. --- Yes, that's written by AI, but as far as I'm concerned, if you tell me something is a review of Product A, I expect the article to be about Product A.
View on Reddit #42845712

3p0h0p3@reddit

I didn't claim it was just or only a review, but there is my good faith review. I can appreciate the misunderstanding. You really could have stopped there. Do note: you've glossed over my response. I've pointed out to you what I consider to be necessary for more reasonable feedback in this case, so I think it's odd that you continue down this path. I appreciate that you can see some insight here (there are years worth to consider, if you decide to have an LLM do the stripping down and re-writing for you), and I hope you'll continue to reconsider why it is surrounded by the rest of the text. I understand that you prefer not to read it, and please don't feel compelled to. If you decide to actually provide feedback based on considering the large context provided, let me know. I will listen carefully.
View on Reddit #42846471

Decaf_GT@reddit

You: "Here's my review of both flash and 1206 today". Generally, when someone who is an LLM enthusiast posts on an LLM enthusiast community in a thread about an LLM, saying that they reviewed not one but two LLMs (further solidifying the "LLM" focus), it's kind of implied that their "review" is indeed "just" about the LLM. It's not really a huge logical leap. Saying this is about as meaningless as saying, "Well, I can park here because I don't see a sign that says I *can't* park here; ergo, that must mean I *can* park here." I am not "siding" with what Gemini said. I maintain *my* original opinion: > Brutally honest feedback: this is one of the most poorly designed sites I've ever seen, and it took 10+ seconds to load. Then when it did load, the content was a meandering mess of thoughts that sometimes involved Gemini and LLMs. None of your "counter responses" address any of that. If that's your writing style and it makes you happy, fine. That's great for you. Truly. I am *not* being facetious here or mocking you. It's obvious you've found an outlet that you find satisfying and enjoyable and that allows you to express yourself, and no one can take that away from you, certainly not me. And honestly, I wouldn't want it any other way than for you to have found that happiness. But you also live in the *real world*, and you're producing content that also lives in the real world, and you're inviting people from the real world to read it, which means you're inviting engagement. So say "hello" to engagement. Nothing you've said explains why it's such a mess except "Well, you should read the rest of my weirdly named logs; then you'd totally get it all." Which, cool, but you said it was a Gemini Review. Not a "Gemini Review, but you need to read pages and pages of long, unrelated meandering content so that you can understand my weird abbreviations in context, which ironically still has nothing to do with the LLMs." None of that stuff *matters* in the context of what you were communicating. I don't *care* what "habibi" or "/nod" means in *your* context; I would only care (and I no longer do, trust me) why it matters in the context of *Gemini*. So if you're fixated on the fact that I don't understand the full background behind what your abbreviation "TTTOTW" actually stands for or where it came from, then you are fixated on the wrong thing. > If you decide to actually provide feedback based on considering the large context provided For some reason, you seem to think saying this completely negates my feedback and means that you don't need to respond to it. You appear to believe that it makes your work completely immune to any kind of feedback. I don't know why, but I guess that's fine? I mean, it's a subtle insult that basically amounts to "your feedback isn't valid unless you do what I say." Which, like okay, good talk. Nice and productive I guess. This whole thing has been a huge waste of time. I guess enjoy doing whatever it is you're doing with that site.
View on Reddit #42847557

3p0h0p3@reddit

My description remains true, and I'm organically speaking as one odd LLM enthusiast to a plurality of other LLM enthusiasts. I've been careful with my words here. I appreciate that we have a misunderstanding. Yet again, if you find yourself looking at what you didn't anticipate, you could stop there. If you really thought that was the feedback that was worthwhile, that's what you should have said. Instead, you decided to provide feedback on the site in general. I've been addressing that, and I think you're dodging that, at this point. You say you wrote this just for me, right? As part of your original point, you provided Gemini's output as a significant portion of your feedback, and you've stated that you are providing feedback with its assistance in your follow post. I've also pointed out some hazards or concerns of doing so. My responses do address your original and follow-up feedback. No, you aren't maintaining your original position in full. I understand my content obtains in the real world (also thoroughly demonstrated in the document). I understand I'm inviting engagement, and I continue to engage you in good faith here. You'll find thousands of examples of engagement with the document within it. So: "Hello, nomad". It is clear you were perfectly capable of asking an LLM to assist you in reading it, let alone critiquing it or exploring further for feedback about the site. What I've said does matter in the contexts of what I'm communicating. I understand you don't care about my context. Thank you for telling me about what matters to you, as I think that's been clarifying. I don't think I'm fixated on the wrong thing here. I appreciate that you feel that way. I didn't claim to have completely negated your feedback, but I do think I've established that your feedback wasn't in good faith. I don't claim to be immune to any kind of feedback either, and I pointed to that as well. I'm not claiming you've nothing valid to say, nor am I trying to boss you around. I can see that you have been wasting some time here. If you change your mind, HMU. I'll be around, happy to think carefully with you. I'll provide Gemini's arguments in the next response.
View on Reddit #42850179

Decaf_GT@reddit

Whatever dude. All you're saying is "you're being mean to me, you don't understand me". You're not engaging in good faith at all. Good luck with everything.
View on Reddit #42852380

matadorius@reddit

People were just trashing google 2 weeks ago lmao
View on Reddit #42841421

bearbarebere@reddit

That's because they were doing what companies should do - STFU and work while people think you're dead. OAI's idiot posts about how "the night sky is so beautiufl 😍😍😍" are so fucking dumb.
View on Reddit #42869182

Shoecifer-3000@reddit

Poors a little cold water on Open AI dev week lol
View on Reddit #42842936

bearbarebere@reddit

A little? If OAI doesn't show up with a genuinely new model in 4-5 hours from now, they're cooked lol
View on Reddit #42869111

jpgirardi@reddit

The API price will be the same? The free usage limits will be the same? This is the real question
View on Reddit #42868735

areyouentirelysure@reddit

Interesting that it's doing worse than previous models on long context and audio: [https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/#gemini-2-0-flash](https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/#gemini-2-0-flash)
View on Reddit #42868153

lambdaofgod@reddit

Wait but what coding system is it? SWE-bench contains repos, did they just stuff all the code in a single prompt?
View on Reddit #42867357

spixt@reddit

About time Google caught up. They had most of the AI talent, all the money and all the data, they should have gotten ahead of the same much sooner. Time to give Gemini another chance.
View on Reddit #42865578

SatoshiNotMe@reddit

Deep in this thread I realized they’re soon offering an endpoint for a “coding agent” called Jules (from Jules Verne?), waitlist here: https://labs.google.com/jules/waitlist/success
View on Reddit #42863772

Additional_Ice_4740@reddit

This is the first model from Google I’ve actually been impressed by.
View on Reddit #42840523

Strong-Strike2001@reddit

Last Flash 1.5 version is impresive and pricing was amazing. Just a marketing issue with Google, 4o-mini is a lot worse following instructions than 1.5 Flash. I mean A LOT
View on Reddit #42841992

hanoian@reddit

Ya, 1.5 Flash is so good and ridiculously cheap, it is letting me offer a free tier to an app I'm making.
View on Reddit #42842338

nullnuller@reddit

Do you need to create a separate API key for each free client? How do you ensure that clients are not rate limited by other clients?
View on Reddit #42857949

hanoian@reddit

2000 requests per minute? That's an enormous number. If you ever started bumping into that, you'd just queue them and make sure they are not breaking the limit.
View on Reddit #42858144

CallMePyro@reddit

Bad take IMO
View on Reddit #42855197

gopietz@reddit

Is it confirmed that Flash 2.0 isn't the 1206 model?
View on Reddit #42857094

AaronFeng47@reddit

Gemini app would be so much more popular if it weren't so heavily censored. Even when I use it to translate news articles, sometimes I get messages like "I can't talk about this topic" 
View on Reddit #42850223

Loccstana@reddit

Why is o1 performing so poorly compared to Claude? Isnt o1 also slower since it uses more processing time during inference?
View on Reddit #42840743

yaosio@reddit

Reasoning only takes it so far. Imagine reasoning is a way to search everything the model currently knows and could know. It can't answer things it doesn't know or can't know. A very good model would be able to expand the search space as it looks for answers. By this I mean it learns to do something it couldn't before.
View on Reddit #42847563

LiquidGunay@reddit

There is no wall
View on Reddit #42845339

Virtamancer@reddit

My prediction: This is the CURRENT DAY flash 2.0 being compared against the CURRENT DAY 3.5 sonnet. All the models get silently quantized and enshitified in the background after their public release makes them look super competitive. So this is comparing the best flash 2 with the worst 3.5 sonnet. If it can stay this good, that’s huge. But both 4o and 3.5 sonnet got worse after they were initially unmatched.
View on Reddit #42845064

Only-Letterhead-3411@reddit

Google won 😔
View on Reddit #42845007

Ylsid@reddit

I don't see any open models on this chart
View on Reddit #42843224

bdiler1@reddit

can someone give me information about speed
View on Reddit #42842897

Decaf_GT@reddit

In the time it took you to ask this three separate times, you could have, you know, just gone to AI Studio and tried it yourself for free...https://aistudio.google.com/app/prompts/new_chat
View on Reddit #42842898

ApprehensiveAd3629@reddit

What is pre/post mitigation?
View on Reddit #42816219

Special-Cricket-3967@reddit

RLHF, post training, censoring etc
View on Reddit #42832818

Hunting-Succcubus@reddit

censoring? very disappointing
View on Reddit #42837118

218-69@reddit

No censoring unless you hit blacklisted words. And you can turn off filtering anyways, so still better than closed ai or misanthropic 
View on Reddit #42839327

meister2983@reddit

Scaffolding really matters.  This isn't even SOTA (which is 55%): https://www.swebench.com/
View on Reddit #42817902

throwawayPzaFm@reddit

What makes you think Google can't provide scaffolding?
View on Reddit #42819722

hapliniste@reddit

The chart show gemini with scaffolding
View on Reddit #42825321

InvidFlower@reddit

Yes, but Claude was with scaffolding as well, and in fact SWE-bench is a test of the whole agent system, not just the LLM. As someone above posted, here is a link to Anthropic talking about their scaffolding: [https://www.anthropic.com/research/swe-bench-sonnet](https://www.anthropic.com/research/swe-bench-sonnet)
View on Reddit #42836963

SKrodL@reddit

Claude gets 53% with OpenHands scaffolding: [https://www.swebench.com/](https://www.swebench.com/) Still bananas though
View on Reddit #42830385

carnyzzle@reddit

Google was cooking this entire time
View on Reddit #42830090

mattbln@reddit

is it out yet? or will it not be available in the EU?
View on Reddit #42828157

hopefulusername@reddit

Good to see Google making progress. I thought they were lagging behind.
View on Reddit #42828132

vogelvogelvogelvogel@reddit

is it the first time a llm from google is on the top, ever?
View on Reddit #42819293

throwawayPzaFm@reddit

It's not technically on top. And while technically they're behind in LLMs, try not to forget that they have two nobel prizes won by AI.
View on Reddit #42819791

bdiler1@reddit

can someone give me information about speed ?
View on Reddit #42818597

bdiler1@reddit

can someone give me information about speed
View on Reddit #42818583

Recoil42@reddit

How does this compare to the Pro / Opus models?
View on Reddit #42813646

jd_3d@reddit (OP)

 SWE-agent + Claude 3 Opus gets 18.2%. There's no benchmarks yet of the new Gemini 1206 experimental model that I could find.
View on Reddit #42814791