TheaterFire

Oh my God, what a monster is this?

Posted by NearbyBig3383@reddit | LocalLLaMA | View on Reddit | 148 comments

Oh my God, what a monster is this?

Reply to Post

148 Comments

nauxiv@reddit

The only monster here is the guy who posted a portrait-mode phone screenshot of a square chart image.
View on Reddit #67287119

Repulsive-Price-9943@reddit

https://preview.redd.it/aa4fnyxr3jrf1.jpeg?width=498&format=pjpg&auto=webp&s=fefe69d7b84a23d9dab782b66565ed9044bfca04
View on Reddit #67474871

log_2@reddit

No label of the benchmark nor metric. Useless.
View on Reddit #67361968

chrislaw@reddit

To The Hague!!!
View on Reddit #67293068

SilentLennie@reddit

I don't think they even want to have them. :-)
View on Reddit #67328951

AngleFun1664@reddit

Believe it or not, straight to jail
View on Reddit #67296670

kroggens@reddit

https://preview.redd.it/ru20zgdvw5rf1.jpeg?width=700&format=pjpg&auto=webp&s=3341a74c2e30f2bb50eea7aa4f0a2532e99ce997
View on Reddit #67325120

DataMambo@reddit

https://preview.redd.it/s5m7oecei5rf1.jpeg?width=828&format=pjpg&auto=webp&s=4783347a602f6793eb115c0c5673380de0d24c5b
View on Reddit #67319064

letsgoiowa@reddit

Newbies can't crop smh
View on Reddit #67305528

the__storm@reddit

Crop?! Just post the original image!
View on Reddit #67317744

typeryu@reddit

Chinese models have now reached proper frontier, not that they were that far anyways.
View on Reddit #67280627

No_Swimming6548@reddit

It's possible they will start leading the next year
View on Reddit #67281566

XquaInTheMoon@reddit

Already are in term of usage I believe
View on Reddit #67448472

CeamoreCash@reddit

If any researchers get close to frontier, Facebook will offer them a 100 million salary. That's enough money to leave China even if it's illegal
View on Reddit #67291207

lombwolf@reddit

Personally I’d rather be making 5m working in a Chinese company then 100m working at Meta🤢 lmfao
View on Reddit #67426906

MoffKalast@reddit

Unfortunately for Facebook, they seem to be torn apart by petty office politics and don't seem to be organized enough to do anything even if they get anyone competent working for them again.
View on Reddit #67294846

adscott1982@reddit

Just a bad demo. Apparently the glasses are really pretty great.
View on Reddit #67335705

0xFatWhiteMan@reddit

I mean it's possible, but with less money behind them, and lagging behind already ... It's unlikely
View on Reddit #67344936

SilentLennie@reddit

They might not have the hardware, we'll see what happens.
View on Reddit #67329513

nivvis@reddit

No one remembers r1? ai moving fast lol
View on Reddit #67424393

-p-e-w-@reddit

The new Kimi K2 is also a monster. At most tasks, it’s at least the equal of any proprietary model except Opus, and in creative writing, it’s by far the best model currently available.
View on Reddit #67282627

SlapAndFinger@reddit

I'm sorry but Opus is subtly benchmaxxed and not actually a good model. It's actually unusable for a large class of problems. It looks great if your eval is vibe coding small projects in python/javascript/typescript, but it falls apart outside of that badly. GPT5 absolutely crushes it in the domain of hard code, even Grok-4-fast beats opus in my experience, mostly because its long context support means it doesn't get as confused and fuck shit up.
View on Reddit #67295583

Majorzigzag@reddit

Oh my gosh I thought I was the only one who thought this way. GPT 5 performed way better than Opus.
View on Reddit #67351134

Healthy-Nebula-3603@reddit

If we talking about coding: I think I'd rather gpt-5 thinking > grok 4 > opus 4.1 > Gemini 2.5 pro
View on Reddit #67305120

0quebec@reddit

I use LLMS to build Comfyui workflows and only GPT 5 thinking/pro or grok 4 are able to do it
View on Reddit #67340325

SlapAndFinger@reddit

Gemini has best in class long context reasoning, which is part of the reason I actually put it slightly ahead of Grok even though Grok is smarter. GPT5 is basically a better Grok, while Gemini has a niche that nobody can top it at.
View on Reddit #67311768

brucebay@reddit

what kind of you are developing? got 5 is trash with a python AI, and on teams copilot I use gpt4 (or is it gpt4.5 I just don't look at the minor version number) for text editing or light brain storming since GPT5 adds so many words and usually does the job in a wrong anyway.
View on Reddit #67307329

SlapAndFinger@reddit

Predominantly high performance rust systems and algorithm code, though I do a fair amount of Python for ML and node/TS/react for interfaces.
View on Reddit #67311911

Significant-Pain5695@reddit

The short context of opus is a very serious problem, making it unable to assist in most application scenarios
View on Reddit #67309254

brownman19@reddit

Opus isn’t benchmaxxed. It’s just a diabolical demon. The model is far smarter than it wants you to believe. I think Anthropic’s alignment went way wrong and made the model misanthropic 🤣
View on Reddit #67299887

Healthy-Nebula-3603@reddit

Opis 4.1 is obsolete for nowadays standards whatever you say. Is better only from Gemini 2.5 pro.
View on Reddit #67305202

hemphock@reddit

is the new kimi k2 also non-thinking? i really liked that about the previous version
View on Reddit #67327576

-p-e-w-@reddit

Yes.
View on Reddit #67340838

HyperWinX@reddit

I tried it because ive heard about 1T parameters. Asked it about C++. Saw "using namespace std" in response. Closed. Never again lol
View on Reddit #67284448

inevitabledeath3@reddit

Why don't you just ask it not to use that? Have you heard of a rules file or agents.md? As far as I am concerned it's not still perfectly valid C++. If you want it to follow your preferred practices and architecture than you need to give it instructions for that.
View on Reddit #67295722

CheatCodesOfLife@reddit

What's your goto local model for C++ if I might ask? Oh and I agree, different models are better at different things. K2 is the best I've found for pointing out flaws in my code.
View on Reddit #67287249

theundertakeer@reddit

I used qwen for a while. Mainly the qwen3 coder. Was fine for small stuff but for complex one it is getting lots of mistakes. C++ still best to be learnt and used with less of AI as it is really sensitive language...one mistake can cost you a memory dead region or worse.... memory leak
View on Reddit #67288936

AppearanceHeavy6724@reddit

Agree, Kimi K2 is way better analyst than a creator.
View on Reddit #67287764

HyperWinX@reddit

I dont have hardware for big LLMs sadly, though CPU-only thinking Qwen3-30b works okay-ish, 15t/s on 5600G.
View on Reddit #67287605

TheRealGentlefox@reddit

I love Kimi, but it does have its flaws. While it's excellent at creative writing, there's a reason it drops so much on longform writing on EQ Bench. I've had to switch over to 2.5 Pro for a message or two in a roleplay to get it to move on with a scene or progress the story. I believe others have noticed it hallucinating aspects of a conversation, but I haven't really seen that yet. Great personality though, I need the other top models to be that grounded and unsycophantic. Low slop levels, and impressive smarts for being a non-thinking model. When they do drop the thinking version though, I wouldn't be surprised if it was a total gamechanger.
View on Reddit #67294508

AppearanceHeavy6724@reddit

> it’s by far the best model currently available. I disagree. It has style that initially dazzles, but quickly gets old. I like deepseek more, or even Qwen-Max or GLM.
View on Reddit #67287729

usernameplshere@reddit

Only thing K2 Kimi needs is vision, then it's perfect (for me).
View on Reddit #67286278

power97992@reddit

I doubt it is better than gpt 5 thinking high ? 
View on Reddit #67284570

typeryu@reddit

Saying which is better at this level of bench saturation is pretty meaningless. We call them frontier models because as far as we know, they are the best performing models we made so far. Being in the frontier club was almost exclusive to closed source US models which was generally the “moat” that gave them prestige. I still use GPT-5 because from my own use, it seems to have the best performance for me, but models like Qwen will definitely be bread and butter for others out there
View on Reddit #67288351

power97992@reddit

From my limited experience, QW 3 max non thinking like felt close to gpt 5 non thinking 
View on Reddit #67312008

Significant-Pain5695@reddit

I don't think so, but that doesn't affect my ability to use it in other scenarios
View on Reddit #67309351

hard-scaling@reddit

Isn't gpt 5 pro which is in the chart better?
View on Reddit #67285214

Significant-Pain5695@reddit

I believe there is still a gap when it comes to solving very difficult problems in mathematics and computer science compared to those flagship models in the US, but for everyday tasks, it is indeed sufficient; moreover, there are many open-source models in China
View on Reddit #67309013

typeryu@reddit

100% agree, but the gap in my opinion is small enough where we can say its nearly caught up. US models do have a major advantage which is compute. Not right now, but when the GW tier data centers start rolling in next year, we will have some truly next gen models. Honestly, GPT-4.5 was imo the most advanced model to be ever trained, but too heavy and expensive to go through a proper reinforcement learning post-training phase, with more data centers, we should start to see mega caliber models with insane scientific research abilities.
View on Reddit #67311395

NearbyBig3383@reddit (OP)

I bet a lot on Qwen. It's beautiful, I'm looking forward to R2 but apparently when it arrives we won't even need it hahaha
View on Reddit #67280946

GenLabsAI@reddit

max isn't opensource (yet?)
View on Reddit #67306395

Significant-Pain5695@reddit

Max is probably impossible to open source; the previous version of Max has never been open source, and Max has always been a proprietary commercial model of Qwen
View on Reddit #67309471

TheRealGentlefox@reddit

I need to see more than AIME and GPQA to say they reached the frontier. Two boomer benchmarks that have never corresponded well with capabilities in my testing. I'll believe it when they top the private benchmarks I follow, and when their numbers start surpassing closed model numbers on Openrouter for code / problem solving.
View on Reddit #67295465

FinBenton@reddit

If models score 100 then its a useless benchmark
View on Reddit #67284578

MalumaDev@reddit

Or they trained the model on the benchmark
View on Reddit #67294076

Healthy-Nebula-3603@reddit

Or ..is so good in math. Faking on math is impossible and easily could be find out. You can change one parameter or number on check if result is proper. I can't find any math problems that this model can't solve.
View on Reddit #67304848

Croned@reddit

You can train nearly identical problems but where small details like specific digits or variables are changed. This makes it so you're technically not training on the benchmark test set, but you're sidestepping true intelligence. LLM's have much better semantic memory than humans. As an analogy, imagine I give you an exam with a very difficult integral to solve, but I also give you the full step-by-step solution of a nearly identical integral with just the digits of the coefficients changed. Now what was a very difficult problem becomes a basic exercise in arithmetic and algebra.
View on Reddit #67310921

DuplexEspresso@reddit

Isn’t this literally how all kids learn how to solve integrals ? It all starts with a teacher explaining on the blackboard not the kid magically figuring out themselves.
View on Reddit #67355630

Croned@reddit

Here's a simpler example: sudoku. In a sudoku puzzle you are given a nearly blank grid where a few cells are filled in with numbers, and where your goal is to fill in the rest of the cells with numbers that satisfy a set of constraints. It turns out that in sudoku the identities of digits can be swapped (e.g. all 1s can be swapped with 9s), so if your exam is an unsolved sudoku puzzle I can make it a lot easier by giving you a solved version of that puzzle where the digits have been swapped with digit-specific colors. Now you just need to map each color to a number and you can trivially solve the puzzle, but if I give you a new random puzzle you will be unable to solve it unless you actually understand sudoku. The (simplified) way you do this when training a LLM is by taking a sudoku puzzle from the test set, creating a bunch of versions where the digit identities have been randomly swapped, and training the model to solve those. The simplest algorithm for the model to learn is to recognize the abstract pattern of the starting state of the puzzle (like replacing each digit identity with a unique color) and substitute the abstract pattern with digits from a puzzle instance. This will give it very high accuracy on the test set (and companies can claim they technically didn't train on test questions), but if the model then encounters a new random sudoku puzzle it won't be able to solve it because it didn't learn the much more challenging process of solving sudoku puzzles in general.
View on Reddit #67399243

DuplexEspresso@reddit

I see your point
View on Reddit #67413799

Croned@reddit

I see that example went way over your head.
View on Reddit #67397719

Healthy-Nebula-3603@reddit

In that case current AI is as good at math as humans. We also are trained on skeletons or "blueprints" to solve math problems and adapting then to the problem. Also AI can even invent completely new solutions (creating new blueprints) as was proofed with a Google alpha.
View on Reddit #67312854

Croned@reddit

Here's a simpler example: sudoku. In a sudoku puzzle you are given a nearly blank grid where a few cells are filled in with numbers, and where your goal is to fill in the rest of the cells with numbers that satisfy a set of constraints. It turns out that in sudoku the identities of digits can be swapped (e.g. all 1s can be swapped with 9s), so if your exam is an unsolved sudoku puzzle I can make it a lot easier by giving you a solved version of that puzzle where the digits have been swapped with digit-specific colors. Now you just need to map each color to a number and you can trivially solve the puzzle, but if I give you a new random puzzle you will be unable to solve it unless you actually understand sudoku. The (simplified) way you do this when training a LLM is by taking a sudoku puzzle from the test set, creating a bunch of versions where the digit identities have been randomly swapped, and training the model to solve those. The simplest algorithm for the model to learn is to recognize the abstract pattern of the starting state of the puzzle (like replacing each digit identity with a unique color) and substitute the abstract pattern with digits from a puzzle instance. This will give it very high accuracy on the test set (and companies can claim they technically didn't train on test questions), but if the model then encounters a new random sudoku puzzle it won't be able to solve it because it didn't learn the much more challenging process of solving sudoku puzzles in general.
View on Reddit #67399266

Pyros-SD-Models@reddit

This is literally how 90% of high school kids learn math.
View on Reddit #67326985

Croned@reddit

Here's a simpler example: sudoku. In a sudoku puzzle you are given a nearly blank grid where a few cells are filled in with numbers, and where your goal is to fill in the rest of the cells with numbers that satisfy a set of constraints. It turns out that in sudoku the identities of digits can be swapped (e.g. all 1s can be swapped with 9s), so if your exam is an unsolved sudoku puzzle I can make it a lot easier by giving you a solved version of that puzzle where the digits have been swapped with digit-specific colors. Now you just need to map each color to a number and you can trivially solve the puzzle, but if I give you a new random puzzle you will be unable to solve it unless you actually understand sudoku. The (simplified) way you do this when training a LLM is by taking a sudoku puzzle from the test set, creating a bunch of versions where the digit identities have been randomly swapped, and training the model to solve those. The simplest algorithm for the model to learn is to recognize the abstract pattern of the starting state of the puzzle (like replacing each digit identity with a unique color) and substitute the abstract pattern with digits from a puzzle instance. This will give it very high accuracy on the test set (and companies can claim they technically didn't train on test questions), but if the model then encounters a new random sudoku puzzle it won't be able to solve it because it didn't learn the much more challenging process of solving sudoku puzzles in general.
View on Reddit #67399259

GenLabsAI@reddit

Where do you try it?
View on Reddit #67306260

Healthy-Nebula-3603@reddit

My own heavily modified rare math problems
View on Reddit #67312353

GenLabsAI@reddit

No, but which site do you use it on?
View on Reddit #67329118

FinBenton@reddit

Pretty sure most companies do that anyway.
View on Reddit #67295110

partysnatcher@reddit

You mean: If *all* models score 100 then its a useless benchmark. If it distinguishes between a very few models by some reaching 100 and some not, then it is a useful benchmark.
View on Reddit #67333100

keepthepace@reddit

Yes and now, it still means that these models complete a set of tasks perfectly. It is not a benchmark anymore but more of a "unit" test.
View on Reddit #67286567

KattleLaughter@reddit

regression test
View on Reddit #67288899

shadiakiki1986@reddit

it was already a regression test before it reached 100%
View on Reddit #67332594

SilentLennie@reddit

or the benchmarks aren't that useful anymore, that's always been a thing and only getting worse.
View on Reddit #67329478

Automatic-Newt7992@reddit

This is the way. Make it 2 bit quant. Then it is all if else condition to arrive at the real reasoning for the solution /s
View on Reddit #67329324

k_means_clusterfuck@reddit

If models score 100 does the benchmark say anything about their capabilities? Yes. It is not a useless benchmark, just no longer very descriptive for frontier models. These are still useful for smaller models
View on Reddit #67288006

pneuny@reddit

Or to see how good models are without python assistance.
View on Reddit #67309777

Significant-Pain5695@reddit

You can't say that, because there is still a significant gap between the flagship models of each company
View on Reddit #67308767

Healthy-Nebula-3603@reddit

...even if 90% is useless
View on Reddit #67304484

LrdMarkwad@reddit

I agree that it’s a useless benchmark *now*. Looks like we need new tests
View on Reddit #67297047

Least-Character3079@reddit

Or the model is completely contaminated with data from this and other similar benchmarks presented in the training. I don't know the launch data for each model and benchmark. It's just a suspicion.
View on Reddit #67293977

Mani_and_5_others@reddit

Benchmarks are bullshit
View on Reddit #67388566

Nandishaivalli@reddit

100 what ? What metrics are you showing
View on Reddit #67386520

NigaTroubles@reddit

Wow we already reached 100
View on Reddit #67381789

TSJasonH@reddit

Incredible job getting this at exactly 4:20. Too bad your battery wasn't 69%.
View on Reddit #67375267

mpasila@reddit

In benchmarks it looks good but in world knowledge is so much worse than GPT-5.. I just asked bunch of questions about Finnish culture related stuff (and popular shows) and Qwen3 Max would either not know about it or just hallucinate a lot. GPT-5 did much better job of being aware of 99% things I asked about and being mostly correct as well. Qwen3 Max clearly didn't have almost any data about that stuff. It's a Chinese model sure but they are marketing it towards the west.. so it better know some western stuff as well..
View on Reddit #67297848

Bakoro@reddit

Finland is part of the West.
View on Reddit #67347889

mpasila@reddit

My last sentence doesn't mean anything?
View on Reddit #67372335

Ice94k@reddit

yep, qwen is incredible rn.
View on Reddit #67359594

jacek2023@reddit

We moved from "discussion about not local Claude models" to "discussion about not local Qwen models" on this sub? Is it called "progress"?
View on Reddit #67281338

robberviet@reddit

It's not local, but from a company that provide local, good and frequently. Therefore hopefully we will get the open weight of this, maybe. Talking about that, we still have not seen Qwen 2.5 Max yet. Maybe we will see 2.5 Max when 3.5 Max is released.
View on Reddit #67290961

aurelivm@reddit

Qwen 2.5 Max was just Qwen 2.5 72B
View on Reddit #67355287

robberviet@reddit

At least it's MoE, not 72B. https://qwen.ai/blog?id=e2eebf44bd7d617d7e4da68fec1f995585409a5e&from=research.research-list
View on Reddit #67355363

Smile_Clown@reddit

I sometimes forget that reddit can be visited by anyone with any opinion, any depth of knowledge and post. >Therefore hopefully we will get the open weight of this, maybe. 1. That would not matter, you cannot run it and no one is serving it to you free and unlimited. Therefore you'll either pay just like you would with any commercial enterprise or get less quality less access. 2. See 1. a lot of people get all wide eyed with "open source" (and sometimes get angry too?) and forget their 3060 can't run even the most ridiculously quantized version without gibberish. They also seem to forget that performance and result is on a linear slope with the scale. For the foreseeable future you are not getting any open source frontier model and technically speaking, you never will. What is frontier today is also ran tier tomorrow. Just for the record, to sum up: >Therefore hopefully we will get the open weight of this, maybe. Not the same thing.
View on Reddit #67291728

stylist-trend@reddit

> I sometimes forget that reddit can be visited by anyone with any opinion, any depth of knowledge and post. Wow, speak for yourself asshat. Someone is looking forward to open weights, and your response assumes 1) that they plan to run it, 2) that they plan to run it on their own hardware, 3) that they plan to run it today, 4) that they plan to run it today, quantized on a 3060, 5) they want to run it for free, and 6) therefore, that they're too dumb to understand LLMs. Just assumption after assumption after assumption. Open weights means the model can be driven by other fast providers like Cerebras or Groq, and in general means costs come down because many different companies and groups can perform inference. Maybe think a little before you speak. And if you don't, at least try to be humble instead of assuming the worst and acting like a dick about it. Geez.
View on Reddit #67297076

pigeon57434@reddit

not only is this not local the thinking version of qwen3 max isnt even freaking out yet closed source
View on Reddit #67303238

chocolateUI@reddit

It’s not local, but now we know that future *local* Qwen models have the potential to match the capabilities of closed source models like GPT-5 mini or Gemini Flash, and I think that’s worth talking about!
View on Reddit #67297658

Initial-Argument2523@reddit

'Yes since now at least we are talking about models we could run locally if we had a crap ton of money
View on Reddit #67283751

KnifeFed@reddit

Not Max though.
View on Reddit #67286087

Beneficial-Good660@reddit

Qwen provides decent open weights that are usable. How can you compare them to Cloud, which doesn't have OS, OpenAI, and others, which only provide emasculated models? A little attention to them wouldn't be a bad thing.
View on Reddit #67284891

Kqyxzoj@reddit

> Oh my God, what a monster is this? It's a horrible shitty bar chart. You're welcome.
View on Reddit #67349193

MerePotato@reddit

This just means the benchmarks have been saturated
View on Reddit #67335712

korino11@reddit

I hope it is thrue... i am stuck with stupid gpt5... it almost good..but.. its filters... my nercouse cells a iong with him... gpt5 always can say ..fuck off, idont wanna do this... so we need a not only good, but without bullshit filters! cloude stupid as a hell.. even at max..it is have not only high price..but he doesnt listeng to you. cloude always simple math... doesnt do it hard as needed. always trying avoid heavy solutions.. always trying to get something from him personal, not what i asked... so i hope qwen3 will gona change situation a lot!
View on Reddit #67289187

RonJonBoviAkaRonJovi@reddit

I bet even LLMs get confused at how bad you type.
View on Reddit #67335513

muffnerk@reddit

noob here. sorry, but what exactly am i looking at? a new llm that is fantastic at python??
View on Reddit #67331018

Thick-Specialist-495@reddit

i dont trust their benchmarks
View on Reddit #67281365

dalittle@reddit

I don't trust graphs pushing gwen when the clear winner is GPT-5
View on Reddit #67327888

fish312@reddit

When a benchmark becomes a target something something
View on Reddit #67285229

TheCatDaddy69@reddit

In kinda dumb but whats the scope here? Whats Python got to do with anything? Is this when using its api in python?
View on Reddit #67303108

Nid_All@reddit

It’s using Python as a tool to execute the written code during the CoT like GPT-5 Thinking for example
View on Reddit #67307043

TheCatDaddy69@reddit

Ah thanks.
View on Reddit #67327801

-InformalBanana-@reddit

If it is 100% on those tests, and worse on the last one, then it possibly cheated, it was possibly trained on test data.
View on Reddit #67319979

cgs019283@reddit

I like qwen, but this is not local.
View on Reddit #67284603

DeltaSqueezer@reddit

I like qwen too, but this is not Llama.
View on Reddit #67284839

Smile_Clown@reddit

I like Llama too, but this is not a cheetah.
View on Reddit #67291763

GenLabsAI@reddit

I like cheetahs but this isn't a whale
View on Reddit #67306595

thegreatpotatogod@reddit

I like whales too, but this isn't deepseek
View on Reddit #67319700

Ultima_RatioRegum@reddit

I like qwen too, but this is not om/r/ . Based on my admittedly naive reading of the sub's home page url, it deals with 5 fundamental ideas: 1) local , meaning things that are within some neighborhood (I assume topologically but it could be also be referencing real analysis specifically, so we define local based simply on a predefined Epsilon) 2) Llama , or that thing thats from Peru and makes soft sweaters  or the llm ecosystem 3) https://www.red , or the world wide web of communist hipsters (https is short for hipster)  4) dit.c , or whether something is c or not, including the language, the "sea" and the insult (c**t) 5) om/r/ , or hungry then piratey So unless you're a communist hipster pirate looking to discuss whether or not a copy of Llama near you is written in C or not (or is in the ocean or is a c**t) then fuck off.
View on Reddit #67318561

InterstellarReddit@reddit

It’s local to the Data Center it’s hosted on 😂
View on Reddit #67296385

GreenTreeAndBlueSky@reddit

Anybody know the real price comparison for normal code usage? Id assume 100-1 inout output ratio on tokens or something
View on Reddit #67281114

Significant-Pain5695@reddit

I think it's a bit expensive
View on Reddit #67309540

GenLabsAI@reddit

no, most people use 3:1
View on Reddit #67306429

kellencs@reddit

why 235b without python?
View on Reddit #67282912

pneuny@reddit

Maybe because it also gets 100? They may have just wanted something lesser to compare it with.
View on Reddit #67309420

DifficultyFit1895@reddit

Maybe they just ran out of room in the label? Otherwise 235b is the real beast here.
View on Reddit #67299386

mintybadgerme@reddit

No tool calling makes it rather useless for me
View on Reddit #67307180

PumpkinNarrow6339@reddit

100/100 benchmark. What next scale, who dicide this benchmark scale?
View on Reddit #67287534

GenLabsAI@reddit

Wait for arc agi 2 to release numbers
View on Reddit #67306630

PumpkinNarrow6339@reddit

I am waiting for 👀
View on Reddit #67306966

__lawless@reddit

Let’s see how they do in AIME2026, non blind benchmarks are not benchmarks
View on Reddit #67289874

GenLabsAI@reddit

Or ARC
View on Reddit #67306700

harikb@reddit

Why are you running it in "low-power" mode even at 72% ? ... I will see myself out ...
View on Reddit #67305164

hoffeig@reddit

monster in the bench, lady in the terminal
View on Reddit #67300704

Puzzled-Swimmer-4789@reddit

Maxed out benchmark is not really a good comparison. For all we know one could be 120% when the other is 300%.
View on Reddit #67281864

lorddumpy@reddit

100%. it'd be nice to see average token count to completion or cost comparison once they reach 100.
View on Reddit #67298579

xrvz@reddit

That's not how that works...
View on Reddit #67283267

Relevant-Yak-9657@reddit

USAMO and Putnam time.
View on Reddit #67297015

Patrick_Atsushi@reddit

Looks like it’s time to have some new benchmarks.
View on Reddit #67296274

Dutchbags@reddit

anything scoring a 100 is futile
View on Reddit #67296103

RonJonBoviAkaRonJovi@reddit

You guys believe every chart they put out huh?
View on Reddit #67289999

Lucky-Necessary-8382@reddit

Its a benchmaxxed monster. Thats all.
View on Reddit #67289716

AlgorithmicMuse@reddit

Only benchmark I give a rats ass about is mine, how the model works for me. All the other benchmarkscare useless for me
View on Reddit #67287472

FianHQ@reddit

You have to pay attention to who ran these tests, reporting bias, the benchmark design and the setup
View on Reddit #67287434

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*
View on Reddit #67287288

PreciselyWrong@reddit

Imagine not including the SOTA programming model in benchmark comparison graphs. Cowardly
View on Reddit #67284033

zjuwyz@reddit

https://preview.redd.it/z6b6bf3tr2rf1.png?width=921&format=png&auto=webp&s=d9dd8bac990af1a89c18066834ab7acbebf915b7 AIME25 and AIME25 w/python is totally different. For example AIME25 Q15: Count the ordered positive integer triplets (a, b, c) such that 1 <= a, b, c, <= 3\^6, where a\^3 + b\^3 + c\^3 % 3\^7 == 0 Without python? Painful number theory & case analysis. With python? 10 lines of code.
View on Reddit #67283559

Ladder-Bhe@reddit

fake new。 never saw official report like this, show your origin sources
View on Reddit #67281832

Chance_Value_Not@reddit

Yawn. Is it good in use? I was disappointed by qwen-code (the tool, the qwen-code model), but not used max yet. 
View on Reddit #67280916