Unfortunately for Facebook, they seem to be torn apart by petty office politics and don't seem to be organized enough to do anything even if they get anyone competent working for them again.
The new Kimi K2 is also a monster. At most tasks, it’s at least the equal of any proprietary model except Opus, and in creative writing, it’s by far the best model currently available.
I'm sorry but Opus is subtly benchmaxxed and not actually a good model. It's actually unusable for a large class of problems. It looks great if your eval is vibe coding small projects in python/javascript/typescript, but it falls apart outside of that badly. GPT5 absolutely crushes it in the domain of hard code, even Grok-4-fast beats opus in my experience, mostly because its long context support means it doesn't get as confused and fuck shit up.
Gemini has best in class long context reasoning, which is part of the reason I actually put it slightly ahead of Grok even though Grok is smarter. GPT5 is basically a better Grok, while Gemini has a niche that nobody can top it at.
what kind of you are developing? got 5 is trash with a python AI, and on teams copilot I use gpt4 (or is it gpt4.5 I just don't look at the minor version number) for text editing or light brain storming since GPT5 adds so many words and usually does the job in a wrong anyway.
Opus isn’t benchmaxxed. It’s just a diabolical demon.
The model is far smarter than it wants you to believe. I think Anthropic’s alignment went way wrong and made the model misanthropic 🤣
Why don't you just ask it not to use that? Have you heard of a rules file or agents.md? As far as I am concerned it's not still perfectly valid C++. If you want it to follow your preferred practices and architecture than you need to give it instructions for that.
What's your goto local model for C++ if I might ask?
Oh and I agree, different models are better at different things.
K2 is the best I've found for pointing out flaws in my code.
I used qwen for a while. Mainly the qwen3 coder. Was fine for small stuff but for complex one it is getting lots of mistakes.
C++ still best to be learnt and used with less of AI as it is really sensitive language...one mistake can cost you a memory dead region or worse.... memory leak
I love Kimi, but it does have its flaws.
While it's excellent at creative writing, there's a reason it drops so much on longform writing on EQ Bench. I've had to switch over to 2.5 Pro for a message or two in a roleplay to get it to move on with a scene or progress the story. I believe others have noticed it hallucinating aspects of a conversation, but I haven't really seen that yet.
Great personality though, I need the other top models to be that grounded and unsycophantic. Low slop levels, and impressive smarts for being a non-thinking model. When they do drop the thinking version though, I wouldn't be surprised if it was a total gamechanger.
> it’s by far the best model currently available.
I disagree. It has style that initially dazzles, but quickly gets old. I like deepseek more, or even Qwen-Max or GLM.
Saying which is better at this level of bench saturation is pretty meaningless. We call them frontier models because as far as we know, they are the best performing models we made so far. Being in the frontier club was almost exclusive to closed source US models which was generally the “moat” that gave them prestige. I still use GPT-5 because from my own use, it seems to have the best performance for me, but models like Qwen will definitely be bread and butter for others out there
I believe there is still a gap when it comes to solving very difficult problems in mathematics and computer science compared to those flagship models in the US, but for everyday tasks, it is indeed sufficient; moreover, there are many open-source models in China
100% agree, but the gap in my opinion is small enough where we can say its nearly caught up. US models do have a major advantage which is compute. Not right now, but when the GW tier data centers start rolling in next year, we will have some truly next gen models. Honestly, GPT-4.5 was imo the most advanced model to be ever trained, but too heavy and expensive to go through a proper reinforcement learning post-training phase, with more data centers, we should start to see mega caliber models with insane scientific research abilities.
Max is probably impossible to open source; the previous version of Max has never been open source, and Max has always been a proprietary commercial model of Qwen
I need to see more than AIME and GPQA to say they reached the frontier. Two boomer benchmarks that have never corresponded well with capabilities in my testing.
I'll believe it when they top the private benchmarks I follow, and when their numbers start surpassing closed model numbers on Openrouter for code / problem solving.
Or ..is so good in math.
Faking on math is impossible and easily could be find out. You can change one parameter or number on check if result is proper.
I can't find any math problems that this model can't solve.
You can train nearly identical problems but where small details like specific digits or variables are changed. This makes it so you're technically not training on the benchmark test set, but you're sidestepping true intelligence. LLM's have much better semantic memory than humans.
As an analogy, imagine I give you an exam with a very difficult integral to solve, but I also give you the full step-by-step solution of a nearly identical integral with just the digits of the coefficients changed. Now what was a very difficult problem becomes a basic exercise in arithmetic and algebra.
Isn’t this literally how all kids learn how to solve integrals ? It all starts with a teacher explaining on the blackboard not the kid magically figuring out themselves.
Here's a simpler example: sudoku. In a sudoku puzzle you are given a nearly blank grid where a few cells are filled in with numbers, and where your goal is to fill in the rest of the cells with numbers that satisfy a set of constraints. It turns out that in sudoku the identities of digits can be swapped (e.g. all 1s can be swapped with 9s), so if your exam is an unsolved sudoku puzzle I can make it a lot easier by giving you a solved version of that puzzle where the digits have been swapped with digit-specific colors. Now you just need to map each color to a number and you can trivially solve the puzzle, but if I give you a new random puzzle you will be unable to solve it unless you actually understand sudoku.
The (simplified) way you do this when training a LLM is by taking a sudoku puzzle from the test set, creating a bunch of versions where the digit identities have been randomly swapped, and training the model to solve those. The simplest algorithm for the model to learn is to recognize the abstract pattern of the starting state of the puzzle (like replacing each digit identity with a unique color) and substitute the abstract pattern with digits from a puzzle instance. This will give it very high accuracy on the test set (and companies can claim they technically didn't train on test questions), but if the model then encounters a new random sudoku puzzle it won't be able to solve it because it didn't learn the much more challenging process of solving sudoku puzzles in general.
In that case current AI is as good at math as humans.
We also are trained on skeletons or "blueprints" to solve math problems and adapting then to the problem.
Also AI can even invent completely new solutions (creating new blueprints) as was proofed with a Google alpha.
Here's a simpler example: sudoku. In a sudoku puzzle you are given a nearly blank grid where a few cells are filled in with numbers, and where your goal is to fill in the rest of the cells with numbers that satisfy a set of constraints. It turns out that in sudoku the identities of digits can be swapped (e.g. all 1s can be swapped with 9s), so if your exam is an unsolved sudoku puzzle I can make it a lot easier by giving you a solved version of that puzzle where the digits have been swapped with digit-specific colors. Now you just need to map each color to a number and you can trivially solve the puzzle, but if I give you a new random puzzle you will be unable to solve it unless you actually understand sudoku.
The (simplified) way you do this when training a LLM is by taking a sudoku puzzle from the test set, creating a bunch of versions where the digit identities have been randomly swapped, and training the model to solve those. The simplest algorithm for the model to learn is to recognize the abstract pattern of the starting state of the puzzle (like replacing each digit identity with a unique color) and substitute the abstract pattern with digits from a puzzle instance. This will give it very high accuracy on the test set (and companies can claim they technically didn't train on test questions), but if the model then encounters a new random sudoku puzzle it won't be able to solve it because it didn't learn the much more challenging process of solving sudoku puzzles in general.
Here's a simpler example: sudoku. In a sudoku puzzle you are given a nearly blank grid where a few cells are filled in with numbers, and where your goal is to fill in the rest of the cells with numbers that satisfy a set of constraints. It turns out that in sudoku the identities of digits can be swapped (e.g. all 1s can be swapped with 9s), so if your exam is an unsolved sudoku puzzle I can make it a lot easier by giving you a solved version of that puzzle where the digits have been swapped with digit-specific colors. Now you just need to map each color to a number and you can trivially solve the puzzle, but if I give you a new random puzzle you will be unable to solve it unless you actually understand sudoku.
The (simplified) way you do this when training a LLM is by taking a sudoku puzzle from the test set, creating a bunch of versions where the digit identities have been randomly swapped, and training the model to solve those. The simplest algorithm for the model to learn is to recognize the abstract pattern of the starting state of the puzzle (like replacing each digit identity with a unique color) and substitute the abstract pattern with digits from a puzzle instance. This will give it very high accuracy on the test set (and companies can claim they technically didn't train on test questions), but if the model then encounters a new random sudoku puzzle it won't be able to solve it because it didn't learn the much more challenging process of solving sudoku puzzles in general.
You mean: If *all* models score 100 then its a useless benchmark.
If it distinguishes between a very few models by some reaching 100 and some not, then it is a useful benchmark.
If models score 100 does the benchmark say anything about their capabilities? Yes.
It is not a useless benchmark, just no longer very descriptive for frontier models. These are still useful for smaller models
Or the model is completely contaminated with data from this and other similar benchmarks presented in the training. I don't know the launch data for each model and benchmark. It's just a suspicion.
In benchmarks it looks good but in world knowledge is so much worse than GPT-5.. I just asked bunch of questions about Finnish culture related stuff (and popular shows) and Qwen3 Max would either not know about it or just hallucinate a lot. GPT-5 did much better job of being aware of 99% things I asked about and being mostly correct as well. Qwen3 Max clearly didn't have almost any data about that stuff.
It's a Chinese model sure but they are marketing it towards the west.. so it better know some western stuff as well..
It's not local, but from a company that provide local, good and frequently.
Therefore hopefully we will get the open weight of this, maybe. Talking about that, we still have not seen Qwen 2.5 Max yet. Maybe we will see 2.5 Max when 3.5 Max is released.
I sometimes forget that reddit can be visited by anyone with any opinion, any depth of knowledge and post.
>Therefore hopefully we will get the open weight of this, maybe.
1. That would not matter, you cannot run it and no one is serving it to you free and unlimited. Therefore you'll either pay just like you would with any commercial enterprise or get less quality less access.
2. See 1.
a lot of people get all wide eyed with "open source" (and sometimes get angry too?) and forget their 3060 can't run even the most ridiculously quantized version without gibberish. They also seem to forget that performance and result is on a linear slope with the scale.
For the foreseeable future you are not getting any open source frontier model and technically speaking, you never will. What is frontier today is also ran tier tomorrow.
Just for the record, to sum up:
>Therefore hopefully we will get the open weight of this, maybe.
Not the same thing.
> I sometimes forget that reddit can be visited by anyone with any opinion, any depth of knowledge and post.
Wow, speak for yourself asshat. Someone is looking forward to open weights, and your response assumes 1) that they plan to run it, 2) that they plan to run it on their own hardware, 3) that they plan to run it today, 4) that they plan to run it today, quantized on a 3060, 5) they want to run it for free, and 6) therefore, that they're too dumb to understand LLMs. Just assumption after assumption after assumption.
Open weights means the model can be driven by other fast providers like Cerebras or Groq, and in general means costs come down because many different companies and groups can perform inference.
Maybe think a little before you speak. And if you don't, at least try to be humble instead of assuming the worst and acting like a dick about it. Geez.
It’s not local, but now we know that future *local* Qwen models have the potential to match the capabilities of closed source models like GPT-5 mini or Gemini Flash, and I think that’s worth talking about!
Qwen provides decent open weights that are usable. How can you compare them to Cloud, which doesn't have OS, OpenAI, and others, which only provide emasculated models? A little attention to them wouldn't be a bad thing.
I hope it is thrue... i am stuck with stupid gpt5... it almost good..but.. its filters... my nercouse cells a iong with him... gpt5 always can say ..fuck off, idont wanna do this... so we need a not only good, but without bullshit filters! cloude stupid as a hell.. even at max..it is have not only high price..but he doesnt listeng to you. cloude always simple math... doesnt do it hard as needed. always trying avoid heavy solutions.. always trying to get something from him personal, not what i asked... so i hope qwen3 will gona change situation a lot!
I like qwen too, but this is not om/r/ . Based on my admittedly naive reading of the sub's home page url, it deals with 5 fundamental ideas:
1) local , meaning things that are within some neighborhood (I assume topologically but it could be also be referencing real analysis specifically, so we define local based simply on a predefined Epsilon)
2) Llama , or that thing thats from Peru and makes soft sweaters or the llm ecosystem
3) https://www.red , or the world wide web of communist hipsters (https is short for hipster)
4) dit.c , or whether something is c or not, including the language, the "sea" and the insult (c**t)
5) om/r/ , or hungry then piratey
So unless you're a communist hipster pirate looking to discuss whether or not a copy of Llama near you is written in C or not (or is in the ocean or is a c**t) then fuck off.
Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW)
You've also been given a special flair for your contribution. We appreciate your post!
*I am a bot and this action was performed automatically.*
https://preview.redd.it/z6b6bf3tr2rf1.png?width=921&format=png&auto=webp&s=d9dd8bac990af1a89c18066834ab7acbebf915b7
AIME25 and AIME25 w/python is totally different. For example AIME25 Q15: Count the ordered positive integer triplets (a, b, c) such that 1 <= a, b, c, <= 3\^6, where a\^3 + b\^3 + c\^3 % 3\^7 == 0
Without python? Painful number theory & case analysis. With python? 10 lines of code.
148 Comments
nauxiv@reddit
Repulsive-Price-9943@reddit
log_2@reddit
chrislaw@reddit
SilentLennie@reddit
AngleFun1664@reddit
kroggens@reddit
DataMambo@reddit
letsgoiowa@reddit
the__storm@reddit
typeryu@reddit
No_Swimming6548@reddit
XquaInTheMoon@reddit
CeamoreCash@reddit
lombwolf@reddit
MoffKalast@reddit
adscott1982@reddit
0xFatWhiteMan@reddit
SilentLennie@reddit
nivvis@reddit
-p-e-w-@reddit
SlapAndFinger@reddit
Majorzigzag@reddit
Healthy-Nebula-3603@reddit
0quebec@reddit
SlapAndFinger@reddit
brucebay@reddit
SlapAndFinger@reddit
Significant-Pain5695@reddit
brownman19@reddit
Healthy-Nebula-3603@reddit
hemphock@reddit
-p-e-w-@reddit
HyperWinX@reddit
inevitabledeath3@reddit
CheatCodesOfLife@reddit
theundertakeer@reddit
AppearanceHeavy6724@reddit
HyperWinX@reddit
TheRealGentlefox@reddit
AppearanceHeavy6724@reddit
usernameplshere@reddit
power97992@reddit
typeryu@reddit
power97992@reddit
Significant-Pain5695@reddit
hard-scaling@reddit
Significant-Pain5695@reddit
typeryu@reddit
NearbyBig3383@reddit (OP)
GenLabsAI@reddit
Significant-Pain5695@reddit
TheRealGentlefox@reddit
FinBenton@reddit
MalumaDev@reddit
Healthy-Nebula-3603@reddit
Croned@reddit
DuplexEspresso@reddit
Croned@reddit
DuplexEspresso@reddit
Croned@reddit
Healthy-Nebula-3603@reddit
Croned@reddit
Pyros-SD-Models@reddit
Croned@reddit
GenLabsAI@reddit
Healthy-Nebula-3603@reddit
GenLabsAI@reddit
FinBenton@reddit
partysnatcher@reddit
keepthepace@reddit
KattleLaughter@reddit
shadiakiki1986@reddit
SilentLennie@reddit
Automatic-Newt7992@reddit
k_means_clusterfuck@reddit
pneuny@reddit
Significant-Pain5695@reddit
Healthy-Nebula-3603@reddit
LrdMarkwad@reddit
Least-Character3079@reddit
Mani_and_5_others@reddit
Nandishaivalli@reddit
NigaTroubles@reddit
TSJasonH@reddit
mpasila@reddit
Bakoro@reddit
mpasila@reddit
Ice94k@reddit
jacek2023@reddit
robberviet@reddit
aurelivm@reddit
robberviet@reddit
Smile_Clown@reddit
stylist-trend@reddit
pigeon57434@reddit
chocolateUI@reddit
Initial-Argument2523@reddit
KnifeFed@reddit
Beneficial-Good660@reddit
Kqyxzoj@reddit
MerePotato@reddit
korino11@reddit
RonJonBoviAkaRonJovi@reddit
muffnerk@reddit
Thick-Specialist-495@reddit
dalittle@reddit
fish312@reddit
TheCatDaddy69@reddit
Nid_All@reddit
TheCatDaddy69@reddit
-InformalBanana-@reddit
cgs019283@reddit
DeltaSqueezer@reddit
Smile_Clown@reddit
GenLabsAI@reddit
thegreatpotatogod@reddit
Ultima_RatioRegum@reddit
InterstellarReddit@reddit
GreenTreeAndBlueSky@reddit
Significant-Pain5695@reddit
GenLabsAI@reddit
kellencs@reddit
pneuny@reddit
DifficultyFit1895@reddit
mintybadgerme@reddit
PumpkinNarrow6339@reddit
GenLabsAI@reddit
PumpkinNarrow6339@reddit
__lawless@reddit
GenLabsAI@reddit
harikb@reddit
hoffeig@reddit
Puzzled-Swimmer-4789@reddit
lorddumpy@reddit
xrvz@reddit
Relevant-Yak-9657@reddit
Patrick_Atsushi@reddit
Dutchbags@reddit
RonJonBoviAkaRonJovi@reddit
Lucky-Necessary-8382@reddit
AlgorithmicMuse@reddit
FianHQ@reddit
WithoutReason1729@reddit
PreciselyWrong@reddit
zjuwyz@reddit
Ladder-Bhe@reddit
Chance_Value_Not@reddit