Qwen 3 Max Official Benchmarks (possibly open sourcing later..?)
Posted by Trevor050@reddit | LocalLLaMA | View on Reddit | 36 comments

Posted by Trevor050@reddit | LocalLLaMA | View on Reddit | 36 comments
GreenTreeAndBlueSky@reddit
They never open sourced their max versions. Their open source models are essentially advertising and probably some distils of max models
Illustrious_Row_9971@reddit
its available by default here for free: https://huggingface.co/spaces/akhaliq/anycoder
Finanzamt_Endgegner@reddit
tbf there were better smaller models available soon after and there was never a 2.5max released, it was only preview as far as i know
entsnack@reddit
Comparison with gpt-oss-120b for reference, seems like this is better suited for coding in particular:
Neither-Phone-7264@reddit
isnt this a 1t param model?
entsnack@reddit
It is indeed.
BackyardAnarchist@reddit
source?
shark8866@reddit
this Qwen is also non-thinking
entsnack@reddit
It's thinking Qwen, the Qwen numbers are from the Alibaba report not independent benchmarks.
shark8866@reddit
I would advise you to recheck that, if you look at the benchmark provided in this very post, they are comparing with other non-thinking models including Claude 4 opus non-thinking, deepseek V3.1 non-thinking (only 49.8 AIME) and their own Qwen 3 235b A22 non-thinking. I know this because I distinctly remember Qwen 3 235b non-thinking gets 70% on AIME 2025 while the thinking one gets around 92
Massive-Shift6641@reddit
I see zero improvement of this model on my tasks. Sorry but it's likely just a benchmaxxxslop.
shark8866@reddit
i see u in the lmarena server
Independent-Wind4462@reddit
Seems good but considering its 1 trillion parameter model 🤔 difference between 235 and it isn't much
But still from early testing it looks like good really good model
Professional-Bear857@reddit
I think that's diminishing returns at work
SlapAndFinger@reddit
At this stage RL is more about dialing in edge cases, getting tool use consistent, stabilizing alignment, etc. The edge cases and tool use improvements can still lead to sizeable improvements in model usability but they won't show up in benchmarks really.
Finanzamt_Endgegner@reddit
Its a preview so a lot of training is not yet done
x54675788@reddit
Don't get your hopes up for open source model.
There is no incentive in spending millions of dollars for training if they can't sell you access to the best model.
JMowery@reddit
Are you donating money to the cause or paying for the API access to their open source models? If not, why do you expect everything to be free?
Sounds like you're very unappreciative. Businesses exist to make money. And while enshittification does happen, why are you making such a fuss and assuming that terrible things are going to happen when this very same company is the only one to give us an even remotely good video model, a pretty great image model, and the best open source coding model. Like... I don't like what's happening with big companies, but Alibaba has been pretty nice so far.
Why not wait before spewing off hatred?
ohHesRightAgain@reddit
Huh, a graph that starts at 0..
lordmostafak@reddit
thats the real breakthrough hete
o5mfiHTNsH748KVq@reddit
And it’s linear 🫢
Finanzamt_Endgegner@reddit
Incredible!
HomeBrewUser@reddit
It's nothing too special. If it's actually 1T it's not really worth running versus DeepSeek or Kimi tbh.
Impressive_Half_2819@reddit
How many gpus used?
shark8866@reddit
this is what meta intended for llama 4 behemoth
Independent-Wind4462@reddit
Yea idk there gonna be new meta event too in this month so maybe we will see any model there let's see
o5mfiHTNsH748KVq@reddit
I’m hoping that event is segment anything 3
Salty-Garage7777@reddit
Yet its command of the Slavic languages is poor, judging by how it handled a rather simple gap-filling exercise in Polish 🤦
_yustaguy_@reddit
Not looking much better in Serbian, but still noticeably better than it's smaller brothers.
No_Swimming6548@reddit
Literally unusable
Salty-Garage7777@reddit
Maybe it's better at coding at least...😩
bb22k@reddit
It's interesting that they compared it with Opus Non-thinking, because Qwen 3 Max seems to be so kind of hybrid model (or they are doing routing in the backend).
You can force thinking by hitting the button or if you ask something computationally intensive (like solving a math equation) it will just start rambling with it itself (without the thinking tag) and eventually give the right answer.
Seems quick for a large model
nullmove@reddit
Qwen chat in the web always fallbacks to different model that supports requested modality/reasoning, so you can't conclude much from this. But in the API this is non-thinking.
Yes_but_I_think@reddit
AIME 2025 is definitely memorised somehow.
infinity1009@reddit
what about thinking?
Trevor050@reddit (OP)
not out yet