Waiting for Qwen 3.7 open weight... The new King has arrived...
Posted by LegacyRemaster@reddit | LocalLLaMA | View on Reddit | 233 comments
The hype is real! https://qwen.ai/blog?id=qwen3.7
alphapussycat@reddit
Don't wait for it, only be glad if something is released.
Remember that releasing highly capable local models hurts their own monetization. As they announced in April, they're no longer aiming for disruption/sabotaging they're aiming at monetization and competing for frontier.
ttkciar@reddit
Yes, but they have to balance that against their government mandating efforts toward contributing to an open ecosystem in the current five-year economic plan.
Central economic planning sucks, but in this case at least it serves the open LLM community's interests.
budihartono78@reddit
Lol are you actually feeling bad for multibillion companies like Alibaba
ttkciar@reddit
No, it's more of a statement of principle.
People should be free to do what they want with what they own, otherwise they don't actually own it.
That holds whether the possession in question is a computer, a pen and paper, their own body, or a business.
6ghz@reddit
But what about the billionaires 😢
whitefritillary@reddit
“China is promoting an open ecosystem for AI development, but at what cost??”
zippydazoop@reddit
All economies are centrally planned, most of them just aren’t for you. They’re planned to rig the game in favor of the established players and to eliminate any possible competition.
Jorlen@reddit
It has to be a small, very very small niche % of people who are using local LLM models, no? Am I underestimating it?
The technical requirements as well as hardware requirements must cut off most people. Sure you can fire up a LLM in LM studio but I can't imagine many are using models >25gb in size like Qwen 3.6 27b/35b or Gemma4 31b, etc.
alphapussycat@reddit
The technically requirements are almost none... lm studio and ollama makes it extremely easy to download and run. And sure, 24gb is like the upper end of the consumer grade, and kind of the minimum to run.
But sure, to make them useful it requires more work, like internet search (but maybe ollama comes with it). Though, that's kind of circumvented by the fact that most people who want those features are probably using the qwen models for coding, or otherwise handy people who can just vibe code whatever they need.
Jorlen@reddit
Fair point, yeah. I can't imagine people using Qwen 3.6 line for anything other than coding or deep diving into project documentation, etc.
Mindless_Pain1860@reddit
Qwen has never open-weighted the Max series…
LegacyRemaster@reddit (OP)
true but 27b 397b are soo goood
NNN_Throwaway2@reddit
397b isn't being open-weighted anymore.
Happythen@reddit
source?
NNN_Throwaway2@reddit
I don't understand the question; its a fact, not something that needs citing. 3.6 Plus was not open-weighted. Source, there are no open weights you can download.
Happythen@reddit
You wrote the "397B", not 3.6 Plus. That reads like you are predicting that there will not be an open weight version of another 397B in the future. I asked for clarification because you'd be the first with that sort of confirmation if you had a source.
whitefritillary@reddit
qwen3.6-397B-A17B wasn’t open weighted. source: google it ffs.
Happythen@reddit
Didn't say it was, mr. aggressive. Calling out that one missed open weights release at 122B and 397B at a minor version change doesn't necessarily mean they won't release larger open weight models again, unless... there is a source link/Qwen blog I haven't seen, which is what I was asking.
whitefritillary@reddit
it’s ms. actually. and if you genuinely think it looks like Alibaba will probably release a 397B-A17B then you haven’t been keeping up with how they’ve behaved lately.
Happythen@reddit
I guess not! Doing my best following huggingface, qwen blogs and their communications, I must have missed something. Because predicting what models they will release, (and which ones they don't), seems like a silly endeavor with hard evidence besides just a gut feeling bsed on their last release. Apologies for the misgendering, honestly.
NNN_Throwaway2@reddit
They have made it pretty clear what their future plans are if you do a bit of reading between the lines.
I was basically the only person in this sub to point out there would be no more 3.6 releases after the 27b based on the 27b blog post. No one believed me, but lo and behold here we are.
So you can keep trying to cope or you can accept that I might actually know what I’m talking about.
Happythen@reddit
> reading between the lines
Got it!
NNN_Throwaway2@reddit
Somehow I don’t think you do.
NNN_Throwaway2@reddit
Plus is the 397b.
Happythen@reddit
Right-ish, 3.5 plus was based off of the 397B model, but the hosted plus version had more production features, 1 mil context, etc... There were no open weights in the 3.6 release for 122B or 397B. But your assumption is that future versions will not have open weights at these parameters based off of the roll out of 3.6? They don't have a lot of patterned rollouts besides 26B and 35B: https://huggingface.co/Qwen/collections#collections
NNN_Throwaway2@reddit
They’re literally the same weights. 1m context is through rope scaling, not a unique feature. You are simply misinformed on a lot of details.
There will be no more open weights of plus/397b. I’m not sure why that’s so hard to believe. At this point it looks like we might only get a 27b of 3.7, if that.
Happythen@reddit
Same weights for sure! My information is coming straight from Qwen's model card on huggingface, unless you think that's wrong:
https://huggingface.co/Qwen/Qwen3.5-397B-A17B
As far as no more open weight models besides 27B and 35B, it's a decent guess! But it's a guess. Qwen's team has not confirmed they will not be shipping larger open weight models from here on out.
NNN_Throwaway2@reddit
What on god’s green earth do you think your point is, Einstein? I said they are not open-WEIGHTING Plus anymore. Plus and 397b are the same WEIGHTS regardless of what features are part of their production harness. My statement stands. The 397b supports 1m context as well, just not be default. You are being ridiculous.
Happythen@reddit
Apologies! Not trying to attack here, you made the statement that 397B will no longer be open and I wanted to know more. Honestly, apologoies, I'll leave it alone.
NNN_Throwaway2@reddit
Know more about what? I said it wasn’t being open weighted and you tried to be smug and throw misinterpreted information in my face. Now you are pretending to act innocent. Unbelievable.
VectorD@reddit
Neither is the 122B.. Qwen releases are really dull these days after their internal disaster
Affectionate-Cap-600@reddit
what disaster?
VectorD@reddit
The people who cared about open source all got fired and some ex google employee took over the lead to focus on profit gen
Suzoku@reddit
what? can we stop treating baseless rumors as a fact now? Max was never open source, and Qwen never open sourced 122B nor the 397B models ever. Even back in Qwen2.5 the biggest open source model from them was 72b.
whitefritillary@reddit
zxyzyxz@reddit
Great visual novel
Su1tz@reddit
Fuck off bot
NNN_Throwaway2@reddit
This is just embarrassingly inaccurate lmao. Are you drunk?
I'm literally using the 397b locally right now.
VectorD@reddit
Qwen 3.5 397B and 122B are open weights, not so for 3.6
Competitive_Ideal866@reddit
What?!
whitefritillary@reddit
qwen3.6-122B-A10B hasn’t been open weighted.
relmny@reddit
"dull"??? have you heard about 3.6 27b and 35b ?
They were released after the new team took over
VectorD@reddit
These releases were pathetic in the way that they went about the release. They made some twitter poll for which one of their models to release (even though all of them are ready), as some weird kind of cock-tease / thirst trapping. Then refused to release the rest of the models we had in 3.5.. The 3.5 release series was just a mic drop, and 3.6 was some weird marketing play.
swingbear@reddit
You mfer best vote for a bigger model this time the 20-30 was great don’t get me wrong but give me a 100-300
Competitive_Ideal866@reddit
What?! I've been waiting to buy a 512GB Mac Studio just to run that.
NNN_Throwaway2@reddit
3.5 397b is still quite good.
Competitive_Ideal866@reddit
That's what I've heard but I still want more!
whitefritillary@reddit
with 512gb of unified RAM you could also run deepseek-v4-flash and even glm-5.1
ECrispy@reddit
how do 3.6 37B 35B-A3B compare on these benchmarks? and will be get open versions of those with 3.7?
whitefritillary@reddit
so first off there is no 37B. regarding the 35B-A3B, the 27B will be much substantially smarter, and the only reason you’d want to run the 35B-A3B is for either speed or lack of VRAM.
vick2djax@reddit
Isn’t it worse the 3.6 27b?
thaddeusk@reddit
Yes, but 27b is also much slower.
tracagnotto@reddit
ORLY???? Did you expect them to not have a commercial model to sustain expenses?
They literally have you the best parameter for parameter free small model available.
And even if. What you would do with max you dummy? Buy a datacenter to run it?
whitefritillary@reddit
delete your account.
tracagnotto@reddit
No u
whitefritillary@reddit
…
vick2djax@reddit
This isn’t how you grow a community.
StupidScaredSquirrel@reddit
Nobody is shotting on qwen it's just that this sub is interested in open models so if a model is unlikely to be open then it's kinda irrelevant to us.
tracagnotto@reddit
I know what the sub is for but the guy was complaining about them not releasing max which is kind of idiotic. Everyone knows they will release some minor models open and free, they always did
StupidScaredSquirrel@reddit
No he's just saying that this series isn't open so the post is not relevant to the sub. If and when they release an open model then the benchmarks will be relevant
tracagnotto@reddit
And the post is saying that is waiting for them to release open weight
StupidScaredSquirrel@reddit
So they are letting them know that max was never subsequently open.
It was a normal interaction until you decided to be all hostile about it
jonathanx37@reddit
"Leave my billion dollar company alone"
tracagnotto@reddit
Ye sorry
Cereal_Grapeist@reddit
Please delete your account if this is how you're going to behave
tracagnotto@reddit
No
No_Swimming6548@reddit
We just don't expect it to be shared in local llm sub.
adeadbeathorse@reddit
Chill, lol
tired514@reddit
3.7-122B-A17B MTP MXFP4 with 512k context would be the absolute shizzle.
VoiceActorForHire@reddit
Fuckk i dont understand these words. can anyone explain
tired514@reddit
Aye :)
VoiceActorForHire@reddit
Thank!
whitefritillary@reddit
>3.7-122B-A17B MTP MXFP4 with 512k context
just reading this made me so wet 😩
darkbasic4@reddit
Even better qwen3.7-coder-122B, we can dream 😄
cafedude@reddit
Even a 3.7-coder 80B would be amazing.
tired514@reddit
It would, but even with MTP that'd be pretty lethargic without an RTX6000. On Strix Halo at Q6 you'd probably see \~5-6t/s.
No_Lingonberry1201@reddit
I think it'd be a MoE with 3-10 active params (Qwen3-Coder-Next is 80B with 3B active, IIRC).
Fi3nd7@reddit
Omg please stop I can only get so erect. Jfc I'd shit my pants if they did this.
No_Lingonberry1201@reddit
As a fellow Strix user I feel very tumescent as well.
PhDwithaPHD@reddit
Is this like additive? That's a lot of stuff going on in your pants and, as a strix halo owner myself, I might just go pantless
tired514@reddit
SWE above 80 let's gooooooo! >.<
srigi@reddit
On a 12GB RTX 3060
guesdo@reddit
I feel like at this point we just need TurboQuant working or some other form of KV compression like the one Deepseek uses and we are gooooood to go
R_Duncan@reddit
All Qwen 3.5+ are using gated delta-net which redices kv cache, often turboquant 4bit is worse than just q8_0 in terms of speed and space.
tired514@reddit
TurboQuant would be nice, but FWIW even Q8_0 for kv cache makes a big difference and has a minuscule effect on perplexity and accuracy in my testing.
annodomini@reddit
NVFP4 probably better than MXFP4. Yes, you can run NVFP4 on AMD, it might not be quite as optimized as on NVIDIA which has hardware support but it's no worse than any other FP4 quant scheme and NVFP4 is generally a better format than MXFP4.
tired514@reddit
Hmmm.. I thought NVFP4 was a no-go with Vulkan on AMD; does it actually work?
annodomini@reddit
Yep, it works on both Vulkan and ROCm.
So far I've only tested it with Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4.gguf just to see if it worked, and it works just fine.
I haven't benchmarked it at all, so I don't know if it's slower than other quants; I should probably do some benchmarking.
I had to convert it myself with
./convert_hf_to_gguf.pybecause there aren't as many GGUFs of the NVFP4 quants available.tired514@reddit
Interesting. :)
I'll do some testing then!
Fit-Produce420@reddit
It's not better on the strix halo.
annodomini@reddit
Depends on whether you're measuring speed or quality.
NVFP4 should give you better quality; it has smaller blocks that it applies scaling factors to, and better precision by two level scaling factors. So it can get a much more accurate quantization, at the expense of a little bit more data.
Now, it may be slower, since it's going to have to do more work on the scaling. I haven't benchmarked it. I tried it out once with a Nemotron 3 Omni 30B-A3B and it worked just fine, but I didn't dig into it very deeply (turns out that audio isn't implemented for that model yet in llama.cpp) and it worked just fine, but maybe other quants will be faster on Strix Halo, I haven't tested that out at all.
Fit-Produce420@reddit
I'll clarify, native MXFP4 models like gpt-oss are faster and highly accurate on systems that support that format compared to taking a model, quantizing it to NVFP4 and running it on the same system.
Strix isn't nvidia and it doesn't natively run NVFP4 which is why I said what I said. NVPF4 is proprietary, if you buy a chipped that supports it and use a natively trained model it is better than mxfp4 but they aren't the same thing, mxfp4 is a community standard and not proprietary.
thaddeusk@reddit
You try Minimax m2.7 on it yet?
tired514@reddit
Not yet; to be honest I'd have to quantize it so heavily I imagine even Qwen3.6-27B @ Q8 would outperform it on many tasks. :/
cafedude@reddit
I got a Q3 running on mine. Wasn't impressed. It kept trying to run the tests but wasn't changing directory to where the tests are so just kept failing. I even told it to 'cd tests && run_tests.sh' but it still didn't get it. I think the Q3 just makes it kind of dumb.
redblood252@reddit
It’s got what the plants crave
high_on_meh@reddit
You better post your results! Are you on the Strix-Halo Discord? If not, the invite link at at the bottom Donato's website at https://strix-halo-toolboxes.com/
jikilan_@reddit
Ya, post trained with mxfp4 just like gpt-oss 120b that would the new king!!!
Vancecookcobain@reddit
File that under things that will never happen
OMG_IM_A_GIRL@reddit
I want this at Q6 for my MacBook Pro.
tecneeq@reddit
It would be what i need.
UniqueAttourney@reddit
At this point, I think we will need more than the final result number, something like token intelligence : how much token is it using to finish the benchmark. of course assuming that the token per watt is the same for all the other models.
johnnyApplePRNG@reddit
I can't stand these disingenuous benchmark graphs..
Claude 4.7 has been out for months, and so has ChatGPT 5.5 ... no comparison ... claims they "won" ... GTFO
Mychma@reddit
Wtf a 3.7 ? And I am still waiting for Qwen3.6 9B / 4B / 2B / 0.8B . Gemma is really good but Chinese models have better slavic support(which is interesting) and especially qwen more better c++ performance even though gemma is really strong in the coding up part but debugging and modifing not so much. Also I still use Qwen3 4B 2507 model( I just can't explain why it is so well rounded model that it can give such a great answers even compared to gemmas 4 ass backward logic sometimes) maybe 35T and more training tokens to small models is the way???
LegacyRemaster@reddit (OP)
we will see... maybe 3.7 will be 9b 4b
__JockY__@reddit
For the love of all that’s holy I hope they drop the 397B A17B this time! The NVFP4 of 3.5 fits on 4x rtx 6000 pros with room for 10x concurrent sessions of 200k tokens, it’s an absolute unit of a model.
If they dropped that and it performs like those benches suggest then it’s pretty much Opus at home, where home = GPU baller paradise.
kleinishere@reddit
What about Dwarf Star 4? Are you trying that with the 4x RTX 6000 Pros? (Not sure if it’s feasible but imagine it would be amazing).
FullOf_Bad_Ideas@reddit
no multi-gpu support last time I checked
__JockY__@reddit
I’ve never heard the words Dwarf Star 4 and I’m too lazy to google it.
kleinishere@reddit
https://github.com/antirez/ds4
Custom pipeline for running Deep Seek v4 flash at great speed and memory efficiency
ionizing@reddit
I don't have the hardware for this at work but thanks for the info, it is amazing and awesome to see projects like this.
t3rmina1@reddit
We already have separate faster methods built on vLLM and SGLang getting developed for our hardware so we don't use DwarfStar
AppealSame4367@reddit
But why would they? q3.6 27B is already quite close to Opus 4.5, next one would be close to 4.6.
Why would they diminish their own earnings? The price for qwen3.7 max itself tells you that the ultra cheap times are over. Alibaba is a giant company, they did this to crash the competition with cheap / open weights models, not because they wanna provide you with free models forever.
I say there will be one open weights model for qwen3.7 and it will be the last for a long time.
TokenRingAI@reddit
27B is a very good model for it's size, but it is not close to Opus on anything but benchmarks. It beats Haiku, it is a bit worse than Sonnet, it doesnt beat Gemini Flash or GPT mini.
whitefritillary@reddit
i would even argue in some areas it doesn’t beat haiku.
Healthy-Nebula-3603@reddit
Looking on YouTube tests easily beats sonnet 4.6 in coding.
thaddeusk@reddit
I use Opus for work every day, Qwen 3.6 27b is nowhere close to it. It's good for small changes and simple projects, but it was still failing to do things that Sonnet was easily completing.
Healthy-Nebula-3603@reddit
I didn't tell about opus .... Did you read properly what I said ?
thaddeusk@reddit
Did you read properly what I said?
Healthy-Nebula-3603@reddit
Yes and has no sense
thaddeusk@reddit
Read it again. I also mentioned Sonnet.
Healthy-Nebula-3603@reddit
As I said. Your response has no sense.
That you mentioned the sonnet in the last sentence nothing changes.
thaddeusk@reddit
Or you just have poor reading comprehension.
I've used several versions Sonnet over the last year. I have thousands of dollars in local hardware and have tried many of the open weight coding models at various quant levels. Smaller open weight models are always far behind.
Healthy-Nebula-3603@reddit
Sure Karen
thaddeusk@reddit
Ah, resorting to insults, mature. Everybody who actually codes says the same thing about the smaller open weight models. You can deny it all you want, though.
I'm not saying they're bad, but you can't really compare them to the big frontier models.
ECrispy@reddit
vibecoders with videos 'make an html page with random game' isn't coding
Healthy-Nebula-3603@reddit
I'm not looking on vibe coders but real long tasks. You can find such in YouTube as well .
ECrispy@reddit
can you give me some links, would like to see, most of the ones I find use one shot prompt tests
ttkciar@reddit
Those of us who have actually used it know otherwise.
Healthy-Nebula-3603@reddit
I'm usitandnis suberrb but you need fp16 not compressed as even q4m is damaging model badly on the agent mode. ( long context )
NoFaithlessness951@reddit
Sure, buddy...
ECrispy@reddit
curious what you need that kind of capacity for? running for a big team, renting out?
__JockY__@reddit
Parallel agentic workflows, multi-user Claude cli, some batched query work… a bit of all sorts!
Consistent-Height-75@reddit
Hope its not just benchmaxxing
ttkciar@reddit
Qwen is pretty great, but they do benchmax rather a lot.
carrotsquawk@reddit
who does mot? as soon as a benchmark gets popular ut becomes useless as everyone trains for it
ttkciar@reddit
As far as I can tell, ZAI, Google, and AllenAI do not benchmax, or at least not much.
Wouldn't surprise me if every other R&D lab in the world benchmaxed, though.
randombsname1@reddit
Google does hardcore benchmaxxing.
The vast majority of users on r/bard even know that.
Anthropic is probably the only SOTA company that legit scores UNDER what general consumer/consensus "feels", are.
Its also shown by the fact they are on top or very near the top on sanitized benchmarks like swe-REbench. Specifically rebench, as the normal swe bench is benchmaxxed to hell.
whitefritillary@reddit
Anthropic benchmaxxes too, they’re just worshipped a lot so everyone ignores it.
randombsname1@reddit
Which model?
Again, every Anthropic model seems to score LOWER than what most people seem to indicate, and it's also SOTA on sanitized benchmarks as mentioned previously.
whitefritillary@reddit
every.
do you have any actual evidence for this or is it just vibes?
randombsname1@reddit
Which model? And when? Name a specific one, and a time and specific example.
I mean i literally cited swe rebench which is a sanitized benchmark in my first comment you responded to.
So again, which model scored super high on benchmarks, but was far worse than what benchmarks suggested and/or at least not as capable as other models in the same, "performance bracket".
whitefritillary@reddit
every one they have released so far. Anthropic isn’t this magically morally good corporation that would never hurt a fly. they benchmaxx because every studio benchmaxxes.
i was talking about your assertion that “every Anthropic model seems to score LOWER than what most people seem to indicate, not your point about sanitised benchmarks.
randombsname1@reddit
The swe rebench is an example of an objective metric. That's the point of stating that. It's not JUST vibes. Vibes is part of it, but there are also objective measurements. Thats my point.
"every one they have released so far. Anthropic isn’t this magically morally good corporation that would never hurt a fly. they benchmaxx because every studio benchmaxxes."
Well thats weird. Because their benchmarks show them as in the group of SOTA models, and they have routinely proven to be at that range by both objective/subjective criteria.
If you saw Google's benchmarks by comparison you would think they were constantly on top of the AI race wrecking both OAI and Anthropic. Yet it is heavily out classed by both in most workloads per even blind or semi-blind testing like in LM Arena.
whitefritillary@reddit
i will only be replying to your last paragraph because the previous ones are just a restatement of the same nonsense. you personally feel like Google’s underperform so much relative to benchmarks because you’re looking at this from a coding-based framework. google’s models may be quite bad at coding but they make up for that by being very good at maths, physics, and other STEM areas. which is why they score so high in benchmark leaderboards and then feel disappointed, most people don’t want them for that but rather for coding.
randombsname1@reddit
But they also score high in coding and aren't anywhere close.
Also, google scores high in math, but even in blind testing there doesn't seem to be a preference for Google over OAI in STEM fields.
So.....?
Also, you still haven't pointed to a specific model/instance of Anthropic benchmaxxing.
If it happens on every model. Then it would take 2 seconds to pull out an example of it allegedly happening.
Meanwhile I on the other hand can pull out both objective and subjective sources on why, for example, specific Chinese models are benchmaxxed to hell and back.
whitefritillary@reddit
blind testing is just vibes based and isn’t actual proof of anything.
because that’s literally impossible to do. it’s impossible to definitively say with absolute certainty that something is evidence of benchmaxxing. the only thing there is is rumours.
randombsname1@reddit
Well no,
If Anthropic came out today with a new model that scored 90% on Swe Bench or terminal bench or whatever -- and then scores 50% on the sanitized swe rebench. Then that is an objective measurement that there is a wide overstating of capabilities.
1 or 2 people blind testing doesnt mean much.
Tens of thousands or hundreds of thousands of blind testing results in stuff like Lm arena gives you a very good sense of the "vibes" of the model regardless.
Anthropic has yet to score poorly or out of the expected range on both objective measurements or "vibes" measurements.
You just keep harping on about them benchmaxxing because, "all companies benchmaxx" which is funny considering you are talking about not being able to take subjective or vibe-based feelings into the equation. Lmao.
whitefritillary@reddit
yeah i’m just going to stop engaging considering you seem to understand essentially nothing of what you’re talking about.
randombsname1@reddit
I understand what your argument is. Its just a terrible argument. Thats my point.
Good day.
whitefritillary@reddit
you claiming my argument is terrible doesn’t make that true. peace.
randombsname1@reddit
You didnt cite a single thing or respond to even the simple request to point to a single model or even 1 instance.
Bye.
whitefritillary@reddit
i already explained how that’s not possible. bye.
randombsname1@reddit
It is. I can do it for every chinese model and multiple chatgpt and gemini models.
With both objective measurements on benchmarks and vibe benchmarks. Both.
You're wrong. Feel bad.
whitefritillary@reddit
whatever you say i guess. what you said is actually still false, but i’m not going to continue arguing with someone who genuinely doesn’t understand the situation because it will lead nowhere.
out of curiosity, do you also believe that Chinese models distil from Claude and other western LLMs whereas said western LLMs don’t?
randombsname1@reddit
You still cant and didnt cite anything. Nor did you ever point out how I was wrong. You said I was wrong because Its not something you can prove. I cited explicit benchmarks that can provide objective and subjective benchmarks which you casually ignored, and you still pretend like somehow we can't cite both objective/subjective measurements.
So, you're still wrong -- that I'm wrong.
They all distill, but why does that matter for this topic?
Although I do think western companies distill less as that is just the natural course of things when your competitors are 6+ months behind.
The companies who are in, "the lead" (money, technologically, whatever lead you want to insert here) are usually the ones that get stolen from more. That's just how it works out.
whitefritillary@reddit
that’s your subjective opinion.
regarding distillation, i was just curious what you’d say. because i’m not sure if you’re aware that claude distilled deepseek
randombsname1@reddit
For the subjective part. Sure. For the objective part its objective. Not an subjective opinion. Getting consistently higher scores on METR or swe rebench are hard numbers. Provided through stated methodology on the associated white papers for those benchmarks.
Yes I saw the distill story. Not surprised that they would still a chinese model on chinese content.
whitefritillary@reddit
no, it is not, you just don’t understand how it works.
randombsname1@reddit
How does it work?
whitefritillary@reddit
scoring lower in sanitised benchmarks can, some cases be indicative of benchmaxxing, but it’s not even nearly as definitive as you’re trying to make it appear.
randombsname1@reddit
Sure a one off isnt any big deal.
All chinese models scoring lower, in every model release -- than what their stated swe bench scores would indicate, is.
That's the difference.
Thats pretty definitive.
Also scoring terribly on long horizon benchmarks time and time again, on multiple models. Is pretty indicative as well. Like on METR.
whitefritillary@reddit
correlation≠causation.
randombsname1@reddit
Benchmarks didnt cause chinese models to be bad.
They just are.
They also didnt cause Gemini to have inflated STEM scores.
They just are.
Hard to make this argument when benchmarks simply benchmarked what they are poor at, lol.
whitefritillary@reddit
jesus christ please work on your reading comprehension and general situational awareness i’m begging you.
randombsname1@reddit
Hilarious coming from you.
Considering you have literally tried every attempt to side step actually providing an answer to any direct questions. Like citing a single instance of anthropic benchmaxxing (from the "eVerY mOdeL!") person.
Meanwhile I can cite sources for every claim I have made.
How's that distillation argument coming along? How did that help you in this line of reasoning/thread?
whitefritillary@reddit
“no u”
randombsname1@reddit
An "L" for you, madam.
whitefritillary@reddit
if you say so.
tengo_harambe@reddit
There's billions$$$ on the line. Everyone is benchmaxxing.
randombsname1@reddit
Anthropic. Or they seem not to anyway.
Yeelyy@reddit
Everyone does.
Norwood_Reaper_@reddit
It's actually pretty good
2Norn@reddit
ofc it is
sword-in-stone@reddit
they SOTA on critpt, it can't be benchmaxxed, weird
ratocx@reddit
While this really promising, I find it a bit suspicious that Opus 4.7 and GPT 5.5 isn’t included. Like Opus 4.7 is usually scoring better on benchmarks than 4.6. And apparently GPT models have become really good coding models since 5.4.
But I suppose they really want to only promote Chinese models, and just needed a comparison point to one of the most popular American models for coding.
whitefritillary@reddit
Opus 4.7 does not score better than 4.6 in the vast majority of benchmarks, that’s only how it looks because Anthropic nerfed 4.6. 4.7 was a downgrade aimed at improving compute management.
jamaalwakamaal@reddit
Same as what the American closed models do
ratocx@reddit
True, but the exception is that the latest American closed source models have always been on top on average across benchmarks. But I see the point. OpenAI has sometimes not shown comparisons to the latest model of Anthropic, or hidden it far down on their blog post in the benchmarks they don’t win.
Chinese models have come close to the top, but they have never actually been on the top, unless you think about price vs. performance.
Still feels a bit misleading regardless of who does it, not including the top models when they are comparing to the previous gen top model. Makes me trust the provider less. And Qwen seems to more consistently leave out the latest versions of American competitors.
Every_Bathroom_119@reddit
max never open weight before
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Sicarius_The_First@reddit
The state of locallama is very sad, I made the exact same post 6 hours earlier, and it was BANNED without explanation.
artisticMink@reddit
Did they publish the weights for a single max model?
Septerium@reddit
Now that they have the SOTA behemoth, why would they cook and release open-weight smaller models? Perhaps to keep people talking about Qwen... but what if just hitting the top charts is enough?
zephyr_33@reddit
3.6 plus didn't feel as good as the benchmarks claimed. actively made a tonne of mistakes in code exploration, I'm not too convinced here.
Gailenstorm@reddit
The analysis looks good!
https://artificialanalysis.ai/models/qwen3-7-max
mistressrvn@reddit
Wow, this is genuinely next level. If its not been benchmaxxed then this is a new revolution. Can't wait for A27B
Specter_Origin@reddit
the token efficiency of even the Max is not that great, if it sticks to how 3.5 and 3.6 have been the local one gonna be a looper and over thinker.
Also per qwen team they will only open-weight their small models so don't expect anything larger than 50b
AltruisticList6000@reddit
Qwen 3.6 35b looping was fixed for me with increasing the presence penalty to the recommended value. It still occasionally gets into thought loops but instead of doing it forever, it will exit eventually. And in my experience if it starts the thought loop process and finally gives the end result/reply it usually happens at a point when it previously failed to do something and after the looping it always spits out a perfect solution to the problem so it's not a useless thing, even if not too good and takes more time.
TerminalNoop@reddit
what is the recommended value, 1.5?
OMG_IM_A_GIRL@reddit
Link?
Main-Lifeguard-6739@reddit
Check out the new Qwen! Even better at benchmaxing than Gemini!
mitchins-au@reddit
But can it finish a sentence without running out of tokens. Qwen 3.5 used 4x more reasoning tokens than 3.0
AI-Agent-Payments@reddit
The closed/open split has been consistent across Qwen releases, so betting on the MoE A22B or a mid-tier dense being the open-weight drop is more realistic than hoping for the flagship. Worth noting that the 30B A3B from Qwen3 actually punches hard for its active parameter count, and if the 3.7 equivalent holds similar efficiency gains, consumer hardware users will probably get more mileage out of that tier anyway. The benchmark numbers on the max series are real, but the open weights are where the fine-tuning community actually proves whether they hold up on domain-specific tasks.
Historical-Crazy1831@reddit
The only reason they're delaying the release of the small models is that their large models don't significantly outperform them. Qwen is well known for its strong small models, but its large models haven't been as impressive. Part of the reason Lin was forced to step down is that their large model failed to outperform competing other Chinese large models like doubao, kimi, glm.
Dr_Me_123@reddit
If only they had enough resources to train a 1T model
TheRealMasonMac@reddit
Their previous max models have been 1T.
Historical-Crazy1831@reddit
If they can train a mythos-level model and open source a 70b Opus4.6-level dense model, that would be great!
Gohab2001@reddit
It's 7.5$/2.5$ and performs worse than Gemini 3.5 flash from my limited testing. DoA model.
cosimoiaia@reddit
gguf when? (Sorry, I know it's not even released yet but I had to)
DepressedDrift@reddit
Please have 9B, please have 9B, please have 9B......
riceinmybelly@reddit
9B and 122B?
DepressedDrift@reddit
Anything a normal person can afford
riceinmybelly@reddit
Strix halo or second hand Mac Studio will get you bigger models or more loaded at the same time and costs less so it depends on how important speed is because they are not fast.
_-_David@reddit
Don't hold your breath
Tough_Frame4022@reddit
Trying it now. You can chat with it for free on Alibaba cloud services.
ballshuffington@reddit
Why didn't they compare against Google pro series?
IISomeOneII@reddit
Holy
DeepOrangeSky@reddit
Is it theorized that this closed-weights Qwen3.7 Max is still just a 397b a17b model? Or is it thought to be some bigger, different private model, like maybe 1T parameters or something more along those lines?
nuclearbananana@reddit
Mind you this is the MAX model. Don't except the 27b model to be as good
srigi@reddit
Also these benchmarks are done on (B)F16 models. 27B at Q4 is not what you see in marketing material.
LegacyRemaster@reddit (OP)
SWE bench... All I need
Far-Low-4705@reddit
wtf happened to 3.6 on that one math benchmark???
crone66@reddit
Why no gpt 5.5 and opus 4.7?
somerussianbear@reddit
Would love these numbers to represent reality but we know that they don’t.
hwpoison@reddit
Someone knows if there is small model series of this version?
temperature_5@reddit
"It's better than Opus!"
But I admit I will be running it, at least until the next GLM Air comes out and surpasses it (please?)
ea_man@reddit
So how does the cost compare to DeepSeek 4?
ttkciar@reddit
Sir, this is LocalLLaMA.
laul_pogan@reddit
Worth flagging for when weights drop: Qwen3.5 text-only weights ship with multimodal lineage, so vLLM fails to load them unless you strip the
model.language_model.*key prefix from the state dict and removemrope_sectionfrom config.json. Not obvious from the error. Expect 3.7 to need the same treatment if they follow the same save path.suprjami@reddit
Or set
--language-model-only?laul_pogan@reddit
Are you following me around ?
suprjami@reddit
Oh I didn't even see it was you. I fell for a bot comment again.
pineapplekiwipen@reddit
if that math reasoning score translates into real world performance i'm gonna be one happy guy, been building some equity research agents and qwen3.6 and under have been a letdown
Budget-Toe-5743@reddit
Hello, does anybody know how much memory we would need to run these new Qwen models?
hainesk@reddit
I'd like to see the 397b model make a come back. If they could make a 3.7 397b model it would be close to SOTA.
Trollfurion@reddit
Was just wondering - how to do what they claimed in the post? I mean continuously run agent which is optimizing the code
Intelligent_Ice_113@reddit
I need it. can I have it now? 😿
Qwen_os_has_died@reddit
Everyone acts like you were the inferencing service provider. Doesn't the new models break the existing workflows ?
doesnt_matter_9128@reddit
Idts in real usage its gonna cross, any of the others mentioned except qwem3.6 plus
shokuninstudio@reddit
Give me a $60,000 workstation and I'll verify it.
LegacyRemaster@reddit (OP)
less then 60k
iloveplexkr@reddit
king is dead?