Waiting for Qwen 3.7 open weight... The new King has arrived...

[-]

alphapussycat@reddit

Don't wait for it, only be glad if something is released.

Remember that releasing highly capable local models hurts their own monetization. As they announced in April, they're no longer aiming for disruption/sabotaging they're aiming at monetization and competing for frontier.

[-]

ttkciar@reddit

Yes, but they have to balance that against their government mandating efforts toward contributing to an open ecosystem in the current five-year economic plan.

Central economic planning sucks, but in this case at least it serves the open LLM community's interests.

[-]

budihartono78@reddit

Central economic planning sucks

Lol are you actually feeling bad for multibillion companies like Alibaba

[-]

ttkciar@reddit

No, it's more of a statement of principle.

People should be free to do what they want with what they own, otherwise they don't actually own it.

That holds whether the possession in question is a computer, a pen and paper, their own body, or a business.

[-]

6ghz@reddit

But what about the billionaires 😢

[-]

whitefritillary@reddit

“China is promoting an open ecosystem for AI development, but at what cost??”

[-]

zippydazoop@reddit

All economies are centrally planned, most of them just aren’t for you. They’re planned to rig the game in favor of the established players and to eliminate any possible competition.

[-]

Jorlen@reddit

It has to be a small, very very small niche % of people who are using local LLM models, no? Am I underestimating it?

The technical requirements as well as hardware requirements must cut off most people. Sure you can fire up a LLM in LM studio but I can't imagine many are using models >25gb in size like Qwen 3.6 27b/35b or Gemma4 31b, etc.

[-]

alphapussycat@reddit

The technically requirements are almost none... lm studio and ollama makes it extremely easy to download and run. And sure, 24gb is like the upper end of the consumer grade, and kind of the minimum to run.

But sure, to make them useful it requires more work, like internet search (but maybe ollama comes with it). Though, that's kind of circumvented by the fact that most people who want those features are probably using the qwen models for coding, or otherwise handy people who can just vibe code whatever they need.

[-]

Jorlen@reddit

Fair point, yeah. I can't imagine people using Qwen 3.6 line for anything other than coding or deep diving into project documentation, etc.

[-]

Mindless_Pain1860@reddit

Qwen has never open-weighted the Max series…

[-]

LegacyRemaster@reddit (OP)

true but 27b 397b are soo goood

[-]

NNN_Throwaway2@reddit

397b isn't being open-weighted anymore.

[-]

Happythen@reddit

source?

[-]

NNN_Throwaway2@reddit

I don't understand the question; its a fact, not something that needs citing. 3.6 Plus was not open-weighted. Source, there are no open weights you can download.

[-]

Happythen@reddit

You wrote the "397B", not 3.6 Plus. That reads like you are predicting that there will not be an open weight version of another 397B in the future. I asked for clarification because you'd be the first with that sort of confirmation if you had a source.

[-]

whitefritillary@reddit

qwen3.6-397B-A17B wasn’t open weighted. source: google it ffs.

[-]

Happythen@reddit

Didn't say it was, mr. aggressive. Calling out that one missed open weights release at 122B and 397B at a minor version change doesn't necessarily mean they won't release larger open weight models again, unless... there is a source link/Qwen blog I haven't seen, which is what I was asking.

[-]

whitefritillary@reddit

it’s ms. actually. and if you genuinely think it looks like Alibaba will probably release a 397B-A17B then you haven’t been keeping up with how they’ve behaved lately.

[-]

Happythen@reddit

I guess not! Doing my best following huggingface, qwen blogs and their communications, I must have missed something. Because predicting what models they will release, (and which ones they don't), seems like a silly endeavor with hard evidence besides just a gut feeling bsed on their last release. Apologies for the misgendering, honestly.

[-]

NNN_Throwaway2@reddit

They have made it pretty clear what their future plans are if you do a bit of reading between the lines.

I was basically the only person in this sub to point out there would be no more 3.6 releases after the 27b based on the 27b blog post. No one believed me, but lo and behold here we are.

So you can keep trying to cope or you can accept that I might actually know what I’m talking about.

[-]

Happythen@reddit

> reading between the lines

Got it!

[-]

NNN_Throwaway2@reddit

Somehow I don’t think you do.

[-]

NNN_Throwaway2@reddit

Plus is the 397b.

[-]

Happythen@reddit

Right-ish, 3.5 plus was based off of the 397B model, but the hosted plus version had more production features, 1 mil context, etc... There were no open weights in the 3.6 release for 122B or 397B. But your assumption is that future versions will not have open weights at these parameters based off of the roll out of 3.6? They don't have a lot of patterned rollouts besides 26B and 35B: https://huggingface.co/Qwen/collections#collections

[-]

NNN_Throwaway2@reddit

They’re literally the same weights. 1m context is through rope scaling, not a unique feature. You are simply misinformed on a lot of details.

There will be no more open weights of plus/397b. I’m not sure why that’s so hard to believe. At this point it looks like we might only get a 27b of 3.7, if that.

[-]

Happythen@reddit

Same weights for sure! My information is coming straight from Qwen's model card on huggingface, unless you think that's wrong:
https://huggingface.co/Qwen/Qwen3.5-397B-A17B

In particular, Qwen3.5-Plus is the hosted version corresponding to Qwen3.5-397B-A17B with more production features, e.g., 1M context length by default, official built-in tools, and adaptive tool use.

As far as no more open weight models besides 27B and 35B, it's a decent guess! But it's a guess. Qwen's team has not confirmed they will not be shipping larger open weight models from here on out.

[-]

NNN_Throwaway2@reddit

What on god’s green earth do you think your point is, Einstein? I said they are not open-WEIGHTING Plus anymore. Plus and 397b are the same WEIGHTS regardless of what features are part of their production harness. My statement stands. The 397b supports 1m context as well, just not be default. You are being ridiculous.

[-]

Happythen@reddit

Apologies! Not trying to attack here, you made the statement that 397B will no longer be open and I wanted to know more. Honestly, apologoies, I'll leave it alone.

[-]

NNN_Throwaway2@reddit

Know more about what? I said it wasn’t being open weighted and you tried to be smug and throw misinterpreted information in my face. Now you are pretending to act innocent. Unbelievable.

[-]

VectorD@reddit

Neither is the 122B.. Qwen releases are really dull these days after their internal disaster

[-]

Affectionate-Cap-600@reddit

what disaster?

[-]

VectorD@reddit

The people who cared about open source all got fired and some ex google employee took over the lead to focus on profit gen

[-]

Suzoku@reddit

what? can we stop treating baseless rumors as a fact now? Max was never open source, and Qwen never open sourced 122B nor the 397B models ever. Even back in Qwen2.5 the biggest open source model from them was 72b.

[-]

whitefritillary@reddit

[-]

zxyzyxz@reddit

Great visual novel

[-]

Su1tz@reddit

Fuck off bot

[-]

NNN_Throwaway2@reddit

This is just embarrassingly inaccurate lmao. Are you drunk?

I'm literally using the 397b locally right now.

[-]

VectorD@reddit

Qwen 3.5 397B and 122B are open weights, not so for 3.6

[-]

Competitive_Ideal866@reddit

Neither is the 122B

What?!

[-]

whitefritillary@reddit

qwen3.6-122B-A10B hasn’t been open weighted.

[-]

relmny@reddit

"dull"??? have you heard about 3.6 27b and 35b ?

They were released after the new team took over

[-]

VectorD@reddit

These releases were pathetic in the way that they went about the release. They made some twitter poll for which one of their models to release (even though all of them are ready), as some weird kind of cock-tease / thirst trapping. Then refused to release the rest of the models we had in 3.5.. The 3.5 release series was just a mic drop, and 3.6 was some weird marketing play.

[-]

swingbear@reddit

You mfer best vote for a bigger model this time the 20-30 was great don’t get me wrong but give me a 100-300

[-]

Competitive_Ideal866@reddit

397b isn't being open-weighted anymore.

What?! I've been waiting to buy a 512GB Mac Studio just to run that.

[-]

NNN_Throwaway2@reddit

3.5 397b is still quite good.

[-]

Competitive_Ideal866@reddit

That's what I've heard but I still want more!

[-]

whitefritillary@reddit

with 512gb of unified RAM you could also run deepseek-v4-flash and even glm-5.1

[-]

ECrispy@reddit

how do 3.6 37B 35B-A3B compare on these benchmarks? and will be get open versions of those with 3.7?

[-]

whitefritillary@reddit

so first off there is no 37B. regarding the 35B-A3B, the 27B will be much substantially smarter, and the only reason you’d want to run the 35B-A3B is for either speed or lack of VRAM.

[-]

vick2djax@reddit

Isn’t it worse the 3.6 27b?

[-]

thaddeusk@reddit

Yes, but 27b is also much slower.

[-]

tracagnotto@reddit

ORLY???? Did you expect them to not have a commercial model to sustain expenses?

They literally have you the best parameter for parameter free small model available.

And even if. What you would do with max you dummy? Buy a datacenter to run it?

[-]

whitefritillary@reddit

delete your account.

[-]

tracagnotto@reddit

No u

[-]

whitefritillary@reddit

…

[-]

vick2djax@reddit

This isn’t how you grow a community.

[-]

StupidScaredSquirrel@reddit

Nobody is shotting on qwen it's just that this sub is interested in open models so if a model is unlikely to be open then it's kinda irrelevant to us.

[-]

tracagnotto@reddit

I know what the sub is for but the guy was complaining about them not releasing max which is kind of idiotic. Everyone knows they will release some minor models open and free, they always did

[-]

StupidScaredSquirrel@reddit

No he's just saying that this series isn't open so the post is not relevant to the sub. If and when they release an open model then the benchmarks will be relevant

[-]

tracagnotto@reddit

And the post is saying that is waiting for them to release open weight

[-]

StupidScaredSquirrel@reddit

So they are letting them know that max was never subsequently open.

It was a normal interaction until you decided to be all hostile about it

[-]

jonathanx37@reddit

"Leave my billion dollar company alone"

[-]

tracagnotto@reddit

Ye sorry

[-]

Cereal_Grapeist@reddit

Please delete your account if this is how you're going to behave

[-]

tracagnotto@reddit

No

[-]

No_Swimming6548@reddit

We just don't expect it to be shared in local llm sub.

[-]

adeadbeathorse@reddit

Chill, lol

[-]

tired514@reddit

3.7-122B-A17B MTP MXFP4 with 512k context would be the absolute shizzle.

[-]

VoiceActorForHire@reddit

Fuckk i dont understand these words. can anyone explain

[-]

tired514@reddit

Aye :)

New version of Qwen - 3.7
122 billion total parameters (ie. at Q8 the model would be \~122GiB)
17 billion "active" per inference run (qwen 3.5 122B only has 10B active)
MTP is a new speculative prediction layer that's all the rage right now; in many cases it doubles token generation performance with zero loss of quality
MXFP4 is a \~4.5bit floating point-based quantization method that's standardized and very efficient; the model would be \~70GiB or so with great performance
512k context limit would double how much "work" it can do each session

[-]

VoiceActorForHire@reddit

Thank!

[-]

whitefritillary@reddit

>3.7-122B-A17B MTP MXFP4 with 512k context

just reading this made me so wet 😩

[-]

darkbasic4@reddit

Even better qwen3.7-coder-122B, we can dream 😄

[-]

cafedude@reddit

Even a 3.7-coder 80B would be amazing.

[-]

tired514@reddit

It would, but even with MTP that'd be pretty lethargic without an RTX6000. On Strix Halo at Q6 you'd probably see \~5-6t/s.

[-]

No_Lingonberry1201@reddit

I think it'd be a MoE with 3-10 active params (Qwen3-Coder-Next is 80B with 3B active, IIRC).

[-]

Fi3nd7@reddit

Omg please stop I can only get so erect. Jfc I'd shit my pants if they did this.

[-]

No_Lingonberry1201@reddit

As a fellow Strix user I feel very tumescent as well.

[-]

PhDwithaPHD@reddit

Is this like additive? That's a lot of stuff going on in your pants and, as a strix halo owner myself, I might just go pantless

[-]

tired514@reddit

SWE above 80 let's gooooooo! >.<

[-]

srigi@reddit

On a 12GB RTX 3060

[-]

guesdo@reddit

I feel like at this point we just need TurboQuant working or some other form of KV compression like the one Deepseek uses and we are gooooood to go

[-]

R_Duncan@reddit

All Qwen 3.5+ are using gated delta-net which redices kv cache, often turboquant 4bit is worse than just q8_0 in terms of speed and space.

[-]

tired514@reddit

TurboQuant would be nice, but FWIW even Q8_0 for kv cache makes a big difference and has a minuscule effect on perplexity and accuracy in my testing.

[-]

annodomini@reddit

NVFP4 probably better than MXFP4. Yes, you can run NVFP4 on AMD, it might not be quite as optimized as on NVIDIA which has hardware support but it's no worse than any other FP4 quant scheme and NVFP4 is generally a better format than MXFP4.

[-]

tired514@reddit

Hmmm.. I thought NVFP4 was a no-go with Vulkan on AMD; does it actually work?

[-]

annodomini@reddit

Yep, it works on both Vulkan and ROCm.

So far I've only tested it with Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4.gguf just to see if it worked, and it works just fine.

I haven't benchmarked it at all, so I don't know if it's slower than other quants; I should probably do some benchmarking.

I had to convert it myself with ./convert_hf_to_gguf.py because there aren't as many GGUFs of the NVFP4 quants available.

[-]

tired514@reddit

Interesting. :)

I'll do some testing then!

[-]

Fit-Produce420@reddit

It's not better on the strix halo.

[-]

annodomini@reddit

Depends on whether you're measuring speed or quality.

NVFP4 should give you better quality; it has smaller blocks that it applies scaling factors to, and better precision by two level scaling factors. So it can get a much more accurate quantization, at the expense of a little bit more data.

Now, it may be slower, since it's going to have to do more work on the scaling. I haven't benchmarked it. I tried it out once with a Nemotron 3 Omni 30B-A3B and it worked just fine, but I didn't dig into it very deeply (turns out that audio isn't implemented for that model yet in llama.cpp) and it worked just fine, but maybe other quants will be faster on Strix Halo, I haven't tested that out at all.

[-]

Fit-Produce420@reddit

I'll clarify, native MXFP4 models like gpt-oss are faster and highly accurate on systems that support that format compared to taking a model, quantizing it to NVFP4 and running it on the same system.

Strix isn't nvidia and it doesn't natively run NVFP4 which is why I said what I said. NVPF4 is proprietary, if you buy a chipped that supports it and use a natively trained model it is better than mxfp4 but they aren't the same thing, mxfp4 is a community standard and not proprietary.

[-]

thaddeusk@reddit

You try Minimax m2.7 on it yet?

[-]

tired514@reddit

Not yet; to be honest I'd have to quantize it so heavily I imagine even Qwen3.6-27B @ Q8 would outperform it on many tasks. :/

[-]

cafedude@reddit

I got a Q3 running on mine. Wasn't impressed. It kept trying to run the tests but wasn't changing directory to where the tests are so just kept failing. I even told it to 'cd tests && run_tests.sh' but it still didn't get it. I think the Q3 just makes it kind of dumb.

[-]

redblood252@reddit

It’s got what the plants crave

[-]

high_on_meh@reddit

You better post your results! Are you on the Strix-Halo Discord? If not, the invite link at at the bottom Donato's website at https://strix-halo-toolboxes.com/

[-]

jikilan_@reddit

Ya, post trained with mxfp4 just like gpt-oss 120b that would the new king!!!

[-]

Vancecookcobain@reddit

File that under things that will never happen

[-]

OMG_IM_A_GIRL@reddit

I want this at Q6 for my MacBook Pro.

[-]

tecneeq@reddit

It would be what i need.

[-]

UniqueAttourney@reddit

At this point, I think we will need more than the final result number, something like token intelligence : how much token is it using to finish the benchmark. of course assuming that the token per watt is the same for all the other models.

[-]

johnnyApplePRNG@reddit

I can't stand these disingenuous benchmark graphs..

Claude 4.7 has been out for months, and so has ChatGPT 5.5 ... no comparison ... claims they "won" ... GTFO

[-]

Mychma@reddit

Wtf a 3.7 ? And I am still waiting for Qwen3.6 9B / 4B / 2B / 0.8B . Gemma is really good but Chinese models have better slavic support(which is interesting) and especially qwen more better c++ performance even though gemma is really strong in the coding up part but debugging and modifing not so much. Also I still use Qwen3 4B 2507 model( I just can't explain why it is so well rounded model that it can give such a great answers even compared to gemmas 4 ass backward logic sometimes) maybe 35T and more training tokens to small models is the way???

[-]

LegacyRemaster@reddit (OP)

we will see... maybe 3.7 will be 9b 4b

[-]

JockY@reddit

For the love of all that’s holy I hope they drop the 397B A17B this time! The NVFP4 of 3.5 fits on 4x rtx 6000 pros with room for 10x concurrent sessions of 200k tokens, it’s an absolute unit of a model.

If they dropped that and it performs like those benches suggest then it’s pretty much Opus at home, where home = GPU baller paradise.

[-]

kleinishere@reddit

What about Dwarf Star 4? Are you trying that with the 4x RTX 6000 Pros? (Not sure if it’s feasible but imagine it would be amazing).

[-]

FullOf_Bad_Ideas@reddit

no multi-gpu support last time I checked

[-]

JockY@reddit

I’ve never heard the words Dwarf Star 4 and I’m too lazy to google it.

[-]

kleinishere@reddit

https://github.com/antirez/ds4

Custom pipeline for running Deep Seek v4 flash at great speed and memory efficiency

[-]

ionizing@reddit

I don't have the hardware for this at work but thanks for the info, it is amazing and awesome to see projects like this.

[-]

t3rmina1@reddit

We already have separate faster methods built on vLLM and SGLang getting developed for our hardware so we don't use DwarfStar

[-]

AppealSame4367@reddit

But why would they? q3.6 27B is already quite close to Opus 4.5, next one would be close to 4.6.

Why would they diminish their own earnings? The price for qwen3.7 max itself tells you that the ultra cheap times are over. Alibaba is a giant company, they did this to crash the competition with cheap / open weights models, not because they wanna provide you with free models forever.

I say there will be one open weights model for qwen3.7 and it will be the last for a long time.

[-]

TokenRingAI@reddit

27B is a very good model for it's size, but it is not close to Opus on anything but benchmarks. It beats Haiku, it is a bit worse than Sonnet, it doesnt beat Gemini Flash or GPT mini.

[-]

whitefritillary@reddit

i would even argue in some areas it doesn’t beat haiku.

[-]

Healthy-Nebula-3603@reddit

Looking on YouTube tests easily beats sonnet 4.6 in coding.

[-]

thaddeusk@reddit

I use Opus for work every day, Qwen 3.6 27b is nowhere close to it. It's good for small changes and simple projects, but it was still failing to do things that Sonnet was easily completing.

[-]

Healthy-Nebula-3603@reddit

I didn't tell about opus .... Did you read properly what I said ?

[-]

thaddeusk@reddit

Did you read properly what I said?

[-]

Healthy-Nebula-3603@reddit

Yes and has no sense

[-]

thaddeusk@reddit

Read it again. I also mentioned Sonnet.

[-]

Healthy-Nebula-3603@reddit

As I said. Your response has no sense.

That you mentioned the sonnet in the last sentence nothing changes.

[-]

thaddeusk@reddit

Or you just have poor reading comprehension.

I've used several versions Sonnet over the last year. I have thousands of dollars in local hardware and have tried many of the open weight coding models at various quant levels. Smaller open weight models are always far behind.

[-]

Healthy-Nebula-3603@reddit

Sure Karen

[-]

thaddeusk@reddit

Ah, resorting to insults, mature. Everybody who actually codes says the same thing about the smaller open weight models. You can deny it all you want, though.

I'm not saying they're bad, but you can't really compare them to the big frontier models.

[-]

ECrispy@reddit

vibecoders with videos 'make an html page with random game' isn't coding

[-]

Healthy-Nebula-3603@reddit

I'm not looking on vibe coders but real long tasks. You can find such in YouTube as well .

[-]

ECrispy@reddit

can you give me some links, would like to see, most of the ones I find use one shot prompt tests

[-]

ttkciar@reddit

Those of us who have actually used it know otherwise.

[-]

Healthy-Nebula-3603@reddit

I'm usitandnis suberrb but you need fp16 not compressed as even q4m is damaging model badly on the agent mode. ( long context )

[-]

NoFaithlessness951@reddit

Sure, buddy...

[-]

ECrispy@reddit

10x concurrent sessions of 200k tokens

curious what you need that kind of capacity for? running for a big team, renting out?

[-]

JockY@reddit

Parallel agentic workflows, multi-user Claude cli, some batched query work… a bit of all sorts!

[-]

Consistent-Height-75@reddit

Hope its not just benchmaxxing

[-]

ttkciar@reddit

Qwen is pretty great, but they do benchmax rather a lot.

[-]

carrotsquawk@reddit

who does mot? as soon as a benchmark gets popular ut becomes useless as everyone trains for it

[-]

ttkciar@reddit

As far as I can tell, ZAI, Google, and AllenAI do not benchmax, or at least not much.

Wouldn't surprise me if every other R&D lab in the world benchmaxed, though.

[-]

randombsname1@reddit

Google does hardcore benchmaxxing.

The vast majority of users on r/bard even know that.

Anthropic is probably the only SOTA company that legit scores UNDER what general consumer/consensus "feels", are.

Its also shown by the fact they are on top or very near the top on sanitized benchmarks like swe-REbench. Specifically rebench, as the normal swe bench is benchmaxxed to hell.

[-]

whitefritillary@reddit

Anthropic benchmaxxes too, they’re just worshipped a lot so everyone ignores it.

[-]

randombsname1@reddit

Which model?

Again, every Anthropic model seems to score LOWER than what most people seem to indicate, and it's also SOTA on sanitized benchmarks as mentioned previously.

[-]

whitefritillary@reddit

every.

every Anthropic model seems to score LOWER than what most people seem to indicate

do you have any actual evidence for this or is it just vibes?

[-]

randombsname1@reddit

Which model? And when? Name a specific one, and a time and specific example.

do you have any actual evidence for this or is it just vibes?

I mean i literally cited swe rebench which is a sanitized benchmark in my first comment you responded to.

So again, which model scored super high on benchmarks, but was far worse than what benchmarks suggested and/or at least not as capable as other models in the same, "performance bracket".

[-]

whitefritillary@reddit

Which model? And when? Name a specific one, and a time and specific example.

every one they have released so far. Anthropic isn’t this magically morally good corporation that would never hurt a fly. they benchmaxx because every studio benchmaxxes.

I mean i literally cited swe rebench which is a sanitized benchmark in my first comment you responded to.

i was talking about your assertion that “every Anthropic model seems to score LOWER than what most people seem to indicate, not your point about sanitised benchmarks.

[-]

randombsname1@reddit

The swe rebench is an example of an objective metric. That's the point of stating that. It's not JUST vibes. Vibes is part of it, but there are also objective measurements. Thats my point.

"every one they have released so far. Anthropic isn’t this magically morally good corporation that would never hurt a fly. they benchmaxx because every studio benchmaxxes."

Well thats weird. Because their benchmarks show them as in the group of SOTA models, and they have routinely proven to be at that range by both objective/subjective criteria.

If you saw Google's benchmarks by comparison you would think they were constantly on top of the AI race wrecking both OAI and Anthropic. Yet it is heavily out classed by both in most workloads per even blind or semi-blind testing like in LM Arena.

[-]

whitefritillary@reddit

i will only be replying to your last paragraph because the previous ones are just a restatement of the same nonsense. you personally feel like Google’s underperform so much relative to benchmarks because you’re looking at this from a coding-based framework. google’s models may be quite bad at coding but they make up for that by being very good at maths, physics, and other STEM areas. which is why they score so high in benchmark leaderboards and then feel disappointed, most people don’t want them for that but rather for coding.

[-]

randombsname1@reddit

But they also score high in coding and aren't anywhere close.

Also, google scores high in math, but even in blind testing there doesn't seem to be a preference for Google over OAI in STEM fields.

So.....?

Also, you still haven't pointed to a specific model/instance of Anthropic benchmaxxing.

If it happens on every model. Then it would take 2 seconds to pull out an example of it allegedly happening.

Meanwhile I on the other hand can pull out both objective and subjective sources on why, for example, specific Chinese models are benchmaxxed to hell and back.

[-]

whitefritillary@reddit

blind testing is just vibes based and isn’t actual proof of anything.

Also, you still haven't pointed to a specific model/instance of Anthropic benchmaxxing.

because that’s literally impossible to do. it’s impossible to definitively say with absolute certainty that something is evidence of benchmaxxing. the only thing there is is rumours.

[-]

randombsname1@reddit

Well no,

because that’s literally impossible to do. it’s impossible to definitively say with absolute certainty that something is evidence of benchmaxxing. the only thing there is is rumours.

If Anthropic came out today with a new model that scored 90% on Swe Bench or terminal bench or whatever -- and then scores 50% on the sanitized swe rebench. Then that is an objective measurement that there is a wide overstating of capabilities.

1 or 2 people blind testing doesnt mean much.

Tens of thousands or hundreds of thousands of blind testing results in stuff like Lm arena gives you a very good sense of the "vibes" of the model regardless.

Anthropic has yet to score poorly or out of the expected range on both objective measurements or "vibes" measurements.

You just keep harping on about them benchmaxxing because, "all companies benchmaxx" which is funny considering you are talking about not being able to take subjective or vibe-based feelings into the equation. Lmao.

[-]

whitefritillary@reddit

yeah i’m just going to stop engaging considering you seem to understand essentially nothing of what you’re talking about.

[-]

randombsname1@reddit

I understand what your argument is. Its just a terrible argument. Thats my point.

Good day.

[-]

whitefritillary@reddit

you claiming my argument is terrible doesn’t make that true. peace.

[-]

randombsname1@reddit

You didnt cite a single thing or respond to even the simple request to point to a single model or even 1 instance.

Bye.

[-]

whitefritillary@reddit

i already explained how that’s not possible. bye.

[-]

randombsname1@reddit

It is. I can do it for every chinese model and multiple chatgpt and gemini models.

With both objective measurements on benchmarks and vibe benchmarks. Both.

You're wrong. Feel bad.

[-]

whitefritillary@reddit

whatever you say i guess. what you said is actually still false, but i’m not going to continue arguing with someone who genuinely doesn’t understand the situation because it will lead nowhere.

out of curiosity, do you also believe that Chinese models distil from Claude and other western LLMs whereas said western LLMs don’t?

[-]

randombsname1@reddit

You still cant and didnt cite anything. Nor did you ever point out how I was wrong. You said I was wrong because Its not something you can prove. I cited explicit benchmarks that can provide objective and subjective benchmarks which you casually ignored, and you still pretend like somehow we can't cite both objective/subjective measurements.

So, you're still wrong -- that I'm wrong.

out of curiosity, do you also believe that Chinese models distil from Claude and other western LLMs whereas said western LLMs don’t?

They all distill, but why does that matter for this topic?

Although I do think western companies distill less as that is just the natural course of things when your competitors are 6+ months behind.

The companies who are in, "the lead" (money, technologically, whatever lead you want to insert here) are usually the ones that get stolen from more. That's just how it works out.

[-]

whitefritillary@reddit

that can provide objective and subjective benchmarks

that’s your subjective opinion.

regarding distillation, i was just curious what you’d say. because i’m not sure if you’re aware that claude distilled deepseek

[-]

randombsname1@reddit

that’s your subjective opinion.

For the subjective part. Sure. For the objective part its objective. Not an subjective opinion. Getting consistently higher scores on METR or swe rebench are hard numbers. Provided through stated methodology on the associated white papers for those benchmarks.

Yes I saw the distill story. Not surprised that they would still a chinese model on chinese content.

[-]

whitefritillary@reddit

no, it is not, you just don’t understand how it works.

[-]

randombsname1@reddit

How does it work?

[-]

whitefritillary@reddit

scoring lower in sanitised benchmarks can, some cases be indicative of benchmaxxing, but it’s not even nearly as definitive as you’re trying to make it appear.

[-]

randombsname1@reddit

Sure a one off isnt any big deal.

All chinese models scoring lower, in every model release -- than what their stated swe bench scores would indicate, is.

That's the difference.

Thats pretty definitive.

Also scoring terribly on long horizon benchmarks time and time again, on multiple models. Is pretty indicative as well. Like on METR.

[-]

whitefritillary@reddit

correlation≠causation.

[-]

randombsname1@reddit

Benchmarks didnt cause chinese models to be bad.

They just are.

They also didnt cause Gemini to have inflated STEM scores.

They just are.

Hard to make this argument when benchmarks simply benchmarked what they are poor at, lol.

[-]

whitefritillary@reddit

jesus christ please work on your reading comprehension and general situational awareness i’m begging you.

[-]

randombsname1@reddit

Hilarious coming from you.

Considering you have literally tried every attempt to side step actually providing an answer to any direct questions. Like citing a single instance of anthropic benchmaxxing (from the "eVerY mOdeL!") person.

Meanwhile I can cite sources for every claim I have made.

How's that distillation argument coming along? How did that help you in this line of reasoning/thread?

[-]

whitefritillary@reddit

“no u”

[-]

randombsname1@reddit

An "L" for you, madam.

[-]

whitefritillary@reddit

if you say so.

[-]

tengo_harambe@reddit

There's billions$$$ on the line. Everyone is benchmaxxing.

[-]

randombsname1@reddit

Anthropic. Or they seem not to anyway.

[-]

Yeelyy@reddit

Everyone does.

[-]

Norwood_Reaper_@reddit

It's actually pretty good

[-]

2Norn@reddit

ofc it is

[-]

sword-in-stone@reddit

they SOTA on critpt, it can't be benchmaxxed, weird

[-]

ratocx@reddit

While this really promising, I find it a bit suspicious that Opus 4.7 and GPT 5.5 isn’t included. Like Opus 4.7 is usually scoring better on benchmarks than 4.6. And apparently GPT models have become really good coding models since 5.4.

But I suppose they really want to only promote Chinese models, and just needed a comparison point to one of the most popular American models for coding.

[-]

whitefritillary@reddit

Opus 4.7 does not score better than 4.6 in the vast majority of benchmarks, that’s only how it looks because Anthropic nerfed 4.6. 4.7 was a downgrade aimed at improving compute management.

[-]

jamaalwakamaal@reddit

Same as what the American closed models do

[-]

ratocx@reddit

True, but the exception is that the latest American closed source models have always been on top on average across benchmarks. But I see the point. OpenAI has sometimes not shown comparisons to the latest model of Anthropic, or hidden it far down on their blog post in the benchmarks they don’t win.

Chinese models have come close to the top, but they have never actually been on the top, unless you think about price vs. performance.

Still feels a bit misleading regardless of who does it, not including the top models when they are comparing to the previous gen top model. Makes me trust the provider less. And Qwen seems to more consistently leave out the latest versions of American competitors.

[-]

Every_Bathroom_119@reddit

max never open weight before

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Sicarius_The_First@reddit

The state of locallama is very sad, I made the exact same post 6 hours earlier, and it was BANNED without explanation.

[-]

artisticMink@reddit

Did they publish the weights for a single max model?

[-]

Septerium@reddit

Now that they have the SOTA behemoth, why would they cook and release open-weight smaller models? Perhaps to keep people talking about Qwen... but what if just hitting the top charts is enough?

[-]

zephyr_33@reddit

3.6 plus didn't feel as good as the benchmarks claimed. actively made a tonne of mistakes in code exploration, I'm not too convinced here.

[-]

Gailenstorm@reddit

The analysis looks good!

https://artificialanalysis.ai/models/qwen3-7-max

[-]

mistressrvn@reddit

Wow, this is genuinely next level. If its not been benchmaxxed then this is a new revolution. Can't wait for A27B

[-]

Specter_Origin@reddit

the token efficiency of even the Max is not that great, if it sticks to how 3.5 and 3.6 have been the local one gonna be a looper and over thinker.

Also per qwen team they will only open-weight their small models so don't expect anything larger than 50b

[-]

AltruisticList6000@reddit

Qwen 3.6 35b looping was fixed for me with increasing the presence penalty to the recommended value. It still occasionally gets into thought loops but instead of doing it forever, it will exit eventually. And in my experience if it starts the thought loop process and finally gives the end result/reply it usually happens at a point when it previously failed to do something and after the looping it always spits out a perfect solution to the problem so it's not a useless thing, even if not too good and takes more time.

[-]

TerminalNoop@reddit

what is the recommended value, 1.5?

[-]

OMG_IM_A_GIRL@reddit

Link?

[-]

Main-Lifeguard-6739@reddit

Check out the new Qwen! Even better at benchmaxing than Gemini!

[-]

mitchins-au@reddit

But can it finish a sentence without running out of tokens. Qwen 3.5 used 4x more reasoning tokens than 3.0

[-]

AI-Agent-Payments@reddit

The closed/open split has been consistent across Qwen releases, so betting on the MoE A22B or a mid-tier dense being the open-weight drop is more realistic than hoping for the flagship. Worth noting that the 30B A3B from Qwen3 actually punches hard for its active parameter count, and if the 3.7 equivalent holds similar efficiency gains, consumer hardware users will probably get more mileage out of that tier anyway. The benchmark numbers on the max series are real, but the open weights are where the fine-tuning community actually proves whether they hold up on domain-specific tasks.

[-]

Historical-Crazy1831@reddit

The only reason they're delaying the release of the small models is that their large models don't significantly outperform them. Qwen is well known for its strong small models, but its large models haven't been as impressive. Part of the reason Lin was forced to step down is that their large model failed to outperform competing other Chinese large models like doubao, kimi, glm.

[-]

Dr_Me_123@reddit

If only they had enough resources to train a 1T model

[-]

TheRealMasonMac@reddit

Their previous max models have been 1T.

[-]

Historical-Crazy1831@reddit

If they can train a mythos-level model and open source a 70b Opus4.6-level dense model, that would be great!

[-]

Gohab2001@reddit

It's 7.5$/2.5$ and performs worse than Gemini 3.5 flash from my limited testing. DoA model.

[-]

cosimoiaia@reddit

gguf when? (Sorry, I know it's not even released yet but I had to)

[-]

DepressedDrift@reddit

Please have 9B, please have 9B, please have 9B......

[-]

riceinmybelly@reddit

9B and 122B?

[-]

DepressedDrift@reddit

Anything a normal person can afford

[-]

riceinmybelly@reddit

Strix halo or second hand Mac Studio will get you bigger models or more loaded at the same time and costs less so it depends on how important speed is because they are not fast.

[-]

_-_David@reddit

Don't hold your breath

[-]

Tough_Frame4022@reddit

Trying it now. You can chat with it for free on Alibaba cloud services.

[-]

ballshuffington@reddit

Why didn't they compare against Google pro series?

[-]

IISomeOneII@reddit

Holy

[-]

DeepOrangeSky@reddit

Is it theorized that this closed-weights Qwen3.7 Max is still just a 397b a17b model? Or is it thought to be some bigger, different private model, like maybe 1T parameters or something more along those lines?

[-]

nuclearbananana@reddit

Mind you this is the MAX model. Don't except the 27b model to be as good

[-]

srigi@reddit

Also these benchmarks are done on (B)F16 models. 27B at Q4 is not what you see in marketing material.

[-]

LegacyRemaster@reddit (OP)

SWE bench... All I need

[-]

Far-Low-4705@reddit

wtf happened to 3.6 on that one math benchmark???

[-]

crone66@reddit

Why no gpt 5.5 and opus 4.7?

[-]

somerussianbear@reddit

Would love these numbers to represent reality but we know that they don’t.

[-]

hwpoison@reddit

Someone knows if there is small model series of this version?

[-]

temperature_5@reddit

"It's better than Opus!"

But I admit I will be running it, at least until the next GLM Air comes out and surpasses it (please?)

[-]

ea_man@reddit

So how does the cost compare to DeepSeek 4?

[-]

ttkciar@reddit

Sir, this is LocalLLaMA.

[-]

laul_pogan@reddit

Worth flagging for when weights drop: Qwen3.5 text-only weights ship with multimodal lineage, so vLLM fails to load them unless you strip the model.language_model.* key prefix from the state dict and remove mrope_section from config.json. Not obvious from the error. Expect 3.7 to need the same treatment if they follow the same save path.

[-]

LegacyRemaster@reddit (OP)

less then 60k

[-]

iloveplexkr@reddit

king is dead?