Qwen supports 1M context. Closed-source versions have this context window enabled. If you want it in open-source versions, you'll need to make changes, but good luck doing so.
If it has faster prompt processing, I'd be switching even if it performed on par or even slightly worse on benchmarks. Qwen 3.6 already has great PP to performance ratio.
you could take any other big model and compare it to current 3.6 models and you will get very close benchmarks, don't trust it fully, always take benchmarks with a grain of salt.
I agree. Its not that benchmarks are useless, they are a starting point; what we can see is that DS4 seems to have a descent extra knowledge compared to Qwen3.6.
In real usage, how the model behaves to quantization, cache quantization, etc, this things matter and the benchmarks of the full weights models may not be the best representation, so we have to test them out
The way the training of this models work is -> First you try to train a big ass model, which has much bigger chances to learn fast -> Then you try to distill that big model into smaller versions of it, usually gaining a lot of efficiency, but the distilling process is slow.
That is why modern SOTA models are always gigantic models (claude opus probably has trillions which would explain the cost), meanwhile modern smaller models are always behind, but still end surpassing big ass models from some time ago. The "old" DeepSeek R1 was 671b parameters, and modern Qwen3.6 35b is so much better at everything benchmarks measure.
With only 13B active, it doesn't surprise me to see it struggle at benchmarks that require reasoning when compared to the best dense 27B model. The knowledge is there though.
Look at the hardest reasoning problems: olympiad math problems, MIT math competition problems and Humanity's last exam, the DS4-Flash's performance is far better than the 27B at those. So, it's actually down to benchmark saturation and not reasoning.
The 27B keeps up best in saturated benchmarks in common languages like python, java, typescript and terminal environments like bash. If we included Haskell, Prolog or clojure in SWEBench, I expect the 27B performance to drop much more than DS4-Flash.
Yeah, and this graph is TERRIBLE. This is horrendous graphic design. The damn graph tops out at 120%.
Deepseek gets ~95% at HMMT vs Qwen at 84%, which is triple the number of incorrect answers... but that large gap on top makes Deepseek look only half as good.
Smaller models handle long context much worse than bigger ones. Everyone knows Claude Sonnet benchmarks comparably to Opus. But everyone uses Opus despite the insanely tight quota. Agentic workflow now eat context like breakfast.
It's likely significantly better. Qwen3.6-27B is a relatively small model, it has zero chance of generalizing to the extent of a model with 284B parameters. Those benchmarks test a thin scope of a given model's capability which models like Qwen3.6-27B are trained to tackle such as Python, or Javascript. (Well, all models really, but 27B just can't retain much else).
Not really. We need newer, better benchmarks, because current one are basically flat for all recent models, despite widely different real user experience.
We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) andDeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length ofone million tokens.
This is the first line of the technical report. This is a preview version. The models are still in the oven.
I'm sure within a month or two, what happened in Qwen 3.5->3.6 will happen with these models.
In benchmarks. We should try it in real world usage before deciding what's better. The 35B is also imo (and own testing) not even close to the 27B, yet they are close to each other in most benchmarks.
That seems to be the wrong conclusion. Slightly better on this small benchmark set. Intelligence is how well it can generalize. I seriously doubt that a 200+B model is barely better than a 27B.
Why not compare knowledge correctness for example? There is no way that a small model knows correctly as much stuff as a 10x model.
No criticism intended toward the models or your argument, just a few considerations that might be helpful to some of us:
- The v4 Flash equivalent would be roughly a 60B dense model, based on the formula: dense ≈ sqrt(total × active).
- If the Qwen models included benchmark data during post-training, that could have influenced the results.
- A jump from 80 to 85 on a benchmark is much more significant than going from 30 to 40.
284B w/13B active is probably usable-ish locally; it's probably 120 gigs or so at a 2 bit quant - which might or might not perform decently. Hard to know until it's in our hands as GGUFs with supported tools for inference.
Pretty bad non-hallucination rates on AA omniscience. Qwen 3.6 27B 52% vs Deepseek 4 Flash 4%.
? DS only 4% whereas on https://artificialanalysis.ai/evaluations/omniscience
Grok 4.20 0309 v2 (Reasoning) scores the lowest on AA-Omniscience Hallucination Rate with a score of 17%, followed by Grok 4.20 0309 (Reasoning) with a score of 22%, and MiMo-V2.5-Pro with a score of 25%
Even Qwen 3 supports 1M, but you have to do something to activate it; it's not enabled by default, although it does have 1M in the "closed source" versions on Alibaba.
Given the 'lost in the middle' effect and overal quality degradation with long context; even when using Gemini Pro or Opus with 1M context i dont dare go beyond 250k; I try to phase the taks in small enough steps for that.
I also personally use Qwen3.5 (now 3.6) 27b as local LLM in everyday workflow (non-coding) and i keep 2 parallel with 100k context, because I dont feel comfortable trusting it with long context. As I understood so far, quantized models, specially with quantized KV Cache, degrade faster with more context
So, it is nice that it has '1M context window', probably useful for some specific tasks, but be careful
Their stealthily updated web version (as early as in February 2026) seemed to do surprisingly well at ~300K context. I gave it a book to summarize - it performed well, no errors. Whether it was Pro, Flash or something entirely different - no idea, but processing / generation speed was fast and I was happy to see them finally moving on from 64/128K.
As there is REAP version of M2.7 on hugging face, same could be done for Deepseek flash. It is a method to prune less valuable experts. With it M2.7 fits on one DGX spark.
Expert offloading to where? DGX Sparks uses unified memory. The GPU and CPU share the same pool of memory. So CPU offloading wouldn't help at all, would just slow things down.
By expert offloading I mean still leaving it on disk. A "hot" expert is loaded into unified memory. A "cold" expert stays on disk. I'm strix halo which uses a similar unified memory set up. You can do this in llama.cpp now with memory mapping (mmap). The idea is that is you're doing coding you're likely only using a subset of the MoE experts. If you suddenly start talking about 18th century poetry, it'll have to load it from cold storage so you'll see a drop in tps.
We're going back to dense models as soon as we get affordable 48 gigs of vram. There's absolutely no reason to use tremendous amounts of RAM in the 1 terabyte range when dense model in the 70b will have absolutely amazing knowledge based on modern tech. People seem to forget that llama 3.3 70b was announced in December of 2024 and it's been almost 1.5 years since that time.
There's absolutely no reason to use tremendous amounts of RAM in the 1 terabyte range
I totally want 1TB. It's nice having virtually no restrictions, being able to play with the biggest models and not having to care about available RAM for context. And that's just for text generation.
For video generation, once the algorithms improve, it's the amount of RAM which will essentially limit the duration of one-shot clips.
Hopefully by 2030, 1TB/1TB APUs will relatively affordable.
I would also like to point out that 3gb GDDR7 chips are made very comfortably in terms of yields and the next gen of density (which should be 4 gb per chip) absolutely must be just around the corner. Reason being that everything is centered around HBM markets and flipping to 4gb GDDR7 chips just makes a lot of sense to supply same amount of VRAM using lower number of chips to increase their margins.
48 gb is just 12 chips, same amount used on the cards like 3090 ages ago. It's also much easier for the gpu manufacturer to make because it's 12 slots and not like 16 or 24, so they are also thrilled to get these chips to improve their own margins.
We're going to get these 48 gb cards, so I think we're looking at the zenith of MOE models. Those super-large MOE models have a tremendous downside of being "a final product" - it's impossible to fine tune them on your data so it's just "take it or leave it" when even the untitled king of dense - the 405b dense Llama has finetunes.
Lets hope they also scale the memory bandwith; because if you offload 40gb of dense weights with a 1 TB/s of memory bandwith you will get some very low TG (\~25 t/s probably)
Tokens are a wobbly way to judge performance. Would you really say that it is slow if you can 1) use quants 2) these tokens are brilliant and are 30% more efficient than competitor models? It's all relative.
But you are spot on about bandwidth limitation. That's going to be the prime issue because we're gonna get VRAM capacity but by decreasing physical chips amount, we are decreasing bandwidth. I pray Micron can cook whatever they're cooking in the GDDR7+HBM hybrid tech.
As a model increases in size, the fixed capacity of the residual stream can no longer properly incorporate all information added via later (compositions on attentional) layers. You can make the stream/carrier vector wider, but this has a large cost in compute (quadratic).
The issue is that larger dense networks don't scale as well to larger sizes as MoEs. For MoEs, the routing helps reduce noise in residual stream and the conditional computational allows more complex operations within their limited computational budget during inference. The larger the total params, the more complexity packable per layer. On top of that, MoEs are more efficient to train within a given compute budget. This is why large dense models are rarer.
Well, who said we need 405b dense models? As you can see, qwen 3.6 dense that just came out packs a big punch. If you double that and to 50b class, I'd say that is your daily driver...with modern tech. And as you know, there's the question of datasets - we absolutely balloon those MOE models on synthetic data, but is there a purpose or an end to this? It's like a black hole of synthetic intelligence, it's going to collapse eventually into a dense singularity. At this point we are just flexing in size, making moe go to 3T parameters wont make it 50% smarter, but internally you can make it smarter by this much while keeping the "density" at the same level.
Yes, the 3.6 27B is incredible compared to what came before at that size range but, it's worth noting that the benchmarks make it look a lot closer to DS4-Flash than it actually is because fewer param models generalize worse outside what they were post-trained for.
The reason to scale to 3T is not to flex but because scaling total param count is still the best known way to increase model raw intelligence. With MoEs, we can scale effective capacity without paying such a hefty cost on active compute. And recall that due to how effectively attention compounds information, the residual stream is overwhelmed and stops being able to properly add new information at deeper layers.
At larger sizes, MoEs also have some advantages: their residual stream is much less noisy, the conditional computation effectively enables higher per layer pattern compression thanks to specialization (think of a kind of a more domain specific efficient encoding because lots of unrelated features do not need to share the same representational subspace).
Final and most important is not to think of them as just say, 30B of intelligence but as 30B worth of specialized compute dynamically composed per token, combinatorially selected from a (likely redundant but still, combinatorial) library of functions. In fact, over a large sequence, an MoE traces out a trajectory that covers far more than just 30B worth of activated parameters and implicitly composes over the many specialized functions across steps (think of dynamically assembling a program chopped up into 30B sized modules and compute per step).
Can anyone confirm these qwen terminal bench numbers? I don’t see anything official from terminal bench and in my testing I barely get it past 30% (which is excellent for a tiny model). Is Qwen fudging the benchmarks? Benchmaxxing to the max?!
Thanks, I checked their data and talked to terminal bench (the hugging face readme has now been updated).
Those are indeed unofficial numbers and they fudged the timeout it appears to get that completion percentage (as I bet a lot of other models are doing as well).
So, officially qwen cannot achieve that terminal bench number or they haven’t submitted a run that satisfies the official rules yet.
running both on a 3090. ds4-flash at q4_K_M is noticeably faster for code gen, like 40-50% more tok/s in my setup. quality is close enough that speed becomes the tiebreaker for interactive stuff
qwen 3.6 handles structured output better though. if you need reliable json or function calling qwen wins that pretty clearly. basically flash for speed, qwen for precision
My reading: DS4-Flash requires 10x RAM (302GB) than the latest Qwen 3.6 27B and 35B-MoE (32GB) to run at FP8, while improving coding benchmark scores by 5% and general knowledge scores by 10%.
And remember knowledge is not just a list of facts but how to solve problems. Like knowing the details of information theory better. Knowing the details of ECS, how to solve differential equations, problem solving tricks, biology, bioinformatics algorithms. Those are all knowledge.
It is amazing how good accessible models are these days, although about 16GB worth is quite a lot of information capacity. But the issue arises when you need to dive deeper. So, the model can have surface level (or even slightly deeper) knowledge of how to write unity jobs but it'll not be near as proficient at knowing all the collections and which to use when writing a highly optimized path finding solution or spatial hash grid.
Deepseek should also be faster than 3.6:27b due to the smaller amount of active parameters and fp4 MoE layer.
Waiting eagerly how deepseek speed is affected when context grows.
Models with traditional attention calculation slows down tremendously when context grows. For example MiniMax 2.5 running fully on dual rtx pro started with > 100 tok/s generation, but at 100k context speed 30.. 50 tok/s. Qwen3.5-397b partially offloaded to cpu stayed at 40 tok/s due to more advanced attention implementation.
been running qwen 3.6 35b with llama.cpp mmap and the expert loading from disk pattern is real but manageable. coding tasks barely hit cold storage, but switch topics mid conversation and you feel the pause immediately. asking it something unrelated to what it was just doing triggers a noticeable stall while new experts load. if a 122b version drops i'm genuinely curious whether the expert distribution is different enough that it stays warm on more topic switches.
The delta in LiveCodeBench vs SWE Bench makes me think that 3.6 is likely a bit benchmaxxed. It's still excellent and by far the best in its size class, but I'm curious how the two would feel. I can't run any DS models locally, so I might have to play with it on openrouter and compare.
Terminal Bench 2.0 is likely not apples to apples comparison if Deepseek ran it according to the tbench guidelines. I know Qwen models run with increased timeout (3h) and modified hardware config that the benchmark disallows. This is why you see those numbers reported in the model card but not the official leaderboard
I just did some quick testing using the API on my own benchmark that tests LLMs as customer support chatbots, and found out that deepseek-v4-flash (scored 90.2%) was better than qwen3.5-27b (89%) and qwen3.5-35b-a3b (89.1%) and roughly equal to gemini-3-flash-preview (90.5%), but deepseek-v4-flash had the lowest cost of all of them by far.
Have you noticed the deepseek-v4-pro performing worse than flash? I found it surprising and I'm wondering if there is a bug on my software. It performed even worse than qwen3.5-27b.
to the 'it's only bit better than qwen 27b' crowd - In practice those benchmarks are not linear even if they look like it. Going from 30 to 50 score is not the same as going from 50 to 70.
let's wait for actual IRL users opinions, and enjoy this glorious month
6c5d1129@reddit
so its x10 the size and only slightly better
Expensive-Paint-9490@reddit
It's three times the size, not ten.
Sea-Speaker1700@reddit
27*10=???
Expensive-Paint-9490@reddit
DeepSeek-V4 is not FP16. It's 160 GB vs 55.6 GB.
It's both better and much faster.
Sea-Speaker1700@reddit
You get model size != model space on disk right...
Expensive-Paint-9490@reddit
Model size is literally its space on disk. Or you think that FP16 Qwen3.6-27B has the same size of its 4-bit quant?
Capaj@reddit
but it has 1 million context window
sammoga123@reddit
Qwen supports 1M context. Closed-source versions have this context window enabled. If you want it in open-source versions, you'll need to make changes, but good luck doing so.
Mochila-Mochila@reddit
Not nearly the same probability of self-hosting this 1 million context with Qwen, AFAIK. Deepseek just made this plausible for DIYers.
-dysangel-@reddit
If it has faster prompt processing, I'd be switching even if it performed on par or even slightly worse on benchmarks. Qwen 3.6 already has great PP to performance ratio.
MomentJolly3535@reddit
you could take any other big model and compare it to current 3.6 models and you will get very close benchmarks, don't trust it fully, always take benchmarks with a grain of salt.
rm-rf-rm@reddit
mountain of salt. FTFY
6c5d1129@reddit
yes i know. the real benchmark is collective experience after its been out for a couple weeks
flavio_geo@reddit (OP)
I agree. Its not that benchmarks are useless, they are a starting point; what we can see is that DS4 seems to have a descent extra knowledge compared to Qwen3.6.
In real usage, how the model behaves to quantization, cache quantization, etc, this things matter and the benchmarks of the full weights models may not be the best representation, so we have to test them out
_VirtualCosmos_@reddit
The way the training of this models work is -> First you try to train a big ass model, which has much bigger chances to learn fast -> Then you try to distill that big model into smaller versions of it, usually gaining a lot of efficiency, but the distilling process is slow.
That is why modern SOTA models are always gigantic models (claude opus probably has trillions which would explain the cost), meanwhile modern smaller models are always behind, but still end surpassing big ass models from some time ago. The "old" DeepSeek R1 was 671b parameters, and modern Qwen3.6 35b is so much better at everything benchmarks measure.
stddealer@reddit
With only 13B active, it doesn't surprise me to see it struggle at benchmarks that require reasoning when compared to the best dense 27B model. The knowledge is there though.
EstarriolOfTheEast@reddit
Look at the hardest reasoning problems: olympiad math problems, MIT math competition problems and Humanity's last exam, the DS4-Flash's performance is far better than the 27B at those. So, it's actually down to benchmark saturation and not reasoning.
The 27B keeps up best in saturated benchmarks in common languages like python, java, typescript and terminal environments like bash. If we included Haskell, Prolog or clojure in SWEBench, I expect the 27B performance to drop much more than DS4-Flash.
DistanceSolar1449@reddit
Yeah, and this graph is TERRIBLE. This is horrendous graphic design. The damn graph tops out at 120%.
Deepseek gets ~95% at HMMT vs Qwen at 84%, which is triple the number of incorrect answers... but that large gap on top makes Deepseek look only half as good.
Orolol@reddit
Yeah people praise graph starting at 0 but you really lose a lot of nuance when reading them.
alex20_202020@reddit
I have recently suggested adding errors % chart (starting at 0).
anotherJohn12@reddit
Smaller models handle long context much worse than bigger ones. Everyone knows Claude Sonnet benchmarks comparably to Opus. But everyone uses Opus despite the insanely tight quota. Agentic workflow now eat context like breakfast.
Kolapsicle@reddit
It's likely significantly better. Qwen3.6-27B is a relatively small model, it has zero chance of generalizing to the extent of a model with 284B parameters. Those benchmarks test a thin scope of a given model's capability which models like Qwen3.6-27B are trained to tackle such as Python, or Javascript. (Well, all models really, but 27B just can't retain much else).
Septerium@reddit
It is always like this
MDSExpro@reddit
Not really. We need newer, better benchmarks, because current one are basically flat for all recent models, despite widely different real user experience.
6c5d1129@reddit
i agree. everyone is >70% on swe bench verified now
cantgetthistowork@reddit
Qwen is benchmaxxed garbage. They only exist to beat benchmarks
Long_comment_san@reddit
not for gooning purposes sadly
dark-light92@reddit
This is the first line of the technical report. This is a preview version. The models are still in the oven.
I'm sure within a month or two, what happened in Qwen 3.5->3.6 will happen with these models.
Technical-Earth-3254@reddit
In benchmarks. We should try it in real world usage before deciding what's better. The 35B is also imo (and own testing) not even close to the 27B, yet they are close to each other in most benchmarks.
SmartCustard9944@reddit
That seems to be the wrong conclusion. Slightly better on this small benchmark set. Intelligence is how well it can generalize. I seriously doubt that a 200+B model is barely better than a 27B.
Why not compare knowledge correctness for example? There is no way that a small model knows correctly as much stuff as a 10x model.
AlbeHxT9@reddit
No criticism intended toward the models or your argument, just a few considerations that might be helpful to some of us:
- The v4 Flash equivalent would be roughly a 60B dense model, based on the formula: dense ≈ sqrt(total × active).
- If the Qwen models included benchmark data during post-training, that could have influenced the results.
- A jump from 80 to 85 on a benchmark is much more significant than going from 30 to 40.
logTom@reddit
+ it has 1 million context length
LinkSea8324@reddit
284B vs 27B btw
ToInfinityAndAbove@reddit
deepseek v4 way cheaper tho!
26YrVirgin@reddit
And the 27B supports image input
dampflokfreund@reddit
And the 27B can be actually run locally.
overand@reddit
284B w/13B active is probably usable-ish locally; it's probably 120 gigs or so at a 2 bit quant - which might or might not perform decently. Hard to know until it's in our hands as GGUFs with supported tools for inference.
overand@reddit
But it's 13B Active vs 27B active.
twack3r@reddit
Sparse vs dense but still.
Rascazzione@reddit
1M token context my friend... 1M token context!! Let's see other benchmarks like omniscense
Porespellar@reddit
But no vision tho 😔
Middle_Bullfrog_6173@reddit
Pretty bad non-hallucination rates on AA omniscience. Qwen 3.6 27B 52% vs Deepseek 4 Flash 4%.
Reverse situation in accuracy due to size difference. 37% vs 19%.
TomLucidor@reddit
Vectera benchmark pls
alex20_202020@reddit
? DS only 4% whereas on https://artificialanalysis.ai/evaluations/omniscience
sammoga123@reddit
Qwen 4 will most likely have 1M of base context.
Even Qwen 3 supports 1M, but you have to do something to activate it; it's not enabled by default, although it does have 1M in the "closed source" versions on Alibaba.
flavio_geo@reddit (OP)
Given the 'lost in the middle' effect and overal quality degradation with long context; even when using Gemini Pro or Opus with 1M context i dont dare go beyond 250k; I try to phase the taks in small enough steps for that.
I also personally use Qwen3.5 (now 3.6) 27b as local LLM in everyday workflow (non-coding) and i keep 2 parallel with 100k context, because I dont feel comfortable trusting it with long context. As I understood so far, quantized models, specially with quantized KV Cache, degrade faster with more context
So, it is nice that it has '1M context window', probably useful for some specific tasks, but be careful
rm-rf-rm@reddit
250k? i dont evn go psat 100k with Opus
BestGirlAhagonUmiko@reddit
Their stealthily updated web version (as early as in February 2026) seemed to do surprisingly well at ~300K context. I gave it a book to summarize - it performed well, no errors. Whether it was Pro, Flash or something entirely different - no idea, but processing / generation speed was fast and I was happy to see them finally moving on from 64/128K.
7734128@reddit
I think that I value multimodality and 250k over 1M. Being able to input images is a nice feature, even for coding and such.
Sticking_to_Decaf@reddit
You can do 1m token context on Qwen3.6-27B with rope. I think it’s even in their official recipes.
pseudonerv@reddit
So now we should see 122b qwen 3.6. Right? Right?
FullOf_Bad_Ideas@reddit
maybe 397B too
craterIII@reddit
PLEASE
__JockY__@reddit
I hope so.
Storge2@reddit
I hope so. That one would be perfect for DGX Spark as the Deepseek V4 Flash doesnt fit in a single Spark...
Ishkabibble87@reddit
I’ve been thinking about this, with expert offloading and quantization, it’s not impossible. I think it’s definitely possible
Antropog@reddit
As there is REAP version of M2.7 on hugging face, same could be done for Deepseek flash. It is a method to prune less valuable experts. With it M2.7 fits on one DGX spark.
inevitabledeath3@reddit
Expert offloading to where? DGX Sparks uses unified memory. The GPU and CPU share the same pool of memory. So CPU offloading wouldn't help at all, would just slow things down.
Ishkabibble87@reddit
By expert offloading I mean still leaving it on disk. A "hot" expert is loaded into unified memory. A "cold" expert stays on disk. I'm strix halo which uses a similar unified memory set up. You can do this in llama.cpp now with memory mapping (mmap). The idea is that is you're doing coding you're likely only using a subset of the MoE experts. If you suddenly start talking about 18th century poetry, it'll have to load it from cold storage so you'll see a drop in tps.
NNN_Throwaway2@reddit
Zero confirmation there will be a 122b release of Qwen3.6.
sammoga123@reddit
They conducted a survey a while ago and that model was listed, but it didn't win.
xrvz@reddit
https://x.com/ChujieZheng/status/2039909917323383036
PinkySwearNotABot@reddit
I may need to see a doctor for this 124 hour erection i've had all week.
Long_comment_san@reddit
We're going back to dense models as soon as we get affordable 48 gigs of vram. There's absolutely no reason to use tremendous amounts of RAM in the 1 terabyte range when dense model in the 70b will have absolutely amazing knowledge based on modern tech. People seem to forget that llama 3.3 70b was announced in December of 2024 and it's been almost 1.5 years since that time.
Mochila-Mochila@reddit
I totally want 1TB. It's nice having virtually no restrictions, being able to play with the biggest models and not having to care about available RAM for context. And that's just for text generation.
For video generation, once the algorithms improve, it's the amount of RAM which will essentially limit the duration of one-shot clips.
Hopefully by 2030, 1TB/1TB APUs will relatively affordable.
Long_comment_san@reddit
I would also like to point out that 3gb GDDR7 chips are made very comfortably in terms of yields and the next gen of density (which should be 4 gb per chip) absolutely must be just around the corner. Reason being that everything is centered around HBM markets and flipping to 4gb GDDR7 chips just makes a lot of sense to supply same amount of VRAM using lower number of chips to increase their margins.
48 gb is just 12 chips, same amount used on the cards like 3090 ages ago. It's also much easier for the gpu manufacturer to make because it's 12 slots and not like 16 or 24, so they are also thrilled to get these chips to improve their own margins.
Long_comment_san@reddit
We're going to get these 48 gb cards, so I think we're looking at the zenith of MOE models. Those super-large MOE models have a tremendous downside of being "a final product" - it's impossible to fine tune them on your data so it's just "take it or leave it" when even the untitled king of dense - the 405b dense Llama has finetunes.
flavio_geo@reddit (OP)
Lets hope they also scale the memory bandwith; because if you offload 40gb of dense weights with a 1 TB/s of memory bandwith you will get some very low TG (\~25 t/s probably)
Caffdy@reddit
the 5090 already have 1.79TB/s, next gen surely will go above 2TB/s
Long_comment_san@reddit
Tokens are a wobbly way to judge performance. Would you really say that it is slow if you can 1) use quants 2) these tokens are brilliant and are 30% more efficient than competitor models? It's all relative.
But you are spot on about bandwidth limitation. That's going to be the prime issue because we're gonna get VRAM capacity but by decreasing physical chips amount, we are decreasing bandwidth. I pray Micron can cook whatever they're cooking in the GDDR7+HBM hybrid tech.
EstarriolOfTheEast@reddit
As a model increases in size, the fixed capacity of the residual stream can no longer properly incorporate all information added via later (compositions on attentional) layers. You can make the stream/carrier vector wider, but this has a large cost in compute (quadratic).
The issue is that larger dense networks don't scale as well to larger sizes as MoEs. For MoEs, the routing helps reduce noise in residual stream and the conditional computational allows more complex operations within their limited computational budget during inference. The larger the total params, the more complexity packable per layer. On top of that, MoEs are more efficient to train within a given compute budget. This is why large dense models are rarer.
Affectionate-Cap-600@reddit
maybe the new "hyper residual" introduced in deepseek v4 is working in that direction?
Long_comment_san@reddit
Well, who said we need 405b dense models? As you can see, qwen 3.6 dense that just came out packs a big punch. If you double that and to 50b class, I'd say that is your daily driver...with modern tech. And as you know, there's the question of datasets - we absolutely balloon those MOE models on synthetic data, but is there a purpose or an end to this? It's like a black hole of synthetic intelligence, it's going to collapse eventually into a dense singularity. At this point we are just flexing in size, making moe go to 3T parameters wont make it 50% smarter, but internally you can make it smarter by this much while keeping the "density" at the same level.
EstarriolOfTheEast@reddit
Yes, the 3.6 27B is incredible compared to what came before at that size range but, it's worth noting that the benchmarks make it look a lot closer to DS4-Flash than it actually is because fewer param models generalize worse outside what they were post-trained for.
The reason to scale to 3T is not to flex but because scaling total param count is still the best known way to increase model raw intelligence. With MoEs, we can scale effective capacity without paying such a hefty cost on active compute. And recall that due to how effectively attention compounds information, the residual stream is overwhelmed and stops being able to properly add new information at deeper layers.
At larger sizes, MoEs also have some advantages: their residual stream is much less noisy, the conditional computation effectively enables higher per layer pattern compression thanks to specialization (think of a kind of a more domain specific efficient encoding because lots of unrelated features do not need to share the same representational subspace).
Final and most important is not to think of them as just say, 30B of intelligence but as 30B worth of specialized compute dynamically composed per token, combinatorially selected from a (likely redundant but still, combinatorial) library of functions. In fact, over a large sequence, an MoE traces out a trajectory that covers far more than just 30B worth of activated parameters and implicitly composes over the many specialized functions across steps (think of dynamically assembling a program chopped up into 30B sized modules and compute per step).
cchuter@reddit
Can anyone confirm these qwen terminal bench numbers? I don’t see anything official from terminal bench and in my testing I barely get it past 30% (which is excellent for a tiny model). Is Qwen fudging the benchmarks? Benchmaxxing to the max?!
flavio_geo@reddit (OP)
https://huggingface.co/Qwen/Qwen3.6-27B
cchuter@reddit
Thanks, I checked their data and talked to terminal bench (the hugging face readme has now been updated).
Those are indeed unofficial numbers and they fudged the timeout it appears to get that completion percentage (as I bet a lot of other models are doing as well).
So, officially qwen cannot achieve that terminal bench number or they haven’t submitted a run that satisfies the official rules yet.
flavio_geo@reddit (OP)
Good finding
spencer_kw@reddit
running both on a 3090. ds4-flash at q4_K_M is noticeably faster for code gen, like 40-50% more tok/s in my setup. quality is close enough that speed becomes the tiebreaker for interactive stuff
qwen 3.6 handles structured output better though. if you need reliable json or function calling qwen wins that pretty clearly. basically flash for speed, qwen for precision
Opening-Broccoli9190@reddit
My reading: DS4-Flash requires 10x RAM (302GB) than the latest Qwen 3.6 27B and 35B-MoE (32GB) to run at FP8, while improving coding benchmark scores by 5% and general knowledge scores by 10%.
Eyelbee@reddit
Intelligence is roughly equal but deepseek has more knowledge.
EstarriolOfTheEast@reddit
And remember knowledge is not just a list of facts but how to solve problems. Like knowing the details of information theory better. Knowing the details of ECS, how to solve differential equations, problem solving tricks, biology, bioinformatics algorithms. Those are all knowledge.
HongPong@reddit
it's impressive these various local llm somehow know about FLECS ECS i do tend to ask. no idea how they pack so much into the models
EstarriolOfTheEast@reddit
It is amazing how good accessible models are these days, although about 16GB worth is quite a lot of information capacity. But the issue arises when you need to dive deeper. So, the model can have surface level (or even slightly deeper) knowledge of how to write unity jobs but it'll not be near as proficient at knowing all the collections and which to use when writing a highly optimized path finding solution or spatial hash grid.
mvaranka@reddit
Deepseek should also be faster than 3.6:27b due to the smaller amount of active parameters and fp4 MoE layer.
Waiting eagerly how deepseek speed is affected when context grows.
Models with traditional attention calculation slows down tremendously when context grows. For example MiniMax 2.5 running fully on dual rtx pro started with > 100 tok/s generation, but at 100k context speed 30.. 50 tok/s. Qwen3.5-397b partially offloaded to cpu stayed at 40 tok/s due to more advanced attention implementation.
ecompanda@reddit
been running qwen 3.6 35b with llama.cpp mmap and the expert loading from disk pattern is real but manageable. coding tasks barely hit cold storage, but switch topics mid conversation and you feel the pause immediately. asking it something unrelated to what it was just doing triggers a noticeable stall while new experts load. if a 122b version drops i'm genuinely curious whether the expert distribution is different enough that it stays warm on more topic switches.
RegularRecipe6175@reddit
Now GGUF wen
moonrust-app@reddit
That MoE Qwen kicks way beyond his height considering how cheap it is to run it.
chillinewman@reddit
So much RAM that i don't have.
sammoga123@reddit
Also, Qwen has been multimodal since version 3.5. DeepSeek V4 (any version) remains text-only.
AtheistSage@reddit
Obviously if you're running this locally, Qwen is way more efficient with the lower parameters, but the Deepseek API prices are substantially lower
sine120@reddit
The delta in LiveCodeBench vs SWE Bench makes me think that 3.6 is likely a bit benchmaxxed. It's still excellent and by far the best in its size class, but I'm curious how the two would feel. I can't run any DS models locally, so I might have to play with it on openrouter and compare.
breadfruitcore@reddit
this sub hates benchmaxxing except when its for models (brands) they like
sabotage3d@reddit
On coding agent benchmarks, they are neck and neck, which is funny considering their size difference.
Single_Ring4886@reddit
The classical benchmarks are saturated... the new kind of benchmarks is needed...
2Norn@reddit
v4 flask is 284b-a13b btw
Comfortable-Rock-498@reddit
Terminal Bench 2.0 is likely not apples to apples comparison if Deepseek ran it according to the tbench guidelines. I know Qwen models run with increased timeout (3h) and modified hardware config that the benchmark disallows. This is why you see those numbers reported in the model card but not the official leaderboard
cmitsakis@reddit
I just did some quick testing using the API on my own benchmark that tests LLMs as customer support chatbots, and found out that deepseek-v4-flash (scored 90.2%) was better than qwen3.5-27b (89%) and qwen3.5-35b-a3b (89.1%) and roughly equal to gemini-3-flash-preview (90.5%), but deepseek-v4-flash had the lowest cost of all of them by far.
Have you noticed the deepseek-v4-pro performing worse than flash? I found it surprising and I'm wondering if there is a bug on my software. It performed even worse than qwen3.5-27b.
Iory1998@reddit
This is why I believe that if Alibaba trained a 50-70B dense model, it would create a true beast. The 27B beats Gemma (31B) in what I do.
madsheepPL@reddit
to the 'it's only bit better than qwen 27b' crowd - In practice those benchmarks are not linear even if they look like it. Going from 30 to 50 score is not the same as going from 50 to 70.
let's wait for actual IRL users opinions, and enjoy this glorious month
VEHICOULE@reddit
Why not one mention that is still a preview, wait for 4.1 or shit and we will see again
jacek2023@reddit
You should also compare price of local setup for both models
Leflakk@reddit
Thanks, was waiting for this kind of post (too lazy to do it myself^^)
flavio_geo@reddit (OP)
This is a quick graph generated by ChatGPT comparing them across same reported benchmarks