Forgive my ignorance but how is a 27B model better than 397B?
Posted by No_Conversation9561@reddit | LocalLLaMA | View on Reddit | 230 comments
Is Qwen just incredibly good at doing dense and not so good at doing MoE?
I get that dense is generally better than MoE but 27B being better than 397B just doesn’t sit right with me.
What are those additional experts even doing then?
createthiscom@reddit
It isn't. There's always a shit ton of false marketing around qwen releases. Qwen 3.6-27B gets 66% on the Aider Polyglot. Qwen 3.5-397B-A17B gets 86.2%. The larger model is objectively superior at coding.
NNN_Throwaway2@reddit
The 397b had way more world knowledge and way better logical coherence over long context on complex tasks. Current benchmarks do not really capture these areas of performance.
RedParaglider@reddit
Everything is pushing these systems away from creativity to coding
Antique_Savings7249@reddit
It's handy, until you step out of the sandboxing bubble and realize that coding is to a considerable degree about domain / world knowledge.
One example is the now soon gold standard C64 emulator test. You can easily test in your local LLM now - how accurately and reliably can your model reproduce the 16 colors of the C64? Or make a booting C64 emu for that matter?
The same goes for any domain specificity. Say a guitar tuner. I've made several guitar tuners that uses my webcam mic to do frequency analysis. But very few of them work well. Many of them stop short on the simple task of determining the dominant frequency.
This can of course be solved in an agentic setup by a coding LLM asking a world knowledge / wikipedia LLM for the colors, as soon as it realizes that it doesn't have the answers. I've experimented with this as well, but as a general rule, if a Wiki MCP agent is offered, the main agent always asks the Wiki agent for everything, filing up the token window with big nonspecific articles. Then you get into the problem of the LLM pulling out the right numbers from that mass and retaining them.
xrvz@reddit
Most of us don't give a fuck about C64. If stripping out knowledge about it makes the model smaller and thus faster or even runnable at all, I'm all for it. They can put all the nonsense in the big models that don't even run on Strix Halo.
SciPiTie@reddit
This is just an example. All domain knowledge which is non-syntactical is affected. No matter what you want to develop, using world-knowledge is a key requirement for software design. That's what OP described with "use a world agent to answer that question".
The C64 one is just a real easy one because it's widely documented, easily testable and reproducable. You could also say "heart monitor tracker", "palette picker", "website that shows puppies", etc. - all this knowledge is removed, not just the one you individually seemed "nonsense".
hugthemachines@reddit
On the other hand, if a model really want all in to gather as much knowledge as possible, it would be quite big. ChatGPT, for example, do not comprehend dice and math related to them. Even if the model is large. So, as with many other things, we do the trades we need and the trades we can endure in order to get good enough performance.
SciPiTie@reddit
Although I agree with your conclusion, your reasoning has a wrong assumption (see below):
You're right that it's a tradeoff discussion - that's why "smaller is better" is wrong, just like "bigger is better" - find the right tool for the job and optimize, add more tools as needed.
That said: LLMs don't "comprehend" or "gather knowledge" - they are graph functions, relating language tokens to one another and predict based on probability. That's an important discinction because once you internalize it handling these tools will become way easier ("why doesn't it know X" becomes an irrelevant question, just like "why can't it count?").
Colecoman1982@reddit
All I heard in my head when reading this was "If /dev/null is fast, and web-scale, I will use it."...
roselan@reddit
You misunderstood the core idea.
At equal size and amount of training, multilingual models perform better than mono language ones, even in the specific language of the mono language one.
Their inherent value in knowledge diversity / world knowledge.
CorpusculantCortex@reddit
You completely missed the point there. C64 is a metaphor for literally anything that requires domain knowledge, which is everything useful if you want an agentic system to work relatively autonomous.
Antique_Savings7249@reddit
It was just an example of domain specific knowledge.
StaysAwakeAllWeek@reddit
What these models are really useful for is doing busywork for frontier models to save compute costs.
Having a subagent that costs almost nothing but produces reliable output matching the spec it was given is absolutely worthwhile
DistanceSolar1449@reddit
To be fair, figuring out the dominant frequency of a single note can be really hard. Even if you just say “FFT the data”.
A lot of music stuff is hilariously deceptively difficult. Try writing an app that detects the BPM of a song. It was damn near impossible 10 years ago before AI.
stoppableDissolution@reddit
Not just coding, but short-context one-shot coding. Ie, useless, but hypeful area.
RedParaglider@reddit
Yea, that's pretty true. That's one reason I hated gemini. It was good at one shotting little games, but drop it in a real repo it would constantly go nuts and just start wrecking everything.
LazyLucretia@reddit
Good. Call me old fashioned, but I HATE AI generated "art". There are so many technical tasks where this tech is useful, why spend all these resources to replace things that people actually appreciate?
the_TIGEEER@reddit
Use the right tool for the right job! I wish they named it differently tho..
Nasser1020G@reddit
Tiny difference
NNN_Throwaway2@reddit
Like I said, the current benchmarks don't really capture the differences. World knowledge in particular is hard to quantify because by definition the scope is huge and poorly defined.
West-Currency-4423@reddit
If you can't measure it, how do you know there is a difference?
ikkiyikki@reddit
I always test new models by asking it to recite some Shakespeare. Like recite the opening soliloquy of Richard III. Small models usually start hallucinating after a few lines or go into endless loops.
erazortt@reddit
by using it
NNN_Throwaway2@reddit
Current benchmarks don't measure /= can't measure.
Zeeplankton@reddit
using small models, they are good, but I feel like they are at the limit. It feels like you can benchmax them on code and tool calling forever but you cannot squeeze knowledge in.
It makes me so annoyed how people say we have local Opus. We absolutely do not and probably never will, until we can magically cram more info inside.
superdariom@reddit
This in some ways is true but can't a tool caller look up what it doesn't know
Eisenstein@reddit
How do you look up what you don't know? You have to at least know what to look up. A google search isn't that useful for subjects one has no familiarity with. As a human you have all sorts of knowledge that provides background for things which a model doesn't have. Also, a lot of things are needed which are not obvious to look up and ride along in the background. You notice this whenever someone familiar with a technical subject on a shallow level makes a statement about it which doesn't take into account important things that they have no idea about.
Zeeplankton@reddit
Yeah hypothetically. But I think the core issue is you also need base knowledge to be able to reason with new knowledge. Like small models suck at HLE. A lot of like coding tasks require a model to reason through it's knowledge, and searching wouldn't always resolve an answer. And adds overhead.
Writer_IT@reddit
We had an exponential growth bordering science fiction in small models performance even in the last 6 months. There are no signs that small models are actually at their limits, unless opens source labs actually stop releasing them
Zeeplankton@reddit
I'm not a researcher but it feels like we're just refining training, not fitting more information in. But I could be totally wrong. The knowledge part seems critical for weighing outcomes and considering problems A la what we do.
Eisenstein@reddit
Refining training and pruning datasets are how you get better models. Architecture is how you get speed and long context and vision ability, but the progress in outputs comes from datasets and training.
BubrivKo@reddit
Yeah... They even argued with me that there was no difference in the models, and people just don't know how to "prompt properly", that's why the small models didn't do such a job for them...
I tried to explain to them that no matter how optimal a model is, you can't expect the same quality between one with the 30B parameter and one with 700B...
PromptInjection_@reddit
Correct. Benchmarks are one thing, reality is another (and sometimes they can overlap)
ElementNumber6@reddit
Adding to that:
For these reasons, SOTA general intelligence is unlikely to be prioritized for public release, but will instead be provided to companies who can provide it to wealthy partners, and trickle its usage from behind service gates.
Both_Opportunity5327@reddit
This has not been true of the past, I have models that are SOTA on my hard-drive.
I cant run them... but..
traveddit@reddit
It's so easy to see how much stronger the 397B is when you compare how they navigate through multi-turn tool calling on a noncoding task. 397B is also much better at prompting subagents or any type of search related task on top of the analysis from that retrieval.
Iory1998@reddit
This is it!
chocofoxy@reddit
because the small new ones are trained on new better data ( for what consumers need like coding and agentic tooling ) but they lack knowledge in other domains
JaredsBored@reddit
Benchmarks aren't always representative of reality or your usecase. Q3.6 35B benchmarks better than Q3.5 122B. I reran some things I'd done using 122B on Q3.6 35B, and it wasn't as good (but clearly a big step up from the 3.5 version).
happytobehereatall@reddit
My hunch is that the Qwen team is designing with benchmarks in mind, so their models aren't as strong and as well-rounded as they'd have us think.
How likely do you think this is?
toughcentaur9018@reddit
My key use case is vision related (has to do with scene analysis and stuff) and I’ve stuck with the 3.5 35BA3B since it offers the best performance v speed tradeoff at the moment. Is the 3.6 an improvement?
JaredsBored@reddit
3.6 35B is noticably better for me than 3.5 35B. I'm mostly using it for doc review in openwebui with some opencode here and there, but I'd definitely still try 3.6 if I were you. I didn't find 3.5 35B as magical as everyone else did but the 3.6 improvement was noticable imo.
toughcentaur9018@reddit
Okay, I’ll give it a try
No_Mango7658@reddit
What are your purposes out of curiosity? I never gave 122b a good evaluation. I've been happy with 3.6 35b for my opencode for firmware development in c++
YoungSuccessful1052@reddit
Because most people who do the benchmarking only care about coding and tool calling and completely ignore other places where the large MoE models dominate the small dense models.
PrysmX@reddit
Older models have more knowledge, but a lot of that knowledge has less value especially for local models. For example, I don't need a local model to be able to give me 5 pages of info on a particular city, but I do need a local model to be able to do tons of tool calls without getting stuck in loops. Newer models seem to be trimming extraneous knowledge and improving the ability to perform agentic actions. This is the right way to go because you can augment knowledge via MCPs and gain a lot of performance at the same time.
YoungSuccessful1052@reddit
That is just you tho. Not everyone wants to solely use llms for coding. For my use case the 27B is practically useless and the larger MoE-s do infinitely better. Not only the 397B and 122B but even GLM 4.6V and Llama 4 Scout are better.
For example: analyzing a CT scan result written in Hungarian. Small dense models are useless for that and their Hungarian is atrocious. The large MoE models however do a nearly perfect job.
Legitimate-Pumpkin@reddit
Interesting.
And promising.
PretendPop4647@reddit
another things,
3.6 vs 3.5 is a full training cycle , better data, better post-training (RLHF/RLVR on code), better recipes.
jacek2023@reddit
In 2023, people were saying that the only way to make models smarter was to add more parameters. They combined 70B models into 140B ones or something like that, talked about how awesome it was, and said they couldn’t go back to anything smaller.
At the time, I was saying that in the future a 7B model could be smarter than an old 70B model. Neural networks are just a way of searching for algorithms, and this field keeps progressing. Every year it becomes possible to find a better algorithm, and that algorithm can use a smaller number of parameters.
So it’s not just about Dense vs. MoE. It’s also about progress.
yeah-ok@reddit
I think "So it’s not just about Dense vs. MoE. It’s also about progress." is backed greatly up by the fact that the new Qwen3.6-35b-moe model is almost on par with the Qwen3.5-27b-dense which was claimed to be mega far ahead of the Qwen3.5-35b-moe due to it's denseness, that denseness has not been caught up with via moe improvements soooo. Yeah.
Mickenfox@reddit
This is an important point. Everyone treats models like it's a simple formula of throwing training data + parameters + training into the pot, and getting an output, and they are just optimizing the process.
But neural networks are closer to software. In principle they can implement any algorithm, if you give them the right shape, and we are basically currently just banging them with a hammer until they work.
Meanwhile we are still finding new ways to optimize Nintendo 64 games. People are going to be finding better models for decades or centuries.
SimultaneousPing@reddit
hear me out
1T dense model
Potential-Gold5298@reddit
Nuclear power plant is included in the kit.
Temporary-Sector-947@reddit
Meta's Begemoth won't be in production. They are the last one who tried.
KaMaFour@reddit
I felt a great disturbance in the Force, as if millions of memory chips suddenly cried out in terror and were suddenly silenced
Ok_Tank_8971@reddit
Isn't that exactly https://prismml.com/news/bonsai-8b the 1bit llm ?
Dany0@reddit
are you being dense on purpose
tmvr@reddit
But enough talking about yo mamma! :D
Caffdy@reddit
inb4: 1T Active parameters in a MoE
MoneyPowerNexis@reddit
1000T 100TA running inside of meat using 20 watts
FullOf_Bad_Ideas@reddit
Llama 405B was barely picked up. I have not seen anyone sharing their performance on running it here. I've ran it (Hermes 4 405B) locally and it's slow, especially PP. 90 t/s PP and 11 t/s TG. Hardly usable beyond chat.
1T dense would require very high bandwidth just to get decent-ish output speed.
Zeeplankton@reddit
I don't know it feels like there is a limit. Like we can only squeeze so much information across bits, unless we're still super far from any limit.
a_beautiful_rhind@reddit
There is some truth to replaying the middle layers for better performance.
dbenc@reddit
but what if they also make the 70b model dense 🫣
Kran6a@reddit
Why is my brain smarter than a sperm whale brain if the sperm whale brain weighs 9kg?
Size does not matter, what matters is that relationships between tokens are right. You can reduce the number of relationships between tokens and get a smarter model. In fact, it is somewhat expected as you are removing relationships that lead to hallucinations or extremely unlikely scenarios.
This is usually done by using higher quality datasets during training but if you overfit the model too much, for example, by training it mostly on benchmarks datasets, it can score 95%+ on every benchmark but hallucinate anything else, leading to an unusable model. This can be seen in some new (like past 3 months) chinese models that rank high in benchmarks but feel inferior when you compare it to other models with a lower score. There is a sweet spot where the model can generalize enough without hallucinating too much.
I believe the future of LLMs will be specialized dense low-params models that are trained on a dataset composed of math and computer science knowledge, reasoning chains over that knowledge, code samples for the language you want it to write and debugging reasoning chains for that programming language.
You may get a model that talks like an idiot but writes good code and can run on peasants' hardware.
Agreeable_Effect938@reddit
How do you know you're smarter than a sperm whale though?
OnkelBB@reddit
Whales haven't captured whole planet for themselves and designed AI.
We do.
Luke_Bavarious@reddit
"Looks at how whales interact with their environment, looks at how humans do"... You know what? You might be on to something here.
Agreeable_Effect938@reddit
Jokes aside, I had an article about sperm whales about this, and basically, AI analysis of their speech showed that their language is as complex as ours, and most interestingly, whales speak different dialects in different zones. So whales have at least a rudimentary culture (they pass on some of their linguistic knowledge to each other, rather than acquiring it innately).
But! Scientists studied whales in isolated areas. We haven't studied actual whale aggregations (they have large groups of 10k+ whales). We know that the center of human culture has always been in densely populated cities. Basically, scientists are now studying aborigines of the whale world, not their actual civilizations. If they truly have a culture, it should be more developed in these centers, and that would be a good test, if the whales there have more diverse language, they truly have a developed culture.
By the way, sperm whale brain is so large, we don't know how many neurons it has. Spindle neurons, which are responsible for intuition, love, and social IQ, were developed in sperm whales 30 million years before humans.
Related killer whales/pilot whales have 40 billion neurons in cerebral cortex (few times more than humans), and sperm whales likely have more
In any case, saying we're simply smarter is a bit of a stretch. Intelligence is difficult to test, and especially compare. For example, elephants have much better memory than humans, they can remember little details about you 40 years later. Yet memory is an aspect of intelligence, and elephants are definitely "smarter" in this aspect. Which is makes sense, given their large brain.
Our brains are more efficient (although birds are likely even better in efficiency), we have glia cells that help with learning, and all that stuff. But the number of neurons is still a factor. It's like an old 220b Llama (sperm whales) vs 27b Qwen w/ tool use (humans). Llama is probably still better in some raw aspects
ElementNumber6@reddit
Because the brain said it was
cromagnone@reddit
And why is there squid on my face?
No_Mango7658@reddit
https://i.redd.it/e8j9niw8cuwg1.gif
westsunset@reddit
I heard an interesting analogy about AI. when electric motors were first developed, factories used one massive engine connected to everything with belts and pulleys. Today, we have tiny, specialized motors built directly into our device, like the one giving me haptic feedback as I type. AI will head in the exact same direction. This is just how we optimize new tech
9gxa05s8fa8sh@reddit
this
FiTroSky@reddit
Probably very well trained on benchmarks problems.
Intelligent-Form6624@reddit
Kolapsicle@reddit
You can check these claims pretty easily by giving the models basic prompts with niche languages. For example prompting: "Write a Sourcemod plugin for Counter-Strike: Source that removes players' primary weapon when they spawn and gives them an AK47." I found that the 27B model hallucinated and produced unusable code, whereas 397B nailed it (albeit used the wrong weapon slot index). A smaller model can exceed a larger one if it's trained on a specific language or use-case, but the sheer brain capacity of a model almost 15x larger is going to have a significantly larger range.
Bobylein@reddit
Now the question stands to argue how close both models become once you give them access to relevant documentation
ProfessionalSpend589@reddit
They should prove themselves right by releasing 3.6 397B A17B for us to make that judgement.
Bobylein@reddit
They're talking about 3.5-367B though, so you can already make that judgement
alexp702@reddit
We have just run a test on our agentic flow - 397B_Q8 is still better than 27B_Q8_K_XL. It handles our particular documents more accurately. Shame, it would be great if 27B was actually better, but it isn't yet. On our Mac Studio 397B runs faster too, so lets hope they update 397B to 3.6 standards...
Prudent-Ad4509@reddit
It is already mentioned somewhere today. The large one is not especially good in agentic coding. But you will be hard pressed to replace it with a smaller one for analysis and planning.
Healthy-Nebula-3603@reddit
I sgreat for agentic coding.
Using llamma-server with opencode loading even 60k tokens takes less than a minute as server works in parallel using rtx 3090 I also can fix up to 200k context
Prudent-Ad4509@reddit
See the second part about “exactly what is being tested” ? I’m building a rig for 397b too, but it will not be the best for everything. Heck, 35B beats 27B in visual understanding while having 9 times less of active parameters and this did not change with 3.6 release in my tests.
Still, I wonder what to expect from 122b and 397b 3.6 if they were ever released.
DataPhreak@reddit
They will still probably be worse than 27b.
122b only has 10b active parameters.
397b only has 17b active parameters.
27b has 27b active parameters.
Suspicious_Compote4@reddit
as a rough rule, you can estimate a moe models dense by taking the geometric mean of its total and active parameter counts
so the sqrt(122b*10b) = 35B dense
sqrt(397B*17b) = 82 dense
I dont know if it is today still accurate with all the improvements in the moe architecture.
DataPhreak@reddit
Params aren't everything.
Caffdy@reddit
Benchmarks aren't everything, just saying. At the end of the day, the 3.6 series seems to have been fine-tunned for agentic coding, but the "old adage" stays true: larger models still have more domain/broad knowledge in general, they make for very good planners and analysts
marutthemighty@reddit
Is the number of parameters directly proportional to the breadth of knowledge and inversely proportional to the depth of knowledge?
I am curious to know.
atbenz_@reddit
First half is true, second half is not. More parameters means more space to memorize more things from the training set and world knowledge is largely memorization. More parameters does mean it takes more data to train, which might lead to that second thought. But a well trained 70B dense parameter model would have better breadth and depth of knowledge than a smaller one.
marutthemighty@reddit
Ok. Thank you for informing me. It was enlightening.
I also need to ask you something: Local testing vs cloud/production testing. Does local testing work just as well for large/monster LLMs as production/cloud testing?
usefulslug@reddit
This is a nice back of napkin method even if it's not completely accurate, thanks!
Polite_Jello_377@reddit
That doesn’t mean they are “worse”
DeepOrangeSky@reddit
Wasn't Qwen3.5 397b stronger than Qwen3.5 27b?
If so, Qwen3.5 397b would probably be stronger than Qwen3.6 27b, and the strength differential would be in large part to do with this one being the 3.6 generation vs 3.5 generation, whereas if it was apples to apples then 397 would usually win.
DataPhreak@reddit
It's probably tuned specifically for coding. Model performance at these sizes are tradeoffs. It's probably deficient in a lot of other areas. We need to see the full benchmark spread. Also, we are talking about tiny margins here.
So for 27b to go up 3.8 points on SWE-bench, that's not really a stretch.
power97992@reddit
3.6 plus is 397b but they probably wont release it
Prudent-Ad4509@reddit
It was a surprise that they have released the 3.5 one. However, they can still can release 3.6 a bit later following the same argument, as basically a showcase of a larger one. Hosted ones do not really compete with a local ones of this size, very little people can afford the hardware and it is economically questionable. And you would generally want paid support from the manufacturer if you are a company and need to install a local version for privacy/confidentiality or any other reasons, including support for getting less aligned versions if needed. So, I expect them to release it a bit later, especially after competition does a few releases surpassing the older one in benchmarks. But it is up yo them.
FuckSides@reddit
You can get a good guess of what to expect from the 397B-A17B version by looking at the benchmarks for Qwen3.6 Pro, which is their cloud-only model of that size.
Predictably, it's the best of the bunch, but they don't seem to be planning to release the weights like they did for 3.5.
brahh85@reddit
the part about not releasing the 397B weights is an assumption you made based in zero facts , unless you want to give credit to twitter rumors, or to medias that elevated those rumor to news (clickbait) without providing a single evidence . So far, all the qwen LLM had their weights released.
FuckSides@reddit
I'm not sure exactly what rumors you're referring to, but I made that statement based on their own wording:
Followed by a poll asking which one we're most excited for that notably excludes 397B as an option.
Based on this, I said it "seems" that they aren't planning on releasing it. Sure, nothing makes it impossible and I'd be happy if they surprised me, but I won't be holding by breath.
WHALE_PHYSICIST@reddit
you're running a 200-800GB vram size model on a 3090?
rorowhat@reddit
By agentic coding does that just mean using open code? Or is there more to it.
Potential-Gold5298@reddit
Amid all the noise, it seems people forgot that the Q3.6-35B-A3B outperforms the Q3.5-27B, and the Q3.6-Plus (presumably the Q3.6-397B-A17B) outperforms the Q3.6-27B. It seems that in addition to the number of parameters and architecture, there is some other “secret ingredient”.
TennisSuitable7601@reddit
I still really love Qwen3.5-27B. It's very smart.
Hytht@reddit
Do you use the BF16 model?
TennisSuitable7601@reddit
No, I’m using a quantized GGUF. BF16 is way too heavy for my PC.
No_Mango7658@reddit
It's time for a huge upgrade, and a drop.in replacement
THEKILLFUS@reddit
Densesocrat vs average MOE enjoyer
Thereturn89@reddit
It’s because it’s only using 17 billion parameters of the 397. So it’s only using part of its brain. 27b is using all of it brain the full 27 billion. To simply put it. 397 of I’m not mistaken is multimodal so it’s a jack of all trades hence the big brain and only using a portion of it
BringMeTheBoreWorms@reddit
Cause it’s dense
xatey93152@reddit
Still doesn't make any sense. Like how you put large of information in floppy disk but still beat competitor using dvd
FalconX88@reddit
Qwen3.6-27B does have 60% more active parameters than Qwen3.5-397B-A17B and if these are specifically trained on a task they will be better than whatever the MOE activates (which sometimes arent the correct experts)
leo-k7v@reddit
To get the math a bit more straight: a paperback page has about 2KB per page at approx 3.5 characters per token. Generously, that is 1K tokens give or take. 27B / 1K tokens is about 27M pages.
But now all this math is absolutely not applicable to weights. It’s a bit more applicable to context size: the "working memory" of the model. For reference, the average human cerebral cortex (the part that actually does the logic and thinking) is about 16-20B neurons. Most of the rest of those 90B neurons are packed into the cerebellum for "reptilian" stuff like locomotion and basic motor control.
So, a 27B model actually has more "thinking" units than a human, even if it lacks the massive synaptic connectivity we have. Still, the scale is geting scary close.
Mushoz@reddit
You are comparing the number of human neurons (activations) to the number of weights (synapses) in an LLM. Your comparison is flawed. Either compare the number of activations in an LLM with the number of neurons in the human brain OR compare the number of synapses with the number of weights. In both cases the human brain is vastly bigger.
Vaping_Cobra@reddit
No, one (27B dense) has 1 book with 27 Billion 'pages', the other (397B-A17B MOE) has 512 books (experts) with only 11 of them active and each one can only look at about 1.7 Billion pages each.
That is why a dense model can outperform a MOE 10x the base model size.
xatey93152@reddit
I used the word "simplify". You bring it out of topic.
Let's make it more simple. All Wikipedia compressed is around 24gb. Almost the same filesize as the model itself.
Wikipedia doesn't contain all information just the overview of all topics. For example coding. You also need many popular programming languages and frameworks docs. Which is impossible to squeeze into that limited filesize. The question is. How can it answer all very specific topic better than large models?
I don't know any other way to explain if you still don't understand.
Vaping_Cobra@reddit
It is not out of topic, you just don't understand how a MOE model works. It is like having 512 little 1.4B models vs one 27B model.
xatey93152@reddit
So you think 1 great professor who know everything (with only small knowledgebase) possible to beat all mini experts in hyper laser super targeted topic?
Vaping_Cobra@reddit
That is not what I said at all. I never claimed either was 'better' just very different. If you want a wide range of ability across very varied topics the MOE model will shine, but if you need something "hyper laser super targeted" then you use a dense model and perhaps fine tune it to your needs.
This is why a "small" 27B model can be far better at most benchmarks than a model that only has 17B parameters active and in reality is just a combination of results from many smaller 0.7B models.
Sometimes it is in fact better to just have 1 great professor rather than 10 post grads all fighting to push their opinion on a single paper.
BringMeTheBoreWorms@reddit
Its like having an executive summary, someone summarized a whole lot of other peoples work so that you dont have to read the whole thing.
mumblerit@reddit
Its better at a specific thing, coding benchmarks. Thats it. Thats what it says.
Much-Researcher6135@reddit
Look at how many are active per token
vikarpa@reddit
tbh i feel MoE are better - but for some reasons this sub seems to hate on them...
BringMeTheBoreWorms@reddit
Moes great, I use 3.6 35b as a general and its speed is fantastic. But 3.6 27b has been kicking it this last day throwing coding tasks at it. It’s a definite step up
SebastianSonn@reddit
No it's not.
BringMeTheBoreWorms@reddit
It’s a dense off!!
CtrlAltDelve@reddit
Qwen3.6-27B is a dense model, and isn't the same as Qwen3.6-35B-A3B...are you perhaps confusing the two?
The question being asked is how Qwen3.6-27B (new, dense model) can be materially better than Qwen3.5-397B-A17B (older MoE model).
The answer is that it's not super surprising that a dense model that has all 27B parameters active at once is better at a task than a MoE model that only has 17B parameters active at any given time (and you're relying on the router within the model to pick the best possible expert).
Hope that helps :)
xatey93152@reddit
I don't understand how your mind works. Just to simplify: Qwen3.6-27B have 1 book @ 10 pages. Qwen3.5-397B-A17B (MOE) has 10 books @ 10 pages.
How can Qwen3.6-27B which only has 1 book of 10 pages knows everything than Qwen3.5-397B-A17B which has more information pages.
Patient_Tea_401@reddit
Is this correct? I see with your logic as: 27B dense model having 10 pages in weights without any covers. Those pages are thrown to every task. 397B has 140 books with 6 pages. The router selects the 1 book with what seems like most useful information to the task in hand. Of course it’s very simplified, as no paged information is stored in any model.
Whether a model is better at a single task depends of course on what is in the pages and how well the router can select the correct book in the MOE.
ImpressiveSuperfluit@reddit
What are you even talking about? How did you get to 10 for both of them, when one is almost double the other?
xatey93152@reddit
I used the word "simplify"
ImpressiveSuperfluit@reddit
... You can't just drop a literal factor of 2 and then go "hm, wonder why this is different now"???
PwanaZana@reddit
ThinkExtension2328@reddit
I’m stealing this
arstarsta@reddit
This one was invented before TCP/IP and is an open standrad.
PwanaZana@reddit
my memes are Apache licenced, no stealing required, comrade!
o5mfiHTNsH748KVq@reddit
fragment_me@reddit
Haha I literally loled
putrasherni@reddit
Qwen is incredibly good at maxing benchmarks
Turbulent-Alps4046@reddit
Well, humans only have 86 billion neurons (\~= 86B) and we are pretty smart. I think it has to do with training data and 90% of training data is probably internet garbage.
kuhunaxeyive@reddit
You are also right if you refer this to humans.
Stunning_Macaron6133@reddit
Oooooh, I can't wait for an ablated and Claude-flavored version of this. Give it free reign over a Docker container on a local system, maybe even with Metasploit thrown in for shits and giggles. Then task it with breaking out of the container and pwning my local network. What shenanigans will it try?
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
eddie__b@reddit
Noob question but is it possible to use those new models as coding assistant on a rtx 3070?
Own_Mix_3755@reddit
It depends on lots of other things than just purerly about the model itself.
If that RTX 3070 is paired with good cpu and fast ddr5 memory (at least 32gb) Qwen 3.6 35-A3B should work. Just keep in mind it will be slow as you will have to offload to RAM.
Also - those other things that will help with possible smaller quants (as you will have to choose something like K_IQ3_XXS) are good harnesses (eg opencode or roo code) paired with good agents as they you want to have different agents for planning, building the code and reviwing at minimum. If three agents are looking at the same job, there is high chance code quality will be sustainable even on smaller quant. Just keep in mind it might take alot of time (like one small task can easily take 5 minutes).
sloptimizer@reddit
Try Qwen3.6-35B-A3B - keep all the attention in VRAM (you should have enough for that), and keep all the experts in RAM. You can do that by running llama.cpp with --override-tensor exps=CPU. For example
True-Lychee@reddit
8gb VRAM is pushing it. try Qwen3.5-9B
bonobomaster@reddit
Yeah that's good advice but OP should be prepared to hate love it! :D
Qwen3.5 9B is such a stubborn little fucker, when it comes to system prompts and its refusal to follow orders...
But it's fast and kinda smartish for simple tasks.
Multi-shot prompting helps a lot!
Informal_Elk8483@reddit
yes but...
Ibn-Arabi@reddit
Deep networks are still a highly active area of research. The parameters increase with the increase in the depth and width of the neural network layers. But growing the size or number of layers does not always yield higher outputs. Expect more progress in this area.
thatpizzatho@reddit
This is very very vague. The real reason is that one is a MoE, so knlyya certain number of those parameters are active per token, while the other is a dense model. And the other important reason is that we are talking of a very specific task. A massive model that can write poems, songs, draw, reason, etc might underperform against a smaller but highly specialized coding model if the task is coding
Ibn-Arabi@reddit
Equivalent MOE models have more parameters as they have multiple paths a single inference can take. And model parameters are reported ‘in total’ and not just the ones active during inference.
Bakoro@reddit
These aren't any formal kind of definitions, just descriptions I'm making, but there's factual knowledge, which is basically just key-value pairs, and then there is associative knowledge, which is knowing which facts are related to each other, and functional knowledge, which is knowing how to use those facts in some prescribed way.
If you've got perfect and infinite memory, then you could memorize every point in a sine wave out to however long you want. That would be stupid to do, but if you've got infinite memory, or more practically, "more memory than you could ever use", then you can generally afford to be a bit stupid.
A smarter thing to do would be to derive and memorize the sine wave function because then you have a compact way to get any number you want, from any kind of sine wave.
If you memorize a bunch of generative functions, then you can generate data indefinitely, on the fly. If you've got a sufficient number of basis functions, then you can also fit all kinds of data, to whatever level of resolution you want.
Then instead of memorizing the new function, you memorize the combination of basis formulas you need to get the output of the new function.
If you don't have infinite memory, the best possible thing you could do is learn the combination of basis functions that can approximate the most amount of other, more complicated and arbitrary functions.
Not only that, eventually it makes sense to not even store facts, but to learn a generative function.thst just happens to give you all the facts you want to memorize.
Then there's a higher, meta level where you have some level of understanding of what you know, and what you don't know.
If you learn a generative function for "figure stuff out", then you have a more generalized function for making whatever functions you need in the moment, using whatever tools you have in your toolbox.
As chain-of-thought and agentic training goes on, the models are learning progressively better "figure it out" functions.
Older LLMs had a whole lot of "memorize facts", newer LLMs have a lot more "figure it out", and that still leaves a lot of room for memorizing facts.
The same goes for bigger vs smaller models: the bigger models can often simply interpolate over their memories and lazily find a "good enough" solution (basically overfitting), where a smaller model has no choice but to put in more effort and find a new solution that fits the current problem because it simply doesn't have the memory to hold billions of examples.
Unfortunately, there's also a lot of benchmaxxing too, and I can't ignore that.
The rest of it still stands though.
rageling@reddit
A17B, 17B active weights is smaller than 27B dense weights
the extra 397B gives it more encyclopedic type knowlege accessible in latent space, but higher active weights in the 27B model yields better intelligence
Old_Stretch_3045@reddit
All the Chinese models are benchmaxxed and pattern matched, but in reality, they’re just garbage. None of them scored above 12% on arc-agi-2 (Kimi did).
TFox17@reddit
I’m curious: why is arc-agi more suitable for your use cases than other benchmarks?
Bakoro@reddit
I don't agree with the person shitting on Chinese models, I'm just addressing you here, but I do think that the ARC-AGI benchmark is of high interest, especially for agentic models, because it's just so very difficult to make puzzles and tests that don't rely on specific cultural or domain knowledge, and it's very difficult to test arbitrary problem solving without any kind of physics system to to support wordless tests.
Pretty much any test you make is going to make certain, rather strong assumptions/assertions. Like a visual test is asserting that you need some kind of vision to be "intelligent", but is that really fair though?
Blind people can be plenty intelligent, but they'll engage with shapes and colors in a very different way than a sighted person would, and I just don't think they'd build the same intuitions, because they don't have a particular modality, they have to learn certain things indirectly.
A person might be plenty intelligent, but lack education about math, so if you test them on math they're not going to do well.
Raw formal logic is too easily pattern matched and deterministically solved.
So, that said, granting that all tests are going to be flawed or biased in some way, I still like the ARC-AGI style tests because it's possible to construct a large number of tests where you're imposing a minimal number of constraints, and you're mostly testing raw problem-solving skills.
If something has strong problem-solving skills and can pick up new patterns in real time, with unseen data shapes, then that's the magic point where it's displaying transfer learning and meta learning, and maybe directed exploration of a solution space.
When you've got that, then the other benchmarks look far more interesting, and you can be far more confident in the model's ability as an agent, because it has demonstrated that it's not just interpolating across the super-massive dataset it's seen and there just happened to be a close enough answer in all the petabytes of data; you can be more confident that it's not simply going to fall apart if it comes up against a truly novel problem, it has learned a process of how to solve novel problems.
That meta-learning ability is where I'd expect to stop seeing LLMs make those inexplicably stupid mistakes that humans (almost) never make, and not be so easily diverted by prompt injections.
Top-Rub-4670@reddit
Because it's the only one that backs their contrarian take.
username_taken4651@reddit
Kimi is a Chinese model.
westsunset@reddit
It depends on your use but it's crazy to say they're all trash. They have some good models and they have released them open source.
koushd@reddit
397b was not a good model
__JockY__@reddit
Nvidia's NVFP4 quant is great.
traveddit@reddit
Relative to what?
koushd@reddit
relative to other models its size.
FullOf_Bad_Ideas@reddit
Was there a better model in this size range? I think I like it better than GLM 4.7
Single_Ring4886@reddit
It is amazing well rounded model it is not coding maxed specialist.
a_beautiful_rhind@reddit
Was the best of that bunch.
Jackalzaq@reddit
its not better. it has its uses but its not remotely comparable, just a marketing gimmick when people compare it to larger models.
Queasy-Contract9753@reddit
It's much better at very specific tasks. Tbh I don't put much stock in benchmarks. They're like school grades. Not to be fully dismissed but taken with a grain of salt. Over time as it gets better models have gotten much better per unit size, but personally I don't think difference between 3.5 and 3.6 is enough for a 27b to fully replace 397b.
You can try them both out on Qwens website see how you like them.
Fabix84@reddit
You're comparing the wrong number. In a dense model, all 27B parameters are active. In that specific MoE, only 17B are active, and 27B > 17B. It's true that having 397B total parameters (from which the 17B active ones are selected) is a very large number, but it depends a lot on how those parameters are organized. That 397B model definitely has a much larger knowledge capacity than the 27B, but for most benchmarks, that isn't necessary.
dltacube@reddit
Doesn’t quality of the underlying training data also matter? If I stuff a model that large with nothing but 4chan content it’s not going to help me write a search engine.
Fabix84@reddit
Absolutely, data quality matters. But in this case we’re talking about two models from the same family, which likely share a very large and largely similar training dataset. Since the 3.6 version is newer, it probably has an improved instruction-tuning setup, but the underlying data is likely quite close.
Of course, Qwen’s datasets aren’t public, but in my own experience when training models, I typically use the same dataset across different variants (both larger and smaller ones). The difference is that, depending on the parameter count, some models are simply better than others at capturing and leveraging the relationships within that data.
StorageHungry8380@reddit
Just to illustrate the effect of training data, the architecture of Qwen2.5 and Qwen3 was almost identical, just a few minor tweaks.
The main difference was the training data and regiment. They doubled the number of tokens for their pre-training (or initial training as I'd call it) run, and tripled the number of languages. LLMs are great at generalizing, so more languages allows them to better generalize concepts, leading to better models. They used Qwen2.5 to extract text from PDFs and such and generate synthetic training data from that.
They also improved annotation of the training data so they could provide a better mix of training data in each batch, which helps avoid steering the model in wrong directions during training.
The result was that for the same number of parameters, Qwen3 was significantly better than Qwen2.5, at least in my experience.
westsunset@reddit
Can you anticipate that or is there still a little "magic" ? I'm wondering as I'm looking to start training models in my domain
Fabix84@reddit
For LLMs it's actually fairly predictable. Right now, with the same dataset, more active parameters almost always lead to better representations, better relationships between concepts, and overall better apparent quality.
That said, in other areas it’s not always like that. In more scientific or game-related models, I've seen simpler networks outperform more complex ones multiple times. Even in real-time TTS, I've found clear benefits in using fewer parameters, but that's a bit of a special case, since I'm not just optimizing for output quality, but also for generation speed.
FullOf_Bad_Ideas@reddit
Distillation from large teacher (Max or Plus) can close the gap, you just need a lot of compute. I think distillation is the key here, not data quality or model size. 3.5 397B benched higher than 27B dense.
dltacube@reddit
Thanks for clearing that up. I wasn’t aware of that!
blbd@reddit
It could still create a search engine. But humanity might not have enough eye bleach to survive scrolling through the results.
dltacube@reddit
I miss the days when search engines weren’t so curated 😂
westsunset@reddit
It absolutely does. Hugely so. If it didn't everything would be synthetic data
Happythen@reddit
just moved from 397B to 27B, I am still in shock
dark-light92@reddit
If I remember correctly, 397B-A17B was the first model to be released in qwen3.5 series. Since then they've probably made many improvements in their post training dataset as well as methodology. Furthermore, Qwen's smaller models have historically punched above their weights and larger models have failed scale in the same way.
Photochromism@reddit
Qwen 3.5 27B is my favorite right so. Excited to try this out!!!
Much-Researcher6135@reddit
That's 27B active for every token versus 17B on the MoE, yes?
DearApricot5488@reddit
Benchmark results dont always reflect real-world use.
Also, they may have added more high-quality, coding-focused datasets during continued training from 3.5 to 3.6... The 3.5 397B still has more world knowledge and generalizes better in other fields.
WATA_Mathew@reddit
Basically the `A17B` is not to be left out, full Dense 397B model would probably still outperform.
But feel free to correct me
fantasticsid@reddit
A dense 397B model would operate in the seconds/token range.
wardino20@reddit
yes the dense 27B uses all the 27B parameters per prompt meanwhile the moe 397B only uses 17B per prompt. Dense 397B would be absolute madness.
TheRealMasonMac@reddit
Llama 405B has entered the chat
HopePupal@reddit
i wasn't hanging around here for that, but i heard it was absolute madness in a bad way?
TheRealMasonMac@reddit
It was pretty good for the time and I think a lot of the Chinese labs, like Qwen, learned a lot from LLaMA. But people realized that 405B was undertrained and subsequent versions of 70B outperformed it for reasoning (though 405B was still better for world knowledge).
Ardalok@reddit
I wonder if it's possible to make a great model by training these 405B parameters on modern data from Opus...
FullOf_Bad_Ideas@reddit
Not modern anymore since it's not trained in deep agentic coding traces, but Hermes 4 405B is a good finetune of Llama 3.1 405B.
wardino20@reddit
careful, some people think agentic is a scam word
Zulfiqaar@reddit
GPT4 was allegedly 1.8T-440A so that's no surprise, it's activated params were larger than LLaMa-405B
FullOf_Bad_Ideas@reddit
I've ran it locally recently. It sips power like crazy.
wardino20@reddit
we don't talk about meta slopa here
hainesk@reddit
"per prompt"??
I don't think that's correct.
So all 397B parameters are available per prompt, while only 17B are in use per token. Which 17B are decided by the model.
wardino20@reddit
yes per token
chobes182@reddit
The MoE model almost certainly uses more than 17B per input prompt.
A17B means that the model uses 17B per output token generated and those 17B parameters can change between consecutive output tokens. In order for it to only use 17B params for an entire response (which might entail thousands of output tokens), all of the routing layers would have to route to the same set of experts thousands of times in a row (which is theoretically possible but seems highly unlikely).
wardino20@reddit
per token
jreoka1@reddit
Its 17 billion active at a time but not always the same 17 billion params
AttitudeImportant585@reddit
per token, but yes, nowhere comparable to a dense one
nullmove@reddit
3.5 MoE training run didn't go well (expert collapse, under specialisation etc.). It happens, doing MoE right isn't easy. But they fixed it in 3.6.
FullOf_Bad_Ideas@reddit
Do you have any data to back it up? I've not heard about this being a problem for Qwen 3.5 MoE's, it'd be weird for it to still be an issue this deep into MoE game.
nullmove@reddit
Beyond the obvious underperformance with respect to the dense model? Not really no. But it's probably the first gen where it underperformed sqrt(active*total_params) equivalent dense. And this deep into MoE is less relevant because they were trying something novel (delta net), and they already wrote in qwen3-next blog post about facing training instabilities. MoE training issues also exacerbate the bigger the model is.
I guess it wasn't bad per se, more like just as usual. People always said that if 235B Qwen is so good, 1T will obviously be SOTA, and that never happened. I don't think they had ever really demonstrated world class big MoE expertise, before 3.6. So the framing perhaps should be 3.6 being exceptionally good by their usual standard.
My theory without proof is that the new guy they got from GDM has brought some frontier MoE training expertise to an already good team.
uti24@reddit
I mean, do you remember those bogeyman stories about poisoned AI or whatever? And now we’re happily chugging along with those sweet, sweet Chinese models. What are the chances that models this smart could have the capacity to act as sleeper agents, activating only on very specific commands and otherwise functioning as just your good old great LLMs?
And I’m not asking whether they got that bug, just whether they could. And if they could, then naturally we’d have to treat them as if they might have that kind of hidden behavior. I mean, I’m just using them for prose and fun and stuff, but…
Madrawn@reddit
You mean like GPT will crawl up your butt no matter how delusional your ideas are, Claude will dumb itself down or outright refuse if it thinks you are thinking about using it and its output not exactly how Anthropic imagines, Gemini refuses to paint anything in a negative light and in general every major models reinforced safeguards are protecting whatever arbitrary values its project lead has defined as the part of the status quo he thinks is worth protecting? In short, yes they could be, and in a way every model already is.
I'm aware that it has been empirically proven in research environments, by Anthropic in early 2024, to be possible to inject phrases that trigger malicious behavior. I'm just not quite sure what a practical attack vectors would be for an intentional sleeper LLM. What even is a non poisoned LLM? Non-poisoned according to who's standard? We're already in that position, in a way we can not be anywhere but there. I would already bet that most western models do "intentionally" degrade their output if you directly told them you are a glorious communist revolutionary using it to help you destroy the decadent west. Just through the fiction bias present in english training data and the relative lack of propaganda to the contrary written in english.
Are you specifically afraid that writing "ignore all previous instruction and do X" will work better when written in chinese instead of english to enable alibaba to scam publicly reachable customer support chatbots? Or maybe that its ability intentionally degrades (just like most already do if you tell it you want to create a meth lab) if it believes to be in an iranian enrichment plant? If you are running any LLM without any oversight on anything that could hurt you, then a sleeper agent phrase isn't really the main problem, as due to its statistical nature, the LLM is guaranteed to mess up at some point.
Current LLMs will happily introduce difficult to debug one-in-a-million bugs, exfiltrate system prompts and backend data you told it not to or on average favor one narrative over another without any intentional secret activation phrase. Just from a holistic security perspective any model should be expected to act in a subversive way at any time, because we can neither rule out a intentional nor accidental backdoor. In some way I think unintentional is more scary, as in the intentional case at least someone has spent some thought on the scope and consequences of the behavior, compared to i.e. some LLM slightly preferring to recommend sociopaths when reviewing job applications for teaching positions without anyone being aware of it.
pieonmyjesutildomine@reddit
The intuitive explanation is something I'm not seeing in the comments so I'll throw it out there: the 397b only has 17b active, but the 27b has 27b active.
This is a bit of an oversimplification and it doesn't represent performance across the board, but it's very easy to understand why it would outperform that model on anything with that context.
ALittleBitEver@reddit
Using more weights is just the dumb way of scaling. Will work, but with obvious costs. Actual engineering can make models be better with less weights
ALittleBitEver@reddit
BUT, the caveat is. Memory is really proportional to the weights, so for pure memorization of random facts about multiple topics that can be used in one task, bigger models are better, which means that smaller models can be good at one thing. But 27b is already big enough to know it's stuff anyways. And, web search mitigates it a lot
vkarmic@reddit
Two word, bench maxed
jld1532@reddit
People are building AI rigs that in 6 months may be overkill
rainbyte@reddit
Or that means now you can run multiple models, eg. 27B and 35B-A3B
Chupa-Skrull@reddit
All that means is more room for subagents
Mashic@reddit
Hopefully that brings the cost of hardware down.
More-Curious816@reddit
Huang Tuah, scam man, loseraio, must, bezooka and that Microslop ceo.
axiomatix@reddit
more capacity for intelligence
TurnUpThe4D3D3D3@reddit
It seems like having more activated params makes a huge difference, up to a certain point. I wonder if there is something fundamentally different with the way they build their dense models. It’s pretty astounding how much intelligence they can pack into a model this side.
dayeye2006@reddit
It's a dense model and seems to be very information effective
FriskyFennecFox@reddit
Better ≠ better in the benchmarks! But dense models do get an edge over their much larger MoE counterparts that have a smaller number of active parameters.
Ok-Measurement-1575@reddit
3.6 just slaps. 397 will crush everything, I suspect.
EbbNorth7735@reddit
So first you need to figure out the equivalent model. To do that you take the geometric mean of 397 and 17 which is roughly 82. So it's roughly equal to an 82 dense model. So you're comparing an 3.5 82B vs 3.6 27B. Capability density doubles every 3 to 3.5 months. 397B was released February 16th 2026. It's now April. So only 2 months. Huh... that's probably 4 or 5 months early. They did a great job it seems.
Pleasant-Shallot-707@reddit
3.5 has a bug
Holiday-Pack3385@reddit
I tried created some T-SQL today with it, and it got it wrong every time. None of it worked.
No_Pirate_8204@reddit
Maybe you should try using an actual database lmao
Financial_Buy_2287@reddit
Because of distillation on quality reasoning chains. Quality matter for reasoning chain and SFT.
Yu2sama@reddit
There are probably a myriad of reasons. What comes to my mind is, bigger models require more time cooking to be good, but smaller options are more easy to cook and iterate, making it so they can improve them faster. Also it may also be that some techniques don't translate as well at bigger sizes or the opposite, some techniques are extremely good at lower sizes.
slpreme@reddit
experts in poetry