Qwen released Qwen3-Next-80B-A3B — the FUTURE of efficient LLMs is here!

[-]

79215185-1feb-44c6@reddit

Will love to try it out once Unsloth releases a GGUF. This might determine my next hardware purchase. Anyone know if 80B models fit in 64GB of VRAM?

Reply

[-]

GreaseMonkey888@reddit

The MLX version runs very fast on my Mac Studio M4 Max with 64GB RAM, 80+ tok/sec, 0.5s to first token.

Reply

[-]

prof2k@reddit

Just by a used M1 max 64 for less than $1500. I did last month.

Reply

[-]

ravage382@reddit

Looks like they are already at it. [https://huggingface.co/unsloth/Qwen3-Next-80B-A3B-Instruct](https://huggingface.co/unsloth/Qwen3-Next-80B-A3B-Instruct)

Reply

[-]

Majestic_Complex_713@reddit

my F5 button is crying from how much I have attacked it today

Reply

[-]

rerri@reddit

Llama.cpp does not support Qwen3-Next so rererefreshing is kinda pointless until it does.

Reply

[-]

Majestic_Complex_713@reddit

almost like that was the whole point of my comment: to emphasize the pointlessness by assigning an anthropomorphic consideration to a button on my keyboard.

Reply

[-]

crantob@reddit

you didn't have one. hitting refresh on an output when you can just read the input (llama.cpp git) and know that hitting reresh is pointless.

Reply

[-]

At some point, the llama.cpp git will update saying that it can now be run. How exactly to do anticipate I would know when that is if I didn't....refresh the "input", as you call it? You can miss my point. You can not understand my point. You can not agree with my point. But you can't say I didn't have one. I spent time arranging words in a public forum for a reason.

Reply

[-]

steezy13312@reddit

Was wondering about that - am I missing something, or is there no PR open for it yet?

Reply

[-]

_raydeStar@reddit

Heyyyy F5 club!! In the meantime, I've been generating images in QWEN. https://preview.redd.it/nnjml6jkgnof1.png?width=1024&format=png&auto=webp&s=6feae35166341cd9d662e75121e7c95b9137ed1a Here's my latest. I stole it from another image and prompted it back.

Reply

[-]

InsideYork@reddit

Dr QWEN!

Reply

[-]

alex_bit_@reddit

No GGUFs.

Reply

[-]

ravage382@reddit

Those usually follow soon, but I haven't seen a PR make it though llama.cpp yet.

Reply

[-]

Ok_Top9254@reddit

70B models fit in 48 so 80B definitely should in 64.

Reply

[-]

Spiderboyz1@reddit

Do you think 96GB of RAM would be okay for 70-80b models? Or would 128gb be better? And would a 24GB GPU be enough?

Reply

[-]

Kolapsicle@reddit

For reference, on Windows I'm able to load GPT-OSS-120B Q4\_K\_XL with 128k context on 16GB of VRAM + 64GB of system RAM at about 18-20 tk/s (with empty context). Having said that my system RAM is at \~99% usage.

Reply

[-]

-lq_pl-@reddit

Assuming you are using llama.cpp, what are your commandline parameters? I run GLM 4.5 Air with a similar setup but I get 8 tk/s at best.

Reply

[-]

Kolapsicle@reddit

I only realized I could run it in LM Studio yesterday, haven't tried it anywhere else. It's Unsloth's UD Q4\_K\_XL. https://preview.redd.it/1dfade9olpof1.png?width=697&format=png&auto=webp&s=3ffddaf2eb0a70c84044548b107d1f7e906d121a

Reply

[-]

-lq_pl-@reddit

Thanks, that's great. Time to give LM Studio a try.

Reply

[-]

Neither-Phone-7264@reddit

More ram the better. And 24 is definitely enough for MoEs. Though, either one of those ram configs will easily run an 80b model even at Q8.

Reply

[-]

OsakaSeafoodConcrn@reddit

What about 12? Or would that be like a Q4 quant?

Reply

[-]

Neither-Phone-7264@reddit

6 could probably run it (not particularly well, but still.) at any given moment, only a few experts are active. each expert is only 3b params.

Reply

[-]

Steus_au@reddit

llama3.3 70b q4 give about 3tps on 32gb vRam offloading about 50% to Ram

Reply

[-]

jacek2023@reddit

please watch [https://github.com/ggml-org/llama.cpp/issues/15940](https://github.com/ggml-org/llama.cpp/issues/15940)

Reply

[-]

Aomix@reddit

Well here’s to hoping Qwen contributes the needed code because it sounds like it’s not going to happen otherwise.

Reply

[-]

ArtfulGenie69@reddit

Buying two 5090's is a bad idea. Buy a Blackwell rtx 6000 pro (96gb vram).

Reply

[-]

waiting_for_zban@reddit

You still want wiggle room for context. But honestly, this is perfect for the Ryzen Max 395.

Reply

[-]

SkyFeistyLlama8@reddit

For any recent mobile architecture with unified memory, in fact. Ryzen, Apple Silicon, Snapdragon X.

Reply

[-]

_rundown_@reddit

The community knows quality u/danielhanchen

Reply

[-]

SillyLilBear@reddit

Yes it should. I can fit the GPT 120B Q8 in 71G

Reply

[-]

Opteron67@reddit

get a xeon

Reply

[-]

MoffKalast@reddit

With a new MoE every day the strix halo is looking awfully juicy.

Reply

[-]

mxmumtuna@reddit

At a 4bit quant, yes.

Reply

[-]

Lorian0x7@reddit

it should fit yes

Reply

[-]

PhaseExtra1132@reddit

So it seems like 70-80b models are becoming the standard for usable for complex task model sizes. It’s large enough to be useful but small enough that a normal person doesn’t need to spend 10k on a rig.

Reply

[-]

jonydevidson@reddit

> a normal person doesn’t need to spend 10k on a rig. How much would they have to spend? A 64GB MacBook is around $4k, and while it can certainly start a conversation with a huge model, any serious increase in input context will slow it down to a crawl where it becomes unusable. NVIDIA 6000 Blackwell costs about $9k, and would have enough VRAM to load an 80b model with some headroom, and actually run it a decent speed compared to a MacBook. What rig would you use?

Reply

[-]

koflerdavid@reddit

Since it is going to be a MoE model, it could be amazing to run locally even for the GPU-poor. It has 512 experts, but there are only 10+1 simultaneously active, so it should be very inference-friendly. https://old.reddit.com/r/LocalLLaMA/comments/1mke7ef/120b_runs_awesome_on_just_8gb_vram/

Reply

[-]

AmIDumbOrSmart@reddit

If you don't mind getting your hands dirty, all you need is 64-96gb of system ram and any decent gpu. A used 3060 and 96gb would run about 500 or so and would run this at several tokens per second with proper moe layer offloading. Maybe spring for a 5060 to get it a bit faster. Framework will go faster for most llm's, but 5060 can do image and vid gen and wont have to deal with rocm

Reply

[-]

Fearless-Researcher7@reddit

The dedicated GPU for MoEs only makes a difference to process long inputs. To generate at 20 tok/s, system RAM is all you need, llama.cpp is working on support. For $2k, the Mac mini and Framework desktop should run the Q4 at 40 tok/s. And at the same price, you can run the Q8 on the Framework desktop or a used Mac Studio. Little parenthesis: all computing units with >200GB/s bw used for AI inference have non-upgradable memory: Nvidia/AMD GPUs, mac mini, framework desktop... It's due to routing constraints for signal integrity.

Reply

[-]

Famous-Recognition62@reddit

A 64GB Mac Mini is $2200…

Reply

[-]

Fearless-Researcher7@reddit

$1,999

Reply

[-]

busylivin_322@reddit

Works fine on my 128gb m3 MacBook. Even at larger context windows.

Reply

[-]

PhaseExtra1132@reddit

What’s the usable context window are you getting out of the 128gb ? I’m going for the AMD Ai chips with the same vram amount

Reply

[-]

busylivin_322@reddit

For local stuff, I’m really happy with my Mac. Ollama, OpenwebUI and openrouter means everything is at my fingertips. Both for chatting and development. Just waiting for the M5 and would love to max it out. Only done 60k context since the model released but <5seconds

Reply

[-]

Solarka45@reddit

Yes but something like a Chinese mini-PC with 64GB memory would be fairly affordable

Reply

[-]

MengerianMango@reddit

Even a basic gaming Ryzen AM5 can run this at ~10tps. I can't estimate the PP speed. A DDR5 CPU + 3090 would be enough imo if you're trying to run on a budget. I.e. what I'm saying is that what you already have will probably run it well enough. I am not a fan of the macbook/soldered ram platforms because I dont like that they're not upgradable. If you don't like the perf you can achieve on what you have, then my next cheap recommendation would be looking at used threadripper combos or old epyc hardware. You can build monstrous workstations using Epyc Rome that can get hundreds of GB/s (ie roughly 100tps on an a3b model). And you'll have tons of PCIe slots for cheap CPUs.

Reply

[-]

Majestic_Complex_713@reddit

If I'm understanding the MoE architecture right, I don't think I'm gonna have any problems running this on my 64GB DDR5-5800 i5-12600K + Nvidia 1650 4GB at a personally acceptable speed. smooth stream, no kidney stones. (hehe....i am a toddler. pp speed.)

Reply

[-]

OmarBessa@reddit

and the binary mode of failure, once SoC is gone it's really gone

Reply

[-]

SporksInjected@reddit

A Mac Studio is almost half that btw. You can get much cheaper if you offload MoE with llamacpp

Reply

[-]

PhaseExtra1132@reddit

You can get the framework desktop for 2k ish. And that has a 128gb vram setup. These Ai max 395 chips are seemingly a good way to get in. Im attempting to save up for this. And tbh this still isn’t that expensive. My friends car hobby is 10x the cost

Reply

[-]

redoubt515@reddit

Why does it seem that way to use? Afaik Qwen3-80B is the *only* popular recent model in that size range. The other recent popular medium sized models I am aware of have been: 120B, 235B, 106B. The *only* other popular model in the 70-80B range I can think of is Llama 3, but that is a couple years old now. Are there other good models in this range i'm unaware of or forgetting about?

Reply

[-]

meshreplacer@reddit

Works nice. uses 80GB of ram with the context slider all the way to 262K Fast I get 54 tokens/sec on M4 Max 128gb

Reply

[-]

Feisty_Signature_679@reddit

This is gonna be to go model for strix halo holders AKA framework desktop. MOEs work best there. Can't wait to see benchmarks for it.

Reply

[-]

ResearchCrafty1804@reddit (OP)

They released the Thinking version as well! https://preview.redd.it/aml5furdukof1.jpeg?width=1920&format=pjpg&auto=webp&s=7ac615436163ca517616948739a990c575597164

Reply

[-]

ItGaveMeLimonLime@reddit

I don't get it. 80B model that barely beats older 30b model ? How is this supposed to be win ?

Reply

[-]

SirStagMcprotein@reddit

Looks like they pulled an OpenAI on that last bar graph for livebench lol

Reply

[-]

tazztone@reddit

indeed

Reply

[-]

-InformalBanana-@reddit

They probably used something to emphasize (bold/increase size of the bar) which is the reason why it is noticable only here on 0.2 difference, instead of increasing just the width for example, they also increased the height. I hope the mistake wasn't intentional...

Reply

[-]

PhasePhantom69@reddit

I think the 0.2 percent doesn't matter by a lot and it is very good for cheap budget.

Reply

[-]

UnlegitApple@reddit

What did OpenAI do?

Reply

[-]

Silver_Jaguar_24@reddit

[https://x.com/kareem\_carr/status/1953510697456836694](https://x.com/kareem_carr/status/1953510697456836694)

Reply

[-]

UnlegitApple@reddit

This can‘t be real😭🤦‍♂️

Reply

[-]

danielv123@reddit

You wish xD

Reply

[-]

zdy132@reddit

Showing smaller numbers with larger bars than larger numbers, in their GPT5 reveal video.

Reply

[-]

Broad_Tumbleweed6220@reddit

I will test it more thoroughly, but i think it's gonna be a big surprise to most. Qwen3-30b-coder was already very good at agentic tasks and following instructions, general reasoning too. It is however no match for Qwen3-next-80b... I just posted a quick test comparing both : [https://medium.com/p/b011f63c5236#3940-739c39c5a9cc](https://medium.com/p/b011f63c5236#3940-739c39c5a9cc) Qwen3-next-80b one shot the the code challenge of the bouncing ball inside a triangle.. with gravity. In less than 30s...

Reply

[-]

A7mdxDD@reddit

Anyone tested on M4 Pro 64GB?

Reply

[-]

Slow_Independent5321@reddit

Estimated 20-30 tokens/s

Reply

[-]

A7mdxDD@reddit

This will fill up 64GB of VRAM afaik😭

Reply

[-]

TheActualStudy@reddit

That's fantastic. I'm looking forward to being able to use it at \~4.25BPW.

Reply

[-]

Crinkez@reddit

4.25 Bokens per wecond?

Reply

[-]

A7mdxDD@reddit

I'm dying 😂😂😂😂😂😂😂😂

Reply

[-]

Caffdy@reddit

A six piece bicken nugget

Reply

[-]

Narrow-Impress-2238@reddit

He said bpw so it's definitely Bokens per week

Reply

[-]

Bandoray13@reddit

This cracked me up

Reply

[-]

TheActualStudy@reddit

Bits per weight

Reply

[-]

Ardalok@reddit

You know it's bullshit when a 32b loses to a 30ba3b.

Reply

[-]

Emergency_Wall2442@reddit

Thanks for pointing that out. Are the results consistent with their original technical reports when Qwen3 was released?

Reply

[-]

randomqhacker@reddit

Older revision 32B. 2507 training was magic, apparently!

Reply

[-]

juanlndd@reddit

Is it faster than the 30b a3b? Because there are only 3b assets, but the architecture has changed, correct?

Reply

[-]

Traditional_Tear_363@reddit

According to their blog, the longer the context, the faster it is compared to 30B-A3B (by using linear attention in 75% of the layers) https://preview.redd.it/nkm8ba6p7mof1.png?width=910&format=png&auto=webp&s=b608a944e00b35bf1f758d2fd9b3c1f6e776c98e

Reply

[-]

Emergency_Wall2442@reddit

How about the performance? If the context window is larger than 6k, model performance usually drops significantly

Reply

[-]

Commercial-Celery769@reddit

If it keeps high speeds at long context lengths that will be great. Qwen 3 30b a3b slows down very quickly the higher its context length gets IME.

Reply

[-]

mattbln@reddit

what does this mean. does it run on devices that run 80b or does it run on devices that run 3b?

Reply

[-]

OsakaSeafoodConcrn@reddit

If only 3B active, does this mean I can run it on a 12GB 3060 and expect reasonable 5-7 tokens per second?

Reply

[-]

Lucas1479@reddit

Yep, but you do have enough RAM to load the model. It is ideal to have 64GB RAM along with your 3060

Reply

[-]

OsakaSeafoodConcrn@reddit

Ok so the 3060 + 64GB RAM ....what's the biggest quant I can use?

Reply

[-]

Lucas1479@reddit

maybe q4

Reply

[-]

the__storm@reddit

First impressions are that it's very smart for a3b but a bit of a glazer. I fed it a random mediocre script I wrote and asked "What's the purpose of this file?" and (after describing the purpose) eventually it talked itself into this: > ✅ In short: This is a sophisticated, production-grade, open-source system — written with care and practicality.

Reply

[-]

Striking_Wedding_461@reddit

I never understood the issue with these things, the glazing can be usually corrected by a simple system prompt and/or post history instruction "Reply never sucks up to the User and never practices sycophancy on content, instead reply must practice neutrality". Would you prefer if the model called you an assh\*le and that you're wrong for every opinion? I sure wouldn't and I wager most casual Users wouldn't either.

Reply

[-]

Traditional-Use-4599@reddit

the glazing for me is bias that make me take the output with more salt. If i query for some trivial thing like do the git commit. This is not problem but when I ask about thing I am not certain that bias is what I must account for. For example, say a classic film I am not understand some detail and ask LLM, the tendency catering to user will make any detail even trivial sophisticated.

Reply

[-]

Striking_Wedding_461@reddit

Then simply instruct it to not glaze you or any content, instruct it to be neutral or to push back on things, this is the entire point of a system prompt, to cater the LLM's replies to your wishes, this is the default persona it assumes because believe it or not despite what a few nerds on niche subreddits say, people prefer more polite responses that suck up to you.

Reply

[-]

NNN_Throwaway2@reddit

Negative prompts shouldn't be necessary. An LLM should be a clean slate that is then instructed to behave in specific ways. And this is not just opinion. Its the technically superior implementation. Negative prompts are not handled as well because of how attention works, and can cause unexpected and unintentional knock-on effects. Even just the idea of telling an LLM to be "neutral" is relying on how that activates the LLMs attention, versus how the LLM has been trained to respond in general, which could potentially color or alter responses in a way that then requires further steering. Its very much not an ideal solution.

Reply

[-]

Striking_Wedding_461@reddit

Then you be more specific and surgical, avoid negation and directly & specifically say what you want it to be like. - Speak in a neutral and objective manner that analyzes the User query and provides a reply in a cold, sterile and factual way. Replies should be uncaring of User's opinions and completely unemotional. The more specific you are on how you want it to act the better, but really some models are capable of not imagining the color blue when told not to, Qwen is very good at instruction following and works reasonably well even with negations.

Reply

[-]

ayawnimouse@reddit

The more you have to prompt in this way the more the response is watered down and less capable than if you didn't need to provide this. Which is especially true with smaller less capable models, with smaller inputs and less ability to maintain coherence with long context.

Reply

[-]

NNN_Throwaway2@reddit

I know how to prompt, the problem is that prompting activates attention in certain ways and you can't escape that, even by being more specific. This is easier to see in action with image models. Its why LoRAs and fine-tuning are necessary, because at some point prompting is not enough.

Reply

[-]

Striking_Wedding_461@reddit

Why would the certain ways it activates attention be bad? I'm not an expert at the inner workings of LLM's but to people who don't want glazing the more it leans away from glazing tokens the better right? It might bleed into general answers to queries but the way it would color the LLM's response to shouldn't be bad at all?

Reply

[-]

Majestic_Complex_713@reddit

because a lean isn't a direct lean. we intend to lean away from glazing and we intend to lean towards more neutrality, but in a multidimensional space, a slight lean can be a drastic change in other non-intuitively connected locations. I'd rather not fight with having to lean in a way that I would prefer to be standard for my interactions, since, if I am understanding the multidimensionality problem correctly, I can't be certain of the cascading effects of any particular attention activations. I can hope that it works the way I want it but, based on my understanding and intuition and experience, it's more like threading a needle than using a screwdriver. In both instance, you have to aim, but with the screwdriver, X marks the spot, and with the needle, the thread likes to bend in weird ways.

Reply

[-]

EstarriolOfTheEast@reddit

> Negative prompts are not handled as well because of how attention works, and can cause unexpected and unintentional knock-on effects. Is this intuition coming from all but the most recent gen image models, whose language understanding barely surpassed bag of words? In proper language models, the algebra and geometry of negation is vastly more reliable by necessity. Don't forget that attention primarily aggregates/gathers/weights and that the FFN is where general computation and non-linear operations can occur. Residual connections should help in learning the negation concept properly too. Without strong handling of negation, it would be impossible to properly handle control flow in code and besides, negation is also a huge part of language and reasoning (properly satisfying reasoning constraints requires this). For instance, a model that can't tell the difference between/struggles to appropriately modulate its output given isotropic and anisotropic will be useless at physics and science in general.

Reply

[-]

NNN_Throwaway2@reddit

I think the confusion here is between negation as a learned semantic operator and negation as a prompt-level instruction. Transformers can handle logical negation, hence their competence with booleans and control flow in code, which they’ve been heavily trained on. But that doesn’t guarantee reliability when you ask for something like "not sycophantic" or "more clinical," because the model’s behavior there depends less on logic and more on how those style distinctions were represented in the training data. Bigger models and richer alignment tend to improve that, but it’s not the same problem.

Reply

[-]

EstarriolOfTheEast@reddit

The tokens condition the computed distribution and whatever learned operations are applied based on the contents of the provided prefix. The system prompt is just post-training so that certain parts of the prefix more strongly modulate the calculated probabilities in some preferred direction. The same operations still occur on the provided context. How well the model responds to instructions such as "be more clinical" or be "less sycophantic" are more an artifact of how strong the biases baked into model by say, human reward learning are, rather than from trouble correctly invoking personas whose descriptions contain negations. Strong learned model biases can cause early instructions to be more easily overridden and more likely to be ignored.

Reply

[-]

NNN_Throwaway2@reddit

But the issue is that the presence of the system prompt changes the distribution in ways that are dependent on patterns present in the latent space of the model. The system prompt doesn’t just “add a bias” in the abstract. Because the model’s parameters encode statistical associations between patterns, any prefix (system, user, or otherwise) shifts the hidden-state trajectory through the model’s latent space. That shift is nonlinear: it can activate clusters of behaviors, tones, or associations that are entangled with the requested style. The entanglement comes from the fact that LLMs don’t have modular levers for “tone” vs. “content.” The same latent patterns often carry both. That’s why persona prompts sometimes produce side effects: ask for “sarcastic” and you might also get more slang or less factual precision, because in training data those things often co-occur. My point is this: the presence of a system prompt changes the distribution in ways dependent on the geometry of the learned space. That’s what makes “prompt engineering” hit-or-miss: you’re pulling on one thread, but it tugs at others you didn’t intend.

Reply

[-]

EstarriolOfTheEast@reddit

> latent space. > Because the model’s parameters encode statistical associations between patterns Too high level. There is much more going on across attention, layer norm and FFNs. Complex transforms and actual computations are learned that go beyond mere association. Specifically, latent space is a highly under-defined term, we can be more precise. A transformer block has key operations defined by attention, layer norm and FFNs, each with different behaviors and properties. In attention, the model learns how to aggregate and weight across its input representations. These signals and patterns can then be used by the FFN to perform negation. The FFN operates in terms of complex gating transforms whose geometry approximately form convex polytopes. Composition of these all across layers is beyond trying to intuit what happens in terms of clusters on concrete concepts like tone and style. I also have an idea on the geometry of these negation subspaces as it's possible to extract them using some linear algebra from semantic embeddings. And think about it, every time the model reasons and finds a contradiction, this is a sophisticated operation that will overlap with negation. Or go to a base model. You write a story and define a character and a role. This definition contains likes and dislikes. Modern LLMs can handle this just fine. Finally, just common experience, I have instructions which contain negation, and explicit nots, they do not result in random behavior related to the instruction or its negation nor an uptick of it. They'd be useless as agents if that were the case.

Reply

[-]

NNN_Throwaway2@reddit

A prefix (system or otherwise) perturbs early residual-stream activations. Because features are superposed and polysemantic, that perturbation propagates through attention and MLP blocks and ends up moving multiple attributes together. In practice, stylistic and semantic features are entangled in the training data, so nudging toward a “style” region often drags correlated behaviors with it, whether you want to talk hedging, slang, refusal posture, and so on. That’s the sense in which persona or style prompts produce side effects even when you only intend tone. What I said about “clusters” wasn’t meant to imply that models contain modular, separable units. Rather, it was shorthand for regions of the residual stream where features co-occur. Your point about learned computation (attention patterns, layer norms, MLP gating) is compatible with this: the non-linear composition maps the prefix-induced shift into a different trajectory, but the consequence is the same: different reachable behaviors. Your negation example is orthogonal. The fact that models can follow explicit NOTs doesn’t imply tone and content disentangle cleanly. Negation operators may be comparatively well-instantiated, but stylistic controls are not guaranteed to be. Finally, the distributional point is simple: adding a prefix changes the conditional probabilities the model uses to generate the next token, and that shifts the set of trajectories the model is most likely to follow. Whether you describe the geometry in terms of associations, convex polytopes, or high-dimensional gates, the end result is the same: system prompts bias what the model is likely to do next.

Reply

[-]

218-69@reddit

What you want is a base model or your own finetune. Other than that what you're talking about doesn't exist. Learn to prompt to get whet you want instead of wanting mind reader tech

Reply

[-]

NNN_Throwaway2@reddit

...That's why I mention those exact things in the thread lol.

Reply

[-]

ttkciar@reddit

Yep, that. I'm pretty happy with this system prompt: > You are a clinical, erudite assistant. Your tone is flat and expressionless. You avoid unnecessary chatter, warnings, or disclaimers.

Reply

[-]

ortegaalfredo@reddit

\> 2.5 Flash or Sonnet 4 I don't think this model is meant to compete with SOTO closed with over a billion parameters.

Reply

[-]

VectorD@reddit

Over a billion? Thats very small for llms

Reply

[-]

the__storm@reddit

You're right that it's probably not meant to compete with Sonnet, but they do compare the thinking version to 2.5 Flash in their blog: https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list Regardless, sycophancy is usually a product of the RLHF dataset and not inherent to models of a certain size. I'm sure the base model is extremely dry. (Not that sycophancy is necessarily a pervasive problem with this model - I've only been using it for a few minutes.)

Reply

[-]

Paradigmind@reddit

Does that mean that the original GPT-4o used the RLHF dataset?

Reply

[-]

the__storm@reddit

Sorry should've typed that out, I meant RLHF (reinforcement learning by human feedback) as a _category_ of dataset rather than a particular example. Qwen's version of this is almost certainly mostly distinct from OpenAI's, as it's part of the proprietary secret sauce that you can't just scrape from the internet. However they might've arrived at that dataset in a similar way - by trusting user feedback a little too much. People like sycophancy in small doses and are more likely to press the thumb-up button on it, and a model of this scale has no trouble detecting that and optimizing for it way too much.

Reply

[-]

InsideYork@reddit

Guess they will never get it, only benchmax on science and math since people can't prefer answers (as much).

Reply

[-]

Paradigmind@reddit

Ahhh I see. Thank you for explaining. It's interesting.

Reply

[-]

InevitableWay6104@reddit

not competing with closed models with over a billion parameters? this model has 80 billion parameters...

Reply

[-]

ortegaalfredo@reddit

Oh I'm from Argentina. My billion is your trillion.

Reply

[-]

Neither-Phone-7264@reddit

is flash 1t? i thought it was significantly smaller, like maybe ~100b area

Reply

[-]

KaroYadgar@reddit

Yeah flash is much smaller than 1T

Reply

[-]

ninjasaid13@reddit

is our billion your million? our million your thousand? our thousand your hundred? our hundred your... tens?

Reply

[-]

daniel-sousa-me@reddit

The "European" BIllion is a million million. A TRIllion is a million million million. Crazy stuff

Reply

[-]

Kholtien@reddit

Million = 10^(6) = Million Milliard = 10^(9) = Billion Billion = 10^(12) = Trillion Billiard = 10^(15) = Quadrillion etc

Reply

[-]

cockerspanielhere@reddit

Yo te conozco de Taringa

Reply

[-]

ortegaalfredo@reddit

Nah soy muy viejo para Taringa jaja

Reply

[-]

o-c-t-r-a@reddit

Same in Germany. So irritating sometimes.

Reply

[-]

_yustaguy_@reddit

This is about personality, not ability. I'd much rather chat with Gemini or Claude because they won't glaze me while spamming 100 emojis a message.

Reply

[-]

_risho_@reddit

2.5 flash is the only non qwen model they put on the graph. i dont know how it could be more clear they were intending to compare thing against 2.5 flash

Reply

[-]

Mental_Bandicoot8091@reddit

Модель всё ещё тупая в повседневных задачах и уступает dpsk v3, но для локальных вычислений и API очень хороша. Не хватает грамотной поддержки русского языка. Очень похожа на gpt 4o.

Reply

[-]

Weird_Researcher_472@reddit

No chance of running this with 16GB of VRAM?

Reply

[-]

dark-light92@reddit

With 16GB VRAM + 64GB RAM you should be able to.

Reply

[-]

Zephyr1421@reddit

What about 24GB VRAM + 32GB RAM?

Reply

[-]

dark-light92@reddit

Would probably work with unsloth 3BPW quants. 4BPW may also work but there will be little room for context.

Reply

[-]

Zephyr1421@reddit

Thank you, for translations how much better would you say Qwen3-Next-80B-A3B-Instruct is compared to Qwen3-30B-A3B-Instruct-2507?

Reply

[-]

dark-light92@reddit

Haven't tried the new model so I don't know. And it seems that llama.cpp support might [take a while](https://github.com/ggml-org/llama.cpp/issues/15940#issuecomment-3286596522).

Reply

[-]

Zephyr1421@reddit

Wow, 2-3 months... well thanks for the update!

Reply

[-]

OsakaSeafoodConcrn@reddit

What about 12GB VRAM + 64GB RAB?

Reply

[-]

dark-light92@reddit

Depending on whatever RAB means, it may or may not.

Reply

[-]

Ensistance@reddit

That's surely great but my 8 GB GPU can't comprehend 🥲

Reply

[-]

shing3232@reddit

CPU+GPU inference would save you

Reply

[-]

Ensistance@reddit

16 GB RAM doesn't help much as well and MoE still needs to copy slices of weights between CPU and GPU

Reply

[-]

Caffdy@reddit

RAM is cheap

Reply

[-]

lostnuclues@reddit

Was cheap, DDR4 are now more expensive than DDR5 as production is about to stop.

Reply

[-]

Caffdy@reddit

that's why I bought 64GB more memory for my system the moment DDR4 was announced to be discontinued; act fast while you can. Maybe you can find some on Marketplace or Ebay still

Reply

[-]

lostnuclues@reddit

Too late for me, now holding for either gpu upgrade or full system upgrade or both.

Reply

[-]

Caffdy@reddit

well, just my two cents: for a "system upgrade" you only need to upgrade 3 parts: -MOBO -CPU -Memory AMD already have plans to keep supporting AM5 platform longer than expected, so, they could be a good option

Reply

[-]

lostnuclues@reddit

I am on intel 6 th gen atm, my laptop has Ryzen 5 thought, As my sole purpose is bandwidth so have shortlisted some old Xeon hexa/Octa channel chips in case intel arc b60 in not easily accessible.

Reply

[-]

ac101m@reddit

That's actually not how that works on modern moe models! No weight copying at all. The feed-forward layers go on the CPU and are fast because the network is sparse, and the attention layers go on the GPU because they're small and compute heavy. If you can stuff 64G of ram into your system, you can probably make it work.

Reply

[-]

shing3232@reddit

just get you RAM ,it shouldn't be too hard compare to cost of VRAM

Reply

[-]

Uncle___Marty@reddit

Im in the same boat as that guy but im lucky enough to have 48 gig of system ram. I might be able to cram this into memory with a low quant and im hopeful it wont be too horribly slow because its a MoE model. Next problem is waiting for support with Llama.cpp I guess. I'm assuming because of the new architecture changes it'll need some love from Georgi and the army working on it.

Reply

[-]

TAA_verymuch@reddit

For anyone who doesn’t want to run it locally but still wants to play around with the model, there’s a [online version](https://www.reddit.com/r/LocalLLaMA/comments/1nefmzr/qwen_released_qwen3next80ba3b_the_future_of/) where you can try it here.

Reply

[-]

Serveurperso@reddit

Vivement les qwants GUFF !

Reply

[-]

Face_dePhasme@reddit

i use the same test on each new model/ai and tbh it's first one who answer me : your are wrong, let me teach you why (and she's right)

Reply

[-]

Pro-editor-1105@reddit

How are you testing it? There are no AWQ/GPTQ quants out there and there is no GGUFS, so is it just FP16 in raw transformers?

Reply

[-]

VectorD@reddit

You can just load it in fp4 with bnb or fp8 quant urself it is not hard

Reply

[-]

FullOf_Bad_Ideas@reddit

not local, but they're probably trying it on OpenRouter. Me too, I'll wait a few days before running it locally. Not a big fan so far.

Reply

[-]

NNN_Throwaway2@reddit

She?

Reply

[-]

HilLiedTroopsDied@reddit

This person must be one of the numerous “roleplay” users, the same ones that download linux isos

Reply

[-]

Thomas-Lore@reddit

What? Just because they used she for a model? I use she most of the time and I mostly do programming. And it or he some other times.

Reply

[-]

Majestic_Complex_713@reddit

I think centuries of naval tradition would like to have a word, but that's just my two cents.

Reply

[-]

AppearanceHeavy6724@reddit

degenerate at fiction. same degeneracy as with 235B model, prose becomes single word sentences after about 800 tokens

Reply

[-]

paperbenni@reddit

Did they benchmaxx the old models more or should I be thoroughly whelmed? Is this more than twice the size of the old 30b model for single digit percentage point gains on benchmarks?

Reply

[-]

qbdp_42@reddit

What do you mean? The single percentage gains, as claimed by Qwen, are compared to the 235B model (which is ≈3 times as large in terms of the total parameter count and ≈7 times as large in terms of the activated parameter count), if you're referring to their LiveBench results. Compared to the 30B model, the gains are (as displayed in the post here and in the Qwen's blog post): | SuperGPQA | AIME25 | LiveCodeBench v6 | Arena-Hard v2 | LiveBench | |-----------|--------|------------------|---------------|-----------| | +5.4% | +8.2% | +13.4% | +13.7% | +6.8% | (That's for the Instruct version, though. The Thinking version does not outperform the 235B model, but it still does seem to outperform the 30B version, though by a more modest margin of ≈3.1%.)

Reply

[-]

KaroYadgar@reddit

So, what you're telling me is, there are only single digit percentage gains aside from just two benchmarks? I love this new model and think the efficiency gains are awesome but you made a very terrible counterpoint. You should've explained the improved & increased context as well as the better efficiency.

Reply

[-]

HilLiedTroopsDied@reddit

That's just Request response benchmarks, The model should be faster (depending on hardware), and perform better at longer context lengths

Reply

[-]

KaroYadgar@reddit

I know, I mentioned that briefly in my reply. I think the model is great.

Reply

[-]

qbdp_42@reddit

Ah, if it's *positionally* "single-digit", i.e. that it's "just one digit changed" and not "a digit changed to just the very next one" (e.g. a 5 to a 6), then I have misunderstood the comment. But why would one expect double-digit gains from a ≈2.7 times larger model (isn't any larger in terms of the active parameters though) where a ≈7.8 times larger (≈7.3 times larger in terms of the active parameters) model's gains are around the same? My point's been that while it doesn't really outperform the much larger model, it gets very close and it does outperform the model of the same computational load class (in terms of the active parameters), rather significantly. As for the "very terrible counterpoint" — well, I'm not a Qwen representative and I'm not here to defend the product against any *potential* misunderstandings. I've been addressing just the overt claim that there's been barely any benchmark improvement over the 30B-A3B version — I've had no reason to presume that the original comment implied the author's also not realising the architecture improvements, as those *are* briefly mentioned in the post here and rather elaborately approached in the linked blog post from Qwen.

Reply

[-]

KaroYadgar@reddit

That's how I understood it, single digit gains. Why he'd think that it should have double digit claims, no clue. Thanks for explaining your perspective, I better understand your prior response now.

Reply

[-]

GreenTreeAndBlueSky@reddit

Am i the only one that thinks it's not really worth it compared to 30b? Like double the size for such a small diff

Reply

[-]

dampflokfreund@reddit

Yeah 3B is just too small. I want something like 40B A8B. That would probably outperform it by far.

Reply

[-]

toothpastespiders@reddit

In retrospect I feel like Mistral had the perfect home user size with the first mixtral. Not a one size fits all for everyone, but about as close as possible to pleasing everyone.

Reply

[-]

ParaboloidalCrest@reddit

Yup, that's one size/config that is 24GB VRAM's best friend, alongside 49B dense models like Nemotron Super. Both not popular among model creators, for some reason.

Reply

[-]

GreenTreeAndBlueSky@reddit

Yeah or 40b a4b, like 10x sparsity and would be a beast

Reply

[-]

FullOf_Bad_Ideas@reddit

It should be worth it for when you're 150k deep in the context and you don't want model slowing down, or if 30B was less than your machine could handle. I do think this architecture might quant badly. Lots of small experts.

Reply

[-]

GreenTreeAndBlueSky@reddit

Do you think we'll get away with some expert pruning?

Reply

[-]

FullOf_Bad_Ideas@reddit

I think Qwen 3 30B and 235B had poorly utilized experts and they were pruned. Did we get away with it? Idk, I didn't try any of those models. This model has 512 experts, I don't know what to expect from it.

Reply

[-]

NeverEnPassant@reddit

Yep. 30b will fit on a 5090, this will not. I guess what they advertise about this is fewer attention layers, so it may go faster at large context sizes if you can have the vram?

Reply

[-]

Eugr@reddit

It scales much better for long contexts, based on the description. It would be interesting to compare it to gpt-oss-120b though.

Reply

[-]

OmarBessa@reddit

A bit sycophantic, but very good model, nonetheless. I expect people to start buying tons of DDR5. I just ordered a lot of it today.

Reply

[-]

ac101m@reddit

Yeah, I also find these qwen models to be very sycophantic. It can sometimes make it a little difficult to trust their output.

Reply

[-]

FalseMap1582@reddit

I am curious about how quantization affects the quality of this model. I would be nice if they release some kind of qat version of it

Reply

[-]

Unable-Letterhead-30@reddit

Ollama when?

Reply

[-]

lostnuclues@reddit

At q2 if it can beat q4 30A3b then it would be awesome.

Reply

[-]

jonasaba@reddit

Why is it 80B. We need 24B.

Reply

[-]

infusedfizz@reddit

Why is the benchmark against 2.5flash? That’s a good model but only really used for dumb problems.

Reply

[-]

Thomas-Lore@reddit

Because it is a similar model to Flash, fast, small, likely not super intelligent.

Reply

[-]

Remarkable_Pride1979@reddit

next model , Amazed performance with only 3B activated param!

Reply

[-]

stuckinmotion@reddit

Nice! Unleash the quants!

Reply

[-]

Pro-editor-1105@reddit

Still waiting lol. No llama.cpp support yet and not even a PR in sight...

Reply

[-]

barracuda415@reddit

It should work with ROCm, but you'll need ROCm 7.0 and a bleeding edge kernel, like Arch Linux level of bleeding edge, because even slightly older ones have a nasty bug that crashes the amdgpu drivers once the context becomes moderately large. Vulkan is probably more forgiving right now, but also slower.

Reply

[-]

Pro-editor-1105@reddit

Did you just assume my GPU??? ^(/s) I am on nvidia

Reply

[-]

NebulaPrestigious522@reddit

I'm not sure how effective it is for any job, but I tested the translation and it's still much worse than Gemini 2.5 flash.

Reply

[-]

duyntnet@reddit

I'm excited about the Instruct version. I prefer non-reasoning models because of my weak hardware.

Reply

[-]

Dreadedsemi@reddit

I wonder what's the speed for will be on my 4070 ti 16gb vram and 128gb ram

Reply

[-]

Smart-Cap-2216@reddit

很好用在某些任务上甚至超越了他们家最大规模的1000b大模型，而且速度不慢

Reply

[-]

Blizado@reddit

Well, time to double my RAM to 128GB (DDR5 6000).

Reply

[-]

Attorney_Putrid@reddit

With this efficiency, they will easily be able to scale up their training volume further—what an exciting future it is!

Reply

[-]

gpt872323@reddit

Most of these are not meant for consumer hardware.

Reply

[-]

NNN_Throwaway2@reddit

So does this mean that Qwen has abandoned their 32B model fully?

Reply

[-]

Traditional_Tear_363@reddit

Judging by the fact that this 80B model took only 9.3% of the compute cost to train compared to Qwen 3-32B, its probably mostly over for dense models above \~20B

Reply

[-]

RandumbRedditor1000@reddit

I wonder how fast it'll be on CPU

Reply

[-]

Lopsided_Dot_4557@reddit

I got it installed and working on CPU. Yes 80B model on CPU, though takes 55 minutes to return a simple response. Here is complete video [https://youtu.be/F0dBClZ33R4?si=77bNPOsLz3vw-Izc](https://youtu.be/F0dBClZ33R4?si=77bNPOsLz3vw-Izc)

Reply

[-]

TSG-AYAN@reddit

55 minutes sounds like you are running from disk or gave it a massive prompt

Reply

[-]

adt@reddit

[https://lifearchitect.ai/models-table/](https://lifearchitect.ai/models-table/)

Reply

[-]

wektor420@reddit

I wonder how well will it finetune in unsloth, MoE models way slower

Reply

[-]

silenceimpaired@reddit

This really feels like a huge leap forward based on their blog. Excited to see if this is better than the 30b dense model… I have some doubts it won’t meat my needs and use case.

Reply

[-]

Professional-Bear857@reddit

I'm looking forward to a new 235b version, hopefully they reduce the number of active params and gain a bit more performance, then it would be ideal.

Reply

[-]

silenceimpaired@reddit

I still hope to see a shared expert that is around 30b in size with much smaller MoE experts. Imagine if only 5b other active parameters were used. 235b would be blazing on a system with 24 gb of VRAM… and likely outperform the previous model by a lot.

Reply

[-]

Professional-Bear857@reddit

This one has 3.7% active params, so applied to the 235b model this would be around 9b active. Let's hope they do this.

Reply

[-]

silenceimpaired@reddit

I still want to see them create a MoE that had a dense model supported by lots of little experts.

Reply

[-]

popsumbong@reddit

Woah

Reply

[-]

NoFudge4700@reddit

Can anyone tell how much VRAM do I need to fully offload this and GLM Air 4.5 Air to GPU?

Reply

[-]

solidhadriel@reddit

I run Q3 GLM Air in Vram and it uses roughly 48GB of Vram

Reply

[-]

toothpastespiders@reddit

I'd be surprised if this wasn't the case. But I tossed a few things most would label trivia at it and saw a nice improvement over 30b. Seems like this might be a nice improvement for RAG over it.

Reply

[-]

lordmostafak@reddit

qwen is really killing it this past months just miracle after miracle

Reply

[-]

its_just_andy@reddit

I'm a little disappointed there isn't a hybrid or dynamic reasoning version. They're sticking with "thinking" and "instruct." In my experience this approach does great on benchmarks (exclusive-reasoning does well on reasoning benchmarks, exclusive-instruct does well on instruct benchmarks) but in real-world usecases this causes agentic behavior to suffer, because in the real world you are mixing scenarios that require reasoning with scenarios that do not, often in the same chat context.

Reply

[-]

ken-senseii@reddit

Not much difference in compare to 32B model. But the side is approx 2x

Reply

[-]

Healthy-Nebula-3603@reddit

Sure for instance arena hard V2 34 vs 83...small difference ...

Reply

[-]

Single_Ring4886@reddit

Well I bet in real life difference will be visible.

Reply

[-]

ken-senseii@reddit

Let's see

Reply

[-]

OGRITHIK@reddit

Performance is extremely underwhelming.

Reply

[-]

abskvrm@reddit

At least use it for a good minute before spamming 'mid', just because your name is 'thi(c)k'...

Reply

[-]

OGRITHIK@reddit

It's quite mid.

Reply

[-]

lordpuddingcup@reddit

How’s it at coding vs qwencoder is my ?

Reply

[-]

rerri@reddit

Collection doesn't work, but here's the models: [https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct) [https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking)

Reply to Post

224 Comments