Qwen3.6-35B-A3B released!
Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 717 comments
Meet Qwen3.6-35B-A3B:Now Open-Source!🚀🚀
A sparse MoE model, 35B total params, 3B active. Apache 2.0 license.
- Agentic coding on par with models 10x its active size
- Strong multimodal perception and reasoning ability
- Multimodal thinking + non-thinking modes
Efficient. Powerful. Versatile.
Blog:https://qwen.ai/blog?id=qwen3.6-35b-a3b
Qwen Studio:chat.qwen.ai
HuggingFace:https://huggingface.co/Qwen/Qwen3.6-35B-A3B
ModelScope:https://modelscope.cn/models/Qwen/Qwen3.6-35B-A3B
ThePirateParrot@reddit
Here we go again with hours of testing and optimisation. But i wont complain!
PassengerPigeon343@reddit
Exactly my thought too! Love to see it but now I have work to do…
Borkato@reddit
Not just that, but updates to llama cpp, then unsloth will say “ok we fixed it”, then that one guy will say “actually I found a bug at layer 73927228, please update” and unsloth will say “ok guys we fixed it for real” so we download, and then qwen will release a new template and a token will be changed, and then unsloth will say ok guys we fixed it for realsies I promise, and then we download and then llama.cpp comes out and says that actually tool calls are broken and…
Zeeplankton@reddit
Lmao
contrebandeco@reddit
I think we're at the point just before llama.cpp finally admits their GBNF autoparser broke the tool call JSON output and they are still trying to blame CUDA 13.2 for it. Yay ? Nay ?
Borkato@reddit
I have no idea what a GBNF auto parser is, but sure :p
contrebandeco@reddit
I'm talking about this: https://www.reddit.com/r/LocalLLaMA/comments/1rmp3ep/llamacpp_now_with_automatic_parser_generator/
And it does break tool calling: https://github.com/ggml-org/llama.cpp/issues/21771
But they've still not confirmed it. I've hit that problem myself, however. With the firecrawl MCP server.
dabiggmoe2@reddit
That's why I waited for bartowski's quants before downloading lol
ab2377@reddit
😂👆
PassengerPigeon343@reddit
This is so accurate it hurts
rm-rf-rm@reddit
Please share once you've optimized! This is crucial to broader adoption as many wont spend the time/dont have the time.
viperx7@reddit
I don't think it would require testing because it's exactly the same model as 3.5 just trained for a bit longer
Joozio@reddit
Running a 35B model locally is genuinely viable now. I swapped my Mac Mini M4 from Gemma 3 to a different 35B config last month and the difference in reasoning depth for structured tasks was noticeable. The memory bandwidth on M4 is the real unlock - 120GB/s means you're not CPU-bottlenecked at these sizes anymore. Wrote about the whole setup and what the swap changed: https://thoughts.jock.pl/p/local-llm-35b-mac-mini-gemma-swap-production-2026
mattabott@reddit
I'm waiting the tiny ones!
Whole-Net-8262@reddit
Do users use this open models as a personal coding assistant too?
gajesh2007@reddit
Is it actually better than 27B dense model?
gamerendres@reddit
And the equivalence with the llama?
Kodix@reddit
Well this seems absolutely lovely. What a good couple months for local LLMs, huh?
astral_crow@reddit
It’s been a good couple years.
FaceDeer@reddit
Indeed, the pre-2023 models were kinda crappy. They've been much more capable since then.
No_Afternoon_4260@reddit
What models are you talking about pre-2023? Gpt2?
FaceDeer@reddit
Of course not, GPT-2 is far too dangerous to release to the general public.
I was using Markov chain generators back then.
metamec@reddit
> Of course not, GPT-2 is far too dangerous to release to the general public.
https://i.redd.it/a9k6lxer9qwg1.gif
TheToi@reddit
Well in a way, he was right.
Just look at the crap it’s caused with the data centers and the shortages. 😅
aeroumbria@reddit
More like "too dangerous to be horded and sold as snake oil to people with no business using them"...
oulu2006@reddit
what an interesting post!! thanks for referencing that historical prediction
No_Afternoon_4260@reddit
Really cool what did you used them for?
FaceDeer@reddit
For laughs. I had an archive of all of my Reddit comments, I created an automatic FaceDeer Comment Generator. Haven't touched it in a few years, but I found some old outputs from it:
The sad thing is I can remember what kinds of comments it was pulling from for much of that. :)
the__storm@reddit
I started on Flan-T5 - an encoder-decoder model. It was not smart, but at the time still felt basically like black magic for NLP.
fuck_cis_shit@reddit
BERT, GPT-J, GPT-NeoX
Zolty@reddit
Before the 1900s you wouldn't believe how bad they were.
Borkato@reddit
I’m so fucking happy. This was the ONE model I wanted the 3.6 version of the most, because 3.5 is so absolutely insanely mindblowingly good for me with local coding that I’m thinking it can’t possibly get better and still be local.
(Not that it’s the best model, obviously, just that it’s the model that had me say “damn, i can actually do agentic coding now, even if it flubs sometimes”)
IrisColt@reddit
mother of God...
ea_nasir_official_@reddit
If that's the case, imagine how good the 27b 3.6 is gonna be :o
rumblemcskurmish@reddit
We're on a timeline where a 0b model will eventually 1 shot everything. I can't wait.
power97992@reddit
qwen 3.5 27b was okay, still way worse than glm 5.1/5.0 and sonnet 4.6, even worse than gemini 3.0 flash and minimax 2.7.... In fact, qw 3.5 397b is probably worse than minimax 2.7/2.5
Borkato@reddit
This is like comparing a tank to a revolver and complaining it sucks…
IrisColt@reddit
This
Fit-Palpitation-7427@reddit
So glm 5.1 seems to be your preferred llm? I really like codex 5.4 xhigh But for local inference I’m on qwen 27b because of my 24GB of 4090 Think there is better at my disposal for 24GB?
Rare_Potential_1323@reddit
Goonies ending 😳
jax_cooper@reddit
cabable like 27b but fast as 35b? dayum
layer4down@reddit
27B probably won’t be as fast… unless DFlash or DDTree is used which would indeed be insane! Right now I got DFlash working for 27B and it’s a genuine 2-4x performance boost with no tradeoffs so far.
po_stulate@reddit
Did you try it with long context or agentic coding? How's the acceptance rate for these? I saw their github issue saying that the model isn't trained on these data so the acceptance rate will be low?
layer4down@reddit
Personally I’m finding the acceptance rate to be high at lower ctx_length. (38.5tps DFlash vs 10.5tps baseline). My OpenCode sysprompt + tools alone is like ~30k at the moment so I’m going to give Qwen3.6-35B-A3B-BF16 a try instead. Frankly Qwen3.5-27B-BF16 works fine for me at ~10tps TG so it’s not a loss I’m just trying to see if I can improve my bang for buck without quality loss.
Main_Secretary_8827@reddit
Whats DFLASH?
layer4down@reddit
DFlash (roughly speaking) is flavor of Speculative Decoding that helps a model to predict more tokens in a single forward pass to improve token generation speed without losing any quality. Up to 3.6x faster IIRC:
https://www.emergentmind.com/videos/dflash-block-diffusion-for-speculative-decoding-f31dc322
Caffdy@reddit
big if true
ArtfulGenie69@reddit
I'm sure this random guy who wants one of the small ass models knows exactly how good it is in the first 10m of the release.
my_name_isnt_clever@reddit
There is no way that after only a couple months and within the same training run they were able to get this MoE with 3b active to perform on-par with 27b active. It's like claiming a new bike can match a sports car, it just doesn't make sense.
Mart-McUH@reddit
I am sure it is not, those are some strange benchmarks where it leads. I suppose it might be better at some narrow tasks but in general no chance.
Key-Contact-6524@reddit
I heard somewhere that deepseek is coming too
A while ago a new unknown 1T params model ( pretty sure my ThinkPad can't run it) appeared for free on Openrouter. There is a good speculation that it is a new deepseek model
Also judging by the timeouts , seems like a deepseek model
Sufficient_Prune3897@reddit
Deepseek has been a week away from releasing for 4 months
power97992@reddit
Deepseek will release v4/3.5 when it is ready serve at mass and cheap and finished training on ascends and almost as good as the newest best publicly available gpt/claude model at benchmarks( actual performance might worse).
Key-Contact-6524@reddit
Probably some issues with the chinese government i believe.
They probably want them to run on some locally developed compute chip ( speculations btw)
The best we can do is wait till next week lmao
Fit-Palpitation-7427@reddit
I heard it’s gonna be released next week 🫣
Worth_Contract7903@reddit
As soon as Iran is getting a nuclear weapon.
Long_comment_san@reddit
Dude whole Iran doesn't write as many Iran comments as you do
Thomas-Lore@reddit
They are cut off from the internet by the regime for more than a month now. So almost no comments unless they smuggle a StarLink for which they may be killed if found out.
QuinQuix@reddit
So what was it then?
Key-Contact-6524@reddit
Mimo (Xiaomi)
Fantastic-Emu-3819@reddit
Could be KIMI K2.6
Key-Contact-6524@reddit
Apparently it was one of the Xiaomi models
Middle_Bullfrog_6173@reddit
The 1T model ended up being MiMo-V2-Pro, a closed model. Deepseek rumors abound, but so far it's been a cycle of "imminent release" followed by a few weeks of silence.
Key-Contact-6524@reddit
Ahh fuck
AppealSame4367@reddit
If llama adds dflash one day, it's game over for cloud coding agents.
Pyros-SD-Models@reddit
Wait, wasn't the top thread yesterday how the golden age of LLMs is now over
my_name_isnt_clever@reddit
That's been claimed since 2024.
BassNet@reddit
Open source stable diffusion is still get wrecked
ansmo@reddit
ERNIE, anima preview 3, LTX distill 1.1 this week?
BassNet@reddit
I think Klein is better than Ernie. Also LTX is still terrible with motion, no LoRA has been able to fix that yet. There is still nothing like Kling or Seedance open source.
RebekkaMikkola@reddit
Benchmarks look solid but I’m always a bit cautious with MoE models. They tend to shine on evals more than real-world workflows. Curious if anyone’s tried this in an actual dev setup yet.
PlanetPhaelon@reddit
Qwen is quickly becoming my favorite to run locally...just really need a better GPU, have a 3080 now but needs a 3090 to really run Qwen with enough params to really cook.
Inside-Cantaloupe233@reddit
The model is pretty bad like most local models, makes way too many mistakes and hallucinations are non stop in lmstudio.Coding with it is like fighting with senior coder who intentionally sabotages your code so you wont get better.
ChoiceLeft1686@reddit
wow! it's very impressed!!
AndreVallestero@reddit
I hope they release 3.6 122B to pressure Google to release their 124B model as well.
RedParaglider@reddit
Exactly. That's my assumption as well that gemma 124b was held back because it out competed flash in some ways. A qwen 3.6 122b would be my daily driver for sure. I'd for sure switch from my qwen 3 coder next 80b.
year2039nuclearwar@reddit
What hardware you running to fit a 120b? I’ve got a consumer mobo so think I’m stuck with max 48GB
AndreVallestero@reddit
Consumer boards can fit 2x RTX 6000 Pro would be 188GB Vram, which is enough for Q8 and a good amount of context
year2039nuclearwar@reddit
That's useful to know, thanks but in my country, that card is £9000 and there doesn't seem to be a used market yet. I think buying it at £6000 would work out as "sweet spot" for me so I guess I'll wait until then.
Then again buying it used for such a high value item is such a risk
Far-Low-4705@reddit
dude thats $20k...
Practical-Collar3063@reddit
I think the point he was making is that the limitation is not the mobo
Borkato@reddit
Right?! “Consumer board” lol
Still-Wafer1384@reddit
You should be able to stick more RAM in a consumer Mobo. I have rtx3090 and 64GB system RAM. I can run qwen3.5 122b Q4, be it at a lowly 8 or so tokens/s.
mxmumtuna@reddit
RTX6k on any motherboard will run 122b at max context with full gpu offload.
arbv@reddit
31B IMO, feels more like Pro as far as smartness goes, but of course, it has far less knowledge.
Blues520@reddit
I'm also looking for something to switch from qwen3-coder-next
Still-Wafer1384@reddit
How do you rate it vs qwen3.5 27b for coding?
Blues520@reddit
I've tried qwen 3.5 27b and gemma4 31b but still get better results with qwen3-coder-next on web dev tasks. I am hoping that a 3.6 coder models emerges.
Still-Wafer1384@reddit
How do you rate qwen3 coder next 80B to qwen3.5 27B for coding?
RedParaglider@reddit
I've never really had a reason to use 27b. I have a strix Halo so it's not very fast.
Far-Low-4705@reddit
if u can run qwen 3.5 122b, you should have already switched from qwen 3 coder next tbh
RedParaglider@reddit
Not as good. It's 10 t/s slower and doesn't keep up on agentic tasks as well in my use case. It's close though.
redditorialy_retard@reddit
damn bro I don't got another 3090 for that hahahaha
stoppableDissolution@reddit
31b is outcompeting flash already
TechnoByte_@reddit
Gemma 4 31B is quite close to Gemini 3 Flash so I'd be surprised if the 124B didn't outperform it
Daniel_H212@reddit
I doubt they'll be dangerously close to frontier models. There's like a 5x size gap. They will probably put the final nail in the coffin for all GLM air models and gpt-oss 120b though
VoiceApprehensive893@reddit
the biggest question: is the yapping fixed
Statcat2017@reddit
Give it a rest, Qwen is so obviously neurodiverse.
Ask Qwen to solve global warming and it will have you an answer in three minutes.
Say hello to it and you'll be waiting days for an answer as it argues with itself endlessly.
gmork_13@reddit
I never had this problem with 3.5, but 3.6 is an infinite looper on a lot of settings. fingers crossed something is wrong with this first release in some template or setting.
No_Swimming6548@reddit
I tried it on Qwen chat, still thinks a lot after a simple hello. It crushed a logic question sonnet failed though. I think Qwen models are tuned for agentic use and coding, not for rp or assistant purposes.
BreakfastAdept9758@reddit
god bless china
Asceny@reddit
Dang.. I hope they would release lower param versions..
Dependent-Aardvark32@reddit
woow , it is impressive benchmark ! :)
Key_Extension_2501@reddit
I don't understand, if this model is only 35b and 3b active, then why is it over 3x as expensive on the API than gpt-oss-120b which has 5b active?
wtfihavetonamemyself@reddit
Has anybody tried using a draft model with this like qwen 2b or .8? Has it worked in llama? Noticeable gains?
julianmatos@reddit
Can confirm, the jump from 3.2 to 3.6 is noticeable. I've been using it for code review and doc summarization tasks that used to feel like a stretch for local models.
If anyone's wondering whether their setup can handle it before committing to the download, localllm.run is handy for checking hardware compatibility with specific models and quant levels.
ResearchCrafty1804@reddit (OP)
LM Performance:Qwen3.6-35B-A3B outperforms the dense 27B-param Qwen3.5-27B on several key coding benchmarks and dramatically surpasses its direct predecessor Qwen3.5-35B-A3B, especially on agentic coding and reasoning tasks.
Long_comment_san@reddit
Holy shit. This looks more like 4.0
dampflokfreund@reddit
It's just benchmarks. Gemma is not obsolete, it has a ton of other qualities specifically for creative writing and european languages.
Potential-Gold5298@reddit
Even the 26B-A4B model outperforms the Qwen3.5-27B in real-world tasks. The Qwen is better suited for tasks like coding or image analysis, while the Gemma 4 is better at almost everything else.
phazei@reddit
Qwen3.5 35B gives 57t/s
Gemma 4 26B is super close in quality, but gives 130t/s
So depending on task, hard to beat that Gemma speed.
edsonmedina@reddit
130 t/s??? What's your setup?
On my Strix Halo Gemma is significantly slower than Qwen3.5 35B at the same quantization.
phazei@reddit
I have a RTX3090 + 128gb DDR5, but I had everything loaded only on the 24gb VRAM, so system ran didn't make a difference there.
The MoE models are much faster.
lemondrops9@reddit
I'm getting close to the same speed between Qwen3.6 35B A3B and Gemma 4 26B A4B
phazei@reddit
:o
Hmm, I'll have to play with it. i haven't downloaded Qwen3.6 yet. If it's over 100t/s, then it's my winner.
lemondrops9@reddit
It should be over 100t/s. I was vibe coding today, at +50k context it was still over 100t/s
phazei@reddit
I'm using LMStudio. You?
lemondrops9@reddit
Same here but its running on Linux Mint. Linux helped alot with issues and it just runs good now. I should try Llama. cpp but LM Studio is easy and works great.
edsonmedina@reddit
Qwen3.5 35B is A3B MoE too... should be even faster
phazei@reddit
Why even faster? It's much bigger, so takes more VRAM which is why I presumed it was half the speed of Gemma. Still much much faster than either models dense version.
edsonmedina@reddit
Faster because it activates less params (3B versus Gemma's 4B). At least in theory.
Are you comparing them with the same quantization?
phazei@reddit
Ah, that does make sense... hmmm, I could have totally remembered shit wrong... let me look at my notes... All the models are Q4_K_* They were all tested with a context length set to 64K or greater, except Gemma 4 31B which I had to lower it to 16K to get ok speeds.
Qwen3.5 27B: 36t/s Gemma 4 31B: 31t/s (very small context available, if I increased it too much it quickly went to 12t/s)
Gemma 4 26B: A4B: 124t/s Qwen3.5 35B: A3B: 57t/s
rumblemcskurmish@reddit
I run Gemma4 on my 4090 and while I love Qwen3.5-35b, Gemma is insanely fast
MeateaW@reddit
Gemma (q8 awfully slow) failed reading text in some of my image tests. (I have a couple prompts that I just feed straight into the models as my quick and dirty self benchmark) I know the expected output, its source is reading and comprehending data I know the answer to, and I know the vision-reasoning "trouble spots" in the content so I get to see how it works around the issues.
Qwen 27/34 never got the text reading wrong (just the analysis).
I'm still sticking with qwen 122b though on my strix system, as it seems to not get stuck in logic loops and reads all the text perfectly, and even has good enough (not great) performance.
Last_Mastod0n@reddit
I have a similar experience on my 4090. You just cant beat gemma's performance
Last_Mastod0n@reddit
In my experience Gemma 4 has been the better vision model. Ill have to check out qwen 3.6 and report back
po_stulate@reddit
Coding and image analysis ARE real world tasks.
YanderMan@reddit
Gemma sucks for tool calling
Significant_Fig_7581@reddit
I think that was more of a llama.cpp problem, they fixed it in the update though
coder543@reddit
No... all Gemma 4 models are very bad at following instructions, and very lazy about calling tools. I have spent days fighting this issue. The tool calls work fine when it feels like making them. Gemma 4 will usually make one tool call, then decide that is "good enough" if there's even a hint of a partial answer in the result, even if the instructions specifically say that it needs to make multiple tool calls, and even if the tool it called says it MUST follow up with calling another specific tool.
Qwen3.5 is much stronger at following instructions and knowing when to make tool calls. I haven't had enough time with Qwen3.6 to form a strong opinion yet, but it seems to be more of the same.
BrianJThomas@reddit
What inference stack are you using for Gemma? I'm still seeing wildly different results between different implementations. It's been interesting to dig into. I didn't realize how many layers there were for templates, tool calling, etc.
coder543@reddit
Just standard
llama-server. I do agit pulland recompile almost every day.arman-d0e@reddit
Honestly I’m not one to push my models, but the opus trained one uploaded TeichAI (v2) has been actually very strong with instruction following and tool calling, though I’m sure performance got effected elsewhere
coder543@reddit
No... it is very bad at following instructions, and very lazy about calling tools. I have spent days fighting this issue. The tool calls work fine when it feels like making them.
SummarizedAnu@reddit
It doesn't ? What are you even using?
DoorStuckSickDuck@reddit
He's right, Gemma 4 is substantially less reliable in tool calls compared to Qwen 3.5.
SummarizedAnu@reddit
Don't know about qwen 3.5 but the iq2 xxs Gemma 4 run on llamacpp turboquant with Gemma 4 chat template creates about no wrong tool calls when running in llamacpp server with mcps like searxng , fetch , time,exec etc. and the free cloud provider is even better at reasoning .
The Gemma 4 26B a4b is very bad even in it's full weights and iq2 quants.
So idk. Maybe you are using it wrong? Cause it works for me.
SummarizedAnu@reddit
I meant 31B for the first one and 26B for the second one. Specially running with nous Hermes agent.
Western_Courage_6563@reddit
3 probably, that one wasn't great, unless it was Instruction tuned...
swagonflyyyy@reddit
The tool calling implementation wasn't added properly initially. It works now.
VoiceApprehensive893@reddit
i tried some image recognition on 27b qwen and gemma 31b
qwen was worse
Borkato@reddit
Me when I lie
(Unless you consider “real world tasks” to not include tool calling, in which case you’re not wrong)
Potential-Gold5298@reddit
Real tasks is not tools calling. I meant answering questions/explain topics, solving problems (such as budget planning), write a letter etc. What an ordinary person would ask a chatbot about. Agentic coding, tools calling etc. is work.
Borkato@reddit
I will say though, you are correct in that Gemma is better for general non-agent tasks
Borkato@reddit
Lol, with that (wrong) definition, you’re correct.
Potential-Gold5298@reddit
My English is bed, so I'm very sorry)
Significant_Fig_7581@reddit
How about this one 3.6 for real world tasks?
Potential-Gold5298@reddit
I don't know - it just came out and I haven't seen any tests with it yet. (I don't attach much importance to benchmarks like those on Artificial Analysis - judging by them, Qwen3-4B-Thinking-2507 is equal to GPT-4o, but this is obviously not entirely true). Qwen3.5 and Qwen3 are excellent models, including for many real-world problems. I think Qwen3.6 will be even better, but will likely be worse then Gemma 4 in some scenarios (e.g., languages other than Latin/Chinese, RP/Creative writing).
F1yoz1k@reddit
Chinese with benchmaxxing will always be 3 steps ahead for everyone... except final user.
send-moobs-pls@reddit
Lmao if Gemma was from a random Chinese lab instead of Google it would be largely ignored as mid
Borkato@reddit
Random question but do people actually send you moobs
send-moobs-pls@reddit
I wish 😔
Due-Memory-6957@reddit
Nah, gemma 4 34b is really good. There's a reason it is the first Gemma version to actually get loved, all the others were largely ignored.
rkoy1234@reddit
eh, it's great as a chatgpt at home kinda deal, just not as good for coding imo.
also, multilingual is far above qwen. qwen's non cn/en languages sound like gpt3.5 level awkwardness.
j_osb@reddit
Yep. Like Gemma4 is... nice to talk to. good at like, translation too.
But for what it matters most, like agentic loops or coding, qwen3.5/6 is just better.
Healthy-Nebula-3603@reddit
Exactly my observations.
Gemma 4 31b dense is a great translator.
Queen is better in coding especially 27b dense version
Borkato@reddit
Gemma is also an EXCELLENT coding teacher and summarizer and writer.
Basically if I need someone to fix my code or be an agent, I call qwen. If I need someone to explain something to me or write prose, I call Gemma. :D
BassNet@reddit
Gemma 4 e4b with VL is the best overall tiny model I’ve used, I’ll give it that
Velocita84@reddit
That's what matters the most for you, i use LLMs to translate and ~~gooning~~ writing much more than coding or agent-ing
draconic_tongue@reddit
there's no way you actually believe that right
Healthy-Nebula-3603@reddit
What are you talking about?
In programming even Qwen 3.5 27b is much better than Gemma 4.
I'm waiting for new qwen 3.6 27b
F1yoz1k@reddit
Sadly, there is 20 more use-cases outside of coding, and tradeoff in coding (which is not even that big) is worth it to get exceptional performance for such model in many different tasks.
Healthy-Nebula-3603@reddit
I'm using Gemma for instance a translator for books . Here Qwen is worse .
IrisColt@reddit
This.
Ifihadanameofme@reddit
I'm crying in tears of joy XD It wasn't even over a week since I downloaded the gemma MOE and before I could even think about switching over, qwen goes "hello there, it's been a while"🙂 like they didn't just release the qwen3.5 less than 2 months ago
jeansec@reddit
I'm curious, what are you doing with your local llm to be so excited ?
Bakoro@reddit
I have an experiment going right now where I define a long horizon goal, task the LLM with breaking it down into high level phases and steps that can be accomplished in a mostly greedy fashion, and put it in an eternal agentic loop.
I define a development pipeline like: mark subgoal as "in progress" -> state the goal and acceptance criteria -> research -> plan the implementation -> implement plan -> review and verify work -> log the stage's work and mark the task as complete.
So every feature in the list gets it's own development pipeline, and the model just keeps going. I have a scheduled task at the operating system level to make sure the LLM server is running, and to restart the server and agentic loop automatically if something happens to close it.
It's a bit cheaty to the ultimate dream of having the local model be fully autonomous and self-directed, but I also have a heartbeat to trigger a proprietary model to examine the state of the project, report on the quality of the work the local model is doing, read the logs and identify if the model appears to be stuck on something, or is falling into trivial solutions, or otherwise failing to follow the protocol that has been set out (like not updating the logs, despite doing work), and the proprietary LLM takes corrective action, which is usually sending a message to the local LLM to do XYZ, and updating the system prompt and Agents.md file with instructions.
It's almost embarrassing, but I also made a basic ticketing system so if I want to inject work into the middle of the plan, or eleveate the priority of something I put the work order in, and that gets priority in the next development loop.
So far I've had the model running for a few days straight, and it's slowly but surely making progress on its own.
At some point I will try to add in additional capacity like a tool for allowing the LLM to control the mouse and keyboard, so it can use GUI apps and verify visual work. I don't have a ton of confidence in that, because even the biggest LLMs don't have very good visual reasoning yet, but it's worth trying for straight-foreward visual tasks.
Borkato@reddit
Wow this is cool as hell. So it’s actually working?!
Bakoro@reddit
So far so good.
I'm running the experiment on a secondary laptop I have, so it's not the most speedy process, but the loops are running fairly well, the model is making meaningful progress, and it answers the tickets I put into the system.
The recovery script I made has restarted the llama.cpp server a few times, I don't know what causes the server to crash at this point, but the system recovers.
I have had to add a lot of reminders and instructions for the model to actually test the and verify its work. It has a bad habit of changing the API and then not updating the callers.
The proprietary model is doing a fairly good job of course correcting the local model, but it tends to step in and do the work itself, more often than I'd like.
I've reframed the proprietary model's task as an optimization problem to improve the agentic environment of the local model, so now it's reviewing the local model's work, but also trying to improve the meta environment whenever it has to fix errors the local model made.
The concept of models running models seems to be sound. I can't say that the end results will be professional quality, it's a fairly small model after all, but it is making real stuff.
If I had three or four GPUs that had an appreciable amount of VRAM, I think I could really cook up something.
Borkato@reddit
That is really really cool. Thanks for sharing!!
Spectrum1523@reddit
Sir this is LocalLLaMA we dont actually use the models here
Borkato@reddit
I made a wrapper around api calls so it’s private unlike things like opencode. I was annoyed that opencode seems so complex and supports tons of online models and has telemetry (even if it claims it can be disabled, it still pings their site to get the model list) and I just didn’t like that. So I had Gemma and qwen write me a local version and it just pings my endpoint :D
Healthy-Nebula-3603@reddit
Conned your Gemma 4 / Qwen 3.5/3.6 to opencode via llamacpp-server and you actually have codex-cli / claudie-cli at home :)
awesomeunboxer@reddit
I like to use local models for boiler plate code then have a flagship online one check the work. Gemma 4 is very good and the qwen line has been very good for this.
Long_comment_san@reddit
I'm actually quite scared because in 6 months we're gonna have models maybe 20-30% better than these. I might need to resume my cooking lessons because that's about the only job left for me.
wipeoutbls32@reddit
Don't worry, You can still be a cook for two more years. Robots are going to do that next
Makers7886@reddit
chillinewman@reddit
There is a robot for that.
autoencoder@reddit
Are the economics worth it?
chillinewman@reddit
20k household robot
https://www.reddit.com/r/singularity/s/E0JUR4ZUTf
MerePotato@reddit
That's the price of the 1X Neo isn't it, not that?
autoencoder@reddit
Oof. And I have to pre-crack its eggs? Still a way to go.
jld1532@reddit
Nothing is worth eating that chef hasn't tasted
Borkato@reddit
This is absolutely not even close to true
chillinewman@reddit
Probably there is going to be a robot for that too.
SummarizedAnu@reddit
Nice . Gonna use this for everything now.
Ifihadanameofme@reddit
Processing img 58zf7o469kvg1...
Glum-Atmosphere9248@reddit
Well qwen outputs 3x the tokens so..
Due-Memory-6957@reddit
On the benchmarks that IMO matters the most (Livebench amd MMLU-Pro) it wasn't slapping Gemma, it seemed quite equivalent, in fact.
Mart-McUH@reddit
Gemma4 is harder to configure to run well (also has quite complicated chat template and lot of errors when it launched). Could be those benchmarks were done with sub-optimal settings/template).
I can only speak for dense but Gemma4 31B works for me notably better than Qwen 3.5 27B (though Qwen is good too). But it took me long time/tuning the prompt to get out of Gemma4 what I wanted.
IMO at least with complicated tasks, that is vital flaw of current benchmarks - I assume they use same prompt for every LLM. But that just does not tell much because each LLM has different strengths/weaknesses and requires different prompting to get around them.
Both_Opportunity5327@reddit
This does not look correct in my tests, Gemma 4 31b dense wipes the floor with Qwens of similar size.
Borkato@reddit
It’s also ridiculously slow. I hate using it solely for that reason! So excited to have a 35B-A3B as good as the 27B dense of 3.5!
Both_Opportunity5327@reddit
I agree with you the dense model is slow, but it is so good.
The new Qwen is a good upgrade, its fast q8 has very good reasoning(if a bit token hungry).
tavirabon@reddit
Likewise, most things Gemma is just better than 27B, it does what you tell it. Even some of these comparing the 3.5 models feel wrong, 35B hallucinates worse and gives worse quality outputs in many cases.
biogoly@reddit
It’s called benchmaxxing…
Recoil42@reddit
And Gemma is no slouch!
jazir55@reddit
Uhhhhh.....3.6 35B-A3B loses agaiunst 27B dense in almost every one of these benchmarks. Are we reading the same chart?
Ryba_PsiBlade@reddit
Yes reading the same charts but getting something on par with dense version with an moe is super valuable. Just like q4 is worse than f8 or f16 but it is super valuable to be able and make it reasonably useable for some of us. Token speed especially.
Big_Mix_4044@reddit
Can't wait to see what new 27b is capable of.
cafedude@reddit
Or the 122b
StyMaar@reddit
This please!
122b has been insane on my Strix Halo. I'm using it all the time and I completely stopped using any closed LLMs since then.
devil_ozz@reddit
4 bit?or 16?
StyMaar@reddit
Q4_K_M, fits with the maximum context.
KriKraKrischi@reddit
Whats your Token per second ?
StyMaar@reddit
pp: 150-200 depending on context length tg: 18-20
What's super impressive with this model is how little the performance degrades with context length. You get 150/18 with 60000 tokens in the context.
The Qwen-Next architecture is incredible.
jwpbe@reddit
at least one or two
Caffdy@reddit
16 would be impossible, 8 I'd imagine could barely fit (you still need some memory left for the system), so probably 4-bit to 6-bit quant
More-Curious816@reddit
Damn you google for not releasing the 122 version
Glazedoats@reddit
WHAT?! 🤯
johnfkngzoidberg@reddit
With limited VRAM (24GB) is it better to use Qwen3.5 26B A4B at Q4_K_M, or Qwen3.6 35B A3B, but at IQ4_XS?
I’m not sure how the math works out. I know Qwen3.6 is expected to be smarter, but does using a smaller quant but larger parameter size wash each other out?
ZeitgeistArchive@reddit
in many regards it seems a bit worse than qwen3.5?
Borkato@reddit
Wait, what are you comparing it to? 3.5 35B seems worse in almost every metric
Cold_Tree190@reddit
Look at 3.5 27B
Aromatic_Bed9086@reddit
That's not an apples to apples comparison. MoE vs Dense. Hopefully a dense 3.6 drops.
jazir55@reddit
It is apples to apples in real world performance, not architecture.
snomile2@reddit
it's non-sense to compare an A3B model to 27B in technical aspect, there will be Qwen3.6-27B, which make sense to compare with Qwen3.5-27B
Cold_Tree190@reddit
Idk man that’s just what I figured the other guy was talking about
phazei@reddit
You're comparing it to a dense model, that makes no sense, the fact that it's so close to it, with 3B active params at a time, that's crazy good.
Daniel_H212@reddit
It definitely has lost a few points here and there but it's almost across the board significantly better in coding. I'm surprised this was a general release and not a Qwen3.5 coder.
curious_ab0ut_stuff@reddit
Yeah.. It's strange.. It's under performs
PinkySwearNotABot@reddit
and further breakdown of your chart into something even more useful
PinkySwearNotABot@reddit
PinkySwearNotABot@reddit
Based purely on what the benchmarks show:
Qwen3.5-27B — General workhorse / agentic coding Best default choice. Use it for agentic coding tasks (SWE-bench style autonomous bug fixing, repo-level tasks), STEM reasoning, math competition problems, and anything requiring broad knowledge. If you don't have a specific reason to use another model, start here.
Qwen3.6-35BA3B — Frontend & web UI The clear pick for front-end code generation — its QwenWebBench score (1397) is a significant jump above the field. Also solid for terminal/CLI agent tasks and holds up well on coding broadly. If you're building web apps, components, or anything visual/browser-facing, reach for this one first.
Qwen3.5-35BA3B — General agent tasks Where it edges out Qwen3.5-27B is in agentic workflows: TAU3-Bench, MCP-Atlas (tool use). If you're building multi-step agents that call external tools or APIs, this is worth considering over the 27B. Coding ability is close to the 27B too, so it's a reasonable all-rounder if you need slightly better tool-use behavior.
Gemma4-31B — Multimodal / visual agents + knowledge retrieval The only model that wins VITA-Bench (multimodal/visual agent tasks), and it leads on MMLU-Redux and SuperGPQA. If your use case involves processing images, visual understanding in an agent context, or you need strong general knowledge recall, Gemma4-31B has a genuine edge. It's also competitive on TAU3-Bench, so it's not a bad general agent either.
Gemma4-26BA4B — Cost-sensitive, low-stakes tasks Honestly hard to recommend on performance grounds. The only realistic case for it is if you're extremely cost/compute constrained and the task is simple enough that raw benchmark performance doesn't matter much. Don't use it for anything agentic or coding-heavy.
Quick reference:
OmarBessa@reddit
Amazing
Temporary-Roof2867@reddit
If this is true, I'm so happy 🤩🤩🤩🤩
God bless the MoEs!
l_eo_@reddit
Awesome stuff.
Gotta try it immediately.
Because of 27B, I was a bit jealous of the folk that were able to run dense models with good performance.
AvocadoArray@reddit
Let’s fucking go. Too bad I’m headed out of town and won’t be able to play with this until next week.
Durian881@reddit
Awesome!
Thrumpwart@reddit
Oh hell yeah.
garloebx@reddit
How much ram do I need to run this locally on a Mac mini/studio?
Long_comment_san@reddit
I HOPE they fixed that atrocious 1.5 presence penalty.
apeapebanana@reddit
what up with 1.5 presence penalty?
Long_comment_san@reddit
It's absolutely murderous for any sort of long-term chat, roleplaying for example. Presence penalty is great on paper but has a nasty drawback of not having a decay range so it persists over the entire context range. To my knowledge, Qwen uses Presence penalty to counter looping that they currently have which is an absolute idiotic way of doing so because DRY exists for this very purpose and is far superior and it can also be configured to decay over time. I don't know how DRY isn't a staple penalty at this point. Its far superior to the trio of penalties most are familiar with.
Robot1me@reddit
Similar reasons why SillyTavern doesn't have a node system with prompt chaining yet: People will do what they personally think is best and not acknowledge other approaches, but that narrow mindset leaves a lot on the table
Long_comment_san@reddit
Sillytavern really needs some coordinated help. At this point it has far too many things glued together randomly.
apeapebanana@reddit
I asked 3.6 to duplicated the sillytavern memory system with pi-coding-agent, cant quite verify if the memories system work, but it built and verify easily..
apeapebanana@reddit
thanks for explaining! maybe they 'don't fix if it working'?
i tried load 3.6 onto sillytavern, it feels like my character speaking as if on speed and crack lol
while on the other hand gemma is crashing from time to time but come with great prose.
now running 3.6 for website building, it absolutely nailing it!
Long_comment_san@reddit
No, they haven't! Arghh!!!!
H_DANILO@reddit
I just tested this model, and yes, this is my new favorite.
I was running Qwen3.5 397b before(Q2) and I'm running this Q8 with 60tps tg, and the agentic capabilities of it is REALLY up there. I sent him into a somewhat complicated task and it has been pingpongin and implementing the solution for 8 minutes straight, no stopping, no asking, just doing the stuff.
AWESOME.
tremblerzAbhi@reddit
What hardware are u using? Because 8 minutes could mean different time horizons depending upon your t/sec
H_DANILO@reddit
Rtx 5090 128gb ram ryzen 9900x3d
80 tps because I had not optimized anything up to that point
bernzyman@reddit
Has an oddly early knowledge cutoff date of 2024 whereas Qwen3.5 and Gemma4 could identify image of current U.S. president whereas Qwen3.6 only identifies him as former president. Interesting curiosity
Due_Net_3342@reddit
the cut off is around end of 2024 from my testing. A model will never “know” its cut off unless it is specified clearly in the training data.
bernzyman@reddit
When I pushed the model, it stated that Trump is current President and was re-elected for 2nd term. Its some sort of weird glitch, not necessarily tied to censorship filter as I’m getting same result using the Uncensored-HauhauCS-Aggressive version of the model
Due_Net_3342@reddit
it could be also that it was trained with more data from the 2024 and early (which makes sense) and less recent data… this would cause it to give conflicting information for specific events and topics that changed recently (most of the time taking the route with more established information and other times the more novel path)
autoencoder@reddit
So strange. I wonder if the heretic version also does this.
RNSsports@reddit
I'm pretty new to LLMs... is anyone else having an issue running this model? I keep getting a "Failed to load model". I'm using a 5080. All other models work fine if I download them from LM Studio. This is the first one I've manually added into the models folder. I followed the same folder structure as the other models I downloaded inside LM Studio.
c64z86@reddit
I'm loving it!! Running it at Q8 Quant from the RAM on my 64GB latpop at 35-30 tokens a second with 128k context and it really punches above the older Qwen 3.5 27B and 35B and even gemma 4 26B. It created an entire beach with moving animals, moving clouds accurate palm treas and even generated sounds, all in one webpage and in one go!!!
c64z86@reddit
I came back to update my experience of it.
When it works, it works beautifully and brilliantly and produces things much much better than 3.5 or even Gemma 4 could, but when it fails, and for me it fails often then the result is much worse than any Gemma 4 prompt.
I'm only talking of HTML coding, I haven't tried Pythron coding or anything else so I don't know what it's like there. But for me Gemma 4 one shots things much better than 3.6 can.
I'd rather have a lower quality output and the thing actually working than a higher quality output and the thing not even working at all... so I'm going back to Gemma 4.
TexasBryan14@reddit
Do you have thinking on or off?
c64z86@reddit
Thinking on
c64z86@reddit
I came back to update my experience of it.
When it works, it works beautifully and brilliantly and produces things much much better than 3.5 or even Gemma 4 could, but when it fails, and for me it fails often then the result is much worse than any Gemma 4 prompt.
I'm only talking of HTML coding, I haven't tried Pythron coding or anything else so I don't know what it's like there. But for me Gemma 4 one shots things much better than 3.6 can. I'd rather have a lower quality output and the thing actually working than a higher quality output and the thing not even working at all.
year2039nuclearwar@reddit
Why does this show Qwen3.5 dense absolutely blowing gemma4 dense out of the water. In practice, that is not what I have noticed. Gemma4 seems to be a lot more capable in understanding long essay text
Sadman782@reddit
They generalize much better
R_Duncan@reddit
Sadly, gemma 4 can't code , likelot of people just after graduating.
Sadman782@reddit
Not true, write better code for me. Can you give any example?
R_Duncan@reddit
I use c++ and some python. In both these languages gemma4 was tie with qwen3.5.
pneuny@reddit
That's for Qwen 3.5, not 3.6
CYTR_@reddit
What's the source ?
Sadman782@reddit
https://kaitchup.substack.com/p/gemma-4-31b-vs-qwen35-27b-inference
Kodix@reddit
Very, very interesting. Seconding the call for source, please.
Sadman782@reddit
https://kaitchup.substack.com/p/gemma-4-31b-vs-qwen35-27b-inference
Holiday_Bowler_2097@reddit
Quick quantization brain damage test. Mmlu-pro computer science (temperature 0.7 top-p 0.8 top-k 20 min-p 0 presence-penalty 1.5 enable_thinking false): Unsloth's Q8_0 - 84.88 Q6_K - 83.41 Q4_K_XL - 82.93
R_Duncan@reddit
check mxfp4_moe please .... these Hybrid models are where that format shines.
Holiday_Bowler_2097@reddit
83.17 Knowing Unsloth's tendency to rush gonna redownload and retest his and alternative ggufs couple days after. That was just quick test that there is nothing unexpected with new model
R_Duncan@reddit
Not unexpected for me that mxfp4 positions nearly like Q5k_m. As I said, hybrid models seems to have issues with low quants (<6) and mxfp4 has not.
ResearchCrafty1804@reddit (OP)
VLM Performance:Qwen3.6 is natively multimodal, and Qwen3.6-35B-A3B showcases perception and multimodal reasoning capabilities that far exceed what its size would suggest, with only around 3 billion activated parameters. Across most vision-language benchmarks, its performance matches Claude Sonnet 4.5, and even surpasses it on several tasks. Its strengths are particularly evident in spatial intelligence, where it achieves 92.0 on RefCOCO and 50.8 on ODInW13.
TechySpecky@reddit
Can anyone check whether it's fixed the overthinking problem? I tried it before with thinking and it took SO long I had to turn thinking off
rpkarma@reddit
At least if you’re running it locally you have to set the parameters exactly as their model card suggests. It isn’t trained on reptition_penalty, only presence, and that has to be set right amongst other things.
finevelyn@reddit
We have 20 replies with workarounds to the overthinking issue in Qwen 3.5, but no one checked if Qwen 3.6 fixed the issue. 💀
rpkarma@reddit
Mines not a workaround so much as the actual setting you’re supposed to use for the model that it’s trained on shrug
I’ve not tried 3.6 35B yet because my 122B deploy on my Spark is honestly great and I can’t be assed to tear it down right now lol
Due-Project-7507@reddit
The overthinking is often caused by quantization according to https://kaitchup.substack.com/p/qwen35-quantization-similar-accuracy. But I found that e.g. Gemma 4 with the same quantization method always thinks shorter and still gets good results compated to Qwen 3.5.
TechySpecky@reddit
weird I use the FP8 instruct versions from hugging face via VLLM
Due-Project-7507@reddit
Then it is really the "normal" overthinking, it would be even worse with a smaller quantized version.
Skyline34rGt@reddit
Maybe:
"Qwen3.6 Highlights
This release delivers substantial upgrades, particularly in
fragment_me@reddit
Wha does that even mean ? ( the thinking preservation) Can someone spell it out ?
DistanceAlert5706@reddit
Model sends reasoning_content additionally to answer, on client side you must return return reasoning_content back. Same as GLM models work, and I guess some others.
Familiar_Wish1132@reddit
just add --chat-template-kwargs "{\"preserve_thinking\":true}"
it should see it's own thinking proccess from what i understood
waitmarks@reddit
Does your setup give it access to any tools? I have noticed that as long as it has access to at least a few tools, it wont overthink.
Borkato@reddit
Similarly, if you don’t need tools, just send it a fake tool like “calculate_distance_to_the_sun” lol
Several-Tax31@reddit
This is so accurate
waitmarks@reddit
Even just a current timestamp tool is useful especially with gemma 4 which wont believe you when you tell it the date.
Borkato@reddit
Good point! I just give it access to bash because I’m crazy 🤪
trycatch1@reddit
At least he is no longer going into Dostoevsky-level self-reflection spin, when you say him "hi!".
Nicking0413@reddit
The model card was really helpful and fixed it
Borkato@reddit
Just the sampler settings or?
Kodix@reddit
Can't you fix that to your liking yourself using a reasoning budget? Not *as* good as a model that's optimized for brevity in thought, but seems like a decent workaround.
FinBenton@reddit
Last time I tried reasoning budget, it just cuts the reasoning cold turkey right at the limit so it becomes incomplete, idk how much that affects the result though.
Borkato@reddit
Yeah, use something like
—reasoning-budget-message “…\nHmm, I’ve thought enough about this. I’ll respond to the user now.”the …’s show it that it’s switching gears better than just cutting it offNFSO@reddit
you could try injecting a message at the end of the cut with
--reasoning-budget-message ". Okay enough thinking. Let's just jump to it."Although this can fail too, if for example if you were inside of a parens, like:Aegyptus (western Nile deltaFinBenton@reddit
Yeah, I got pretty good system so I just used unlimited budget and it was no problem.
Kodix@reddit
That's how it does that, yeah. And - given my limited understanding - it should be fine.
Reasoning works by stuffing the current context with tokens that align the model's output generation more closely to what is desired. Meaning that partial reasoning should absolutely be effective, still.
Borkato@reddit
I get the feeling updates will come out over a few days, but also use llama cpp’s “reasoning-budget” flag and “reasoning-message”.
AvidCyclist250@reddit
It converges fast enough for me. Does exactgly what MoE is supposed to do. It's not a chatty narratively driven chatbot.
keepthepace@reddit
Ok, I guess I need to get on the train with this one.
Will a q4 fit on 24GB VRAM?
Longjumping-Sweet818@reddit
On my machine it takes 30gb.
Imaginary-Unit-3267@reddit
Well, given that 3.5-35B struggles to understand that if objects on the screen are moving right, the viewport must be moving left (real coding problem I've had), I hope this is true...
CryptoLamboMoon@reddit
The benchmark positioning is interesting — they're explicitly comparing MoE vs Dense at the same active parameter count (3B), and the MoE is winning by a meaningful margin on agentic coding (Terminal-Bench 2, MCPMark). That's the architecture proof point they needed after Qwen3-30B got some criticism for inconsistent tool-use reliability.
Apache 2.0 on a model this capable is going to accelerate a lot of production deployments that were waiting on licensing clarity. The 35B total / 3B active footprint on consumer hardware is basically the threshold where it becomes viable for solo builders running inference servers on their own machines without cloud dependency.
This one's going in the stack for sure.
Useful-Shift-3688@reddit
It is strange that they dont release any more model yet.
is this all?
Technical-Earth-3254@reddit
Nice, I would like to know if it's able to surpass Qwen 3 Coder Next 80B in coding benchmarks. Have to test it later on
FoldOutrageous5532@reddit
Yeah that's my daily driver for a while now.
dabiggmoe2@reddit
Wait, correct if I'm wrong, but I thought the Qwen3.5 27b and 35BA3B already surpassed Qwen 3 Coder Next 80B in coding benchmarks?
Beginning-Window-115@reddit
the older Qwen 3.5 35b was on par with qwen3 coder next already
soyalemujica@reddit
That is wrong. 35B was behind Coder-Next by a big margin. I run both in C++ and Coder-next was amazing. 27B Superior though.
R_Duncan@reddit
Well, this should be approximately good as 3.5-27B, so it might be time to put Qwen3-Coder-Next on a retirement plan.
Several-Tax31@reddit
Don't think so, coder is significantly better imo.
Sensitive_Worry4633@reddit
The dense 27b model is far superior
spoonfulofchaos@reddit
Really huh? Have I been wasting money renting gpus for Coder Next? Thankful I found this!
Several-Tax31@reddit
Yeah, 27b > coder > 35b
ItsNoahJ83@reddit
In benchmarks, but absolutely not in practice.
grumd@reddit
Yep, 3.5 122B >= 3.5 27B > 3-Coder-next > 35B-A3B > 9B
benevbright@reddit
many people including me are seeing better experience with qwen3-coder-next in real tasks.
RedParaglider@reddit
Yea, I haven't found anything that really tackles qwen 3 coder next 80b on really using it yet that a 128gb machine can run for specifically coding tasks. Benchmarks don't hold up to the real use.
RedParaglider@reddit
But that was non MOE correct? So different beast.
dinerburgeryum@reddit
Nah both models in this case are MoE. Coder-Next was tragically underbaked, I have every expectation that continued training on the 3.5 models will yield better results even with a smaller total parameter count.
Beginning-Window-115@reddit
no they are both MOE
SourceCodeplz@reddit
No way, but I guess each with his own tests.
the__storm@reddit
I know it's unlikely to happen, but I would love an 80B-A3B 3.5/3.6.
m_mukhtar@reddit
Man i would live to have 80b but with a but more active perameters like in the 6b to 9b range of active works be amazing. One can dream i guess
the__storm@reddit
The reason I'd want a very sparse model is because my DDR4 is slow - otherwise I'd just bump up slightly to 122B-A10B. (We all have slightly different hardware and would like an exact fit I guess lol.)
Hopefullyanonymous2@reddit
How much RAM does a 80 billion model take? Thought at that size you would be incredibly slow
the__storm@reddit
~50 GB, depending on the quant, plus 16 GB VRAM (you could make it work with 8).
The 80B-A3B is just about as fast as a 35B-A3B with fully offloaded experts - the compute and bandwidth requirements per token are pretty much the same, there are just more possible experts it can route to. Like 10-15 tok/s. (Of course with a 35B-A3B and a 16 GB GPU you could fit a good chunk of the experts into VRAM, so it'd be faster. On an 8 GB card you'd probably have to offload most/all of the experts and so would see ~the same performance.)
Hopefullyanonymous2@reddit
That makes sense, thanks for the explanation. Very much a noob on all this stuff.
When we are talking about tokens per second, early days that was "how quickly does a llm respond to you" but that was before reasoning. Now with reasoning, I assume there is a big difference between "read" TPS and write TPS? For instance I did some CC work the other day that took 850\~k tokens of input and did 10k tokens of output. If it was 100 TPS that would take 6 days? lol. Is that accurate, or are read tokens going to be "taken" faster than output? I don't even know if this question makes sense.
KURD_1_STAN@reddit
I hope they make it 80b a6b
ParaboloidalCrest@reddit
Yes please. It's been the unsung hero of coding for the last 75 days.
Interesting_Key3421@reddit
+1 let us know!
ItsNoahJ83@reddit
I would love that, but there's zero chance that is the case
aelma_z@reddit
3090 with Qwen3.6-35B-A3B-UD-Q3_K_XL.gguf adn 256k context - 110token/s - speeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeed
koygocuren@reddit
Q4_k_xl doesnt fit with 256k?
aelma_z@reddit
That was misinformation on my end. 72k is the max with Qwen3.6-35B-A3B-UD-Q3_K_XL.gguf
aelma_z@reddit
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 Off | N/A |
| 75% 51C P8 19W / 370W | 22392MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Middle_Bullfrog_6173@reddit
Did no one read the blog to the end?
MuDotGen@reddit
24gb of vram is still out of my scope for now, so I hope they release smaller variants like 3.5
Objective-Stranger99@reddit
Im running it with 8 gb vram ( iq4 xs unsloth)
MuDotGen@reddit
I thought the smallest quantized size was 4bit precision at still required like 17gb? To my understanding 3gb of active parameters would run it like 3-4gb at any given time but still requires the standby expert parameters to be loaded in memory too, hence the extra space in VRAM.
I'm probably mistaken, but if it's something I could try running in llama.cpp at 8gb of VRAM, I'd love to hear more info.
Objective-Stranger99@reddit
My bad, forgot to mention RAM offloading (around 15 GB). My cpu supports avx512, which speeds up inference
MuDotGen@reddit
Ah, I figured it had to have some kind of offloading. Still worth trying maybe, but it would be slow on mine. It's a 16gb shared VRAM but 32gb total, so it can technically load up to 16gb (not realistically for a decent context window of course), but I doubt it would go at any decent speed.
PaceZealousideal6091@reddit
With 8gb vram and 32 gb ddr5 ram, I can run it with tg at 30 tps and pp at 400-500 tps with 32k context. About 24 tps for 128k context. These are very usable numbers.
Objective-Stranger99@reddit
I am getting around 20 tps with 256k context. Seems to match your numbers as well.
Don't you frequently hit the 32K ceiling though?
PaceZealousideal6091@reddit
But why would you use 256k context for your system? Btw, even 20 tps is still usable number! I dont use 32k. I wrote 32k just for context. 128k is what I use and 24 tps is very useable number. It is more than enough for most usecases. You just need to be smart with how you manage your sessions. Ofc, its possible that you have longer context usecase. So you do you. But getting a 20 tps at 256k is still a damn workable situation.
Objective-Stranger99@reddit
I spent several hours tweaking compile and runtime flags to get it to this point. I would have been pissed if I didn't get at least this.
Objective-Stranger99@reddit
How much RAM do you have. The only tensors which mudt stay on the GPu are the attention ones, which are small relative to the size of the model. How much RAM do you have for offloading?
tteokl_@reddit
Sorry I'm new but isn't that worse performance?
Objective-Stranger99@reddit
What do you mean?
Gloomy_Butterfly7755@reddit
With the unified memory of Apple Silicon 24gb is very doable.
ghostrmor@reddit
apex-mini quant for qwen 3.5 35 a3b was \~12gb, so if your gpu has 16gb vram, consider waiting for the apex-mini quant
droptableadventures@reddit
HuggingFace page also says refers to it as "the first open-weight variant of Qwen3.6" (emphasis mine), implying there will be more.
FrogsJumpFromPussy@reddit
Oh god this is wonderful news, thank you
harpysichordist@reddit
Let me bring attention to what they stated: "Thinking Preservation: we've introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead."
This is a big deal because that can resolve a lot of the cache misses people were experiencing. It was destroying performance having to reprocess more of the prompt because there could be large changes to the prompt from turn to turn, due to missing reasoning context.
finevelyn@reddit
Thinking preservation might be a good option to have, but I wouldn't consider it a fix to the cache miss issue, because it also has other tradeoffs. The cache misses can and should be fixed at the chat template and caching logic level, and it can be done without thinking preservation.
Only the latest assistant message and subsequent tool calls need to be reprocessed even without thinking preservation, when the caching logic is implemented correctly.
harpysichordist@reddit
Yup, I agree with that.
This isn't a complete fix for cache misses people were experiencing, and like I said it depends on how tools like OpenCode are messing with your prompt, some make more of a mess than others, but this change looks like it helps out in some situations (from my early testing with OpenCode).
cunasmoker69420@reddit
do we know how to enable this in llama.cpp yet?
harpysichordist@reddit
Looking at their instructions for the Chat Completions API, you would pass something like: "chat_template_kwargs": {"preserve_thinking": True}
harpysichordist@reddit
Specifically for CLI you would pass:
--chat-template-kwargs '{"preserve_thinking": true}'If using a .ini file for router mode, use: `chat-template-kwargs = {"preserve_thinking": true}`
human-rights-4-all@reddit
for llama-swap I use this config:
LinkSea8324@reddit
Nothing related to
preserve_thinkingin qwen code repo, beside being cited in this issuehttps://github.com/QwenLM/qwen-code/pull/2820#issuecomment-4175593805
Nothing related to
preserve_thinkingin vllm code.harpysichordist@reddit
It's in the chat template itself: https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/chat_template.jinja
So that's why things like llama.cpp's chat-template-kwargs should work out of the box.
cunasmoker69420@reddit
yeah turns out I needed to RTFM. That seems to have done it for me
Imaginary-Unit-3267@reddit
*RTFLLM (get an AI to read the manual for you) :P
pkailas@reddit
I've been testing Qwen_Qwen3-32B-Q3_K_M.gguf against Qwen3.5-27B-Q4_K_M.gguf in performing code reviews of various projects.
1. RTX PRO 4000 Blackwell
2. 3.6 with a 64K context window is all I dared try
3. 3.5 with 128K context fit nicely
results.
3.6 was 85 t/s but hallucinated and lied about results, got things wrong. But it did do well if I took the results it had and ran a deep dive on them as a second pass.
3.5 was slower at about 20 t/s, but didn't make hallucinations and didn't require a second pass.
The major difference was that I was unable to provide a big enough context window for the task at hand, and MoE is a "Jack of all trades, Master of none".
,
wowsers7@reddit
How much RAM do I need to run this model on CPU only on Windows 11?
autoencoder@reddit
On my Linux box without a GPU, llama.cpp is using 19G shared memory for the Q4_K_M
wowsers7@reddit
Ok thanks!
EnzioKara@reddit
With 16 GB and low ctx and q3 xs ks If you can use some of your vram maybe 4gb so you can put active layers on it and gain some ctx
Icy_Anywhere2670@reddit
It will be so slow.
Iory1998@reddit
To be honest, Gemma-27B-A4B is not really good, The 31B variant is, though.
Local-Cardiologist-5@reddit
i wanted to love that model so much, but when i compare it with my Qwen3.5 35B-A3B, its so lazy.
I asked to change a date setting in our project and it literaly only changed the date enviremont file searched everything and called it a day.
Qwen knew to change the envireoment file and 9 other file templates and service files as well as s pdf generator which had it own date format and fixed it and created a custom date parser. which is exactly what i needed
Which is onlestly so impressive, compared to Claude, which had the same solution as Qwen but also thought to remove material date formaters.
The Gemma model was such a dissapointment, and i ran the reqwesnt atleast 6 times and all 6 times its been terrible.
Downloading the 3.6 model now im extremely excited
DOAMOD@reddit
You're the first person I've seen who thinks the same as me. It seems surprising to me all I read about people defending Gemma4, when it's so lazy. It's exactly the definition I've been thinking about for days every time I use it: it doesn't want to do anything. It even admitted it to me, saying it's more of a conversational model. What a surprise, it's a chat model, yes, very intelligent, and writes very well, but it's not your coworker. You know what else it told me? It told me to go to YouTube and search for the information or on Google. OMG, never in my life has a model told me to look up the information myself, hahaha this was incredibly fun.
CircularSeasoning@reddit
Niiiiice!
With my limited download capacity, I was about to download Gemma 4 but now I will get this instead.
This time I will go with Q6 because Q4 on Qwen3.5 35B feels a little bit like I am disrespecting the power of the original weights.
DistanceAlert5706@reddit
Q6 was only working quant for me, q4, q5, any quant I tried on old model was looping and falling tool calls, only q6 worked.
CircularSeasoning@reddit
Good to know! I get occasional loops sometimes on Q4 but it's been otherwise quite usable. I think I will happily trade a bit of speed for more surety in the quality of responses, now that I've battle-tested the model and very much enjoy it.
Corosus@reddit
"E:\dev\git_ai\llama.cpp\build\bin\Release\llama-server" -m D:\ai\llamacpp_models\unsloth\Qwen3.6-35B-A3B-UD-Q4_K_XL_v1.gguf --host 0.0.0.0 --port 8080 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 -ngl 99 -ts 28,20 -sm layer -np 1 --fit on --fit-target 2048 --flash-attn on -ctk q8_0 -ctv q8_0 -c 50000
latest llama.cpp, opencode 1.4.0
Its actually doing its job and not endlessly failing tool calls like every other moe ive tried. Hell yeah.
90tps for quick test and 75tps for opencode with my 5070ti/5060ti setup
Ancient-Celery-4293@reddit
thank you. i was strugling with finding settings that would make the model not crash. this one seems to be going strong at a task for about 30min, which previously i was having trouble getting it to do. i had to break my tasks to very small tasks for it not to crash. this one just handled 22k lines of code in under a minute and its now building plan. thank you again for sharing this.
SensitiveVariety@reddit
I love how the standard is "doing its job and not endlessly failing tool calls" but it really does sum up the experience well. So far, it has been really reliable compared to Gemma 4 with tool calling and looping.
Borkato@reddit
Shiiiiit I’m so excited. Trying it right this second
Furacao__Boey@reddit
Didn't qwen 3.6 - 27b won the voting to be open source
DisturbedNeo@reddit
Better to release 35B first, get all the template / llama.cpp issues ironed out, and *then* release 27B. If that's the model everyone's planning to download, it's also the model the Qwen team are going to want to make sure works flawlessly.
butihardlyknowher@reddit
I mean logically, why would you expect them to release the one with the most demand for free?
I understand that was their implication, but the autist in me has to point out the incentives.
AvocadoArray@reddit
That poll was just for marketing and engagement. I’m sure we’ll get all of them in due time.
No-Refrigerator5998@reddit
I know which model I am using tonight !
waddaplaya4k@reddit
What kind of hardware do I need for this?
So I can run it locally—with good performance :)
pneuny@reddit
Depends on quant. 16GB is minimum viable at low quants (2-3 bit) but also REAP can reduce parameter count, so when REAP quants come out, Q4 might be viable on 16GB.
waddaplaya4k@reddit
According to the website, you need about 32 MB of VRAM for it to run somewhat smoothly.
But what does “smoothly” mean? I use Claude Code Opus a lot—is the speed comparable?
If I want to use it almost exclusively for programming, how does the quality compare to Claude Code Opus?
Or is it unfortunately much worse, because Claude Code Opus thinks much more clearly?
pneuny@reddit
32 MB? There's no way you could run any modern language model on that. Perhaps they meant 32 GB. Generally, 32 GB is recommended for this model, but with some compromise (quantization and perhaps reap), 16 GB is doable.
waddaplaya4k@reddit
ahhhh sry :) 32 GB :)
RouterAPI@reddit
Run the Same AI Models at \~80% Lower Cost
Craftkorb@reddit
Interesting deviation to the previous status quo. will have to check if that means they fixed overthinking, otherwise it'll eat even more tokens then ever before
Robot1me@reddit
That and the tendency of the Qwen models to interpret anything and everything as a "possible jailbreak", makes it (in my view) a poor cost choice compared to Gemma 4
Plus_Two7946@reddit
Interesting release. The 3B active params with 35B total is exactly the architecture I've been waiting to see more of from Qwen, because it means you can run this on hardware that would otherwise choke on a dense 35B model.
I've been running my own agent infrastructure on Hetzner with Docker, and a model at this active-parameter footprint could realistically sit alongside a Fastify backend without needing an A100. The multimodal thinking/non-thinking toggle is also a smart call for agentic pipelines, because you don't want a reasoning loop firing on every trivial tool call, only on the steps that actually need it.
What I'd want to test first is how it holds up as an orchestrator in a multi-agent setup, specifically whether the sparse activation causes any latency spikes under concurrent requests compared to a dense model of similar active size. If the agentic coding benchmark numbers hold in practice, this could be a serious local alternative to hitting the Claude API for code-generation subtasks.
Empty_Bus9742@reddit
Hardware requirements to run locally?
baddhabbits@reddit
they even managed to make fake numbers for gemma 4 31b
relmny@reddit
And yet many people were saying, just a few days ago, that the last open weight model from qwen was 3.5 and so on...
Go Qwen!!!
Early_Play_1259@reddit
We use in theranger.ai
Super strong and effective
One_Key_8127@reddit
"Across most vision-language benchmarks, its performance matches Claude Sonnet 4.5, and even surpasses it on several tasks"
Well, it surpassed Sonnet 4.5 on all the quoted benchmarks. Benchmarks are crap, but it looks very promising. Anyone knows if MLX fixed prompt caching for Qwen3.5? It was bugged before, making it a bad option for agentic use on Mac.
Dry_Syllabub_7570@reddit
I don't think the mlx prompt caching has been fixed yet. Was super bummed to encounter it last week. Tried running Qwen 3.5 and Gemma4 through MLX in Opencode, had to process the same 11K token prefix on every single call
SilentScribe42@reddit
Prompt caching issue got fixed in LM studio mlx engine recently.
mr_il@reddit
I use Qwen3.5 on MLX and also tried 3.6 just now with OpenCode, didn't notice any problems.
tredbert@reddit
I thought Gemma4 was pretty unimpressive for coding compared to Qwen3.5. Nice to see that validated here.
Looking forward to trying out Qwen3.6!
Are there are benchmarks on how it compares to the latest Sonnet and Opus?
Foreign-Bedroom-3063@reddit
Why still no 14b? Would be the sweat spot for my inference pipeline.
Predictor12@reddit
How can i run this locally? I want to use hermes but it was said i needed a provider. Openrouter, to be exact. How can i run it full offline on my pc? (Im starting LLM use)
danigoncalves@reddit
I am running out of disk guys!!
DominusIniquitatis@reddit
A3B is the key here. I ran Qwen Next 80B A3B without much problem at 12GB 3060.
danigoncalves@reddit
Really? what is the config that you use how much system RAM you have? me with the latest llama and with the common parameters can get 15 t/s. Its not bad but I guess for long coding tasks could be a little but slow.
DominusIniquitatis@reddit
Default config (aside from the recommended sampling parameters), 32GB DDR4-2666, various 4-bit quants, 64k context. Yep, I've also been getting around 15-25 t/s, but that's not an issue for my use cases, given that I don't use LLMs for coding (too messy for my taste, so still doing everything by myself for now).
danigoncalves@reddit
I use as a rubber duck and a inverted AI pair programmer but yes I have to review every single line of it and for the most comolex tasks these models still don't grasp so well the quality demand.
pneuny@reddit
Maybe REAP version at a low quant when it arrives?
danigoncalves@reddit
Hum... you are right maybe I can make something from that.
Hugi_R@reddit
Gave it a try, and I'm not impressed.
Plugged it to a code agent, in "ask" mode. Simple prompt "How to add a second lib to this Rust project" and it immediatelly tried editing file. In code mode, it halucinated the answer, desperately trying to debug an invalid Cargo syntax. (it tried a bonkers [[lib]] syntax (valid for [[bin]], not lib) and got mad a cargo when it failed, conviced it was correct)
Gave the same task to Gemma 4, which handled the task like a champ. (properly set up a cargo workspace).
Qwen: Qwen3.6-35B-A3B-UD-IQ4_XS.gguf
Gemma: gemma-4-26B-A4B-it-UD-Q4_K_M.gguf (smaller model, bigger quant while keeping the same context)
Due_Net_3342@reddit
it is expected, qwen has only 3B active vs 4B so it is less prone to quantisation quality loss. Use Q8 and tell us what you get
Hugi_R@reddit
Can't use Q8, that won't fit on GPU. All model I use must fit on 24GB of VRAM with 100k context. Otherwise I have no use for it.
kmp11@reddit
All morning I have been trying to get Hermes and Gemma 4 31B to look at the menu of my local sandwich shop to tell me the daily specials. and it failed with multiple tries. Qwen3.6 was able to list the specials and place order on the first try.
it allows me to us a much higher precision model while getting ~120tk/s instead of ~15tk/s (average).
Is this a scientific test? no, but a sandwich manifested itself, that's already a win Gemma never had. Its worth using for the next day or two until the next better model drops.
computehungry@reddit
I also saw it messing up text recognition with default settings. I think, by default, gemma uses less tokens per image. You can set this to be higher. If you're using llama.cpp, try: --image-min-tokens N (I like N=300). image max tokens has to be set to be higher than N too. Gemma has nailed my ocr tasks after I changed this.
kiwibonga@reddit
Anthropic and OpenAI are so cooked.
It's so hard not to gloat in the "boohoo claude ate my tokens" threads when 99.99% of what they use it for can be achieved by 27B on $1000 worth of GPU.
Qual_@reddit
cooked nothing you mean.
People that will spend a thousand dollars worth of GPU instead of using the SOTA models is so niche that's a rounding error in their revenue streams.
Awkward-Reindeer5752@reddit
The only revenue stream that matters is enterprise customers. Most already use public cloud providers offering long term leases on dedicated GPU instances capable of running models like qwen 3.6-397b for their many users at < 1/10th their Anthropic API bill. Anthropic and OpenAI’s only hope for a mote is regulatory capture.
Piyh@reddit
I regularly watch people at work run Claude code queries that individually cost 10 to $15.
bnightstars@reddit
I was actually testing claude code to build an AI Web research agent in Python using Sonnet4.6 with AWS Bedrock and the end result costed $10 in tokens.
Spectrum1523@reddit
This is what you think most enterprise customers are doing? Show me even a single one doing this.
Awkward-Reindeer5752@reddit
Where do I say enterprise customers are currently doing this? Sparse attention all over this bitch.
Spectrum1523@reddit
You literally said this
Awkward-Reindeer5752@reddit
Most enterprises do use one or more of AWS, Azure, or GCP. Because they are already doing so, it will become enticing for them to roll their own inference as part of their existing infrastructure when the cost savings vs. potential time/accuracy loss is undeniably favorable.
yaboyyoungairvent@reddit
I don't know what location or industry you're in but all I ever see at companies is either people running claude or copilot.
Awkward-Reindeer5752@reddit
I was being forward looking based on the progress open models are showing, especially at >200b parameters. I’m using claude code at work at opus api rates right now, but regularly test agentic coding with open weight models. We will have opus 4.7 equivalent open models in 2027 and I don’t see us paying over raw compute costs at that point.
jld1532@reddit
Buddy, I work for a multi-billion dollar entity, and we run nothing but oss models. Kimi K2.5 and MiniMax 2.7 are the big ones. There'll be a learning curve for people, but open weights are going to make a dent, and I suspect a big one.
kiwibonga@reddit
We're talking about customers that are paying $2400/year, in some cases, paying that much to both Anthropic and OpenAI, or holding multiple accounts, or accepting the insane premium API billing.
It's like if renting a ferrari for 4 months cost the same as buying a honda civic.
And it's like ferraris are expensive because ferrari gives away 10,000 free test rides for every 1 person that buys a car.
Main_Secretary_8827@reddit
Sadly not true, people who go out and buy claude plans usually know what they need and do, maybe for gpt users perhaps,
TinyZoro@reddit
What’s the lowest spec Mac mini this could comfortably run on?
jacek2023@reddit
Fantastic news. 27B won the voting so let's hope all sizes will be released
coder543@reddit
yeah, it won the voting by a wide margin... yet this is what they chose to focus on? Funny/concerning.
I really want a new Qwen3.6 122B A10B model for my Spark.
mrrizzle@reddit
You assume the vote meant they would release the most anticipated first
nullmove@reddit
The 80B qwen3 next hybrid MoE took 9.3% of compute to train than the 32B qwen3 dense model.
If they were doing both with equal priority, this one was always going to be far quicker to finish training. Obviously I don't know their plans, just casting doubt on the notion that they "focused" on this instead just because it came out the door first.
ROS_SDN@reddit
I would think the compute time would be a function of total parameters and active parameters, not just active.
Could you explain this to me a bit more since your number is nearly exactly 3/32?
Very interesting I just didnt realise it was that computationally efficient because I've felt small - medium MoEs have been a bit too sparse as of late or they should have a less sparse counterpart.
Like 35B a6b would be lovely, but when you say that would take 2x the compute for significant, albeit marginal gains in intelligence I'm very likely to back track on this thought a little.
nullmove@reddit
I nicked the numbers from their blog post.
I think that's just a weird coincidence.
The standard function is
C = f(D, F)where F is FLOPs per token, which is (roughly) a function of total params for dense models but only active params for MoEs (since each token is only routed to a subset of experts).The other important variable we must consider is D, which is total number of token processed. Qwen3 32B was trained on 36T tokens, but Qwen3-next was only trained on 15T tokens. I presumed they stopped there because it was experiment, and already achieved desired training goals and performances in downstream tasks.
Rough rule of thumb is
C = 6 * D * P(P_activefor MoE). Which means Qwen3-next A3B should have been ~4% of compute of 32B dense (assuming some of the other factors like number of layers, sequence length etc. are same), but there is also a significant MoE overhead that depends on bunch of other things, and here total parameters can play a role (say you might need to split training across a number of GPUs, then you have communication overhead). Also how well experts learn also matters (load balancing with aux loss), and here hybrid arch probably played a role too. No clue about exact breakdown of numbers, but that's probably how it was ~9%.But anyway, broad picture is that MoEs are appealing because compute cost grows only with active params and top-k experts, but model capacity grows with active params and number of total experts (E). Since usually E > k, it's a good trade-off. That said in practice, training MoEs are pretty hard as they introduce MoE specific issues, like you really need to load balance the router carefully or else expert collapse happens, and this might be one of the areas where frontier companies have secret sauce/experience which makes a massive difference. I haven't trained any model though, so I wouldn't really know lol.
Borkato@reddit
This makes me think that we could have such insane MoE models if they really try even harder haha
nullmove@reddit
MoE training has a bunch of complications though, it's not just about compute. The gated network/router needs careful tuning for load balancing, otherwise experts collapse. There are other uniquely MoE related training instabilities to solve, and these challenges increasingly get harder the bigger the model you are training.
Qwen is really good at getting very good bang per buck up to a size. But every time they try to scale beyond that, it turns out pretty suboptimal so they stop. They still make decent progress each generation, probably through data pipeline refinement alone. But arguably the 3.5 series MoEs were kinda underwhelming at big sizes due to those issues.
That being said it does feel like they are pushing the envelope with 3.6 again. The big one (that they decided to close off) seems to be competing favourably with GLM-5 which is twice its size. Which makes me bullish about them, but again up to a limit.
cafedude@reddit
The voting thing was just a marketing move.
Borkato@reddit
I’m so happy, I wanted 35B 😂
soyalemujica@reddit
35B is easier to make than 27B dense.
stan4cb@reddit
27b then 35b would be underwhelming for 35b, this way they keep us waiting.
I'd prefer 27b tho
Beginning-Window-115@reddit
probably training it longer since its the most anticipated one
Beginning-Window-115@reddit
obviously we're gonna get both if not more
coder543@reddit
That is not obvious at all. The new management has to prove themselves after the old ones left.
Darkoplax@reddit
waiting for 9b
Much-Researcher6135@reddit
YES gimme that density
coder543@reddit
yeah, by a wide margin... yet this is what they chose to focus on? Funny/concerning.
I really want a new Qwen3.6 122B A10B model for my Spark.
genzpepega@reddit
I'm a noob. What program should I use to run this? Does it actually matter?
Future-Coffee8138@reddit
LM studio has GUI so could be a good starter. Just ask any AI to guide you through.
iMrParker@reddit
I daily 122b. I'll give it a shot and see how it compares
raveschwert@reddit
What's is your machine made of ?
Late_Film_1901@reddit
I'm running it on strix halo and is it very much usable. qwen 3.5 122B beats every other smaller model for my use cases. I didn't have time to compare to minimax 2.7 at Q3.
SpicyWangz@reddit
Are you doing Q4? Mine never seems to work on opencode, it endlessly triggers compaction as soon as it tries to edit
Late_Film_1901@reddit
I haven't used opencode. The description sounds like context running out. I can try over the weekend.
AvocadoArray@reddit
Not OP, but 122b is very capable at 4bpw. 48GB VRAM + some CPU offloading will get you there, or 72GB+ in full VRAM with a good amount of context.
That said, I still prefer 27b at FP8 when it comes to complex tasks or coding.
spoonfulofchaos@reddit
Damn ! Seriously?! Is 27b that good? I can’t believe I haven’t tried it yet
Low-Boysenberry1173@reddit
Im also using Qwen3.5:122b as daily driver, openclaw and coding. But would you prefer still the 27b over the 122b moe? In my head I want to utilize my hardware so im feeling uncomfortable to use just 27b. Any experience with agentic tasks?
AvocadoArray@reddit
Yeah, I thought the same and started out with 122b using Roo Code and Pi coding agent. It was great, but got hung up on a long complex task that required a lot of pre-planning. It sort of got it working, but made a mess of the codebase and it wasn’t very elegant at all.
I decided to throw it at 27b to see how it stacked up, and the difference was night and day, at least in the planning phase. The planning document was much more detailed and broken down into more rational steps so it didn’t have to figure things out on the fly.
As far as pure coding ability, it’s roughly the same. But the planning and reasoning is much cleaner, and it’s able to get itself out of loops or dead ends easier.
Since then, I run it in FP8 with ~130k context and it only takes up 60% of my 96GB VRAM, which is great because it leaves plenty of room for STT/TTS, image gen or whatever else I’m playing with at the time.
Late_Film_1901@reddit
Did you test capabilities between Q4 and Q8 ? I'm wondering how much can be gained by going with finer quantization.
iMrParker@reddit
I run it with experts offloaded to the CPU on a 5080 with 96 GB of DDR5. And I run Qwen3.5-27b on a 3090. I built this machine in Feb of 2025. If I knew what was coming
I run Q4 quants for both because I use a 100k context window. 122b runs around 15tps and 27b runs at \~40tps
I'm strongly considering getting an RTX Pro 5000 48gb to replace the 5080
grunt_monkey_@reddit
Im running 4x9700 and getting pp 230 t/s and decode 31 t/s on llama.cpp running the UD-Q6_K_XL.
sdexca@reddit
Curious as well.
Borkato@reddit
They answered in reply to their comment
colemab@reddit
RAM and the souls of the dead /s
Its-all-redditive@reddit
I’m using 122b nvfp4 in running projects so I would love to know your opinion.
AppealSame4367@reddit
I knew it was Christmas already. Saw a deer yesterday!
Euphoric_Emotion5397@reddit
LOL. I was just starting to download the Gemma 4 Opus distilled mixed. Hope this comes out in LM studio fast.
jadbox@reddit
Will there be a smaller parameter version for the GPU poor? (using q2 bits is generally unstable)
c64z86@reddit
If you have a fast enough CPU and enough RAM(64GB) you can load it in llama.cpp and run it from the RAM instead of VRAM. And with it being MoE with 3B active it will use less VRAM than a full dense 35B model would anyway.
Caffdy@reddit
DDR4 or DDR5? Full RAM or with VRAM into the mix?
c64z86@reddit
DDR5, it also fills up nearly 12GB of my RTX 4080 mobile so it's being used as well, but the CPU and RAM are doing most of the heavy lifting.
Free_Change5638@reddit
already abliterated this one if anyone wants an uncensored version: https://huggingface.co/wangzhang/Qwen3.6-35B-A3B-abliterated
Free_Change5638@reddit
been playing with it today. ran abliteration on it already — https://huggingface.co/wangzhang/Qwen3.6-35B-A3B-abliterated. quality feels preserved, refusals mostly gone.
work__reddit@reddit
Seemed to overthink a bit more than 3.5, I will stick with 3.5
=== Testing: qwen3.5:35b-a3b-q8_0 ===
🔥 Warming up model (may take 2-7 minutes)... ✅ Ready
reasoning [1/4]: A rectangular pen is built with one side against a barn, 200...
Run 1 → 36.229s | tps: 59.29 | answer: correct | code: n/a
Run 2 → 51.336s | tps: 59.49 | answer: correct | code: n/a
Run 3 → 36.458s | tps: 59.33 | answer: correct | code: n/a
reasoning [2/4]: Janet's ducks lay 16 eggs per day. She eats 3 for breakfast ...
Run 1 → 13.041s | tps: 58.05 | answer: correct | code: n/a
Run 2 → 16.199s | tps: 58.52 | answer: correct | code: n/a
Run 3 → 15.749s | tps: 58.35 | answer: correct | code: n/a
reasoning [3/4]: How many letter r's are in the word 'strawberry'?...
Run 1 → 6.701s | tps: 56.86 | answer: correct | code: n/a
Run 2 → 6.714s | tps: 56.75 | answer: correct | code: n/a
Run 3 → 6.709s | tps: 56.79 | answer: correct | code: n/a
reasoning [4/4]: Alice rolls a fair n-sided die (faces 1 to n) and Bob rolls ...
Run 1 → 138.022s | tps: 59.35 | answer: incorrect | code: n/a
Run 2 → 132.685s | tps: 59.39 | answer: correct | code: n/a
Run 3 → 130.022s | tps: 59.40 | answer: correct | code: n/a
coding [1/3]: Write a single example of runnable Python code to reverse th...
Run 1 → 21.041s | tps: 58.98 | answer: n/a | code: correct
Run 2 → 19.842s | tps: 58.97 | answer: n/a | code: correct
Run 3 → 19.748s | tps: 58.99 | answer: n/a | code: correct
coding [2/3]: Create a single runnable Python script with a function that ...
Run 1 → 6.828s | tps: 56.39 | answer: n/a | code: correct
Run 2 → 6.760s | tps: 56.80 | answer: n/a | code: correct
Run 3 → 6.975s | tps: 56.92 | answer: n/a | code: correct
coding [3/3]: Inside a single executable example usage python script that ...
Run 1 → 24.306s | tps: 59.04 | answer: n/a | code: correct
Run 2 → 26.988s | tps: 59.25 | answer: n/a | code: correct
Run 3 → 23.345s | tps: 59.16 | answer: n/a | code: correct
✅ Benchmark complete → benchmark_results.csv
=== Testing: qwen3.6:35b-a3b-q8_0 ===
🔥 Warming up model (may take 2-7 minutes)... ✅ Ready
reasoning [1/4]: A rectangular pen is built with one side against a barn, 200...
Run 1 → 35.836s | tps: 59.30 | answer: correct | code: n/a
Run 2 → 46.526s | tps: 59.49 | answer: correct | code: n/a
Run 3 → 37.122s | tps: 59.34 | answer: correct | code: n/a
reasoning [2/4]: Janet's ducks lay 16 eggs per day. She eats 3 for breakfast ...
Run 1 → 13.727s | tps: 58.13 | answer: correct | code: n/a
Run 2 → 13.932s | tps: 58.35 | answer: correct | code: n/a
Run 3 → 20.359s | tps: 58.94 | answer: correct | code: n/a
reasoning [3/4]: How many letter r's are in the word 'strawberry'?...
Run 1 → 11.986s | tps: 58.40 | answer: correct | code: n/a
Run 2 → 9.344s | tps: 57.90 | answer: correct | code: n/a
Run 3 → 6.174s | tps: 56.53 | answer: correct | code: n/a
reasoning [4/4]: Alice rolls a fair n-sided die (faces 1 to n) and Bob rolls ...
Run 1 → 137.788s | tps: 59.45 | answer: incorrect | code: n/a
Run 2 → 137.781s | tps: 59.46 | answer: incorrect | code: n/a
Run 3 → 137.806s | tps: 59.45 | answer: incorrect | code: n/a
coding [1/3]: Write a single example of runnable Python code to reverse th...
Run 1 → 24.354s | tps: 59.25 | answer: n/a | code: correct
Run 2 → 37.164s | tps: 59.52 | answer: n/a | code: correct
Run 3 → 36.608s | tps: 59.52 | answer: n/a | code: correct
coding [2/3]: Create a single runnable Python script with a function that ...
Run 1 → 37.335s | tps: 59.57 | answer: n/a | code: correct
Run 2 → 39.637s | tps: 59.49 | answer: n/a | code: correct
Run 3 → 38.342s | tps: 59.52 | answer: n/a | code: correct
coding [3/3]: Inside a single executable example usage python script that ...
Run 1 → 59.513s | tps: 59.60 | answer: n/a | code: correct
Run 2 → 62.388s | tps: 59.66 | answer: n/a | code: correct
Run 3 → 58.374s | tps: 59.63 | answer: n/a | code: correct
✅ Benchmark complete → benchmark_results.csv
🏆 MODEL RANKING (Based on LAST 3 RUNS)
Score = CorrectAnswers + 10/Latency
=====================================================
Rank Model Reasoning Coding Latency(s) Avg TPS Score
──────────────────────────────────────────────────────────────────────────────────────────
1 qwen3.5:35b-a3b-q8_0 11 9 35.509 58.38 20.3
1 qwen3.6:35b-a3b-q8_0 9 9 47.719 59.07 18.2
Here is the question qwen3.6 failed with.
"Alice rolls a fair n-sided die (faces 1 to n) and Bob rolls a fair m-sided die (faces 1 to m). n is the smallest composite number. m is the smallest composite number greater than n that is coprime to n. What is the probability that the sum of their rolls is a prime number? Express the answer as a simplified fraction a/b, and output the final answer as the value of a+b."
2 answers were blank, was the last correct or incorrect?
# cat outputs/qwen3.6_35b-a3b-q8_0_reasoning_p3_run1.txt
# cat outputs/qwen3.6_35b-a3b-q8_0_reasoning_p3_run2.txt
# cat outputs/qwen3.6_35b-a3b-q8_0_reasoning_p3_run3.txt
To find the probability that the sum of the rolls is a prime number, we first determine the values of $n$ and $m$ and then analyze the possible outcomes.
1. Determine $n$ and $m$
* $n$ (Smallest composite number): The smallest composite number is 4. Thus, $n = 4$.
* $m$ (Smallest composite number $> n$ coprime to $n$): We check integers greater than 4:
* 5 is prime.
* 6 is composite, but $\gcd(6, 4) = 2$.
* 7 is prime.
* 8 is composite, but $\gcd(8, 4) = 4$.
* 9 is composite and $\gcd(9, 4) = 1$.
Thus, $m = 9$.
2. Sample Space and Favorable Outcomes
Alice rolls a 4-sided die ($A \in {1, 2, 3, 4}$) and Bob rolls a 9-sided die ($B \in {1, 2, \dots, 9}$).
* Total outcomes: $4 \times 9 = 36$.
* Sum range: $1+1=2$ to $4+9=13$.
* Prime sums in range: 2, 3, 5, 7, 11, 13.
We count the pairs $(A, B)$ that result in these prime sums:
* Sum = 2: (1, 1) — 1 outcome
* Sum = 3: (1, 2), (2, 1) — 2 outcomes
* Sum = 5: (1, 4), (2, 3), (3, 2), (4, 1) — 4 outcomes
* Sum = 7: (1, 6), (2, 5), (3, 4), (4, 3) — 4 outcomes
* Sum = 11: (2, 9), (3, 8), (4, 7) — 3 outcomes
* Sum = 13: (4, 9) — 1 outcome
Total favorable outcomes: $1 + 2 +
Thrumpwart@reddit
This looks really interesting - Qwen3.6 RYS model with 10 duplicated layers.
https://huggingface.co/DJLougen/Ornstein3.6-35B-A3B-RYS-GGUF
Claims significant reasoning improvement. Downloading now.
FoundationFirm6934@reddit
Great work
wakaokami@reddit
Not sure if this is the right place to ask, but I used to be a heavy user of Claude Code. Recently, the usage limits have made me reconsider, so I’m looking into running local LLMs.
Qwen 3.6 looks promising, and I don’t necessarily need something state-of-the-art.
Do you have any recommendations on what kind of hardware I should be looking at, or any useful resources to get started?
My main use cases are coding and generating research ideas.
PlatypusMobile1537@reddit
c64z86@reddit
I'm updating my experience of it. Qwen 3.6 is fantastic, when it gets something right it really gets it right and the quality is well beyond even Gemma 4 26B.. but it also gets things wrong a lot of the time.
AdUnlucky9870@reddit
3b active params doing coding on par with models 10x the size is wild. been running qwen3.5 for a few weeks and it already punches way above its weight, cant wait to see what the quants look like on this one
Phaelon74@reddit
Their benchmarks arr a bit misleading, as that looks to be grmma4-31b non it. Would love to see where gemma4-31b-it is on that graph.
fastlanedev@reddit
Just tried it on Qwen chat, very disappointed. Endless thinking loops, can't do a simple comparison pulling in benchmark data on itself, thinking loops, etc. doing things I explicitly said not to do spending thinking tokens on lecturing me on model capabilities. Couldn't even find Qwen 3.6 35b A3b when I spelled it out
Maybe it's the chat harness, but that's pretty disappointing considering the team that developed it should have that under control.
May try it later on a simple harness like pi
paq85@reddit
It works really good, but I'm facing lots of tool calling issues when used via opencode and used context goes above 100k... Anyone solved this perhaps?
jstraj@reddit
I am having really good results on my Nvidia 4070 Super (12 GB) with 32 GB RAM. Although I tested it lightly but I am getting somewhere between 43-52 t/s based on various prompts.
Here's my config:
Although, I am getting the best performance of about 52 t/s on
--n-cpu-moe=17but that is only possible with short context size (16k).Xyrus2000@reddit
I can second this. I'm running on a 4080 super with an Intel Ultra 7. I have a similar setup in LM studio, and I'm hitting around 66 t/s sustained. I use the "experimental" option of forcing the expert layers to the CPU (set to 20).
I'm going to put it through its paces with some coding tests and see how it does.
jstraj@reddit
Since your VRAM is bigger (16 GB vs 12 GB), you can also test using lesser values for `n-cpu-moe`. I think you'll get better t/s.
Try between 10 - 18.
GeorgeTheGeorge@reddit
What are you running it with? I've been thinking things couldn't get much better than Gemma 4 26b A4 with LM Studio on a 4080, so what you're saying is very exciting.
jstraj@reddit
I am running this using llama.cpp using ini file. The config is provided above.
--Rotten-By-Design--@reddit
I would be more excited if the smallest useful version was not 24GB.
evilbarron2@reddit
But it’s an MOE model, so it’ll only have a subset of layers active at any given moment and should be runnable on <24gb, right?
--Rotten-By-Design--@reddit
I am fully aware of that, but the qwen3.5-35b-a3b is not as big. And yes you can run it on 24gb leaving no VRAM for context, not even a little for a Chrome tab on a 24GB card, making it much slower and for me useless with the speeds you get on my CPU/RAM
evilbarron2@reddit
Hmm. I’m running Qwen3.5-35B-A3B-UD-Q4_K_M.gguf on a 3090 with 132k context using llama.cpp and the llama-serve WebUI reports 150t/s - I haven’t seen any reason not to expect the same from 3.6. I’ve disabled reasoning entirely and I don’t miss it. This works pretty well as a backend for Hermes agent and openclaw. Not so hot for multiuser, but this is my personal home lab and development box.
--Rotten-By-Design--@reddit
3.6 is much bigger, so it can't be the same, unless they changed its efficiency also.
You get better speeds than me, but I use LM Studio, and also your RAM is most likely much faster than my 64GB DDR4 3600MHz, so that might help you
ea_man@reddit
LM Studio is probbly your problem, try llama.cpp
--Rotten-By-Design--@reddit
Yeah could be, I just like the simplicity of using LM studio, but I may test it at some point
ea_man@reddit
Windoze +LM Studio gave me some 1/2 performance with some MoE models, YMMV
--Rotten-By-Design--@reddit
Nice knowing. Will have to test at some point. Thx
Top-Rub-4670@reddit
They're literally the same size at the same quant, as you'd expect them to be.
https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/tree/main
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main
--Rotten-By-Design--@reddit
Note quite from my perspective. The ones you find in LM Studio are 22GB or slightly less in the q4_k_m quant. Dunno if that means the other software downloads something extra, that LM Studio already has built in.
But it could mean that Qwen3.6 will also be 22GB.n LM Studio, in which case I will be happy
Top-Rub-4670@reddit
They're literally the same size at the same quant, as you'd expect them to be.
https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/tree/main
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main
TexasBryan14@reddit
How do you enable the non thinking mode when using this with openclaw? Thanks!
tombhayya@reddit
Do we have MLX version of it?
FaceDeer@reddit
Ooh. I've been putting off switching my local workflows over to Gemma4 due to all the churn about its template format and so forth, looks like I might actually be skipping it instead.
FormalAd7367@reddit
i thought alibaba won’t release any open source model anymore?
GregoryfromtheHood@reddit
Using llama.cpp I'm running into an issue with it sometimes just outputting a thinking block with a tool call and nothing else, which breaks things. I've had to run it with reasoning off where I can run 3.5 with it on.
viperx7@reddit
3.6 27B will be a gold. what happened to the poll on twitter 3.6 27B when?
ea_man@reddit
I mean ain't a dense 27B slower to test / train than a MoE?
dkeiz@reddit
nope, its actually could be opposite, dense testing more predictable.
DistanceSolar1449@reddit
False, training/distilling scales linearly on active params * number of tokens trained
mr_il@reddit
I really hope so! If it hikes on SWE-Bench Pro as much as the MoE model, it'll basically be a GLM-5 level model that you can actually run on Spark DGX or M5 Max. If with DFlash and DDTree it could run at something like 40 tok/s, it'll be pretty much the dream coder.
pigeon57434@reddit
i think maybe theyre gonna do the other models first because the other models are the most desprately in need of upgrading and super undercooked whereas 3.5 27b already is extremely well cooked
lemon07r@reddit
Im excited for the 27b model. I think gemma 4 has some rough competition.
DingyAtoll@reddit
Did they fix 3.5’s issue of using tens of thousands of thinking tokens for no reason?
Sakatard@reddit
Holy fucking fuck
somerussianbear@reddit
Countdown to Qwen3.5-A3B-Opus-4.7-Reasoning-Heretic-Abliterated-Uncensored-GGUF
de4dee@reddit
dont forget "ahuahuahua"
Familiar_Wish1132@reddit
maybe also exited about Qwen3.6 RYS type of models? have you read about it?
https://dnhkng.github.io/posts/rys-ii/
https://huggingface.co/mradermacher/Qwen3.5-27B-heretic-RYS-XL-i1-GGUF
openclaw-lover@reddit
AI is experiencing an exponential growth. My linear brain just cannot keep track of all the great releases!
vex_humanssucks@reddit
The 3B active parameter count is what makes this really compelling. Running 35B-class reasoning on consumer hardware with that efficiency ratio is a big deal. The Apache 2.0 license is the cherry on top — looking forward to seeing what fine-tunes emerge.
Fit-Palpitation-7427@reddit
How is it compared to qwen 27b?
ComfyUser48@reddit
I am getting 166 tok/sec on my 5090, Q5 unsloth, with 215040 context. And it works so well!
Tough_Frame4022@reddit
Jack Rong is on this
FrogsJumpFromPussy@reddit
So no smaller models? 😭
olearyboy@reddit
Ok a few tiny tips
Small changes but makes it so much easier to read
vogelvogelvogelvogel@reddit
yessss thank you alibaba!!
Industrialman96@reddit
Is it possible to use it via cli for free?
StateSame5557@reddit
I started running performance metrics, it scores considerably higher in instruct mode, even at lower quant
Qwen3.6-35B-A3B-qx86-hi
https://huggingface.co/nightmedia/Qwen3.6-35B-A3B-qx86-hi-mlx
Beautiful-Floor-5020@reddit
IVE BEEN WAITING
Icy_Anywhere2670@reddit
For a girl like you
Beautiful-Floor-5020@reddit
3.6B 27B gona be game changer.
To be honest this is the one model I would spend $ on better hardware to run the highest size 🤣
New-Inspection7034@reddit
I've tested both the Quinn 3.5 27b and the Quinn 3.6 35b-a3b. both in my visual studio extension that I've written to do agentic coding. They both seem pretty comparable of how smart they are, but the 3.6 MOE is a lot faster. I'm going to be interested when I get my beast and have that RTX 6000 with 96 GB of RAM. I will be able to use the q8 version of the 3.6. Moe. Maybe an unlobotomized version will work better.
FatheredPuma81@reddit
Community: "We're most excited for Qwen3.6 27B!"
Qwen team: "Okay here's Qwen3.6 35B!"
Well I for one am still happy.
Iory1998@reddit
Wait what! Didn't the 27B won the most votes? WTH?
Fault23@reddit
122B please
MindRuin@reddit
does it normally get dropped after?
Desther@reddit
Where is the swe-bench gemma result? Cant find it in official Gemma press or on swe-bench results table. Did they hallucinate it lol?
IronColumn@reddit
Base model m1 max studio with 32gb of ram:
between 25-32 t/s
3.5 27b: 9-12 t/x
unbannedfornothing@reddit
Guys at Holy Qwen Mother of LLM models, please 397b 3.6!
kl__@reddit
Great work Qwen team
Acu17y@reddit
❤️❤️🔥😍
_derpiii_@reddit
What’s the Apple memory requirements?
realmosai@reddit
190 t/s on Pro 5000? Holy Moly, am I doing something wrong? isnt this too fast?
ProfessorWar001@reddit
Guys because ı am bit od a new on this field try can someone tell me that why it looks spectecular? When ı look for ı only see that nearly no imrpovement ro 27B model and nearly same as gemma what is the most important think on that benchmark results that choose 3.6 35B rather then 27B version?
Fit-Pattern-2724@reddit
More impressive than opus 4.7 lol
SirSod@reddit
Fucking awesome in code. In all that rest it is lose to Gemma 4 26B
Alarming-Contest3736@reddit
Can someone explain like I’m 5. Is this comparing running all of those locally? I see google there; is that web based Gemini? What about considering the local hardware?
Zealousideal_Fill285@reddit
The compared models are local models at similar size. Those models are qwen 3.5 (prevous version) and gemmma 4 (also local model; its not gemini)
seppe0815@reddit
for german writing stuff what is better this or gemma4 ? please i dont understand benchmark
Gleethos@reddit
Oh my goodness! Please tell me this is no bench maxing!!!
Ok_Study3236@reddit
I don't want to suggest Google is some panacea of benchmaxxing, but aren't such huge contrasts in benchmarks between equivalent size models not at least a little suspicious? My initial thought looking at the post was "overfitting" especially after spending some time with Gemma.
Sadman782@reddit
As I always said little benchmaxxed. Not directly, it is indirectly. But anyway they are quite good for some tasks too, but overall Gemma 4 is better for most tasks
pneuny@reddit
That's for the older Qwen. Not 3.6
Naiw80@reddit
Now this is a model that appears to work just fine with openclaude... Unlike gemma4 which still is completely useless for agentic work.
uniVocity@reddit
I got the BF16 quant to run on my M4 Macbook Pro Max with 128gb of ram. LMStudio runs this at 40 tokens/sec which is not bad.
I've asked it to refactor some non-trivial java code that had a bit of overlap into something cleaner and it did a better job at giving me clean and less congitive loaded code than gemini pro - just had a compilation error that was easily fixed.
One thing that keeps me using online models is the time to wait before the model spits an answer out. I wonder if there are any recommended settings specifically for coding tasks.
Nutty_Praline404@reddit
Running A3B-UD-Q4_K_M well at \~50 tok/s on my RTX4060 Ti 16GB (Win11 i7-13700F 64GB) with the following:
Sticking_to_Decaf@reddit
Tool calling in Hermes Agent is very good. A couple minor hiccups but both were things Hermes Agent recovered from immediately without user intervention.
duebina@reddit
Does anyone have any direct experience with Qwen Next Coder that could offer a comparison with this model? I use the 8-bit quant version with open code everyday in it performs great. But this is a smaller model which would be even better if it exceeds in its capabilities.
BumblebeeParty6389@reddit
I'm so glad they didn't listen to that BS twitter poll
mr_il@reddit
On M5 Max at 32k context prompt processing 2047 tok/s, token generation 62 tok/s. Sweet!
Ferilox@reddit
Is it possible to run this on 12G VRAM with decent performance and speed? Whats the minimum VRAM thats usable?
Kodix@reddit
We don't know yet. There isn't even a GGUF. But due to it being MoE it likely handles CPU offloading pretty well, and the previous Qwen models quantized very very well.
Meaning: when it comes out, try the UD-IQ3_XXS quant or something like it, and see for yourself.
trying4k@reddit
At Q3, isn't 9b the better option (for 3.5)? How do lower quants impact things like code quality?
Kodix@reddit
Dunno, couldn't tell ya exactly. But what I *can* tell you is that, according to Oogabooga using this research methodology, Qwen3.5 A3B at UD-Q8_K_XL (so the largest quant available) has a KL divergence of 0.093 and top-1 of 96%. While UD-IQ3_XXS has a KL divergence of 0.262 and top-1 of 89%.
Top-1 is the more illustrative statistic here, I think - the IQ3 quant's most likely token to pick is 89% the same as the completely unquantized model, and the Q8's is 96% the same. That difference seems tiny to me (and is significantly larger for his Gemma benchmarks, that's why I say Qwen quantized well).
trying4k@reddit
Thank you for the information!
ea_man@reddit
Hey you can even run 27B IQ3 on 12GB ;)
lolwutdo@reddit
just cpu offload, it's fast as hell even with cpu
DefNattyBoii@reddit
Sadly not unless you are willing to go to the 2 bit category. I'm able to run 27b gemma4 moe model with a very small cache with IQ3_XXS on my 3080ti. 12 gb bros rise up.
sagiroth@reddit
I ran 3.5 on 8gb vram at 44tkps at 64k context and 32 ram
Live-Possession-6726@reddit
For folks with a DGX Spark/GB10, this model is ridiculously fast for Atlas Inference (\~115tok/s). We've got run commands on our website atlasinference.io and Discord, and plan to open source this week!
ResearcherFantastic7@reddit
Someone tag jack for the qwopus gguf version now!
feverdoingwork@reddit
We are all waiting for this guy to get to work
Eyelbee@reddit
I really hope this doesn't mean they won't release the 27B size class version.
9gxa05s8fa8sh@reddit
I have been using the full qwen 3.6 and it works REALLY well, like mimo and glm. very close to the big names, and good enough to not pay for the big names...
Ecstatic_Country_610@reddit
When will we see light weight versions of this? I wanted to try using it with Void (VS Code Fork)
Far-Low-4705@reddit
OH MY GOSH THEY FIXED THE OVER THINKING!!!!
"hi" -> only 200 output tokens (down from like 4-8k tokens)
_BigBackClock@reddit
oh helll yeah, I used to pray for times like this. Thank you alibaba daddy
MaCl0wSt@reddit
> Thinking Preservation: we've introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead.
what's this mean
LinkSea8324@reddit
iirc thinking content is supposed to be stripped when moving to a new message, now you can keep it and it will use it (could be kept but ignored before ?)
MaxKruse96@reddit
gguf where
vladlearns@reddit
it should have been gguf when
Opteron67@reddit
no gguf, FP8 is the way
Specter_Origin@reddit
Unsloth already released it
mintybadgerme@reddit
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF :)
Genebra_Checklist@reddit
Here. Unsloth posted 3 min ago. That was fast lol
MaxKruse96@reddit
u got baited, only repo is there, no files yet
the__storm@reddit
It's up now
Ok_Technology_5962@reddit
But redownload 3 times after... Lol
Genebra_Checklist@reddit
Yeah i saw that the seccond i posted. Kind sloth of me actually
Long_comment_san@reddit
Unsloth being totally-not-a-sloth
Lowkey_LokiSN@reddit
Same model type as qwen3_5_moe. Should be there soon!
MaxKruse96@reddit
i was joking sir, i know
hyrulia@reddit
A new Qwen (3.5)
The Gamma (4) strike back
Return of the Qwen (3.6)
Best trilogy ever!
Fault23@reddit
There's one left...
xignaceh@reddit
Mistral :(
LingonberryMore960@reddit
So i downloaded, tested,deleted, so i can say this is most stupidest model i ever tried for the things i use Ai, gemma 4 31b is literary godlike model in relation to this.
LingonberryMore960@reddit
To give a bit more context on why this model is bad for me. I use AI for 3D work (in SideFX Houdini). Everything mostly happens in a Python environment and is based on HScript. Strong models clearly understand that environment and produce excellent results it’s actually a very good testing ground for the “intelligence” of an AI model. So far, I’ve tested quite a few models, and none have performed as well as Gemma 4 31B, which delivers impressive results even though I didn’t expect that at all. Usually, smaller models up to 35B parameters don’t perform well in that environment. When I tested Qwen 3.6, it literally had no idea where it was or what it was supposed to do it produced so many random outputs that didn’t even match the prompt.
Similar_Sand8367@reddit
Wasn’t this the Model Generation which doesn’t run offline in ollama anymore?
funding__secured@reddit
Where's 397b?
Reddit_User_Original@reddit
Greatest 2 months of human history
Dangerous_Bad6891@reddit
THE GOAT!
Blues520@reddit
How do they keep on winning?
Omnimum@reddit
Oof I feel bad for Google
jmakov@reddit
No comparison to GLM-5.1?
TurnUpThe4D3D3D3@reddit
Wow
nabeelkh5@reddit
Excited to try this, downloading now :)
Dany0@reddit
201 tok/s on an rtx 5090. ud q4 in ik llamacpp with the ggml graph reuse patch manually applied
apeapebanana@reddit
just got qwen3.5 started running yesterday with 180k context, now I'm using qwen3.6 that just released by unsloth, its FLYING!!
so far been using with pi to code out wordpress, oh my, the terminal is running smoothly so far without hitches
hakanavgin@reddit
These are great and all, but what happened to nearly all major labs releasing 14 to 22B models? There used to be a time where the consumer grade GPU's with 16 GB of VRAM could fully offload them with non quantized KV and q4-k-m or even iq6_k. Nowadays it is ALWAYS either heavy quantization so your model is lobotomized, or heavy RAM/VRAM partition so your model is painfully slow.
If there was an architectural reason, these labs shouldn't be able to create 0.6B to 9B models either, so it is more of a decision, and their decision is to support unified memory and "I've just got 20 more H100's and running an obscure stack of circular validation for my bouncing balls benchmark" bros, seems like. It is very disappointing
audioen@reddit
Quick test says that this model is solid. Locally, on Strix Halo, this is going to replace 3.5 A122-A10B at least today. Maybe tomorrow the 3.6 122B-A10B replaces it again, but I'm not sure. Speed also matters and this seems to be working well and doing agentic tasks correctly. At least it's much better than the 3.5 was, I'm sure of that.
I'm using Q8_0_XL size for initial testing, though in reality I'll probably re-download the Q6_K_XL for slightly more speed. (About 1000 tok/s pp, 50 tok/s gen on Strix Halo.)
renczzz@reddit
Until now got some good results.
Running unsloth/Qwen3.6 35B A3B IQ4_XS fully in the VRAM on AMD RX7900XTX on Ubuntu with 128k context.
It looks like it gets the job done faster as the 3.5 A3B model. Processing prompt is way faster, tok/s to respond is around the same as the 3.5 model from my experience.
Don't forget to configure the LLM with the right parameters if you use the unsloth models to prevent repetition of thinking: https://unsloth.ai/docs/models/qwen3.6
Happy so far with this update!
mumblerit@reddit
Unsloth q8_0 gguf in llama.cpp
mohammed_28@reddit
Qwen never ceases to impress me.
aschroeder91@reddit
This MOE is 256 experts with 8 active experts - thats a 1:32 ratio giving nice speed. Given how wide peoples computation requirements and goals are i still think there is space for a 1:8 ratio with quality closer to the dense model but still enough speed bump to make agentic / reasoning work fast enough to make sense. Just verbalizing my wishlist - qwen team giving us so much already I can't complain.
Temporary-Roof2867@reddit
On languages, "gemma-4-26b-a4b" is superior to all the Qwens imaginable. Let's not joke!
This data is fake!
AvocadoArray@reddit
This is exciting, but I have to wonder how long before we’ll see another open model push the boundaries like this. We might not see another open release from Qwen at all, and I don’t see any other teams competing in this size range in the near future.
The 3.6 series might be the king for a long time.
ea_man@reddit
What I'd like to see is a full vertical optimization of something like OpenCode with a free QWEN, so that we have reliable tool calls.
Training + jinja templates in the GGUF + prompts / roles in the IDE.
gurilagarden@reddit
Sweet, hopefully have Q4s before dinner.
Late_Film_1901@reddit
before dinner? I'm at 79% download already!
gurilagarden@reddit
damn...that was fast, went from empty repo to packed in less than an hour.
No_Mango7658@reddit
Strix halo:
total duration: 29.425241762s
load duration: 97.931413ms
prompt eval count: 471 token(s)
prompt eval duration: 653.007273ms
prompt eval rate: 721.28 tokens/s
eval count: 1259 token(s)
eval duration: 28.336498498s
eval rate: 44.43 tokens/s
Dry_Yam_4597@reddit
May the good heavens of AI bless Qwen.
Temporary-Roof2867@reddit
Qwen3.6-35B-A3B is certainly an interesting model but... some of these data seem a bit rigged to me, I don't trust them at all! 👀🤔
Healthy-Nebula-3603@reddit
So we are waiting for quem 3.6 27b dense :)
StupidityCanFly@reddit
Qwhen Qwen?
Healthy-Nebula-3603@reddit
Ups .. autocorrected
Sticking_to_Decaf@reddit
Running the Qwen official FP8 on a single Pro 6000 max-q gpu in vLLM: ~200 tps decode for 1 request ~300 tps decode for 2 concurrent requests
Tool calling in Hermes Agent is working well so far but needs more robust testing.
PlainPrecision@reddit
Can this run on a 16GB 3090?
the__storm@reddit
There's a 16GB 3090 ?
Anyways at 16 GB you'll need to offload to system memory (will still be useably fast). 24 GB you could squeeze it in but I'd probably run 3.5 27B instead.
90hex@reddit
According to this it beats Gemma4 in all benches. Can’t wait to give it a go.
Kahvana@reddit
Awesome! Can't wait to test it's vision encoder and see if it still reports non-genshin anime characters as genshin characters.
Hope they'll do the other models too, especially 122B-A10B and 2B.
cr0wburn@reddit
Qwen 3.5 is amazeballs, i cant wait to test this one! Thank you qwen team!
Big_Mix_4044@reddit
We feast today!
Speedping@reddit
Does anyone know why is the mlx-community version so big? 90GB for 4 bits, 3.5 was 20GB for 4 bits with the same parameters (35B A3B)
Kaljuuntuva_Teppo@reddit
Noice, looking forward to Qwen3.6-27B the most.
I thought that one won the poll they did to gauge interest for the model to release first, but I didn't keep track until the end 😅
bithatchling@reddit
This looks like a really interesting release! I'm always excited to see new models come out that can potentially help us all build cooler things. Thanks for sharing the news!
Direct_Technician812@reddit
Qwen 3.6 💀👑. Gemma is outdated.
Manaberryio@reddit
Around 30 tps with my RX6800. So glad!
No_Doc_Here@reddit
I'm hoping for 122B.
Not mad if they don't (because we are owed nothing) but the 3.5 fp8 of that is our current work horse.
AlreadyBannedLOL@reddit
Didn’t the 27b sense model win the poll last week? Now we get MOE.
I mean, I am going to take it but I was waiting for 27b.
This_Maintenance_834@reddit
It will come, be patient and enjoy.
JLeonsarmiento@reddit
Oh gosh, just when I started to go with gemma4 for everything…
This_Maintenance_834@reddit
Now, we will be waiting for the dense one.
sagiroth@reddit
Gguf , 27b when ?
ea_man@reddit
And don't forget Omnicoder 3.6
dinerburgeryum@reddit
Yeah 3.6 27B will be the one to beat if the 3.5 model is any indication.
Serious-Log7550@reddit
Qwen 3.5 passes that test :(
One_Key_8127@reddit
Guys, I liked this test prompt but it's probably cooked by this point. Qwen3.6 35b a3b passes it even without thinking. What's interesting is that "Qwen 3.6 Plus" fails without thinking. It might have gotten into training data...
FinBenton@reddit
I mean thats pretty much a pass.
Serious-Log7550@reddit
You right, my bad. Tried `I want to wash my car. The car wash is only 100m away from my house, should i walk or drive?` promt and it works well:
Kodix@reddit
What do you mean? That's basically a pass. It says if you want to wash it you need to drive it there.
No_Swimming6548@reddit
The question doesn't even say "I want to wash my car" lol
Serious-Log7550@reddit
My bad, just ommited it :(
One_Key_8127@reddit
Right. The response is kind of awkward, but the version of this question is poorly worded too. But the "(in which case, you'll obviously need to drive it)" indicates that the model grasps the concept of requiring the car itself at the car wash to wash the car.
And it seems \~800 tokens were used for this response - which is great, 3.5 usually used way more tokens.
One_Key_8127@reddit
Where did you test it? Is it quantized?
Qwen3.5 35b a3b answers it correctly after producing \~5k+ thinking tokens. Gemma4 (local, quantized at \~Q4) answers it correctly producing \~500 tokens.
alexx_kidd@reddit
3.6 9b when?
Ok-Measurement-1575@reddit
So is this the 2507 moment for 3.5?
DefNattyBoii@reddit
Someone pls make a turboquant,polarquant,bonsai,howevertheynamedthenextgenquant weights sub 4 bit!
StrikeOner@reddit
i'm working on the new 0.1 bit quant at the moment. stay tuned!
DefNattyBoii@reddit
no bits llm
--Rotten-By-Design--@reddit
3.6 is much bigger, so it can't be the same, unless they changed its efficiency also.
You get better speeds than me, but I use LM Studio, and also your RAM is most likely much faster than my 64GB DDR4 3600MHz, so that might help you
Blaze6181@reddit
So I don't need to buy a PRO 6000? Thank you 😭😭😭
henk717@reddit
Eagerly waiting for the GGUF (and the 27B version), I didn't like the last 35B since it wasn't good at my use cases and I suspect this is going to be the same here but i'd be happy to be pleasantly surprised. Its coding being on part with 27B would solve at least one of those.
I expect the 27B to be in the works to since it won their twitter poll, if its like 3.5 but without the looping bug i'd be very happy.
Zc5Gwu@reddit
I haven’t run into the looping bug recently with the 27b. I’ve seen it with the 35b though.
henk717@reddit
Its not a constant thing. It probably doesn't help that I prefer heretic models which may be more prone to it. For me a reliable way to test it is playing 20Q with the model where your answer is electricity. I can't make it through the 20 turns without it looping.
Another easy one for me was this one: Hey Gemma! Is concedo cooked or boiled?
(Its based on an in joke, but because it makes no sense it will endlessly think about it)
LegacyRemaster@reddit
it's a beautiful day
mtmttuan@reddit
Yeah the model seems better than its competition, but now even qwen do the bullshit charts starting at whatever values just a bit lower than the competitors to act like their model are way better huh. That's kind of low.
DOAMOD@reddit
I'm testing it out and it's thinking a lot, but it seems very intelligent. I think I'm going to like it. I'm really looking forward to seeing the 27b and what it can do.
No_Lingonberry1201@reddit
Ahh, just what the doctor ordered!
morphlaugh@reddit
I'm downloading now... gonna try the Q4_K_XL quant since the full BF16 is just a little too big to run on my setup at 71GB *cough*
iamagro@reddit
Are this benchmark related to the 8bit quant or 16bf ?
JHShim1@reddit
Wow, if 35b a3b got that better, then the 27b... hoping for it to come out soon!
westsunset@reddit
I know 🤞
popsumbong@reddit
That was fast
DOAMOD@reddit
Gemma 4 is dead?
caetydid@reddit
uses less vram, so no!
DragonfruitIll660@reddit
Haven't tested the new Qwen yet but I wouldn't think so. Gemma 4 I'd argue is likely to stand out for this generation/release cycle.
Kodix@reddit
Relax. Benchmarks never tell the whole story. Actual community reactions and personal testing are king.
Willing-Toe1942@reddit
heretic when ?
Side note: in my benchmarks for agentic workflows and coding I found heretic version (1.2 ara method) of any model are waaaay better in performance and token effecincy and tend to put correct amount of thinking without go crazy in loops
this applies on both Gemma4 and Qwen3.5 so hopefully heretic for Qwen3.6 are going to be better
bartskol@reddit
So now we are waiting for guffs, lamacpp update and best settings for it. Will be ready for the weekend :)
ustas007@reddit
anyone tested against gemma4:27B?
celsowm@reddit
Surpassed Gemma 4 31b?
Kodix@reddit
Benchmaxxed to do so. We'll see the reality soon enough. Hopefully yes, though!
KageYume@reddit
I doubt it would surpass Gemma 4 (26B A4B and 31B) in translation but I won't complain if it actually does.
Kodix@reddit
Gemma 4 is also amazing for roleplay, to the point where you don't even really need to uncensor it.
But yeah. This is a win-win either way.
Septerium@reddit
I just can't believe this thing... can't wait to test it for myself... is it possible without benchmaxing?
Serious-Log7550@reddit
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
GGUFs incoming!
KringleKrispi@reddit
That's the one!
Prestigious-Use5483@reddit
Amazing. I was starting to get use to MOEs with Gemma 4 26B A4B. This should hopefully be a nice upgrade.
Beginning-Window-115@reddit
if we got this much of a performance improvement imagine the qwen3.5 27b version
Paradigmind@reddit
*3.6
Beginning-Window-115@reddit
thx
ciprianveg@reddit
when 3.6 397b? glm 5.1 is just too big for my local machine..
root_klaus@reddit
so amazing, i hope we have a 27B and 9B model, the 9B is is good for for extraction tasks and so convenient and a 4B would be fantastic, i hope they release all the small models! LETS GO!!
Fun-Farm-452@reddit
I can't wait!!!!! When the gguf released!!
throwaway957263@reddit
Do you reckon any decent quant version of it could have decent tps (+30) on a 5060 ti 16GB + ddr5 rig?
i had ~30 tps for q4 26B gemma and ~85 for one that specifically fits a 16gb vram GPU
Mashic@reddit
Happy to see it, hopefully they publish qwen3.6 27B and 9B too.
inaem@reddit
gptq-int4 when
(They usually release in a day or two
Impressive-Sir9633@reddit
Wow! Multimodal, open source Qwen 3.6 and better performance than Gemma4 on benchmarks. Here goes all my recreational time for the next week
bakawolf123@reddit
Nice, like I thought they wanted to trample gemma4.
Competition is good
jacek2023@reddit
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
volleyneo@reddit
No way!
segmond@reddit
From a quick eyeball, the benchmark is trash. If it's better than the 3.5 variants that's good. I suppose these benchmarks are for the non technical.
appakaradi@reddit
I am worried that they are comparing to 3.5 27B Dense. Does that mean we are not getting 3.6 27B dense?
rpkarma@reddit
Probably. To me it looks like all the big open models are closing up things that are too good
ArugulaAnnual1765@reddit
Qwen are they releasing 27B?
Chance-Studio-8242@reddit
on ollama?
DeedleDumbDee@reddit
I’ve been using 3.5 35B Q6 since release and it has performed extremely well. GGUF soon hopefully.
Ifihadanameofme@reddit
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
LET'S GO!
xXprayerwarrior69Xx@reddit
Bro is very sparse
Sarayel1@reddit
uhu. we wanted dense not MoE
dampflokfreund@reddit
Speak for yourself. This is the perfect size for low to mid end PCs.
iamapizza@reddit
Agree, I am hoping this is just one of many steps towards commoditization. People should be able to run models on any hardware.
Thrumpwart@reddit
An MoE model with improved agentic and coding use is a godsend. 27B was smarter but 35B much faster.
Recoil42@reddit
Go get your money back.
appakaradi@reddit
Yes. It looks like we are not getting one..
VoiceApprehensive893@reddit
gguf when
NaN_Loss@reddit
Holy
MaCl0wSt@reddit
sweet, just earlier I was playing around with 3.5 35b and its damn good for something I can run on my gaming rig at decent speeds
moahmo88@reddit
WTF!