Just to drive the speed question home, I have 3090s at home and a Pro 6000 Blackwell Max Q at work. On identical inference workloads that completely fit in the VRAM of both setups the Blackwell is like 10-15% faster.
It doesn’t matter how many 3090s as long as the work load fits in the VRAM. For example, I ran Gemma4:26b on a single 3090 and I also forced it to split across both 3090s. Same prompt, and there was a .0003% difference in speed.
I mentioned the difference with the Blackwell card because a lot of folks expect a crazy improvement in speeds; unfortunately the performance doesn’t scale like that.
i am putting it through the test on my 2x4090. I can fit the Q6_XL with full Q8 full context window. Its coding at 20-25tk/sec with context window 50% full. it takes a while to ingest large context, but otherwise chugging along quite nicely.
I honestly don’t understand why Gemma4 score this low. I been using latest 31B and it’s coding results have been cleaner than 3.6 35B in almost every case, and it was able to do tool calling more accurate for Xcode MCP while qwen just gave up or stuck in loop. Gemma4 from my experience needs more detail in prompt, but results are better. Qwen often add things that I didn’t asked for and have less chance to one shot problem.
Gemma is a great model for its size, but Qwen 3.6 seems to be incredible, I would go gemma for this size, but running the 122b qwen 3.5 was my favourite so far local-capable model (strix halo 128gb), 3.6 in the ~100 billion parameter size is going to be amazing if it follows these smaller models capability.
It doesn't even make sense for them to be on a technical level - they are designed to service literally as many requests as possible from all kinds of domains, why in the world would you want any part of their knowledge base to be unloaded at any time
kimi, glm, minimax, xiaomi, gemini(stated in docs), gpt(leaked) are all moes. only one that's unknown is claude. there is no knowledge if its dense or moe. but it's very normal to assume it's moe just like all others.
Claude not being MoE would explain their huge compute issues though. :) Although they did speed up Opus recently, so maybe they mvoed to MoE recently? Or just made it more sparse.
To be fair, the model it is beating is effectively 17B Expert, but with much higher memroy and a bit of help as needed. You don't get to keep all of that intelligence, unfortunately, in MOE models.
the funny thing is that it's not even a new generation, just a minor update of the same same generation. I saw Anthropic and OpenAI smashing a big round number on models with much less performance gap
It's got looping and other obvious issues, I have free access to it but mostly use Sonnet 4.6 or GPT 5.4.
Sonnet is really reliable and stable
Something is very strange about Opus 4.6 & 4.7, they act like a large model that is excessively quantized. Opus 4.5 was not like this. I wonder if this is a side effect of them using TPUs. Gemini acts the same way.
Frr this happened after jan. Like I would one shot a whole project with 10 words on opus 4.5 and now 4.6 acts so dumb like I feel its basically another gemini 3 pro. I mean its still better but really disappoints remembering what opus used to do 🥀
Yeah I did notice that adaptive thinking button what does it really do ??? Like its not the same as the previous one which was enhanced thinking or something ? I thought it eas just cosmetic and they named it "adaptive thinking"
I use gemini 3.1 pro in antigravity but haven't really used it in a while now. Maybe I should give it another shot. I was comparing opus with 3 pro and not 3.1 pro btw :)
This might be a side effect of adaptive thinking, the responses come almost immediately and the chat is muddled with looping content that should have reasonably been expected to be in the thinking block
I feel like the best way to describe it is that the intern is just as smart as the wizard but not as wise. Being smaller parameters means its going to know less but handle the common tasks we ask of it really well.
Great! That's important to know for a couple of reasons. They are official so they are based on something and if they come from Qwen so they are also designed to make 3.6 look good.
5060ti User here. I run Qwen 3.5 27b HauHau aggressive uncensored in IQ4_XS with Medium context which is absolutely fine quality. Expect to run 3.6 the same.
HauHauCS Version is 15.1GB in size. Qwen context does not eat much memory.
Here is some proof in a picture so you dont need to listen to all these people talking out of their asses who say IQ3 works at best. Sorry for the bad quality, im at my phone atm. But you can See i can load all layers plus 30k context on Q8 into the 5060ti with IQ4_XS. If you are even willing to offload some layers to RAM and sacrifice the t/s then context size goes brrrrrr.
Not just for your 5060ti, but for anyone that has only 16gb of vram, you will need to have it heavily quantized or any spill over to system ram will dramatically slow it down.
There is also the problem of bandwidth limitations of gpus and dense models. You are not going to be getting anywhere close to the same t/s with these. You will probably need some speculative decoding going on as well.
I have no idea. I do not typically use the dense models due to the 5060ti bandwidth despite having 4 of them. For the qwen 3.5 models, if I wanted to have more "intelligence" I would instead use the 122b model as it ran faster on my system (64gb system ram + 64gb vram) than the 27b dense model.
You probably want to stick with the 35B-A3B model (MoE) - the 27B might require a bit too much quantization. (Regardless, there is only an FP8 available so far, so you'll need to wait regardless, as that one is going to be about 27 gigs)
Try a few, but if you quantize kv cache and trim context, you should be able to run Q3 variants pretty well. I'm going to try Q4 to see if it's tolerable because I have the same GPU, but Q3_K_S is my fallback because I know it works for me for Qwen3.5 27b and Gemma 4 31b.
I just run LM Studio with Qwen3.6-26b (Q4_K_M) and with 16000 context size it is running about 3.08 token/sec at the beginning of new chat. I have 16 GB VRAM and 32 GB of RAM.
Have not yet tested with coding using Pi, just downloaded and made first tests.
Oh snaps, I am setting up a dual agent/sub-agent approach on 2x RTX 3090s with Qwen3.6-35B-A3B and Gemma-4-26B-A4B-it. I wonder how well this 27B is on chain of thought, to replace Gemma-4-26B-A4B-it.
I'm thinking Qwen3.6-35B-A3B as primary builder and architect (agent) while Gemma-4-31B-it or Qwen3.6-27B as debater (sub-agent)! If the CoT on the Qwen3.6-27B is near as good as the Gemma 4, then I may be able to squeeze in more layers on a RTX 3090 or use a higher quant. This is quite exciting!
nice, you're running Q4 quants of each? What context? Even I have a dual GPU setup 5090+4090 and totally maxxing out one model in Q8XL and max context. Its very tempting to have a dense and a MoE model Q4 running simultaneously. I know it could be vibes/placebo but am concerned about quality drop if I goto Q4. However, I like your idea of adversarial subagent (this could be what I need to drop quants). May I know how exactly you are running this subagent scene? Pi? or which harness? When is the subagent invoked?
I'm testing the duo agent all this week so will let you know! I'm using OpenCode, it's primarily via agent config prompt, so will have to see how that fares.
Builder: Qwen3.6-35B-A3B Q4_K_M, 262k context, q8_0 KV cache. Invokes sub-agent with (paraphrasing): "If it's easy, go with it. If it's complex, ping the Debater up to 3 times. Be mindful of security, coding practices in AGENTS.md, and architecture and DTOs defined master_plan.md. Ask clarifying questions."
Debater: Gemma-4-31B-it Q4_K_M, 32-64k context, q4_0 KV cache. Max steps: 6. Init prompt (paraphrasing): "Red team, be strict with coding, security, memory leaks, edge cases, laws, only read and respond, no praising, follow format for response (Issue, Priority, Recommendation, Status)."
phone-a-friend seems to be a really good way of keeping agent sessions on track and getting past difficult problems, if we can have fast MoE models crunching through stuff and asking for help from 122b's or whatever for tricky stuff that's probably optimal, with Opus as a last resort.
Wow! you are the man! 🤩 Share performance after running bro. I want one for myself, but unsure if the speed will be usable and whats the max I can accommodate on a macbook. Otherwise will wait for ultra.
Ya I’m def looking to get the studio ultra when it drops based on what I can see this thing do. It’s arriving tomorrow so I’ll post up a few thoughts on performance. I’ve got a narrow focus and have built an interface that’ll wire up to Qwen so I’m not using it for open ended things, which will hopefully keep it slim and fast. We’ll see. If it’s too slow then I’ll just enjoy my $5k coffee table ornament 🤣
I'll say the for the gguf kv cache seems to behave more like gemma4 31b than qwen3.5 27b. I was able to squeeze a lot in with qwen3.5 27b but 3.6 27b is doubling the model size for my cache.
This is a great model but right now my lm studio q8 has less code issues than my unsloth q8kxl. That's unusual. I wonder if anyone else is experiencing this. I tried a couple different quants and q6kxl kept looping around 30-40k tokens without presence penalty.
Initial thoughts:
The unsloth ones feel jagged. Much more ambitious but also more glitches. When instructed to iterate to fix things it makes changes but less meaningful differences than I expected. I'm still playing with it between other tasks but curious anyone else's experiences.
lmstudio is the lazy way. (turn off "keep model in memory" and mmap to load models - they are talking about keeping the models in system memory which just doubles the memory required to load the model on unified systems)
yeah thanks, then how do I get the model? I have tried to pull and get"An attempt was made to access a socket in a way forbidden by its access permissions"
The capability density doubles every 3 to 3.5 months. So an 800B now will be as good as a 400B in 3ish months and that will equal a 200B at 6 or 7 months, and finally a 100B at 10 to 12 months. We're talking MOE. Thats roughly equal to a 30-35B dense. So yes.
i just setup m5pro 64gb and it works pretty good, context window of 128k in lm studio and it is running at speed on average at least 55-70 tps depending on the type of prompts. Uses about 32gb ram for this
Do you use it for codig - similar to how a workflow looks for a person who’s using claude code in cli, you know parallel agents ~100-200k tokens
How does it compare to that? Is it getting closed or context is still a big issue?
yes I can use it at like 100k context although I dont use agents in parallel or vibe code, most of the time I just use it for bug fixing (I have 48gb of memory so context isnt really an issue)
Qwen is definitely not known for benchmaxxing. Even back in the Model 2 series, it was clear that Qwen was doing something different with their training. For example, it’s well known that their pre-training data sets were significantly more math-heavy than others.
In my private tests, Qwen-3.6-35B actually indeed delivered better results than Opus-4.7
This means that it’s either actually Anthropic who’s doing benchmaxxing, or Qwen has managed to benchmaxx real-life tasks
I did not have the same results as you. Granted I was running a quant but they were worlds apart for me. I was having Opus 4.7 review the code Qwen 3.6 was generating and it was a lot of kickbacks.
And I think that points out how flawed the argument is. If models that can fit in 24 GB VRAM were actually beating claude in real world use than the hyperbolic claims would be talking about new models beating 'that' model rather than claude. Because we'd all be looking at nemotron, phi, or whatever as the top tier model family to beat rather than claude.
It's ridiculous that this is even a controversial opinion. I'm incredibly grateful for the Qwen family of models. I think 3.5 27b was one of the best releases we've had in ages. But Qwen absolutely has a reputation for benchmaxxing. At least outside this sub. Doesn't make the models bad. Just means qwen has specific strategies for training models that align with the major benchmarks.
every lab is doing benchmaxxing some way or another, is their target goal after all, the first thing people refer to when new models are released. Of course they're gonna try to be the best
screw the haters, they dont get it. Some of us DO make our own applications.. and they tend to work because we put time into crafting it for the model. I've been working on this for \~ 10 months and qwen3.5 made it shine. qwen3.6 is actually paying attention to the system prompt and figured out the parallel tool execution. the moe tend to send single tool calls for some reason. 27B likes to group them, that is one difference I am seeing already. anyhow, I am also loving the recent trend of people realizing the locals work best with curated tool sets. heck all they really need is good prompts and guarded bash access nowadays in all honesty. ok I am rambling, but I laughed at the haters and came back to this comment after work to brag about my own 'harness' because yes some of us do that, I mean this IS local llama...
I need to discuss with my employment the ramifications of sharing it first, because I built a lot of it on the clock and they know about it, so it might be an issue to just open source it now. It gets used in an offline environment for work purposes. It would need a bit of cleanup and some other features before releasing too.
also, the available tools on the market are finally catching up to the capabilities, so this is less unique in its abilities now (though I am convinced I was the first to have native docx export including latex to ooml and tables and list etc months ago lol).. So I have thought about putting it out there and might be able to someday.
Explain harness for grandpa. I love AI. I’m just getting into running local models on my Apple m5 pro and framework 395+ Ai amd apu w/128GB ram. Using lmstudio & ollama. Thanks!
Mention Ollama, and people will get riled up on this sub.
I think Ollama is an okay starting point for a lot of people since it is rather plug and play.
But if you want to get a bit more serious with local models, you will want to look into llama.cpp (https://github.com/ggml-org/llama.cpp) (on which ollama is heavily based without attribution), and llama-swap (https://github.com/mostlygeek/llama-swap) for managing multiple models, switching them out, etc.
llama.cpp is much more performant than Ollama, allows for greater customization, is faster with the updates.
No need for llama-swap, now llama.cpp server has model loading/swapping/unloading built in: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#model-presets
A harness is basically the new buzzword that is used as a pretty large umbrella term which basically is like a way to give LLMs tools. Think of it like a mech suit to a person, a harness is the mech suit for an LLM. Also I recommend you drop ollama, just imo tho hahahah
Would openclaw and Hermes be considered harnesses? What about qwen coder cli? Oh, and what’s wrong with ollama? I kinda like it better than lmstudio, I guess because I don’t tweak any settings. Just use Ollama serve, ollama run qwen, seem so simple and intuitive?
Openclaw yes. Hermes unsure. I don’t know too much about Hermes but believe it’s a model with claw like features, rather than a program with tools like openclaw.
Like the simplicity make you lose some capabilities and speed. By going with llama-server you can fine tune (or copy the setting from others) and have better results. Sadly, its a bit more of work but after some time the thing just work. Also, llama.cpp updates frequently and drops good optimizations regularly, faster than ollama (that reuse llama.cpp anyway)
Think of the LLM like a horse. It's beautiful and it can run and jump etc... but it's kind of hard to get any work out of it as-is. Put a harness on that horse, now you can get it to pull a plow, carry people, etc.
Llama server and I use the settings on the qwen 3.6 hugginfsce page. If you go there it will tell u the temp. Parameters that's optimal for. The herbal assistant parameters are what I use and switch on the fly to coding parameters. I used Gemini to create my harness. I got seen writing here owncp servers now. I call it kokocode but haven't released yet I keep adding features and functions
I think the relatively difference might not be as big now that the MoE is fixed. But still, equivalent dense models are better in ways that's not always captured in bench (world knowledge) but still evident in daily work. ~60 in terminal bench here is incredible already though.
Hmm. In my experiments (with unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL), tool calling works like 50/50.
For 3.6 35B A3B it worked very well. Maye I'm doing something wrong...
According to Qwen's official WeChat account, it appears this is the final open-source model in the Qwen 3.6 series. Of course, we can also hope that this is simply a typo.
Just thinking about the releases that are around 120B params, we have GPT OSS 120 from Aug 2025, Qwen 3.5 122B and Nemotron 3 Super 120B from this year.
My point was that it's just not a very common size (no equivalent Gemma or Mistral models). Also there is a trend towards releasing small models for domestic use, and retaining the larger models for cloud-only inference. I can absolutely see a strategy where it's thought that the \~120B models are good enough to eat into cloud inference profitability.
But yes, this is a reckon, and I may be hallucinating things in the tealeaves which don't exist!
Mistral Small 4 is the same size. The only one missing is Gemma 4, which is rumored to have one of the same size too (Gemma 4 124B), just not released yet.
They included it in the poll so I'm assuming they'll release all 4 - 9b, 35b, 27b, 122b. They didn't include 397b in their twitter poll though, so they might not open source the big one
Yeah, but it’s so slow. I have 64gb DDR4 ram so I can make it work but 10tk/s is so slow compared to the 40tk/s you could get with 3.5 9B. I can sacrifice intelligence if it’s just agentic coding, a smaller model can correct its mistakes faster than a MoE can finish outputting its first message.
3060 12GB, 32GB DDR4. Getting around ~28 t/s with Q4_K_XL at 32768 context length (can go even higher, tested at around 80k with 131072 allocated, got ~23 t/s). All llama.cpp settings are basically default.
its not really knowledge imo, it's more about nuance.
It understands the nuance of your prompt more, and sees things that aren't implicitley said, even in code. especially when doing on generic SWE coding, and doing much harder and more complex or low level coding tasks.
for me the biggest difference i see is nuance. (obviously knowledge is better too, but its not that big of a factor imo)
Yep my opinion based on my current understanding is that these new smaller models are getting really good at agentic coding workflows, tool calling, etc...
Definitely not as good for world knowledge and writing, like you say. BUT when they're fast and cheap, search and iteration is easy
Having used both, 3.5 397B at Q4 and 3.6 35B Q8 side by side for agentic coding. Within this scope, I can say they're practically matched. But keep in mind this is a pretty narrow scope and one that is often very much a beaten path.
I'm sure if you go to more obsecure programming languages, or tasks unrelated to programming, 397B will win.
Also depends on what you are programming. If you do some low complexity UI+backend+database coding, you don't really benefit from the more clever models. If you do some complex refactoring, algorithm design, heavy math and solve difficult problems, the more powerful models are able to figure things out better.
Haven't tried really complex stuff with 3.6, but I can say I did try fairly complex tasks on large projects and 3.6 35B held well. 3.5 couldn't handle much simpler tasks.
I do have some low level C++ tasks I want test 35B and 27B with. We'll see how it holds.
Yeah, realistically there will be a bigger difference in real world use compared to benchmarks, however, I do think that the gap has closed in a meaningful way and what they have been able to achieve with the 30 billion class. Models is truly impressive. Anybody with a strong gaming computer can run a 30 billion class model the 397B takes $10,000 worth of hardware.
You asked why the smaller one is just as good as the big one. It's because the small one is newer and updated and the bigger one hasn't been updated yet, however long ago the last release was
If we're seeing performance this great out of a dense medium sized model, why doesn't someone do a dense large model again? It seems like the last great one was llama 3.3 70b. Is it that expensive to train big dense models but the 400B sparse models are cheap? It seems like if we had a qwen 4 70b it could sweep the board.
I think it's because of the lower inference speeds. Given the fact that agentic usage is trending dense models falls behind MOE speed wise. I can get 200 t/s with FP8 35B-A3B but with thw 27B I get around 60ish t/s generation speeds.
Do my eyes device me? Does it beat full size 3.5 qwen? Wtf. Blows punches with opus 4.5 (I know, not the newest opus) but fuk, it's 27B, you can run it locally. Opus 4.5 is probably hundred of billions of parameters.
I am always a bit sad when I see the hype surrounding new models, and the benchmarks not transfering to my actual uses cases at all. Of course, we need some kind of eval with the rapid fire release of models, but benchmarks have become worthless for my own use cases.
Hope someone finds a solution for this eventually, because eval time is limited (I only have so many hours every day 😅)
Recently I created explain.toml which has been tremendous help before I get it to execute:
description = "Explain Following Prompt ARGS: "
prompt = """
## Expected Format
The command follows this format: `/explain `
## Behavior
Check if a file named "Explained-Prompts.md" exist, if does not exist create the file.
Make a copy of existing "Explained-Prompts.md" in case there is a mistake in appending to file and file content gets replaced upon update and can be used to restore easily.
Avoiding executing the prompt.
Analyze the prompt.
Explain in detail what is understood from the prompt.
Explain the goals from what is understood from the prompt.
Explain the non goals from what is understood from the prompt.
Explain the plan of action from the understood prompt.
Explicitely and in detail explain how the prompt could be improved, list out what is ambiguius and implicit then how could be without ambiguity and be explicit.
Give a detailed improved prompt that is explicit without any ambiguity.
Update "Explained-Prompts.md" file by adding to the end of the file with following.
You don't need to try a tiny model to know it's nowhere close to one of the best behemoth models up to date. If I claim newly released Toyota SUV is not as fast as last year Ferrari model you will not need proofs for it, will you?
The car analogy doesn’t really hold here.
Cars scale pretty linearly with horsepower. More power, more performance. LLMs don’t work like that.
A smaller model can absolutely match or even beat larger ones on specific tasks.
That comes down to training quality, data, and optimization and not just raw parameter count.
A 27B model won’t beat frontier models overall. But saying it’s “nowhere close” is just not right. The gap has narrowed a lot and for many tasks, smaller models are already somewhat competitive.
SOTA for regular coding tasks / modest apps or websites is increasingly not a distant goalpost. Local models can probably already compete. SOTA that will remain SOTA is probably one-shotting massive applications and very niche specialized knowledge domains.
Which this isn't claiming.. This one claims to be comparable to, not beat, a model 2 releases and 6 months ago. I get the skepticism, but the guy isn't saying, "It will probably fall short." He straight up states it as fact.
See, this is why I love languages other than English. If he said it in Spanish for example, in the subjunctive mood, the speculative aspect would be embedded in the writing automatically. Either way, que tengas buen dia! (I amuse myself)
Seriously. I'm a little shocked by how many posts I scrolled through that are seriously stating that this is comparable to claude. Not just beating it in benchmarks, but extrapolating that to mean it will deliver real world performance at that level.
It's just bizarre to me that anyone getting serious use out of local models can still take the big benchmarks at face value. I think they can typically be suggestive of a model's strengths and weaknesses. But that's about it.
397B smokes it in real codebases. Tried it this morning. Anyone thinking a 27b dense can match context understanding of a model 10x its size is delusional.
The general rule of thumb is that a MoE like 35B A3B is roughly equal to a dense model of sqrt(a*b) parameters: sqrt(35B*3B)=10.25B.
This rule doesn't seem to be holding up perfectly anymore as recent MoEs have done better than the rule would suggest, but it's still a useful ballpark estimator and explains why the 27B dense model is significantly better than the 35B A3B model. The dense model is, however, much slower. The MoE only uses 3B parameters per token, which is a massive reduction in compute.
35B is Mixture of Experts; you only activate like 4B parameters per token. This is a 27B dense model. You hit every weight every token, and unsurprisingly, despite its larger size, it will almost always outperform the MoE model. The MoE model will be significantly faster however.
Everyone is saying "faster" so I'll add: cheaper. My GPU does sweat much, much less in such setup, making the USD per token significantly better in MoE models. Same can be observed in API providers.
So, cheaper and much faster, and only slightly worse.
Yes, it is better, but it is also a different architecture. 35b usually includes A3B when people talk about it. It has 35b TRAINED parameters but to answer a prompt it routes the data through a number of "experts" (Mixture of Experts) meaning it only uses 3 billion parameters to answer the question. This means the MoE model is (1) Larger because you still need to have the 35b parameters stored somewhere (2) Faster because prompts only pass through 3B parameters (3) lower performance/intelligence because prompts only pass through 3B parameters.
Truth be told I can't get Qwen3.6-35B-A3B to outperform Qwen3-Coder-Next. Running both at bf16 on a M3U256 in Claude Code (although think I'm about to swap out for a customized Pi rather than deal with the closed source bullshit from Anthropic anymore).
Will try Qwen3.6-27B at 8bit and see how that goes. I'm not really concerned with speed; just intelligence/accuracy. Would love to have a new go-to coding model.
what about forge code as a harness? it seems to beat claude code with opus too.
I really like Qwen3-Coder-Next as it is running fast and provides good results if you steer it well. I'd like to see it in comparison to this new Qwen3.6 27B model and the MoE model 35B-A3B, but I can't find some good sources.
You can even pick one of their free models for the first few minutes and have that model set up the opencode config file for you to run your local model.
Just spent 3 weeks days benchmarking and fine tuning Qwen3.5-27B and iterating with variants. Now this drops. 🤦♂️ But it makes me happy of course.
That said, I think there’s some suspect stuff with the benchmarks shown here. Gemma4-31B is absolutely better than the Qwen3.5-27B in my testing in multiple areas.
This is a disrupting level of intelligence gain on every iteration, ladies and gents, this is how the ai bubble will pop, big ai companies will go to shit and local 4090s are going to be even more expensive
I dont believe these kind of advancements pop the bubble as long is we are so constrained by power and compute limits in datacenters especially as capacity demands do keep going up.
Extremely verbose, keeps on thinking and thinking without making much progress. Even if the results are good (yet to be seen), it's extremely slow due to being verbose and being dense.
Meh.
Please If you can leave a comment on the huggingface page. IM just a regular guy that hopes qwen never stops helping us Out So comment to your hearts content Please all.
Dense like patrick star? No. Not quite. But we flew passt worms and we're getting close to the number of neurons in a fruit fly. About 1 or 2 orders of magnitude to go
Nah it's just that locomotion is a surprisingly cognitively hard task. Each fruitfly has a (current) supercomputer's worth of compute in it just to bumblefuck around
If I have 64GB of memory (MacOS) and want a large context window (128k-256k) should I be using 27B Q4_K_M, 27B Q6_K, the 35B A3B Q4_K_M model, or a different configuration?
I know running bigger models often gives better results, but sometimes the differences are negligent as well as smaller models having much more usable speeds.
Obviously after thousands of model launches we know that real world use it’s different than benchmarks…. but holy shit this unbelievable! Just in time for all the companies clamping down on usage!
I’m thinking there are quite a few folks who only ever look at the benchmarks and never run the models. If you have used the recent Qwen models you know that actual use matches the benchmarks pretty well but you don’t get as much “out of the box” freedom as they require some tweaking. I think many just use the models a couple of times expecting them to work like a cloud model with harnesses built for them and then are surprised when they do not function that way. My two cents as to where the “collective we” are at.
While they really are impressive, they have had a tendency to overclaim, best example of which is qwq-32b which at the time was touted to compete with DS V3, which turned out to be false
OK. I hooked it up to Claude Code and let it rip on a text processing problem I had. I've ran it on this same problem on Qwen 3.5 122B, Qwen 3.6 35-A3B, and now finally Qwen 3.6-27. It took over 40 minutes to process nine relatively smallish text files - 10-30KB each. Qwen 3.5 122B (MoE) took 30 minutes. I tried Qwen 3.5 397B but... I didn't have the 5-6 hours it would have taken to crunch this project.
Qwen 3.6 27B was the only model to give me a separate file documenting it's discrepancy findings for each of the nine source files, and it did very exceptionally well to boot. Qwen 3.6 35B-A3B is awesome for super fast code - by Qwen 3.6 27B seemed to have a deeper intellectual grasp of the actual problem. This is honestly a lot of fun.
Am I a bitch feeling a bit exhausted by how fast this stuff's moving? I just barely finish benchmarking and tuning my setup and then there's a new thing that makes the previous thing look like shit.
And that's on my PC. I have a Spark cluster that takes even longer to get going, I can run 122b models on my PC so the Sparks... I either need to buy two more so they can run truly huge shit or possibly just sell them, because my $8K GPU is often equally useful but much faster than my $8K worth of Sparks.
Thanks Qwen! This is the best open model I have seen so far in this "weight-class". The only model so far which actually works and does tool calling perfectly fine!
I just started feeding it my benchmarks. Its grasp of literary stylistic commentary is insane. It picks up on everything Gemma 4 does... and then a whole lot more.
Okay, am I the only one who no longer believes in these benchmarks!?
Or is it just that local models don't work as well for me (I don't know how to use them properly), or that these benchmarks are heavily exaggerated?
The Qwen 3.6 MoE model is also theoretically very close to Opus. However, in practice, the responses I get from Qwen are significantly worse than those from Opus. Opus manages to understand me and get the job done with a single prompt, whereas with Qwen, I often have to further clarify what I want to happen, or it simply fails to provide an accurate answer, especially if it's something it hasn't been well-trained on.
I always suggest people make their own benchmarks, based on their own real world needs, and test models against that. I'm willing to bet that anyone who does so will get disillusioned about the worth of the big well known benchmarks pretty quickly. Real world problems are messy with tons of uncontrolled variables that won't have one to one matches in a LLM's training data. Meaning they have more need of, for lack of a better term, intelligence.
Yes, size matters here. Opus has more world knowledge and needs much less guidance. With detailed and precise spec qwen might get close, but such spec is 80% of the job. So benchmarks will be deceiving - they are based on a known set of topics which can be emphasized in the training data. But try to code in a specialised domain like statistics or bioinformatics with a short prompt - qwen will fail and opus will nail.
Benchmarks can be indicative thats for sure, but there comes a point where youre being gaslit so hard and everyone is falling for it (not you just generally speaking).
Guys, its a 27b dense model, thats scoring on a repeatable benchmark at the same or better than a model bigger than 10x its size in the SAME GENERATION? Cmon guys, use your head, applying the 27b against the 397b in serious production tasks, in dynamic environments, that require contextual reasoning? The model with 10x its parameter size will be innate more intelligent in real world applications especially in the same generation.
Thats true. Single gpu models are fun to play with until real work needs to be done. Or at least it’s like opus researches, develops and writes detailed howto to stupid local model how to run those things.
It's a combination of multiple things. These benches are for the fp16 w/fp16 kv cache, which no one runs on a 3090. They're benchmaxxed to hell. And get using harnesses specifically designed for successfully completing these bench runs. Which no one has access to.
Anyone tested the ‘preserve thinking’ concept or know how it works technically? I’m trying to understand if it’s K-V caching or actually holding intermediate thinking in context between requests
Before whenever I ran 3.5 in a coding agent, it would do a task and when I sent a follow up message I'd see a lot of the prompt being re-processed due to the deleted thinking
Now the experience is much better with llama.cpp since the caching makes follow up responses start quickly.
I haven' tested with 27b yet, but for 35b it makes a lot of difference when the model is repeating things in the context (such as when editing files and outputting the fully modified version)
If you don't activate it, every time you send a request the old thinking of the llm's last response is discarded before answering your new query. Since the last message changed, the kv checkpoint is invalidated and llama.cpp will re-parse all the messages so far (all striped of their thinking), so you will have a delay before it starts processing the actual new tokens of your new request.
With preserve, the thinking is not discarded, so it stays in-context, checkpoints work, no delay but thinking will eat up some context (however by discarding it sometimes it has to re-think almost the same thing each time so it's not necessary a bad thing.
Before that, as a user of pi, i can tell you that the symptomi saw was that when i prompted it, it would process, start writing, all the intermediate tool calls it would do would be lightning fast (no matter if much bigger than any small message i sent it) until the task is done, but as soon as i would send a new message there would be a long delay and i could see in llama.cpp log that it was re-parsing the entire context.
So i highly recommend you use that configuration, especially if your pp speed is not so great (mine is 100t/s with 3.6 q8) for agentic use you will see enormous difference in total clock time
This one is realy close to Opus 4.5 Level Model On agentic coding specifically, 27B beats Qwen 397B across every major benchmark they shared. SWE-Bench Verified goes from 76.2 to 77.2. Terminal-Bench 2.0 jumps from 52.5 to 59.3. SkillsBench nearly doubles from 30.0 to 48.2
Here's how Qwen 3.6 27B compared to Qwen 3.6 35B-A3B & which one to choose.
Poor thing. And mine has had 11 new SOTA-in-some-way tts voices in the same time haha
Oh god, that reminds me that I just heard of KokoClone this morning.. Now to have codex 5.4 set up a qwen3.6-27b agent that will go research and install it... Unless Spud drops before I finish browsing.
I am currently using faster-qwen3-tts because some absolute legend made it faster and support streaming. But, it does absolutely spike my GPU utilization.
Pocket-TTS has been great. No real complaints there. It just works.
LuxTTS came out like 3 days later and was basically the same thing with more language support if memory serves..
Omnivoice is the new hotness, but I've only tried it for like 10 minutes and thought, "Oh, this is nice" because I don't have any problems that I need to fix when it comes to TTS. With those other options I'm already at faster than real-time generation, streaming support, voice cloning, etcetera.
And KokoClone is the latest thing to catch my attention because Kokoro is UNBELIEVABLY FAST, and clean and pleasant. It isn't emotive, and you can't clone voices. So if KokoClone can do voice cloning well, it instantly becomes the best (non-emotive) tts available just by virtue of the fact that you could run it at like 50x real time on a crap CPU.
If you have any questions about specific scenarios, like maybe you want something expressive but don't need voice cloning, feel free to ask. I have tried at least a dozen tts models recently, and each shines in its own way.
like maybe you want something expressive but don't need voice cloning
that would be a good start, yeah
Besides those, I know that not many models offer voice cloning, but, of those who do, which one do you really think is the best currently (before going the paid rute, aka 11Labs)?
Actually, you'd be blown away by how many offer good quality voice cloning these days. I've got to say that omnivoice was very clean and a good quality clone, as well as being quite fast. What are you aiming to use the tts for? I have different standards for a voice assistant versus the program I have narrate books for me.
Is anyone using it on macbook? can someone tell me how much ram would you need to run this with 100k context length while using 4-bit precision without any offloading?
I had a binned M4 Pro/48GB MBP and I ran 3.5 27b @ 8-bit MLX with 100k+ context just fine. No fast, but fine. My current M5 Pro/64GB is obviously a step up, especially with prefill.
Have you tried using some of the new speculative decoding methods like DFlash or even DTree? Depending on what you do, coding presumably, they could really speed things up a lot according to preliminary benchmarks.
It's a mixed bag with the M5 and new models right now. MLX can yield a prefill speed improvement of more than 3X, but e.g. Qwen 3.6 and some other new hotness doesn't work right (or at all) yet.
My dream is Qwen 3.6 27b 8-bit MLX with working MTP and prefill boost, which should give something over 20t/s generation and several hundred t/s prefill--i.e. 2-3X+ what my M4 Pro was doing with 3.5 27b.
8-bit models in the 30-35b range are context constrained with 48GB--I basically ran out of RAM before I ran out of CPU. Now I just run whatever I want and don't bother to conserve RAM by closing other programs, etc.
I suggest you upgrade your RAM to 32GB. Then, as long as you ensure the 35B Q4 model can fit in RAM, it should be usable. Although RAM is more expensive than before, it's still much cheaper than a graphics card.
You don't need to put all the weight in VRAM, because that would require too much VRAM capacity. As long as you ensure that the 3B activation parameters can fit completely into VRAM, you can guarantee a relatively decent speed.
1000 t/s PP and 35 t/s TG in ik_llama.cpp with 2x RTX5060Ti 16GB (32 GB total), graph split. Using 140k context at q8_0. It's a pretty good number if the compaction triggers at 128k
It's about exactly 3x slower than 35B A3B (~3000 t/s PP and 100 t/s TG)
But quality is better so far. The difference is easy to feel, it's more or less the same as the difference between the two 3.5 models of the same size.
Fr can we all just take a moment and say thank you China? While America is trying to fuck the world over left and right, China seems to be the only thing that’s giving us any sort of leverage at all.
Now imagine if they let us have their EVs
In America, too.
Qwen keeps shipping at a pace that's genuinely hard to track. The 27B size is particularly interesting because it sits in the sweet spot for consumer GPU deployments - fits comfortably in 24GB VRAM at Q4 with decent throughput. Curious how the instruction following and context handling compares to Qwen2.5-32B which was already a strong performer at the same tier. Does anyone know if this uses the same MoE architecture as the larger variants or is it a dense model?
I like to do story telling with the models so situational awareness and context are very important (like keeping track of who is in the room) so I have been using Q6 models which I *think* handle that better.
That's a good idea to try. Being stateless, it only knows what it can get from the system prompt or the past visible chat log. Outputting notes would help it keep track at the cost of feeling less immersive and taking up space in an already small context window. I read that Silly Tavern has a RAG database to maintain small details invisibly but haven't tried it yet.
How does this compare to the new Qwen3.6-35b-a3b? I also don't really understand the difference between qwen3.6-35b and qwen3.6-35b-a3b. Would be nice if someone can explain the difference between all 3
qwen3.6-35b and qwen3.6-35b-a3bv mean the same model, the Qwen 3.6 35b overall, 3b per expert mixture of experts. Every input goes through 3b parameters.
Qwen 3.6 27b is a dense model, every input goes through 26b parameters. The result is better if all else is equal, but speed is slower. You can get more context with 27b because the weights are a bit smaller.
I get 50 t/s with 27b, 170 t/s with 35b-a3b on a RTX 5090.
Would a 9b (if they release weights in this range) be useful for speculative decode to speed up 27b? Or is there some architectural reason in 27b design where speculative decode isn't a thing?
So the real question here is why Qwen seems to have hard time to scale up ? Qwen 3.6 27b and 35b are amazing, punching way above their weight (hehe), but Qwen 3.6 max, supposedly the beefiest model of the family, the top class frontier model barely get into the SOTA territory.
Okay now this will really start to undercut western SOTA models even enterprises will deploy this model in their harness where it fits, not to mention smaller models such as GPT mini, haiku, flash will go kaput (will only used by their own SOTA models in the harness)
The 27b will require significantly more VRAM at the same 4bit quantization. I would suspect that at 3bit this model would still probably beat OSS 20b in many ways, but be slower. So your answer depends on the answer to the question "What place?"
Because dense models are for dGPUs, whereas MoE models are for mini-PCs, Macbooks and laptops without a dGPU. Qwen3.6 35B A3B runs at 10 tokens/sec on DDR4-3200 on my laptop, whereas Qwen3.6 27B would be unusable.
Hardly anyone has a dGPU with 20GB of VRAM or more to be able to run Qwen3.6 27B on a good level.
I was debating this for a while, but considering quantization kld, is it better to run 27b q4 with q8 kv or 35b moe in q8 with f16 cache. same speed for both on my setup with 20gb vram and 64ram. What do yu think, does q8 overweights q4 more than 27b outperforms moe?
Terminal bench 2.0 rules explicitly disallow modifying timeouts or resources available. Each terminal bench task has timeout (usually under 1h, mostly under 30 mins) and resources configured in the docker container by the task creator and they are chosen that way to test specific model aspects.
Inflection point #2.. first one was opus 4.5 back in November. If the gap to the moe model is the same in the 3.6 series as it was in 3.5, given how good the 3.6 moe is in a harness.. this is a blessing.
Namra_7@reddit
Benchmarks
davl3232@reddit
I can't believe we're getting so close to opus 4.5 levels with 2 3090s
cmplx17@reddit
how is it to run with 2 x 3090? i just have one 3090 but wondering if it’s worth getting another one. does the speed scale 2x?
Outpost_Underground@reddit
Just to drive the speed question home, I have 3090s at home and a Pro 6000 Blackwell Max Q at work. On identical inference workloads that completely fit in the VRAM of both setups the Blackwell is like 10-15% faster.
BillDStrong@reddit
How many 3090's? Without that, we can't really judge as well the difference.
Outpost_Underground@reddit
It doesn’t matter how many 3090s as long as the work load fits in the VRAM. For example, I ran Gemma4:26b on a single 3090 and I also forced it to split across both 3090s. Same prompt, and there was a .0003% difference in speed.
I mentioned the difference with the Blackwell card because a lot of folks expect a crazy improvement in speeds; unfortunately the performance doesn’t scale like that.
davl3232@reddit
> does the speed scale 2x?
Not really, you only get a speed up if models didn't fully fit in vram before.
Ardalok@reddit
Isn't there a speed boost from multiple cards with NVLink on VLM?
kmp11@reddit
i am putting it through the test on my 2x4090. I can fit the Q6_XL with full Q8 full context window. Its coding at 20-25tk/sec with context window 50% full. it takes a while to ingest large context, but otherwise chugging along quite nicely.
Subject-Tea-5253@reddit
You will find this video interesting: https://www.youtube.com/watch?v=xS5wao4H4u4&t
It basically shows how is prompt processing and token generation is affected by the number and types of GPUs you use.
Mashic@reddit
Why 2, you only need 1, and myabe even an rtx 5060 ti or rtx 3060 12gb with quants.
florinandrei@reddit
I run all my models on a Raspberry Pi Zero.
BillDStrong@reddit
No, see you are confusing your terminal with your cloud provider. They aren't the same thing. /s
AreYouSERlOUS@reddit
Underrated comment
davl3232@reddit
you're right, a single 3090 could do it with a different quant.
Potential-Leg-639@reddit
1 is not enough for serious stuff and context.
coder543@reddit
95000 context at full KV (with multimodal loaded) is not horrible.
Mashic@reddit
But at least you can test it and use it for small stuff.
Cuddlyaxe@reddit
This is the future we want
Local_Phenomenon@reddit
My Man! Ai bro
mixbits@reddit
💯
_raydeStar@reddit
Meanwhile Opus tightens their limits and makes it more expensive to get in -- the perfect storm for a good local push
Super_Sierra@reddit
Sorry, but most local models have barely caught up to Claude 2, this is benchmaxxed and overfit to shit.
_raydeStar@reddit
What I really see lacking in local models is context limits and a good harness to give proper direction.
Context can't be solved easily, but at least a memory bank can be created to hold onto important information, and you can scrape by.
A harness can be built -- it can perform mathematical functions, solve logic problems, do web lookups, and perform basic tasks.
maybe qwen 27B cant perform as well as opus 4.5 on all tasks, but it doesnt need to
kommentiertnicht@reddit
do you have some resources / links you could share on the 2x 3090 setup?
Zc5Gwu@reddit
WTF why is this so good.
JLeonsarmiento@reddit
th bitchslapping and floor mopping to Gemma4 in brutal...
shansoft@reddit
I honestly don’t understand why Gemma4 score this low. I been using latest 31B and it’s coding results have been cleaner than 3.6 35B in almost every case, and it was able to do tool calling more accurate for Xcode MCP while qwen just gave up or stuck in loop. Gemma4 from my experience needs more detail in prompt, but results are better. Qwen often add things that I didn’t asked for and have less chance to one shot problem.
MeateaW@reddit
Gemma is a great model for its size, but Qwen 3.6 seems to be incredible, I would go gemma for this size, but running the 122b qwen 3.5 was my favourite so far local-capable model (strix halo 128gb), 3.6 in the ~100 billion parameter size is going to be amazing if it follows these smaller models capability.
funkyman228@reddit
I think the issue is MOE vs dense, in your case and the benchmark.
BillDStrong@reddit
Looking at the benchmarks, they are claiming this 27B beats their own Qwen3.5 397B-A17B model. If that is real, that might just explain it.
corpo_monkey@reddit
The opposite, don't stop now!
I cannot wait the gemma 5 vs qwen 4 battle.
YearnMar10@reddit
Qwen fired their best tech lead - not sure if it will come to that
awittygamertag@reddit
Different use cases tho. Gemma4 is a little Gemini and inherits all of the conversational skills of Google models. Qwen workhorse.
Septerium@reddit
In my testing, gemma 4 is still a better generalist model, even though it gets demolished by Qwen in coding tasks
JLeonsarmiento@reddit
I like the rhythm and prose too. But crunching benchmark numbers is a sport here.
AbeIndoria@reddit
Gemma-4 is still a very good model at anything that's not coding/agentic work. Qwen struggles there.
JLeonsarmiento@reddit
I like that it thinks less to be honest. That make it better for chat for example.
DragonfruitIll660@reddit
Initial use for similar sized quants seems to give the edge to Gemma in my opinion quality wise.
mikewilkinsjr@reddit
Dammit, I JUST rebuilt around Gemma4 at the house. :D
Looks like tonight it's going to caffeine and downloads.
9gxa05s8fa8sh@reddit
bro this is a screenshot of the market crash lol
Legal_Dimension_@reddit
This is likely f16 rather than q4
Healthy-Nebula-3603@reddit
Nowadays q4im /l with imatrix is just slightly worse than fl16.
Accomplished_Mode170@reddit
Sorry you got downvoted for a parameter when the OP is the one who dropped the /s
Old-Independent-6904@reddit
I feel like this shows how inefficient agents have been compared to what they could be. Exciting!
mister2d@reddit
TIL: Claude Opus is MoE
2Norn@reddit
i mean ofc all sotas are
AttitudeImportant585@reddit
having worked on one of them, this is false
CrispyToken52@reddit
Let me guess. Meta?
AttitudeImportant585@reddit
llama was never sota
kitanokikori@reddit
It doesn't even make sense for them to be on a technical level - they are designed to service literally as many requests as possible from all kinds of domains, why in the world would you want any part of their knowledge base to be unloaded at any time
AttitudeImportant585@reddit
you're underestimating the compute available and optimizations made for that specific architecture for a particular chip
2Norn@reddit
which one and when is a good question.
kimi, glm, minimax, xiaomi, gemini(stated in docs), gpt(leaked) are all moes. only one that's unknown is claude. there is no knowledge if its dense or moe. but it's very normal to assume it's moe just like all others.
Thomas-Lore@reddit
Claude not being MoE would explain their huge compute issues though. :) Although they did speed up Opus recently, so maybe they mvoed to MoE recently? Or just made it more sparse.
mister2d@reddit
Obviously.
Comfortable-Rock-498@reddit
Look carefully, there is a divider between MoE and Opus
rc_ym@reddit
You do not get enough upvotes.
Cuz it looked like it was also saying Opus 4.5 was open sourced, which also isn't true.
Successful-Brick-783@reddit
It’s not necessarily, there is a line dividing them it’s just faint
_-_David@reddit
I just realized one of the bars isn't qwen3.5-35b... it's qwen3.5-**397b**
I have no idea why that shocked me more than the Claude comparison, but it did
social_tech_10@reddit
Incredible that it can beat a model 14X larger in 10 of the 12 benchmarks!!
BillDStrong@reddit
To be fair, the model it is beating is effectively 17B Expert, but with much higher memroy and a bit of help as needed. You don't get to keep all of that intelligence, unfortunately, in MOE models.
Still damn impressive.
Plasmx@reddit
Me too. Maybe because it’s the same company releasing a new generation and just slashing the prior gen big models.
Lorian0x7@reddit
the funny thing is that it's not even a new generation, just a minor update of the same same generation. I saw Anthropic and OpenAI smashing a big round number on models with much less performance gap
PassengerPigeon343@reddit
I’m sorry, we’re comparing these to CLAUDE now?! Hell yeah.
RelationshipLong9092@reddit
lol
lmao
sk1kn1ght@reddit
Do they imply that it beats opus? Are we for real? Like not negatively. Like are we for real? Repeating it to myself got me goosebumps
florinandrei@reddit
If goosebumps is what you're after, there's a guy at the street corner who can sell you good stuff.
Next_Pomegranate_591@reddit
Technically calude opus 4.7 beats 4.6 but I heards its shi. We never know without testing. Well qwen plus users would know i guess ?
TheMegosh@reddit
Apparently the 4.7 model was condensing prompts to 200k tokens instead of 1mil, posted on their change log. I'd bet that's what made it bad
TokenRingAI@reddit
It's got looping and other obvious issues, I have free access to it but mostly use Sonnet 4.6 or GPT 5.4.
Sonnet is really reliable and stable
Something is very strange about Opus 4.6 & 4.7, they act like a large model that is excessively quantized. Opus 4.5 was not like this. I wonder if this is a side effect of them using TPUs. Gemini acts the same way.
Next_Pomegranate_591@reddit
Frr this happened after jan. Like I would one shot a whole project with 10 words on opus 4.5 and now 4.6 acts so dumb like I feel its basically another gemini 3 pro. I mean its still better but really disappoints remembering what opus used to do 🥀
Thomas-Lore@reddit
Keep in mind they also changed reasoning effort around that time and now it is almost zero due to adaptive thinking.
Next_Pomegranate_591@reddit
Yeah I did notice that adaptive thinking button what does it really do ??? Like its not the same as the previous one which was enhanced thinking or something ? I thought it eas just cosmetic and they named it "adaptive thinking"
Next_Pomegranate_591@reddit
I use gemini 3.1 pro in antigravity but haven't really used it in a while now. Maybe I should give it another shot. I was comparing opus with 3 pro and not 3.1 pro btw :)
TokenRingAI@reddit
This might be a side effect of adaptive thinking, the responses come almost immediately and the chat is muddled with looping content that should have reasonably been expected to be in the thinking block
MadSprite@reddit
I feel like the best way to describe it is that the intern is just as smart as the wizard but not as wise. Being smaller parameters means its going to know less but handle the common tasks we ask of it really well.
vogelvogelvogelvogel@reddit
i can relate
Non-Technical@reddit
These metrics have no visible source.
some_random_guy111@reddit
Straight from qwen on an X post
Non-Technical@reddit
Great! That's important to know for a couple of reasons. They are official so they are based on something and if they come from Qwen so they are also designed to make 3.6 look good.
JustFinishedBSG@reddit
Man it has to be benchmaxxed to the tits otherwise it’s supremely embarrassing for the « frontier » labs
kaeptnphlop@reddit
They are showing us that Qwen3.6-35B-A3B is on par with 3.5-397B-A17B too. 🤔
vinigrae@reddit
That’s extremely suspicious
Zeeplankton@reddit
god damn wtf
Automatic-Arm8153@reddit
F it we ball
Automatic-Arm8153@reddit
And I was just about to go to sleep. Maybe next time sleep.. maybe next time
Perfect-Flounder7856@reddit
😂😂😂or😭😭😭 I'm right there with you my sleep is shit but everything is so exciting. Any one have an ambien?.
ApprehensiveAd3629@reddit
which gguf quant is possible to run in a 5060 ti 16gb?
Careful_Swordfish_68@reddit
5060ti User here. I run Qwen 3.5 27b HauHau aggressive uncensored in IQ4_XS with Medium context which is absolutely fine quality. Expect to run 3.6 the same.
mintybadgerme@reddit
But that's 15.4GB in size. How do you get a decent context out of that?
Pablo_the_brave@reddit
The best i1q4 xss are 14.7GB. KVcache K Q8 and KVcache V turbo2 and you will have ctx 75k... Works greate.
mintybadgerme@reddit
Thanks for your help. Is that an unsloth quant?
Pablo_the_brave@reddit
This one model https://huggingface.co/mradermacher/Qwen3.5-27B-i1-GGUF/resolve/main/Qwen3.5-27B.i1-IQ4_XS.gguf?download=true
Compile turboquant form TheTom: https://github.com/TheTom/llama-cpp-turboquant/tree/feature/turboquant-kv-cache
My llama.cpp config:
--models-preset "$CONFIG_PATH" \
--models-max 1 \
--host 0.0.0.0 \
--port 8081 \
-t 8 \
--parallel 1 \
--cont-batching \
--keep -1 \
--chat-template-file "$DIR/chat_template.jinja" \
--chat-template-kwargs '{"preserve_thinking": true}' \
--defrag-thold 0.3 \
--cache-reuse 1024 \
--jinja \
--temp 0.15 \
--top-k 1 \
--min-p 0.1 \
--spec-type ngram-mod \
--spec-ngram-size-n 24 \
--draft-min 4 \
--draft-max 64 \
--repeat-last-n 512 \
--repeat-penalty 1.05 \
and the model.ini with the rest of settins (I'm using router)
[Qwen3.5-27B]
model = models/Qwen3.5-27B.i1-IQ4_XS.gguf
ctx-size = 75000
n-gpu-layers = 99
cache-type-k = q8_0
cache-type-v = turbo2
batch-size = 512
ubatch-size = 128
flash-attn = true
no-mmap = true
Careful_Swordfish_68@reddit
HauHauCS Version is 15.1GB in size. Qwen context does not eat much memory.
Here is some proof in a picture so you dont need to listen to all these people talking out of their asses who say IQ3 works at best. Sorry for the bad quality, im at my phone atm. But you can See i can load all layers plus 30k context on Q8 into the 5060ti with IQ4_XS. If you are even willing to offload some layers to RAM and sacrifice the t/s then context size goes brrrrrr.
mintybadgerme@reddit
Thanks very much. Please send the image again, it didn't come through properly.
Careful_Swordfish_68@reddit
Huh, for me it shows up fine. Weird. You See it now?
mintybadgerme@reddit
yep. :) thanks
see_spot_ruminate@reddit
Not just for your 5060ti, but for anyone that has only 16gb of vram, you will need to have it heavily quantized or any spill over to system ram will dramatically slow it down.
There is also the problem of bandwidth limitations of gpus and dense models. You are not going to be getting anywhere close to the same t/s with these. You will probably need some speculative decoding going on as well.
Careful_Swordfish_68@reddit
IQ4_XS works perfectly fine at medium context. Source: I got a 5060ti.
mintybadgerme@reddit
Better performance - UD-Q3_K-XL (14.5GB) or Q3_K_M (13.6GB)??
https://huggingface.co/unsloth/Qwen3.6-27B-GGUF
see_spot_ruminate@reddit
I have no idea. I do not typically use the dense models due to the 5060ti bandwidth despite having 4 of them. For the qwen 3.5 models, if I wanted to have more "intelligence" I would instead use the 122b model as it ran faster on my system (64gb system ram + 64gb vram) than the 27b dense model.
overand@reddit
You probably want to stick with the 35B-A3B model (MoE) - the 27B might require a bit too much quantization. (Regardless, there is only an FP8 available so far, so you'll need to wait regardless, as that one is going to be about 27 gigs)
Careful_Swordfish_68@reddit
IQ4_XS works perfectly fine at medium context and is ok quality. No need for Q3.
logic_prevails@reddit
Yeah and moe offload can be very usable if you have a good CPU/RAM
luncheroo@reddit
Try a few, but if you quantize kv cache and trim context, you should be able to run Q3 variants pretty well. I'm going to try Q4 to see if it's tolerable because I have the same GPU, but Q3_K_S is my fallback because I know it works for me for Qwen3.5 27b and Gemma 4 31b.
Professional-Bear857@reddit
Made a gguf: https://huggingface.co/sm54/Qwen3.6-27B-Q6_K-GGUF
TheOriginalOnee@reddit
Anything that can be used with 16GB?
film_man_84@reddit
I just run LM Studio with Qwen3.6-26b (Q4_K_M) and with 16000 context size it is running about 3.08 token/sec at the beginning of new chat. I have 16 GB VRAM and 32 GB of RAM.
Have not yet tested with coding using Pi, just downloaded and made first tests.
grumd@reddit
IQ3_XXS from unsloth
Professional-Bear857@reddit
Try the unsloth quants on huggingface
Professional-Bear857@reddit
And a smaller version https://huggingface.co/sm54/Qwen3.6-27B-Q4_K_M-GGUF
Awwtifishal@reddit
You shouldn't quantize ssm_* so much. They should be at least q8. I usually add something like this to llama-quantize:
--tensor-type ssm_alpha=F16 --tensor-type ssm_beta=F16 --tensor-type ssm_out=Q8_0snorkelvretervreter@reddit
Thanks! Perfect for a 24GB consumer GPU
Fringolicious@reddit
You're a real one, thanks man
twisted_nematic57@reddit
Wtf how is it so small? I remember Qwen3.5 27b being like 30gb
soyalemujica@reddit
Thank you so very much!
Dany0@reddit
Thank you, real G
Small_Ninja2344@reddit
Might work on a m4 pro 24gb ?
mmkzero0@reddit
It is insane what a 27B model can do, and this is only a “small update”
cleversmoke@reddit
Oh snaps, I am setting up a dual agent/sub-agent approach on 2x RTX 3090s with Qwen3.6-35B-A3B and Gemma-4-26B-A4B-it. I wonder how well this 27B is on chain of thought, to replace Gemma-4-26B-A4B-it.
cosmicnag@reddit
So which one is sub agent
cleversmoke@reddit
I'm thinking Qwen3.6-35B-A3B as primary builder and architect (agent) while Gemma-4-31B-it or Qwen3.6-27B as debater (sub-agent)! If the CoT on the Qwen3.6-27B is near as good as the Gemma 4, then I may be able to squeeze in more layers on a RTX 3090 or use a higher quant. This is quite exciting!
cosmicnag@reddit
nice, you're running Q4 quants of each? What context? Even I have a dual GPU setup 5090+4090 and totally maxxing out one model in Q8XL and max context. Its very tempting to have a dense and a MoE model Q4 running simultaneously. I know it could be vibes/placebo but am concerned about quality drop if I goto Q4. However, I like your idea of adversarial subagent (this could be what I need to drop quants). May I know how exactly you are running this subagent scene? Pi? or which harness? When is the subagent invoked?
cleversmoke@reddit
I'm testing the duo agent all this week so will let you know! I'm using OpenCode, it's primarily via agent config prompt, so will have to see how that fares.
Builder: Qwen3.6-35B-A3B Q4_K_M, 262k context, q8_0 KV cache. Invokes sub-agent with (paraphrasing): "If it's easy, go with it. If it's complex, ping the Debater up to 3 times. Be mindful of security, coding practices in AGENTS.md, and architecture and DTOs defined master_plan.md. Ask clarifying questions."
Debater: Gemma-4-31B-it Q4_K_M, 32-64k context, q4_0 KV cache. Max steps: 6. Init prompt (paraphrasing): "Red team, be strict with coding, security, memory leaks, edge cases, laws, only read and respond, no praising, follow format for response (Issue, Priority, Recommendation, Status)."
cosmicnag@reddit
Nice!
ozspook@reddit
phone-a-friend seems to be a really good way of keeping agent sessions on track and getting past difficult problems, if we can have fast MoE models crunching through stuff and asking for help from 122b's or whatever for tricky stuff that's probably optimal, with Opus as a last resort.
cleversmoke@reddit
Agreed! 122B may be too much for my RTX 3090 though unless heavily quanted? I'm testing the phone a friend approach, do you have experience with it?
challis88ocarina@reddit
Kindly quantized: https://huggingface.co/Qwen/Qwen3.6-27B-FP8
lucidparadigm@reddit
What's the hardware requirements
Maleficent-Pea-3494@reddit
I don’t know what it would take, but I know what I’m using. M5 Max 128gb
iBornToWin@reddit
Wow! you are the man! 🤩 Share performance after running bro. I want one for myself, but unsure if the speed will be usable and whats the max I can accommodate on a macbook. Otherwise will wait for ultra.
Maleficent-Pea-3494@reddit
Ya I’m def looking to get the studio ultra when it drops based on what I can see this thing do. It’s arriving tomorrow so I’ll post up a few thoughts on performance. I’ve got a narrow focus and have built an interface that’ll wire up to Qwen so I’m not using it for open ended things, which will hopefully keep it slim and fast. We’ll see. If it’s too slow then I’ll just enjoy my $5k coffee table ornament 🤣
rjames24000@reddit
also would like to know hardware requirements.. would be great if i can run this smooth on a single 5090
wen_mars@reddit
I'm running the unsloth q5 with q8 k/v cache at max context length and it works great
GCoderDCoder@reddit
I'll say the for the gguf kv cache seems to behave more like gemma4 31b than qwen3.5 27b. I was able to squeeze a lot in with qwen3.5 27b but 3.6 27b is doubling the model size for my cache.
This is a great model but right now my lm studio q8 has less code issues than my unsloth q8kxl. That's unusual. I wonder if anyone else is experiencing this. I tried a couple different quants and q6kxl kept looping around 30-40k tokens without presence penalty.
Initial thoughts: The unsloth ones feel jagged. Much more ambitious but also more glitches. When instructed to iterate to fix things it makes changes but less meaningful differences than I expected. I'm still playing with it between other tasks but curious anyone else's experiences.
RelationshipLong9092@reddit
Maybe below 8k context... but it looks just barely too big to me.
bonobomaster@reddit
Everything you got plus your soul!
AuroraFireflash@reddit
Is there an MLX FP8 quant yet?
woahitsraj@reddit
Yup https://huggingface.co/unsloth/Qwen3.6-27B-MLX-8bit
Dubious-Decisions@reddit
What are you using to run this? Would like to run the MLX flavor but ollama doesn't seem to support it.
MeateaW@reddit
lmstudio is the lazy way. (turn off "keep model in memory" and mmap to load models - they are talking about keeping the models in system memory which just doubles the memory required to load the model on unified systems)
AuroraFireflash@reddit
ty ty, can't look while at the office due to the firewall blocks
cafedude@reddit
Does that one not run on llama.cpp?
bytwokaapi@reddit
Please do the needful
CORKYCHOPS@reddit
I'm new to this, can you download this for using offline? if so how do you do that?
BustyMeow@reddit
This is the purpose of local models
CORKYCHOPS@reddit
yeah thanks, then how do I get the model? I have tried to pull and get"An attempt was made to access a socket in a way forbidden by its access permissions"
Dooquann@reddit
guys iam completely new to this, can i run this on my 8GB amd gpu?
egrueda@reddit
Resident_Bell_4457@reddit
Do you guys think at this pace in a year i could run a similar workflow to what claude code Currently offers with local llms on a 64gb macbook pro?
EbbNorth7735@reddit
The capability density doubles every 3 to 3.5 months. So an 800B now will be as good as a 400B in 3ish months and that will equal a 200B at 6 or 7 months, and finally a 100B at 10 to 12 months. We're talking MOE. Thats roughly equal to a 30-35B dense. So yes.
skyyyy007@reddit
i just setup m5pro 64gb and it works pretty good, context window of 128k in lm studio and it is running at speed on average at least 55-70 tps depending on the type of prompts. Uses about 32gb ram for this
Beginning-Window-115@reddit
you should say that you are running a different model or people are gonna think you're running qwen3.6 27b
skyyyy007@reddit
Edited to reflect the model, thanks for highlighting that👍🏻
Resident_Bell_4457@reddit
That is impressive. Would you mind giving a few updates after you used it?
Like how it performed in real life tasks what is you experience? I’d really appreciate your insight
the3dwin@reddit
https://www.canirun.ai/
the3dwin@reddit
Also LM Studio Tells you whether a model runs on your hardware
kmp11@reddit
Let's see what happens when model start getting released as 1bit and using some offshoot of turboquant. This is really the next step for local models.
Beginning-Window-115@reddit
m5 pro gives me 70tok/s on 3.6 35b moe and around 14token/s on the 27b dense
Resident_Bell_4457@reddit
Do you use it for codig - similar to how a workflow looks for a person who’s using claude code in cli, you know parallel agents ~100-200k tokens How does it compare to that? Is it getting closed or context is still a big issue?
Beginning-Window-115@reddit
yes I can use it at like 100k context although I dont use agents in parallel or vibe code, most of the time I just use it for bug fixing (I have 48gb of memory so context isnt really an issue)
Borkato@reddit
You can literally do that now with 35BA3B. It’s god tier, which means this is going to be titan tier
Fringolicious@reddit
I can't be reading this right - a 27B model that's as strong as Opus 4.5 pretty much across the board? Fucking hell
amunozo1@reddit
Qwen are famous for benchmaxxing. I really doubt this is comparable to Opus in any way.
Evening_Ad6637@reddit
Qwen is definitely not known for benchmaxxing. Even back in the Model 2 series, it was clear that Qwen was doing something different with their training. For example, it’s well known that their pre-training data sets were significantly more math-heavy than others.
In my private tests, Qwen-3.6-35B actually indeed delivered better results than Opus-4.7
This means that it’s either actually Anthropic who’s doing benchmaxxing, or Qwen has managed to benchmaxx real-life tasks
EbbNorth7735@reddit
Been running qwen3.5 122B and finding it amazing combined with Cline for agentic coding
bladezor@reddit
I did not have the same results as you. Granted I was running a quant but they were worlds apart for me. I was having Opus 4.7 review the code Qwen 3.6 was generating and it was a lot of kickbacks.
Beginning-Window-115@reddit
lol no they arent
Top-Rub-4670@reddit
Whether qwen is or isn't we can both agree that r/LocalLLaMa is famous for claiming that <> DESTROYS Claude in all their tests.
toothpastespiders@reddit
And I think that points out how flawed the argument is. If models that can fit in 24 GB VRAM were actually beating claude in real world use than the hyperbolic claims would be talking about new models beating 'that' model rather than claude. Because we'd all be looking at nemotron, phi, or whatever as the top tier model family to beat rather than claude.
some_user_2021@reddit
Did you check?
Beginning-Window-115@reddit
have I checked qwen models? yes
toothpastespiders@reddit
It's ridiculous that this is even a controversial opinion. I'm incredibly grateful for the Qwen family of models. I think 3.5 27b was one of the best releases we've had in ages. But Qwen absolutely has a reputation for benchmaxxing. At least outside this sub. Doesn't make the models bad. Just means qwen has specific strategies for training models that align with the major benchmarks.
Caffdy@reddit
every lab is doing benchmaxxing some way or another, is their target goal after all, the first thing people refer to when new models are released. Of course they're gonna try to be the best
Ueberlord@reddit
Damn, I was just wrapping up my tests of Qwen3.6 35B vs Qwen3.5 27B.
High hopes for 3.6 27B though, the 35B variant of 3.6 was way better than the previous version!
Far_Cat9782@reddit
3.6 35b is godly. Especially with the right harness
WeUsedToBeACountry@reddit
what harness are you using
FeiX7@reddit
pi
ab2377@reddit
which one is that pi?
ozspook@reddit
pi.dev
cuberhino@reddit
Any advice on using it? Never tried it installing now
florinandrei@reddit
Yeah, they choose the most googleable name in the world for it, lol.
ab2377@reddit
😂
Far_Cat9782@reddit
My own. I think that's the best for any model.
ionizing@reddit
screw the haters, they dont get it. Some of us DO make our own applications.. and they tend to work because we put time into crafting it for the model. I've been working on this for \~ 10 months and qwen3.5 made it shine. qwen3.6 is actually paying attention to the system prompt and figured out the parallel tool execution. the moe tend to send single tool calls for some reason. 27B likes to group them, that is one difference I am seeing already. anyhow, I am also loving the recent trend of people realizing the locals work best with curated tool sets. heck all they really need is good prompts and guarded bash access nowadays in all honesty. ok I am rambling, but I laughed at the haters and came back to this comment after work to brag about my own 'harness' because yes some of us do that, I mean this IS local llama...
Tamitami@reddit
Can you share this? I'm also on cachyos and I mainly use forge-code
ionizing@reddit
I need to discuss with my employment the ramifications of sharing it first, because I built a lot of it on the clock and they know about it, so it might be an issue to just open source it now. It gets used in an offline environment for work purposes. It would need a bit of cleanup and some other features before releasing too.
also, the available tools on the market are finally catching up to the capabilities, so this is less unique in its abilities now (though I am convinced I was the first to have native docx export including latex to ooml and tables and list etc months ago lol).. So I have thought about putting it out there and might be able to someday.
LewisTheScot@reddit
My girlfriend goes to another school ah comment
JohnnyLovesData@reddit
No, he's peggers
2Norn@reddit
probably pi
redboy33@reddit
Explain harness for grandpa. I love AI. I’m just getting into running local models on my Apple m5 pro and framework 395+ Ai amd apu w/128GB ram. Using lmstudio & ollama. Thanks!
GreenHell@reddit
Mention Ollama, and people will get riled up on this sub.
I think Ollama is an okay starting point for a lot of people since it is rather plug and play.
But if you want to get a bit more serious with local models, you will want to look into llama.cpp (https://github.com/ggml-org/llama.cpp) (on which ollama is heavily based without attribution), and llama-swap (https://github.com/mostlygeek/llama-swap) for managing multiple models, switching them out, etc.
llama.cpp is much more performant than Ollama, allows for greater customization, is faster with the updates.
skirmis@reddit
No need for llama-swap, now llama.cpp server has model loading/swapping/unloading built in: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#model-presets
DefNattyBoii@reddit
fyi ollama usage patterns are massive cancer it just teaches you the wrong skills for local LLM tinkering
GreenHell@reddit
I think that is a bit harsh.
And I too started with Ollama before moving to llama.cpp and llama-swap.
ComplexType568@reddit
A harness is basically the new buzzword that is used as a pretty large umbrella term which basically is like a way to give LLMs tools. Think of it like a mech suit to a person, a harness is the mech suit for an LLM. Also I recommend you drop ollama, just imo tho hahahah
arcanemachined@reddit
It's not a buzzword. It's a new class of tool, which is why it needs itss own word to describe it.
ThisWillPass@reddit
We use to say “how you hold it” for lack of better words. “Harness” is cleaned up.
redboy33@reddit
Would openclaw and Hermes be considered harnesses? What about qwen coder cli? Oh, and what’s wrong with ollama? I kinda like it better than lmstudio, I guess because I don’t tweak any settings. Just use Ollama serve, ollama run qwen, seem so simple and intuitive?
redonculous@reddit
Openclaw yes. Hermes unsure. I don’t know too much about Hermes but believe it’s a model with claw like features, rather than a program with tools like openclaw.
Happy to be corrected though
adam_suncrest@reddit
hermes would qualify as a harness yes, at its core it's a coding agent with more bells and whistles
New_Comfortable7240@reddit
Like the simplicity make you lose some capabilities and speed. By going with llama-server you can fine tune (or copy the setting from others) and have better results. Sadly, its a bit more of work but after some time the thing just work. Also, llama.cpp updates frequently and drops good optimizations regularly, faster than ollama (that reuse llama.cpp anyway)
BasicBelch@reddit
Ollama is just a shitty wrapper around llama.cpp. Just use llama.cpp directly, it will be faster too.
Far_Cat9782@reddit
Yes
markole@reddit
Harness to an LLM is what a Zord to a Power Ranger is.
ASYMT0TIC@reddit
Think of the LLM like a horse. It's beautiful and it can run and jump etc... but it's kind of hard to get any work out of it as-is. Put a harness on that horse, now you can get it to pull a plow, carry people, etc.
txgsync@reddit
LLMs just output text. A harness is anything that allows that LLM’s output text to do anything other than output text.
dxplq876@reddit
3.6 35B? where are you seeing this? I'm only seeing 27B on hugging face
StardockEngineer@reddit
I use a minimal harness and it’s godly. Harness has nothing to do with it.
choicechoi@reddit
i wonder your setting too
Far_Cat9782@reddit
Llama server and I use the settings on the qwen 3.6 hugginfsce page. If you go there it will tell u the temp. Parameters that's optimal for. The herbal assistant parameters are what I use and switch on the fly to coding parameters. I used Gemini to create my harness. I got seen writing here owncp servers now. I call it kokocode but haven't released yet I keep adding features and functions
KedMcJenna@reddit
And what quant (the other guys have the harness side covered, I'm a quant guy)
Far_Cat9782@reddit
Q/I'm using unsloth 3.6 35b q6_k_xl. The q4_,k_m works just as well
JuniorDeveloper73@reddit
the new buzzword
motorsportlife@reddit
Which harness are you running with it
KURD_1_STAN@reddit
Def and i had high hopes for qwen3.6 27b cause of 35b but benchmarks seem to be disappointing
nullmove@reddit
I think the relatively difference might not be as big now that the MoE is fixed. But still, equivalent dense models are better in ways that's not always captured in bench (world knowledge) but still evident in daily work. ~60 in terminal bench here is incredible already though.
namakoo1@reddit
It’s so frustrating not being able to run it on my local setup. 😭 I really need more VRAM. Maybe it’s finally time for an upgrade!
CurrentNew1039@reddit
Now release qwen 3.6 9b which beats qwen 3.5 27b or 35b, that would be awesome
BringMeTheBoreWorms@reddit
Let hope it gets rid of looping
abu_shawarib@reddit
Might be worth it to try to preserve thinking, or slightly higher quant like q5
BringMeTheBoreWorms@reddit
Havent had it loop yet!
bilinenuzayli@reddit
Doesn't loop for me
BringMeTheBoreWorms@reddit
That’s awesome!
iportnov@reddit
Hmm. In my experiments (with unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL), tool calling works like 50/50.
For 3.6 35B A3B it worked very well. Maye I'm doing something wrong...
Expert_Development64@reddit
The initial Unsloth quantization is not recommended.
Teetota@reddit
Unslloth is famous for this. They usually fix tool calling a few days later.
adam_suncrest@reddit
densocrats it's time to eat 🍽️
some_user_2021@reddit
Dense meat is back on the menu!
Local_Phenomenon@reddit
My Menus!
IrisColt@reddit
*starts eating like prime Harold Saxon*
Ok-Internal9317@reddit
Today's another sleepleess night!
My_Unbiased_Opinion@reddit
Moeblicans can sit on the sidelines today!
lolpezzz@reddit
Haven't followed the trend for months, what's the hype for this one?
FluxFlicker@reddit
Anyone knows if they will be releasing a new version of 122B?
emaiksiaime@reddit
And a very sparse coder like the 80b qwen3 next! That would be awesome!
grumd@reddit
Release Qwen3.6-Coder-80B-A10B and I will stop wasting huggingface bandwidth every day
KURD_1_STAN@reddit
Hopefully they make coder specific one like qwen3 but with 6b active instead
PengLaiDoll@reddit
According to Qwen's official WeChat account, it appears this is the final open-source model in the Qwen 3.6 series. Of course, we can also hope that this is simply a typo.
nickludlam@reddit
Who knows. They might even skip it entirely because that size is just so much less common now.
More-Curious816@reddit
Actually that size is the more common now that we have 128GB of unified memory PCs any customer [who have money or like to eat rocks] can purchase.
DGX SPARK, AMD strix halo and Apple newest laptops and their studio the desktop equivalent of PCs.
coder543@reddit
by what metric?
nickludlam@reddit
Just thinking about the releases that are around 120B params, we have GPT OSS 120 from Aug 2025, Qwen 3.5 122B and Nemotron 3 Super 120B from this year.
My point was that it's just not a very common size (no equivalent Gemma or Mistral models). Also there is a trend towards releasing small models for domestic use, and retaining the larger models for cloud-only inference. I can absolutely see a strategy where it's thought that the \~120B models are good enough to eat into cloud inference profitability.
But yes, this is a reckon, and I may be hallucinating things in the tealeaves which don't exist!
coder543@reddit
Mistral Small 4 is the same size. The only one missing is Gemma 4, which is rumored to have one of the same size too (Gemma 4 124B), just not released yet.
nickludlam@reddit
I had no idea Mistral Small 4 was that size! Thanks for the correction
grumd@reddit
They included it in the poll so I'm assuming they'll release all 4 - 9b, 35b, 27b, 122b. They didn't include 397b in their twitter poll though, so they might not open source the big one
TassioNoronha_@reddit
where did you see this poll?
grumd@reddit
https://x.com/ChujieZheng/status/2039909917323383036
TassioNoronha_@reddit
thanks mate
pmttyji@reddit
Of course, we're getting. But let them cook that best with more stuff.
Thrumpwart@reddit
Soon I hope!
electricarchbishop@reddit
Alibaba please please please give us 9B!! My poor 3060 can’t handle these things!!
DominusIniquitatis@reddit
Try the 35B one (if only you're not specifically interested in the dense model, of course).
electricarchbishop@reddit
Yeah, but it’s so slow. I have 64gb DDR4 ram so I can make it work but 10tk/s is so slow compared to the 40tk/s you could get with 3.5 9B. I can sacrifice intelligence if it’s just agentic coding, a smaller model can correct its mistakes faster than a MoE can finish outputting its first message.
DominusIniquitatis@reddit
3060 12GB, 32GB DDR4. Getting around ~28 t/s with Q4_K_XL at 32768 context length (can go even higher, tested at around 80k with 131072 allocated, got ~23 t/s). All llama.cpp settings are basically default.
electricarchbishop@reddit
Oh yeah, that’s probably why. I’m using ollama, lol.
OS-Software@reddit
Yeah, Ollama can't properly CPU offload MoE models. Just use llama.cpp or LM Studio instead.
DominusIniquitatis@reddit
Just in case, I could run Qwen Next 80B one (also A3B) on the same hardware just fine.
electricarchbishop@reddit
At what quant???
DominusIniquitatis@reddit
IQ4_XS. But probably can go to Q4_K_XL also, given that it's the active parameter count (that 3B) that matters the most, not the total one.
electricarchbishop@reddit
Interesting, I’ll have to look into that. Thanks!
Dion-AI@reddit
Time to load the tests again...
Eyelbee@reddit
boston101@reddit
Hahahaha perfect img for us nerds
NeuroPalooza@reddit
I see a bunch of benchmarks posted, but how does this compare to GLM 4.7 flash for creative writing? Is it uncensored out of the box?
WhyLifeIs4@reddit
Benchmarks
pmttyji@reddit
It beats 397B on 10/12 items
ZBoblq@reddit
According to whatever that is supposed to measure 35b-a3b is nearly equal with 397b-a17b, which doesn't make much sense either.
pmttyji@reddit
But for more knowledge, big models are best.
Current Qwen3.6's 27B & 35B models are great & awesome for Consumer GPUs.
AvidCyclist250@reddit
web mcp is best. Good thing we can do tool calls eh
Thomas-Lore@reddit
Some things are not available on the internet and yet large llms know them from books and other offline data.
Imaginary-Unit-3267@reddit
This is what scihub and libgen are for. (And physical libraries plus OCR, if need be.)
commenterzero@reddit
Like how the streets work
Far-Low-4705@reddit
its not really knowledge imo, it's more about nuance.
It understands the nuance of your prompt more, and sees things that aren't implicitley said, even in code. especially when doing on generic SWE coding, and doing much harder and more complex or low level coding tasks.
for me the biggest difference i see is nuance. (obviously knowledge is better too, but its not that big of a factor imo)
huffalump1@reddit
Yep my opinion based on my current understanding is that these new smaller models are getting really good at agentic coding workflows, tool calling, etc...
Definitely not as good for world knowledge and writing, like you say. BUT when they're fast and cheap, search and iteration is easy
FullstackSensei@reddit
Having used both, 3.5 397B at Q4 and 3.6 35B Q8 side by side for agentic coding. Within this scope, I can say they're practically matched. But keep in mind this is a pretty narrow scope and one that is often very much a beaten path.
I'm sure if you go to more obsecure programming languages, or tasks unrelated to programming, 397B will win.
RevolutionaryGold325@reddit
Also depends on what you are programming. If you do some low complexity UI+backend+database coding, you don't really benefit from the more clever models. If you do some complex refactoring, algorithm design, heavy math and solve difficult problems, the more powerful models are able to figure things out better.
FullstackSensei@reddit
Haven't tried really complex stuff with 3.6, but I can say I did try fairly complex tasks on large projects and 3.6 35B held well. 3.5 couldn't handle much simpler tasks.
I do have some low level C++ tasks I want test 35B and 27B with. We'll see how it holds.
snmnky9490@reddit
The 35b moe is 3.6 whereas the 397b is 3.5
ZBoblq@reddit
So what? There was at most maybe a month between their release
etaoin314@reddit
Yeah, realistically there will be a bigger difference in real world use compared to benchmarks, however, I do think that the gap has closed in a meaningful way and what they have been able to achieve with the 30 billion class. Models is truly impressive. Anybody with a strong gaming computer can run a 30 billion class model the 397B takes $10,000 worth of hardware.
snmnky9490@reddit
What do you mean "so what?"
You asked why the smaller one is just as good as the big one. It's because the small one is newer and updated and the bigger one hasn't been updated yet, however long ago the last release was
Puzzleheaded_Base302@reddit
maybe there will be a qwen3.6-397b
clyspe@reddit
If we're seeing performance this great out of a dense medium sized model, why doesn't someone do a dense large model again? It seems like the last great one was llama 3.3 70b. Is it that expensive to train big dense models but the 400B sparse models are cheap? It seems like if we had a qwen 4 70b it could sweep the board.
Medium_Question8837@reddit
I think it's because of the lower inference speeds. Given the fact that agentic usage is trending dense models falls behind MOE speed wise. I can get 200 t/s with FP8 35B-A3B but with thw 27B I get around 60ish t/s generation speeds.
AvocadoArray@reddit
Don’t forget Seed OSS 36b. It was my daily driver until 3.5 and Gemma.
Mochila-Mochila@reddit
Mirrin' dat Skills Bench score
Perfect-Flounder7856@reddit
🤯
anarchist1312161@reddit
Excellent, going to see how it performs on my 7900 XTX.
hackiv@reddit
Do my eyes device me? Does it beat full size 3.5 qwen? Wtf. Blows punches with opus 4.5 (I know, not the newest opus) but fuk, it's 27B, you can run it locally. Opus 4.5 is probably hundred of billions of parameters.
Free-Combination-773@reddit
In benchmarks. In real life scenarios it's nowhere close to Opus. But for the size it must be really good
tommitytom_@reddit
I just scroll past benchmarks, I decided they're meaningless a long time ago. The only real benchmark is trying it for yourself
Long_War8748@reddit
I am always a bit sad when I see the hype surrounding new models, and the benchmarks not transfering to my actual uses cases at all. Of course, we need some kind of eval with the rapid fire release of models, but benchmarks have become worthless for my own use cases.
Hope someone finds a solution for this eventually, because eval time is limited (I only have so many hours every day 😅)
the3dwin@reddit
Use specs and markdown files like openspec.
Also custom commands like "/explain"
Recently I created explain.toml which has been tremendous help before I get it to execute:
description = "Explain Following Prompt ARGS:"
prompt = """
## Expected Format
The command follows this format: `/explain`
## Behavior
Check if a file named "Explained-Prompts.md" exist, if does not exist create the file.
Make a copy of existing "Explained-Prompts.md" in case there is a mistake in appending to file and file content gets replaced upon update and can be used to restore easily.
Avoiding executing the prompt.
Analyze the prompt.
Explain in detail what is understood from the prompt.
Explain the goals from what is understood from the prompt.
Explain the non goals from what is understood from the prompt.
Explain the plan of action from the understood prompt.
Explicitely and in detail explain how the prompt could be improved, list out what is ambiguius and implicit then how could be without ambiguity and be explicit.
Give a detailed improved prompt that is explicit without any ambiguity.
Update "Explained-Prompts.md" file by adding to the end of the file with following.
Add to end of Explained-Prompts.md file:
----------------------------------------
###### Prompt:
[PROMPT]
###### Understood Explanation:
[UNDERSTOOD EXPLANATION]
###### Goals:
[GOALS]
###### Non Goals:
[NON GOALS]
###### Plan:
[PLAN]
###### Improvement:
[IMPROVEMENT]
[LIST OF AMBIGIOUS IMPLICIT TEXTS]
###### Improved Prompt:
[IMPROVED PROMPT]
----------------------------------------
"""
metigue@reddit
Do you have assumed knowledge in your real world use cases?
It's best to treat smaller models like this as pure tools. Give them the detail and knowledge to execute on and they'll blow you away.
Or YOLO it with web searching and hope they find the details you want.
_-_David@reddit
Source: Trust me bro. It's been out for a full hour.
Caladan23@reddit
You must be new to LLMs...
_-_David@reddit
Nope.
Free-Combination-773@reddit
You don't need to try a tiny model to know it's nowhere close to one of the best behemoth models up to date. If I claim newly released Toyota SUV is not as fast as last year Ferrari model you will not need proofs for it, will you?
chaitanyasoni158@reddit
The car analogy doesn’t really hold here. Cars scale pretty linearly with horsepower. More power, more performance. LLMs don’t work like that.
A smaller model can absolutely match or even beat larger ones on specific tasks.
That comes down to training quality, data, and optimization and not just raw parameter count.
A 27B model won’t beat frontier models overall. But saying it’s “nowhere close” is just not right. The gap has narrowed a lot and for many tasks, smaller models are already somewhat competitive.
AdventurousFly4909@reddit
The source is based on thousands of model releases which claim to beat sota.
trusty20@reddit
SOTA for regular coding tasks / modest apps or websites is increasingly not a distant goalpost. Local models can probably already compete. SOTA that will remain SOTA is probably one-shotting massive applications and very niche specialized knowledge domains.
_-_David@reddit
Which this isn't claiming.. This one claims to be comparable to, not beat, a model 2 releases and 6 months ago. I get the skepticism, but the guy isn't saying, "It will probably fall short." He straight up states it as fact.
See, this is why I love languages other than English. If he said it in Spanish for example, in the subjunctive mood, the speculative aspect would be embedded in the writing automatically. Either way, que tengas buen dia! (I amuse myself)
Both_Opportunity5327@reddit
And I bet you he is right.
toothpastespiders@reddit
Seriously. I'm a little shocked by how many posts I scrolled through that are seriously stating that this is comparable to claude. Not just beating it in benchmarks, but extrapolating that to mean it will deliver real world performance at that level.
It's just bizarre to me that anyone getting serious use out of local models can still take the big benchmarks at face value. I think they can typically be suggestive of a model's strengths and weaknesses. But that's about it.
blutosings@reddit
The bench comparison are to Opus 4.5 not the latest 4.7. Also Opus is a much larger model.
Healthy-Nebula-3603@reddit
Qwen in coding are really good ...I can believe that 3.6 27b dense can be so good.
Look on Bijan on YouTube.
vinigrae@reddit
You’re right, the results in bijans tests were surprising
laterbreh@reddit
397B smokes it in real codebases. Tried it this morning. Anyone thinking a 27b dense can match context understanding of a model 10x its size is delusional.
hackiv@reddit
Wish there was a realworld bemchark which could represent such workload.
Atom_101@reddit
You are looking at the real world benchmarks: reddit comments. That's what I am trying gauge here lol.
LeonidasTMT@reddit
Isn't it expected to beat 3.5 qwen 27b?
hackiv@reddit
That's not full size qwen 3.5
jon23d@reddit
I’m confused. I am using qwen3.6 35b — is this somehow better?
notgreat@reddit
The general rule of thumb is that a MoE like 35B A3B is roughly equal to a dense model of sqrt(a*b) parameters: sqrt(35B*3B)=10.25B.
This rule doesn't seem to be holding up perfectly anymore as recent MoEs have done better than the rule would suggest, but it's still a useful ballpark estimator and explains why the 27B dense model is significantly better than the 35B A3B model. The dense model is, however, much slower. The MoE only uses 3B parameters per token, which is a massive reduction in compute.
BasicBelch@reddit
yeah why does 35b even exist if the smaller model is somehow better?
Puzzleheaded-Drama-8@reddit
35B is much faster though
jon23d@reddit
Five times as fast for me
mxforest@reddit
Because 35B runs much faster than 27B. Like 4-5x faster at the very least because it activates fewer params not all 27B.
frzen@reddit
at any moment only 3B is loaded with 35BA3B but with 27B it's constantly the full 27B. hope that makes sense
dinerburgeryum@reddit
35B is Mixture of Experts; you only activate like 4B parameters per token. This is a 27B dense model. You hit every weight every token, and unsurprisingly, despite its larger size, it will almost always outperform the MoE model. The MoE model will be significantly faster however.
SnooPaintings8639@reddit
Everyone is saying "faster" so I'll add: cheaper. My GPU does sweat much, much less in such setup, making the USD per token significantly better in MoE models. Same can be observed in API providers.
So, cheaper and much faster, and only slightly worse.
Aromatic_Bed9086@reddit
Yes, it is better, but it is also a different architecture. 35b usually includes A3B when people talk about it. It has 35b TRAINED parameters but to answer a prompt it routes the data through a number of "experts" (Mixture of Experts) meaning it only uses 3 billion parameters to answer the question. This means the MoE model is (1) Larger because you still need to have the 35b parameters stored somewhere (2) Faster because prompts only pass through 3B parameters (3) lower performance/intelligence because prompts only pass through 3B parameters.
kondrag@reddit
27b is a dense model. 35b is MOE.
SheepherderSerious51@reddit
I used to pray for times like this
Snoo_27681@reddit
What a time to be alive, and have RAM...
V0dros@reddit
Now praying for the hardware to run this
debackerl@reddit
And the Chinese delivered!
Glad-Pea9524@reddit
Qwen is Chinese?
Bulky_Book_2745@reddit
yes
Maleficent-Ad5999@reddit
Me too.. now our prayers have been answered..
More-Curious816@reddit
God bless the Qwen.
pmttyji@reddit
True_Requirement_891@reddit
We need to keep this momentum going
GoTrojan@reddit
35B IQ3 or 27B 4_K_M?
jake-writes-code@reddit
Truth be told I can't get Qwen3.6-35B-A3B to outperform Qwen3-Coder-Next. Running both at bf16 on a M3U256 in Claude Code (although think I'm about to swap out for a customized Pi rather than deal with the closed source bullshit from Anthropic anymore).
Will try Qwen3.6-27B at 8bit and see how that goes. I'm not really concerned with speed; just intelligence/accuracy. Would love to have a new go-to coding model.
Tamitami@reddit
what about forge code as a harness? it seems to beat claude code with opus too.
I really like Qwen3-Coder-Next as it is running fast and provides good results if you steer it well. I'd like to see it in comparison to this new Qwen3.6 27B model and the MoE model 35B-A3B, but I can't find some good sources.
AuroraFireflash@reddit
Which quant? MoE's are really sensitive about how quant is done from my limited understanding.
jake-writes-code@reddit
bf16
Artistic_Okra7288@reddit
Same. Qwen3-Coder-Next has been my go to for agentic coding. It’s still not great but I queued up Qwen3.6-27B and giving it a roll.
social_tech_10@reddit
Opencode works well with Qwen models
You can even pick one of their free models for the first few minutes and have that model set up the opencode config file for you to run your local model.
Blues520@reddit
Let us know how it performs compared to qwen3-coder-next
Weak-Shelter-1698@reddit
But can it do creative writing? please tell me yes.
Technical_Ad_6106@reddit
yes
FusionCow@reddit
gemma 4 still the goat for that
Weak-Shelter-1698@reddit
🔥🔥🔥🔥
Endothermic_Nuke@reddit
Just spent 3 weeks days benchmarking and fine tuning Qwen3.5-27B and iterating with variants. Now this drops. 🤦♂️ But it makes me happy of course.
That said, I think there’s some suspect stuff with the benchmarks shown here. Gemma4-31B is absolutely better than the Qwen3.5-27B in my testing in multiple areas.
Kodrackyas@reddit
This is a disrupting level of intelligence gain on every iteration, ladies and gents, this is how the ai bubble will pop, big ai companies will go to shit and local 4090s are going to be even more expensive
cdshift@reddit
I dont believe these kind of advancements pop the bubble as long is we are so constrained by power and compute limits in datacenters especially as capacity demands do keep going up.
CyberAttacked@reddit
27B model better than opus ?! Who the fuck hurt Qwen ?
cdshift@reddit
Im always skeptical about benchmarks alone. Im holding out for more tests from the community here on their private benchmarks and personal usecases
bunny_go@reddit
Extremely verbose, keeps on thinking and thinking without making much progress. Even if the results are good (yet to be seen), it's extremely slow due to being verbose and being dense. Meh.
Express_Quail_1493@reddit
Please If you can leave a comment on the huggingface page. IM just a regular guy that hopes qwen never stops helping us Out So comment to your hearts content Please all.
ihatebeinganonymous@reddit
Is it dense?
x10der_by@reddit
it's densin time
logic_prevails@reddit
Are you dense?
Dany0@reddit
Dense like patrick star? No. Not quite. But we flew passt worms and we're getting close to the number of neurons in a fruit fly. About 1 or 2 orders of magnitude to go
Jokes aside model seems baller
Finanzamt_Endgegner@reddit
bro this fruit fly is smarter than most people on the internet 😭
Dany0@reddit
Nah it's just that locomotion is a surprisingly cognitively hard task. Each fruitfly has a (current) supercomputer's worth of compute in it just to bumblefuck around
Finanzamt_Endgegner@reddit
Well tbf its not that much compute in reality, biological brains are fairly sparse no?
inddiepack@reddit
Denser than you believe.
reto-wyss@reddit
It is the densest.
mumblerit@reddit
so dense baby
ab2377@reddit
very!
Sevealin_@reddit
Yes
Xeoncross@reddit
If I have 64GB of memory (MacOS) and want a large context window (128k-256k) should I be using
27B Q4_K_M,27B Q6_K, the35B A3B Q4_K_Mmodel, or a different configuration?I know running bigger models often gives better results, but sometimes the differences are negligent as well as smaller models having much more usable speeds.
the3dwin@reddit
https://www.canirun.ai/
QuantumCatalyzt@reddit
What is your setup with models like this for agentic coding?
tecneeq@reddit
5090:
/home/kst/bin/llama-b8838/llama-server --hf-repo unsloth/Qwen3.6-27B-GGUF:UD-Q6_K_XL --alias Qwen3.6:27b --no-mmap --host 0.0.0.0 --port 11337 --gpu-layers 99 --fit on --threads 8 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --presence-penalty 0.0 --repeat-penalty 1.0 --temperature 0.6 --top-k 20 --top-p 0.95 --n-predict 32768 --ctx-size 196608
neur0__@reddit
Thank you 🐐
the3dwin@reddit
can one of you explain no idea if I am looking at a long terminal command.
Sylvers@reddit
Any idea how it compares to Kimi 2.6?
the3dwin@reddit
When online public benchmarks do not have good comparison I Use Gemini Deep Research to ask.
Also found the following but have not tested: https://github.com/sammy995/Local-LLM-Arena
RazsterOxzine@reddit
But can it tell what this spells: []D [] []V[] []D [] [][]
Sir-Draco@reddit
Obviously after thousands of model launches we know that real world use it’s different than benchmarks…. but holy shit this unbelievable! Just in time for all the companies clamping down on usage!
toothpastespiders@reddit
After reading through the comments I woud argue that we, as a collective whole, have not realized that yet. Somehow.
Sir-Draco@reddit
I’m thinking there are quite a few folks who only ever look at the benchmarks and never run the models. If you have used the recent Qwen models you know that actual use matches the benchmarks pretty well but you don’t get as much “out of the box” freedom as they require some tweaking. I think many just use the models a couple of times expecting them to work like a cloud model with harnesses built for them and then are surprised when they do not function that way. My two cents as to where the “collective we” are at.
Healthy-Nebula-3603@reddit
Qwen models are as good as on benchmark.
No_Weather8173@reddit
While they really are impressive, they have had a tendency to overclaim, best example of which is qwq-32b which at the time was touted to compete with DS V3, which turned out to be false
Healthy-Nebula-3603@reddit
qwq is ancient model ... I remember this models was overthinking like crazy ;)
Now is much better with qwen 3.5 dense .... I did not tested with 3.6 dense yet
Only 200 tokens for hello ;)
Weekly_Comfort240@reddit
OK. I hooked it up to Claude Code and let it rip on a text processing problem I had. I've ran it on this same problem on Qwen 3.5 122B, Qwen 3.6 35-A3B, and now finally Qwen 3.6-27. It took over 40 minutes to process nine relatively smallish text files - 10-30KB each. Qwen 3.5 122B (MoE) took 30 minutes. I tried Qwen 3.5 397B but... I didn't have the 5-6 hours it would have taken to crunch this project.
Qwen 3.6 27B was the only model to give me a separate file documenting it's discrepancy findings for each of the nine source files, and it did very exceptionally well to boot. Qwen 3.6 35B-A3B is awesome for super fast code - by Qwen 3.6 27B seemed to have a deeper intellectual grasp of the actual problem. This is honestly a lot of fun.
the3dwin@reddit
Curious to know more about the tasks you ran, the files, and the prompt.
vulcan4d@reddit
Hope they do another 80B or 122B MOE. If you got a bit of ram these are a great balance of speed vs performance.
the3dwin@reddit
https://www.canirun.ai/ may help
pixelpoet_nz@reddit
I so badly want something I can run with 2x 128 GB Strix Halo and have lots of context.
unjustifiably_angry@reddit
Am I a bitch feeling a bit exhausted by how fast this stuff's moving? I just barely finish benchmarking and tuning my setup and then there's a new thing that makes the previous thing look like shit.
And that's on my PC. I have a Spark cluster that takes even longer to get going, I can run 122b models on my PC so the Sparks... I either need to buy two more so they can run truly huge shit or possibly just sell them, because my $8K GPU is often equally useful but much faster than my $8K worth of Sparks.
Diligent-Detective97@reddit
it´s so slow on a 3090 full in vram with the qwen3.6-27b@q3_k_xl and using opencode
JsThiago5@reddit
How many t/s and PP do you get with a 3090?
m0lest@reddit
Thanks Qwen! This is the best open model I have seen so far in this "weight-class". The only model so far which actually works and does tool calling perfectly fine!
QuinsZouls@reddit
Running the model with vulkan backend and turboquant setup at 26t/s with 131k context windows using a RX 9070 16gb:
Tool calling seems fine like 90% success rate on cline
IrisColt@reddit
I just started feeding it my benchmarks. Its grasp of literary stylistic commentary is insane. It picks up on everything Gemma 4 does... and then a whole lot more.
toothpastespiders@reddit
Have you tested 3.5 27b on it too? I'm really curious how the two compare with each other on anything related to the humanities.
AloneSYD@reddit
3.5 has overthinking issues
BubrivKo@reddit
Okay, am I the only one who no longer believes in these benchmarks!?
Or is it just that local models don't work as well for me (I don't know how to use them properly), or that these benchmarks are heavily exaggerated?
The Qwen 3.6 MoE model is also theoretically very close to Opus. However, in practice, the responses I get from Qwen are significantly worse than those from Opus. Opus manages to understand me and get the job done with a single prompt, whereas with Qwen, I often have to further clarify what I want to happen, or it simply fails to provide an accurate answer, especially if it's something it hasn't been well-trained on.
toothpastespiders@reddit
I always suggest people make their own benchmarks, based on their own real world needs, and test models against that. I'm willing to bet that anyone who does so will get disillusioned about the worth of the big well known benchmarks pretty quickly. Real world problems are messy with tons of uncontrolled variables that won't have one to one matches in a LLM's training data. Meaning they have more need of, for lack of a better term, intelligence.
Teetota@reddit
Yes, size matters here. Opus has more world knowledge and needs much less guidance. With detailed and precise spec qwen might get close, but such spec is 80% of the job. So benchmarks will be deceiving - they are based on a known set of topics which can be emphasized in the training data. But try to code in a specialised domain like statistics or bioinformatics with a short prompt - qwen will fail and opus will nail.
laterbreh@reddit
Benchmarks can be indicative thats for sure, but there comes a point where youre being gaslit so hard and everyone is falling for it (not you just generally speaking).
Guys, its a 27b dense model, thats scoring on a repeatable benchmark at the same or better than a model bigger than 10x its size in the SAME GENERATION? Cmon guys, use your head, applying the 27b against the 397b in serious production tasks, in dynamic environments, that require contextual reasoning? The model with 10x its parameter size will be innate more intelligent in real world applications especially in the same generation.
odikee@reddit
Thats true. Single gpu models are fun to play with until real work needs to be done. Or at least it’s like opus researches, develops and writes detailed howto to stupid local model how to run those things.
Ok_Mammoth589@reddit
It's a combination of multiple things. These benches are for the fp16 w/fp16 kv cache, which no one runs on a 3090. They're benchmaxxed to hell. And get using harnesses specifically designed for successfully completing these bench runs. Which no one has access to.
So yeah, you're not crazy
RickyRickC137@reddit
GGUF re-upload when?
Adventurous-Paper566@reddit
Unsloth are available.
mintybadgerme@reddit
https://huggingface.co/unsloth/Qwen3.6-27B-GGUF
Blues520@reddit
Based
blazze@reddit
Gemma 4 is getting brutalized by a lot, it has to improve.
lookitsthesun@reddit
Absolutely not lol. Qwen models are cool but the epitome of benchmark maxxed. And people fall for it every single time
seppe0815@reddit
it is sota ! stop crap talk
surrealerthansurreal@reddit
Anyone tested the ‘preserve thinking’ concept or know how it works technically? I’m trying to understand if it’s K-V caching or actually holding intermediate thinking in context between requests
tarruda@reddit
Before whenever I ran 3.5 in a coding agent, it would do a task and when I sent a follow up message I'd see a lot of the prompt being re-processed due to the deleted thinking
Now the experience is much better with llama.cpp since the caching makes follow up responses start quickly.
Caffdy@reddit
how do I enable the preserve thinking option on llama.cpp?
tarruda@reddit
Here's the full script I'm using:
Caffdy@reddit
does -ngram-mode make any difference?
tarruda@reddit
I haven' tested with 27b yet, but for 35b it makes a lot of difference when the model is repeating things in the context (such as when editing files and outputting the fully modified version)
dry3ss@reddit
If you don't activate it, every time you send a request the old thinking of the llm's last response is discarded before answering your new query. Since the last message changed, the kv checkpoint is invalidated and llama.cpp will re-parse all the messages so far (all striped of their thinking), so you will have a delay before it starts processing the actual new tokens of your new request.
With preserve, the thinking is not discarded, so it stays in-context, checkpoints work, no delay but thinking will eat up some context (however by discarding it sometimes it has to re-think almost the same thing each time so it's not necessary a bad thing.
Before that, as a user of pi, i can tell you that the symptomi saw was that when i prompted it, it would process, start writing, all the intermediate tool calls it would do would be lightning fast (no matter if much bigger than any small message i sent it) until the task is done, but as soon as i would send a new message there would be a long delay and i could see in llama.cpp log that it was re-parsing the entire context.
So i highly recommend you use that configuration, especially if your pp speed is not so great (mine is 100t/s with 3.6 q8) for agentic use you will see enormous difference in total clock time
techzexplore@reddit
This one is realy close to Opus 4.5 Level Model On agentic coding specifically, 27B beats Qwen 397B across every major benchmark they shared. SWE-Bench Verified goes from 76.2 to 77.2. Terminal-Bench 2.0 jumps from 52.5 to 59.3. SkillsBench nearly doubles from 30.0 to 48.2 Here's how Qwen 3.6 27B compared to Qwen 3.6 35B-A3B & which one to choose.
autisticit@reddit
Slop
TwoPlyDreams@reddit
No preserve thinking?
aalluubbaa@reddit
Openclaw completely locally with just an rtx 5090????
AvidCyclist250@reddit
Well, I'm already doing Hermes locally with qwen 3.6 moe. It's fine. With a 4080 and 64 mGB ram
bilinenuzayli@reddit
What's the TPS
AvidCyclist250@reddit
60
cosmicnag@reddit
Definitely doable now
Fast_Paper_6097@reddit
My poor AI - she’s had a new brain every 2 weeks for the past year
_-_David@reddit
Poor thing. And mine has had 11 new SOTA-in-some-way tts voices in the same time haha
Oh god, that reminds me that I just heard of KokoClone this morning.. Now to have codex 5.4 set up a qwen3.6-27b agent that will go research and install it... Unless Spud drops before I finish browsing.
Caffdy@reddit
in your experience, which one is the best one (tts) or the ones who are worth the trouble trying?
_-_David@reddit
I am currently using faster-qwen3-tts because some absolute legend made it faster and support streaming. But, it does absolutely spike my GPU utilization.
Pocket-TTS has been great. No real complaints there. It just works.
LuxTTS came out like 3 days later and was basically the same thing with more language support if memory serves..
Omnivoice is the new hotness, but I've only tried it for like 10 minutes and thought, "Oh, this is nice" because I don't have any problems that I need to fix when it comes to TTS. With those other options I'm already at faster than real-time generation, streaming support, voice cloning, etcetera.
And KokoClone is the latest thing to catch my attention because Kokoro is UNBELIEVABLY FAST, and clean and pleasant. It isn't emotive, and you can't clone voices. So if KokoClone can do voice cloning well, it instantly becomes the best (non-emotive) tts available just by virtue of the fact that you could run it at like 50x real time on a crap CPU.
If you have any questions about specific scenarios, like maybe you want something expressive but don't need voice cloning, feel free to ask. I have tried at least a dozen tts models recently, and each shines in its own way.
Caffdy@reddit
that would be a good start, yeah
Besides those, I know that not many models offer voice cloning, but, of those who do, which one do you really think is the best currently (before going the paid rute, aka 11Labs)?
_-_David@reddit
Actually, you'd be blown away by how many offer good quality voice cloning these days. I've got to say that omnivoice was very clean and a good quality clone, as well as being quite fast. What are you aiming to use the tts for? I have different standards for a voice assistant versus the program I have narrate books for me.
Caffdy@reddit
which one would you recommend for each case? what about rp
_-_David@reddit
Try omnivoice. It seems to be the gold standard currently. If that doesn't do what you need it to, we can go from there.
kmp11@reddit
top notch from team Qwen... Making a model that fits perfectly in consumer hardware instead of weird sizes that are unusable for most.
AVijha@reddit
Is anyone using it on macbook? can someone tell me how much ram would you need to run this with 100k context length while using 4-bit precision without any offloading?
MrPecunius@reddit
What would you offload to on a Mac? Disk?
I had a binned M4 Pro/48GB MBP and I ran 3.5 27b @ 8-bit MLX with 100k+ context just fine. No fast, but fine. My current M5 Pro/64GB is obviously a step up, especially with prefill.
swoonz101@reddit
What do you mean by not fast? I’ve got an M4 Max with 128 GB of RAM, how many tokens/s can I expect realistically?
MrPecunius@reddit
\~8-9 t/s with Qwen3.5 27b 8-bit MLX = not fast
M4 Max should be somewhat less than double that. With 128GB, I'd run something like Qwen3.5 122b a10b or unquantized(!) Qwen3.6 35b a3b.
No_Weather8173@reddit
Have you tried using some of the new speculative decoding methods like DFlash or even DTree? Depending on what you do, coding presumably, they could really speed things up a lot according to preliminary benchmarks.
MrPecunius@reddit
It's a mixed bag with the M5 and new models right now. MLX can yield a prefill speed improvement of more than 3X, but e.g. Qwen 3.6 and some other new hotness doesn't work right (or at all) yet.
My dream is Qwen 3.6 27b 8-bit MLX with working MTP and prefill boost, which should give something over 20t/s generation and several hundred t/s prefill--i.e. 2-3X+ what my M4 Pro was doing with 3.5 27b.
SolitaryShark@reddit
how many tps do you get with the m5 pro?
MrPecunius@reddit
Your best source of information is here:
https://omlx.ai/compare
FunConversation7257@reddit
I was considering getting a 48 gig model to run these new models, any specific reason you decided to jump for the 64 gigs?
MrPecunius@reddit
8-bit models in the 30-35b range are context constrained with 48GB--I basically ran out of RAM before I ran out of CPU. Now I just run whatever I want and don't bother to conserve RAM by closing other programs, etc.
AuroraFireflash@reddit
27B * 0.5 = 13.5 GB, figure another 4-8GB for context, way too limited on a 16GB option. Passable on a 24GB MBP, just fine on 32GB.
KyrosDesu@reddit
Already? I was starting to get used to 35b the moe moe, I hope this dense is faster than the previous one.
Darkoplax@reddit
now we wait for the 9b for the poors
SweetBluejay@reddit
If you have 8GB VRAM, the 35B is a better choice than the 9B.
Darkoplax@reddit
i have 6gb vram
SweetBluejay@reddit
I suggest you upgrade your RAM to 32GB. Then, as long as you ensure the 35B Q4 model can fit in RAM, it should be usable. Although RAM is more expensive than before, it's still much cheaper than a graphics card.
Darkoplax@reddit
I have 32GB RAM already, I just thought from reading every thread here and there that we should never let llm spill into ram and keep it all in vram ?
SweetBluejay@reddit
You don't need to put all the weight in VRAM, because that would require too much VRAM capacity. As long as you ensure that the 3B activation parameters can fit completely into VRAM, you can guarantee a relatively decent speed.
Whole-Impression-709@reddit
What is the memory footprint on that?
TheArchivist314@reddit
Cool can it run on 12gb of vram ?
kiwibonga@reddit
1000 t/s PP and 35 t/s TG in ik_llama.cpp with 2x RTX5060Ti 16GB (32 GB total), graph split. Using 140k context at q8_0. It's a pretty good number if the compaction triggers at 128k
It's about exactly 3x slower than 35B A3B (~3000 t/s PP and 100 t/s TG)
But quality is better so far. The difference is easy to feel, it's more or less the same as the difference between the two 3.5 models of the same size.
I need a second 32GB AI box so I can run both.
laterbreh@reddit
Are we benchmaxxing yet dad?
27b meets or beats its 397b sibling? Yep. Believable.
IrisColt@reddit
mother of God...
laterbreh@reddit
For those of you thinking youre matching the 397B version on these benchmarks with a 27b dense, youre smoking crack.
Tried it on 5 tasks, 397b smokes it in real world agentic code in real codebases.
keyehi@reddit
how low are they planning to go? 4b?
Technical_Split_6315@reddit
No ways its opus 4 level
logic_prevails@reddit
Jesus Christ it’s jason bourne
Gold-Debt-5957@reddit
Corre en una MAC m1 max de 64 gb ? soy nuevo :v
ernexbcn@reddit
Yes, but very slow.
AlecTorres@reddit
Bueno me vale :v
ricardofiorani@reddit
Nice
PinkySwearNotABot@reddit
Fr can we all just take a moment and say thank you China? While America is trying to fuck the world over left and right, China seems to be the only thing that’s giving us any sort of leverage at all.
Now imagine if they let us have their EVs In America, too.
PinkySwearNotABot@reddit
So according to this — no reason to keep 35B?
jimmytoan@reddit
Qwen keeps shipping at a pace that's genuinely hard to track. The 27B size is particularly interesting because it sits in the sweet spot for consumer GPU deployments - fits comfortably in 24GB VRAM at Q4 with decent throughput. Curious how the instruction following and context handling compares to Qwen2.5-32B which was already a strong performer at the same tier. Does anyone know if this uses the same MoE architecture as the larger variants or is it a dense model?
Adventurous-Paper566@reddit
It's a dense model.
Djagatahel@reddit
You're replying to a bot
TheOriginalOnee@reddit
14B please, I need something that fits into my 16GB peasant VRAM
zsydeepsky@reddit
since Qwen team claimed that 3.6-27B beats 3.5-397B-A17B in every benchmark, and compared to where 3.6-35B-A3B currently stands...
guys, we are literally having a Claude Sonnet 4.6, running locally.
Super_Sierra@reddit
No we don't, this shit is pure sloponium.
Hungry_Audience_4901@reddit
Literally cumming right now
SnooPaintings8639@reddit
Is it that good at ERP too?
Super_Sierra@reddit
Qwen always sucked for that because they overfit the models to shit.
Glum-Atmosphere9248@reddit
Finally no AI comment
Caffdy@reddit
chat are we ballin' yet?
Dany0@reddit
No mistakes!
More-Curious816@reddit
Do it in Ralph loop mode.
Non-Technical@reddit
Wonderful! That’s the sweet spot for dense models on my machine.
mxforest@reddit
Which machine is that?
Non-Technical@reddit
Mac Studio M4 Max with only 36GB ram.
mxforest@reddit
Yeah that should be enough. I have the 128GB M4 Max so I much prefer 122B. Roughly same numbers as 27B dense but much much faster.
Non-Technical@reddit
I like to do story telling with the models so situational awareness and context are very important (like keeping track of who is in the room) so I have been using Q6 models which I *think* handle that better.
mxforest@reddit
You can also try asking the model to maintain notes on the current state of everybody and update it after every interaction.
Non-Technical@reddit
That's a good idea to try. Being stateless, it only knows what it can get from the system prompt or the past visible chat log. Outputting notes would help it keep track at the cost of feeling less immersive and taking up space in an already small context window. I read that Silly Tavern has a RAG database to maintain small details invisibly but haven't tried it yet.
LycanWolfe@reddit
Qwen3.6-27B-DFlash When?
caetydid@reddit
hoooly shite!
Southern_Sun_2106@reddit
Unsloth posted both ggufs, wwufs, and mlx
caetydid@reddit
found them already... iq4_xs is running with \~25t/s on my rtx3090
Diligent-Detective97@reddit
What are the best settings for a 3090 and 64gb ram?
tecneeq@reddit
llama-server --hf-repo unsloth/Qwen3.6-27B-GGUF:UD-Q6_K_XL --alias Qwen3.6:27b --no-mmap --host 0.0.0.0 --port 11337 --gpu-layers 99 --fit on --threads 8 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --presence-penalty 0.0 --repeat-penalty 1.0 --temperature 0.6 --top-k 20 --top-p 0.95 --n-predict 32768 --ctx-size 65536
You probably can get away with a bit more context, like maybe 98304.
Diligent-Detective97@reddit
thx :)
stkt_bf@reddit
I know the constraints are tight, but is it possible to switch over from sonet 4.6? Should I just go out and buy an RTX 6000 300W?
Cesar55142@reddit
i really wanna see the LocalLLaMA opinion on user benchmarks for this. If it holds the reported benchmarks this will be a fun month.
Wise-Chain2427@reddit
3.6 35b already positive 27b dense should be more positive
Caffdy@reddit
more winning more better
Constandinoskalifo@reddit
+1
Healthy-Nebula-3603@reddit
Probably very good as 3.5 dense in coding was great ..much better than Gemma 4.
Haeppchen2010@reddit
Oh I was just tweaking my 3.6 MoE settings and enjoying the speed….. 🫣
Spirited_Maybe7374@reddit
How does this compare to the new Qwen3.6-35b-a3b? I also don't really understand the difference between qwen3.6-35b and qwen3.6-35b-a3b. Would be nice if someone can explain the difference between all 3
tecneeq@reddit
qwen3.6-35b and qwen3.6-35b-a3bv mean the same model, the Qwen 3.6 35b overall, 3b per expert mixture of experts. Every input goes through 3b parameters.
Qwen 3.6 27b is a dense model, every input goes through 26b parameters. The result is better if all else is equal, but speed is slower. You can get more context with 27b because the weights are a bit smaller.
I get 50 t/s with 27b, 170 t/s with 35b-a3b on a RTX 5090.
VoiceApprehensive893@reddit
why are we comparing a 16 gigabyte vram card model to an expensive ass frontier model from <half a year ago
tecneeq@reddit
Because it's as clever. The only difference is it doesn't has as much facts and thus hallucinates more. Easily solved with websearches.
redpandafire@reddit
Which model for 16GB?
https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/tree/main
tecneeq@reddit
Depends on the context. Q3_K_S mit 32k context with q8_0 for K/V cache should fit.
Enki_40@reddit
Would a 9b (if they release weights in this range) be useful for speculative decode to speed up 27b? Or is there some architectural reason in 27b design where speculative decode isn't a thing?
bcell4u@reddit
Qwen 3.6 4b for the vram poors please 🥺
LH-Tech_AI@reddit
YEAH!
viperx7@reddit
Our prayers have been answered
IBM296@reddit
How is this better than Qwen 3.6 35B??
corpo_monkey@reddit
Much better. Read the thread or google moe vs dense.
spaceman_@reddit
Can't wait for 122B soon...
Orolol@reddit
So the real question here is why Qwen seems to have hard time to scale up ? Qwen 3.6 27b and 35b are amazing, punching way above their weight (hehe), but Qwen 3.6 max, supposedly the beefiest model of the family, the top class frontier model barely get into the SOTA territory.
NikolaTesla13@reddit
My guess would be GPUs, you can do a lot more training on a 27b model than a 1T one. Also more experiments, that can help a lot.
tmactmactmactmac@reddit
this is an arms race and I'm here for it
Blanketsniffer@reddit
Okay now this will really start to undercut western SOTA models even enterprises will deploy this model in their harness where it fits, not to mention smaller models such as GPT mini, haiku, flash will go kaput (will only used by their own SOTA models in the harness)
MLExpert000@reddit
We are just testing it in Inferx. net. If anyone wants try it’s available. Chek it out.
pixelpoet_nz@reddit
Sorry, I don't take recommendations from spammers (post history) who struggle to write 5-letter words like "check". Thanks though.
MLExpert000@reddit
English is my third language, so yeah . typo on me. Just sharing since we’re testing it. Appreciate the patience.
StateSame5557@reddit
I made a mxfp4 in text mode that would fit on a smaller Mac
The model is really good even at mxfp4
https://huggingface.co/nightmedia/Qwen3.6-27B-mxfp4-Text-mlx
DOAMOD@reddit
Very first test much better less thinking and less syntax errors over 3.6-A3
Informal-Victory8655@reddit
Can I use it inplace of openai gpt os 20b?
_-_David@reddit
The 27b will require significantly more VRAM at the same 4bit quantization. I would suspect that at 3bit this model would still probably beat OSS 20b in many ways, but be slower. So your answer depends on the answer to the question "What place?"
Kodrackyas@reddit
Holy shit are we deadass? that is a fucking fronteer model level of intelligence
meca23@reddit
122 B next please!
Mashic@reddit
Let's have a competition between gemma and qwen every month, gemma 4.1 > qwen 3.7 > gemma 4.2 >qwen > 3.8 ...
SweetSeagul@reddit
Couldn't help but imagine this.
BasicBelch@reddit
So then why does Qwen3.6 35B exist if a smaller model beats it in every use case?
misha1350@reddit
Because dense models are for dGPUs, whereas MoE models are for mini-PCs, Macbooks and laptops without a dGPU. Qwen3.6 35B A3B runs at 10 tokens/sec on DDR4-3200 on my laptop, whereas Qwen3.6 27B would be unusable.
Hardly anyone has a dGPU with 20GB of VRAM or more to be able to run Qwen3.6 27B on a good level.
kweglinski@reddit
35b is a moe which does laps around 27 dense speed wise while performance wise is not as significant difference (as speed)
Adventurous-Paper566@reddit
35B MoE works with a 16Gb GPU.
charmander_cha@reddit
Precisamos da versão de 9B urgentemente
aparamonov@reddit
I was debating this for a while, but considering quantization kld, is it better to run 27b q4 with q8 kv or 35b moe in q8 with f16 cache. same speed for both on my setup with 20gb vram and 64ram. What do yu think, does q8 overweights q4 more than 27b outperforms moe?
Beginning-Window-115@reddit
just dont mess with the kv cache
cosmicnag@reddit
Holy fuk
Healthy-Nebula-3603@reddit
Pfff who need sleep ....
Status_Contest39@reddit
wordless... it is crazily amazing
SnooPaintings8639@reddit
I can't wait to get back home and set it up. So hyped. Being it better than 35b version, it is guaranteed to be awesome!
qwen_next_gguf_when@reddit
Need to switch to this new 27b in all my flows.
Comfortable-Rock-498@reddit
I love this development and can't wait to try the model, but the terminal bench scores are 'non standard'
> * Terminal-Bench 2.0: Harbor/Terminus-2 harness; 3h timeout, 32 CPU/48 GB RAM; temp=1.0, top_p=0.95, top_k=20, max_tokens=80K, 256K ctx; avg of 5 runs.
Terminal bench 2.0 rules explicitly disallow modifying timeouts or resources available. Each terminal bench task has timeout (usually under 1h, mostly under 30 mins) and resources configured in the docker container by the task creator and they are chosen that way to test specific model aspects.
XE004@reddit
Me want qwen 3.6 VL 4b 8q uncensored.
_-_David@reddit
Let's. Fuckin. GO
robbievega@reddit
after the shit show that is Opus 4.7, I can't wait to see if it really codes as good as these benchmarks promise
odikee@reddit
How could you be so naive?
Adventurous-Gold6413@reddit
YURRRRR
Crafty_Top_9366@reddit
How it is better than the 35B version
Look_0ver_There@reddit
Dense vs MoE
27B will be much slower though ('cos it's dense)
Crafty_Top_9366@reddit
Phi
Healthy-Nebula-3603@reddit
Wait ...WHAT!!!
acetaminophenpt@reddit
Thanks!!!
random-trader@reddit
Not good at all. Almost useless. I gave it a try many time. Each time it failed to do anything
silenceimpaired@reddit
Agentic use seems a strong focus.
ab2377@reddit
because the world's software is being built by agents controlled by humans, at speeds unimaginable just a few months ago.
Barry_Jumps@reddit
Are you tired of winning yet?
FullOf_Bad_Ideas@reddit
that's impressive
I think I'll switch over from Qwen 3.5 397B to this, it would be awesome if we get DFLash for it soon (or 3.5 version works fine)
deRTIST@reddit
qwen gods, i'm praying for 14b, make it happen :)
Technical-Earth-3254@reddit
This looks impressive. I wonder if it can now deliver the same real world performance as 4.5 Haiku or GPT 5 mini. Probably not, but we will get there.
DOAMOD@reddit
The king
Opteron67@reddit
holly s..
FunkySaucers@reddit
Holy s**
2muchnet42day@reddit
Whole-y s***
pmttyji@reddit
iwannaforever@reddit
The king is here
NoConcert8847@reddit (OP)
Seems to be better than Opus 4.5 😭
2Norn@reddit
benchmaxxing is a real thing so i wouldnt take it too seriously
lets just use it and decide
Dany0@reddit
I can confirm 3.6 35b was definitely benchmaxxed at least a little. Still a good model though
YogurtExternal7923@reddit
Alr why the F is it going hand to hand with opus? There's no way..
axiomatix@reddit
Inflection point #2.. first one was opus 4.5 back in November. If the gap to the moe model is the same in the 3.6 series as it was in 3.5, given how good the 3.6 moe is in a harness.. this is a blessing.
fragment_me@reddit
Someone uploaded Q6_0 GGUF: https://huggingface.co/sm54/Qwen3.6-27B-Q6_K-GGUF/tree/main
Yasuuuya@reddit
Absolutely insane - if it's anything like 3.6 35B was.. then we're entering a new era here.
ambient_temp_xeno@reddit
"we're pleased to share the first open-weight variant of Qwen3.6."
gaslighting but we'll forgive this time.
Borkato@reddit
Holy fucking shit YESSS
__some__guy@reddit
Just as I thought we wouldn't get a 27B 3.6.
Let's see if thinking is actually usable on that one.
Mancho_United@reddit
I, for one, welcome our new overlord
vogelvogelvogelvogel@reddit
woohoooo
Mountain_Chicken7644@reddit
And right when I got to work too...
gamblingapocalypse@reddit
Well… that’s quite the headline to wake up to.
densewave@reddit
Lets gooo
power97992@reddit
SO they won't release the weights of 3.6 397B A17b?
Comacdo@reddit
Holy fucking shit
iMrParker@reddit
They never miss
nunodonato@reddit
Interesting that they still don't recommend the qwen3 xml parser
Voxandr@reddit
So do we have hope for 122B?
fragment_me@reddit
Lord have mercy!
ridablellama@reddit
holy crap no way
fishhf@reddit
Gguf when
Blues520@reddit
That was quick!