The Qwen 3.6 35B A3B hype is real!!!
Posted by The_Paradoxy@reddit | LocalLLaMA | View on Reddit | 134 comments
My personal test for small local LLM intelligence is to check whether a model has any ability to understand the code that I write for my own academic research. My research is on some pretty niche topics and I doubt that anything like it is substantively present in the training sets for LLMs. A few months ago, small local models' ability to understand my code was nominal at best with Devstral Small 2 being the top performer. However, several small open weight models now have methods of accommodating fairly long contexts (gated delta net, hybrid Mamba2, sliding window attention) which makes them extremely smarter. I can now feed a model an entire academic paper along with accompanying code and ask it to use the paper to work out what the code is doing.
I just spent a couple days experimenting with:
- Qwen 3.6 35B A3B
- Qwen 3.6 27B
- Gemma 4 26B A4B
- Nemotron 3 Nano
All of them were able to comprehend my code significantly better than what any small local model could do a few months ago. I did try Devstral Small 2 since I recently went from a single 16GB graphics card to two; however, I simply couldn't fit the long context in 32GB of ram. I hope Mistral releases a new small model with a gated delta net, because I think it could take the throne.
These are my detailed findings from asking local models to explain how my code maps to the research paper it corresponds to.
TLDR: All four models listed above are incredibly capable local models, with Qwen 3.6 35B A3B standing out as the best. I'm also inclined to think that an intelligent human with any of these four models is more capable than something like Opus 4.7 on its own (see the detailed findings).
Please let me know your thoughts!
ai-christianson@reddit
27b thicc is smart but you have to make sure to get the temp/sampling params right and don't quant your kv or model too low.
c_glib@reddit
What do you consider "too low" of a quant for model as well as kv? Do advance techniques like Turboquant change things substantially?
atumblingdandelion@reddit
This is great, thanks for sharing your experiment. Also an academic (I guess now mid-career lol). My interest in local models is to minimize my environmental footprint from using AI. I've been experimenting with local models and agree that Qwen3.6 35B A3B is good. I also get good results from Gemma 4 26B. Qwen and Gemma's denser models (27B and 31B, respectively) are a bit too slow for my machine (M4 Pro, 48 GB). Not super slow, but the MoEs are so fast and reliable enough. My conclusion, experimenting with AI is that the LLMs don't matter as much as people think they do. The environment around them (aka the harness) matters much more. Hence, now my efforts are to optimize the harness for my research purpose/domain. I've got good results using Pi Coding Agent and Continue.dev. However, I'm now experimenting with Hermes Agent (as a coding agent on my laptop, not a 24/7 assistant on virtual machines), and am amazed by how well its self-learning ability works. By the end of the session, it typically adds a new skill focused on my domain! I wish a new model in \~15B comes along (Qwen 3.6?)
The_Paradoxy@reddit (OP)
What subject/department do you teach or do research in? Open code an Hermes are the first harnesses I'm going to try using
I wish universities were talking more about training models for a teaching context. Right now when a student asks an LLM something they get the full answer without having to think. I would love to have a model that would help students think through questions without giving them the answers. Like my dream is to have a Hermes agent with its own email account and access to the course notes so that students can email it and it can tell them where to look in the notes and if they have follow-up questions it can help them think through the material instead of simply giving them the answer
atumblingdandelion@reddit
Climate sciences. Unfortunately, no teaching anymore, just research (I love doing both). I think you can put heavy constraints on a model to encourage thinking and not blurt answers directly- in the system instructions. If that works, you can build a local AI Agent that would do this.
The_Paradoxy@reddit (OP)
Thanks I appreciate the suggestion. What kind of code do you have your agents working on? What's motivating the switch to Hermes?
atumblingdandelion@reddit
Mainly python based analysis of satellite data. I’d like Hermes to learn my worklfow, my preferred style of coding, the functions and packages I prefer to use, the project file systems I have, etc. BTW, I didn’t understand your later comment about Devstral Small 2. It takes bigger RAM than the 35B and the 27B?
Also, I’ve never tried Nemotron 3 Nano (or Devstral Small 2). I’ll give them a try. I liked your benchmark.
The_Paradoxy@reddit (OP)
The model itself takes less memory, but the task I was testing with is very large context 120k tokens just on the initial prompt and the context goes up to 180k with follow-up prompts. Devstral Small 2's architecture is such that the amount of memory that the kv cache uses explodes as that context size grows making it ultimately take up more memory even though the model itself takes up less. Qwen 3.6, Gemma 4, and Nemotron 3 Nano all have mechanisms in their architecture (gated delta net, sliding window attention, Mamba2 respectively) that make it so that amount of memory used by the kv cache doesn't grow at such a fast rate.
I think this may connect with your reason for moving to Hermes. I'd really like an LLM to program in my style using them methods I've already developed. My plan has been to have the model hold a lot of the code I've already written in it's context as a way to get it to understand how to generate code in my style. But going with Hermes may be the better option. I wonder if there's a way to integrate Hermes automatic skill learning into open code? I'd be really interested in hearing about how Hermes works out for you
cmndr_spanky@reddit
I dare you to try it as part of a real coding agent harness. You didn’t even pick any params like temperature, repeat penalty. Sometimes qwen 3.6 utterly shits the bed in tool calling unless you use very specific settings.
The_Paradoxy@reddit (OP)
That's my next step, I'll post how it goes.
cmndr_spanky@reddit
Good luck :)
SkyFeistyLlama8@reddit
Maybe I'm crazy but I run Gemma 26B in thinking mode for quick code fixes and chats, Qwen 35B in thinking mode for longer contexts and refactoring. Qwen 35B rambles on and on before it spits out the final output so I only use it for tasks that I don't mind waiting for.
It's only 20 GB for Qwen 35B and 15 GB for Gemma 26B at q4 so I can keep both models loaded in RAM simultaneously.
rayzinnz@reddit
How much context are you squeezing into those RAM limits?
Choubix@reddit
Have you tried qwen3 coder next vs qwen 3.6 35B for coding? Wondering which is more capable for Claude code on local setup. Thanks!
SkyFeistyLlama8@reddit
Qwen 3 Coder Next is a lot smarter for long contexts and doesn't think for so long. The problem is that it's a lot bigger at 80B.
Choubix@reddit
Thanks! I have maxed out the ram on a new config so I should be able to handle it. I will try thr 80b when I am done with my setup (still installing utm, Ubuntu etc) OMLX with memory paging to ssd should help if you are a bit short on vram 🙂👍
SkyFeistyLlama8@reddit
How much RAM do you have? I've got 64 GB and a q4 quant of Qwen Next 80B takes up around 50 GB RAM, so I don't have much left over.
Choubix@reddit
I got 128Gb on mac m5 max (I did the unreasonable thing... 😉). I want tonuse Claude Code locally and this think need to send 16k tokens as system prompt... The new neural tensors should help.
Chrkc if omlx (or Jang?) can help as you have 64gb? Jang looks promising.
my_name_isnt_clever@reddit
Do you use Qwen in a harness with tool calling? From what I've heard they're a lot more reasonable thinkers when given tools. I've used it exclusively in agentic harnesses (Hermes and Pi) and have never noticed overthinking issues.
AcanthocephalaNo3398@reddit
This is it! I have created a custom harness and it thrives when given a decent system prompt.
gpalmorejr@reddit
Qwen3.6-35B is super sensitive to inference settings, so that could be an issue. They have recommended settings in the card. But even then. It does tend to use some serious thinking tokens to come to an answer. But I have found that it's thinking process is "usually" very good as sussing out it's confidence in an answer and figuring things out. I've had mine saying some rather self aware things in it's thinking loop before while trying to figure out a course of action until it decided it was not internally capable of an accurate answer and instead deferred to an online calculator designed for the purpose. I was impressed.
WardyJP@reddit
Thank for sharing this, please can you tell me what GPUs you are using and if it works well to pair them up. I have a RTX 5070 12GB VRAM. Wondering what would work as a second GPU. Running using Ubuntu.
catalini82@reddit
I would use a 5060Ti 16GB VRAM as second GPU. It keeps the same Blackwell arch, true it is slower but the 16GB of memory are very good price/capacity ratio wise. I myself am lucky to have a 4090 and i experimented a period with a 5060Ti 16GB as second GPU. There were small issues here and there, as the 5060Ti Blackwell can make use of more features as compared to the Ada Lovelace RTX 4090, such as NVFP4, but no major issues overall. Depends also waht you want to also do with your projects and of your overall current setup (PSU, Motherboard, PC case,...), but generally the 5060Ti 16GB is low maintenance and low requirements.
json-bourne7@reddit
Hey! quick question on your dual setups that you experimented on.
When you added the second GPU (4090 + 5060 Ti), did you actually see any slowdown in prefill or decode compared to running on the main 4090 GPU alone, or was it mostly the same? I’m on an RTX 5080 running Qwen 3.6 35B A3B with about 10GB in VRAM and 6GB spilling to CUDA host memory, and surprisingly getting solid performance with ‘--fit on’ flag: around 3000 toks/s prefill and 64 toks/s decode on a fresh context, dropping to 1300 toks/s prefill and 40 toks/s decode near 256k context limit (KV cache in q8_0). I’m thinking of adding a second GPU mainly for VRAM, probably a 5060 Ti 16GB, but it would be on a PCIe 4.0 x4 chipset slot, not CPU lanes, so I’m worried about inter GPU bandwidth and sync during decode. From your experience, does that kind of setup still make sense, or does the slower card + x4 link start dragging performance down enough to outweigh the extra VRAM capacity?
This would help me decide if I should invest in getting this new GPU to add to my rig.
Thank you in advance for any insight you may have. :)
catalini82@reddit
I'm afraid i won't be of much help to you as i used my setup just to have a good isolation of my 4090 for running main LLMs, while the 5060Ti was running STT, TTS and embedding, sometimes a vision model as well. It was short lived, i moved on from those plans/ideas :)
There wil be more experienced people than myself that can answer you directly, but i did some research myself before getting the 5060Ti as i was thinking about splitting work of a big-ish local LLM with a big-ish context window between 2 GPUs, me having the similar second PCIe4 4x slot as you mention, and indeed from my research (not proven by me through tests) there should be a slowdown due to the fact that your much faster 5080 will wait on the weights that were split to the slower 5060Ti to finish their work so the result will be pushed in sync. Sorry for the wording, am not a guru on the matter. And YES, the slow PCIe4 4x slot should hinder perf more, not sure by how much, as there will always be movement of data between the two GPUs when the weights are split between them.
Saying all this, as you may already know, everything that you will split between a GPU with its own VRAM and a CPU with system RAM will be much slower than splitting between 2 GPUs.
The_Paradoxy@reddit (OP)
I'm using a 5060ti 16gb and a 4060ti 16gb in Ubuntu 26.04. with the moe models they are only fully loaded for prompt processing and then are at 50% utilization for token generation. So I wouldn't worry about compute power and would focus on vram capacity. Probably makes the most sense for you to go with a 5060 ti 16gb. You can use the --tensor-split flag in llama.cpp to put more of the model on the card with more vram.
json-bourne7@reddit
Hey OP! Did you notice any difference in prefill or decode speed when running on a single GPU vs your dual 5060 Ti + 4060 Ti setup (with the second card on the chipset slot), or was performance basically the same and it mainly just increased your available VRAM?
If you have rough numbers, would love to know how much performance changed in % in dual-GPU mode, if at all.
MarcusAurelius68@reddit
5060ti with 16GB could be a nice addition. 28GB of VRAM adds a lot of capability.
Evgeny_19@reddit
Am I missing something, or it's just doesn't say anywhere which settings did you use to run those models?
The_Paradoxy@reddit (OP)
I just updated the Models Evaluated section with all of the specific quants I used and all of the llama.cpp flags. Temp, top p and top k were all default for the first pass, and then when it looked clear that Qwen 3.6 35B A3B could give me the best performance I dialed in the reccomended temp, top p, and top k that are reccomended. These values are noted in the testing_prompts.md file. The issue for the dense models is that I had to use more aggressive quants to fit them in VRAM. I did experiment with different options such as using q5_1 quants for the kv cache, but perfomance was terrible because it seemed to force a bunch of CPU processing even though everything could fit in VRAM (no idea why, maybe this was a bug?)
RedrumRogue@reddit
Im confused. MoE models don't require less VRAM. If you are using a q6 35b a3b and a q6 27b, the 35b still takes more VRAM, so it would require the more aggressive quant. It simply requires less VRAM per token making it faster, but the full weights still get loaded in to start. Am i misunderstanding?
danihend@reddit
You can offload the experts to CPU though, then everything else on GPU. LM Studio has a toggle for that. Also offload the kV cache to the GPU and then you can have a nice long context, keeping the stuff that needs more speed on GPU and the rest on CPU. That's why MoEs are so interesting for ppl with consumer grade GPUs plus nice amount of RAM.
RedrumRogue@reddit
I need to learn how to offload specific bits to the CPU. I have 32gb of vram, so i could see maybe experimenting with q8 35b a3b maybe. I wish they made a proper 70b moe so i could try an effective q4 on something bigger while selectively offloading to cpu. Thanks for your comment, i have a lot to learn!
danihend@reddit
32 GB of vram is huge :) I've only got an RTX 3080 with 10 GB of vram and then I have 64 GB of system RAM. When I run MoEs like qwen-3.6-35b-a3b I usually have my VRAM fairly maxed out with system RAM being like 50-80% full depending on other applications that are open or the particular model or quantization etc . You could even just run that entire model in your vram with a smaller context, but you can definitely benefit from offloading the experts to the CPU as well and using your vram for a large context and a better quantization. I would 100% be running one of the larger quants like q8. Definitely look into it, yw :)
gpalmorejr@reddit
You are 100% correct but forgetting a small part. 35B also uses smaller Q, K, and V vectors. So at really large contexts, 27B can't really start to climb due to KV Cache and prompt caching even though the model itself is slightly smaller. This is part of why I'll run 35B on my 6GB gpu. I can fit attention and KV in 6GB and the slower friendly MoE layers on CPU/RAM. But for 27B, I am stuck with sequential offloading and the size of the KV Cache means I can fit almost nothing at all in VRAM even at lower quants.
crantob@reddit
Praise for the very helpful explanation.
malianx@reddit
35B on 6gb? At what quant?
gpalmorejr@reddit
Q4_K_M. Using MoE split.
RedrumRogue@reddit
Ah i see, thanks for clearing that up
Altruistic-Dust-2565@reddit
I see 262K context there, what about speed under this kind of context?
The_Paradoxy@reddit (OP)
For 3.6 35b a3b with the recommended temp, top p, and top k, prompt processing starts about around 1000 tk/s and drops to 500 tk/s probably averages out to 600 tk/s. Took about 5 minutes. Generation was 40 tk/s and took 3 minutes. Follow-up prompt was less than a minute prompt processing and 4 minutes of generation again at 40 tk/s. Second follow-up prompt was a few seconds for prompt processing and then 4 minutes of generation around 39 tk/s. I didn't time Claude, but subjectively it was much slower than locally running Qwen 3.6. When I first ran Claude Sonnet 4.6 I didn't realize that having plots in my .ipynb files was exploding the context and it took 45 minutes to run. I'm not going to burn paid tokens rerunning Claude, but I can say that subjectively local Qwen 3.6 35b a3b is faster. The dense local models are slower than Claude
Altruistic-Dust-2565@reddit
That's kind of disappointing me though, the prefill speed seems to be the bottleneck. Waiting a prompt for 5 minutes is borderline unusable for coding. Not sure why you think Claude sonnet is slower though, it provides ~50 tk/s decode and almost instant prefill.
The_Paradoxy@reddit (OP)
This is a 40 page research paper an appendix that is just as long and around 5k lines of code. Before I found out that the plots in my notebooks were exploding context Claude with adaptive thinking took 45 minutes to give me a response. The research paper, appendix, both .py files, both .ipynd files and the prompt are all available on my GitHub. Go ahead and try it if you think Claude will be faster. I don't think it will be, but you have everything you need to prove me wrong.
UncleRedz@reddit
Did you try different quants of the models themselves as well? (Not just KV cache.) Did that affect results as well?
The_Paradoxy@reddit (OP)
I didn't aggressively try different model quants. Basically I started with the largest/best 4 bit quant (I think there was one unsloth dynamic quant that was actually smaller than the step down) and if I couldn't fit everything in vram then I stepped down a level. I made the assumption (possibly false) that it wasn't worth it to go lower than 4 bit quantization which is why I don't have any working results for Gemma 4 31b. Even at the smallest 4 bit quant I couldn't fit the large context in vram; it would launch but then crash pretty quickly. I only got it past prompt processing once, but it still crashed and I think that was with 4 bit kv cache.
UncleRedz@reddit
Reason for asking is that I did something similar and tested MXFP4 and UD-Q5_K_XL on Qwen 3.6 35B-A3B with FP16 KV cache and did not see any measurable quality difference for my task, which was surprising, I would have expected UD-Q5 to be better. I did see a performance difference (MXFP4 was faster) even though everything was in VRAM. Did the same test on an older Qwen3 and there the difference was measurable. Will do some more testing to see what is going on.
Also share the excitement with these new generation of models with more efficient attention, it opens up a lot more interesting data processing scenarios on local hardware.
The_Paradoxy@reddit (OP)
Cool, good to know that info!
KptEmreU@reddit
Check my post . Did have settings for 16gb vram with lots of leeway for some game engine and stuff or agent harness
autisticit@reddit
Where can I download that "intelligent human" ?
Vivarevo@reddit
Insufficient tokens. Renew your intellectual pro human subscription
Real_Ebb_7417@reddit
It’s to dangerous to release publicly, only a limited set of companies has access to it for now to assure security.
-dysangel-@reddit
I hope they know how to control those humans
Altruistic-Dust-2565@reddit
It's currently under preview and only enterprises and institutions can have access. Stay tuned.
More-Curious816@reddit
dario? is this your reddit account?
Ashamed-Mud-7282@reddit
Error: Subroutine Intelligent Human Not Found
matthew_murdock616@reddit
OP is pretty hyped about Qwen 3.6 and all you can read from that post is "intelligent human" 😀😀
HittingSmoke@reddit
Deprecated.
-dysangel-@reddit
unfortunately this one requires a life long subscription.. to yourself
Amazing_Athlete_2265@reddit
Sounds expensive. I hope i don't jack the price up.
davew111@reddit
It's too dangerous to make public
cosmicnag@reddit
Thinking...
YOU_WONT_LIKE_IT@reddit
They quit making them.
Koalateka@reddit
Doesn't exist, it is a myth
Regular-Forever5876@reddit
😅😁🙂
roosterfareye@reddit
Were you able to quantized the k and v cache for devestral? That could make the difference?
The_Paradoxy@reddit (OP)
I tried all the way down to q4_0 but still couldn't do it. Vram consumption really explodes when you need a minimum of 180k context. I really hope Mistral puts out a small model with gated delta net. There's a YouTube channel Protorikis that has done comparisons showing the superiority of Qwen's gated delta net over Gemma's sliding window and I suspect that's why the Qwen models out performed the Gemma model in my test.
statewright@reddit
context window is the underdog in my experience. while attempting to get smaller models doing more useful work, doubling the context window on a 20B model yielded better results than just stepping up to a larger model with the same context. MoE could be advantageous... a smaller KV cache would allow more context concious budgeting instead of quantizing it away
roosterfareye@reddit
Hopefully they do, it's a fairly normal pattern for Mistral.
uti24@reddit
true, but
Yeah, I couldn't make any of Qwen models to work beyond 1 shots. 1 shots - fantastic and great, I ask to write some game or whatever and it nailing it. Any multi steps problem with OpenCode - and it loops like on 5-th message and I cant fix it. Trying using repetition penalty and Presence penalty, trying different quants (well, Q4 and Q6), trying turning KV cache quantization on and off, nothing helps, Qwen loops rather very quickly.
Gemma 4 31B is ok in that regard, didn't loop on me in Q4. But it has not very optimizer KV cache so I managed to fin only 50k context into 32GB of VRAM over multiple GPU's.
bobzdar@reddit
I'm running qwen 3.6 27b q4k_m with claude code and rarely get loops. It's basically just enough to get it to fit on a 4090 and not run into context size issues and as context grows I get to a little over 23GB ram utilization. I've run 30-40 turn iterations and will often resume if I want to add some features to a previous project. Once I had to reload the model when running a ton of turns in a single session, but it picked up where it left off without issue. I have a feeling it's partly model settings and partly prompt...
The_Paradoxy@reddit (OP)
Interesting. I'll probably do another post after I develop an open code workflow that I'm happy with. I'll continue testing between all of these models until I've got my entire workflow down.
Top-Rub-4670@reddit
I have a similar issue with Qwen 3.6 35B A3B.
After a while it will start repeating itself. I'm not talking about thinking loops (those happen too, but more rarely), I'm saying that its response to me will eventually become static no matter what I say to try to break it out of it.
I use the recommended samplers and I have tried q4/q6/q8/bf16. Same issue in all.
my_name_isnt_clever@reddit
Are you using the exact infernece params recommended on the Qwen model card? These models are agentic work horses, you have something messed up with your setup if it can't handle more than one turn. Try enabling Preserved Thinking with Qwen 3.6 too, it helps a lot over long conversations.
uti24@reddit
Yeah, I tried to follow all setting as Qwen recommended.
But you know, it's hard to run model without quantization, so GGUF it is.
Also there is a lot of complains about looping, it's not only me.
Last_Mastod0n@reddit
The consensus is generally that qwen 3.6 27b outperforms qwen 3.6 35b a3b across the board by a small margin. The tradeoff is that 27b is quite a bit slower
FolsgaardSE@reddit
Curious, what kind of card can handle a 35B net? Guessing top of the line 5090
The_Paradoxy@reddit (OP)
I'm running a 5060ti 16gb + 4060ti 16gb and use llama.cpp flag --split-mode layer. There's no pcie bandwidth bottleneck even though the 4060ti is connected through the chipset. Chat gpt/Claude/Gemini will all tell you that performance with this set-up is terrible, but it's all lies. On YouTube Digitalspaceport showed dual GPUs on pcie gen 3 x1 running 35b a3b with no bottleneck and I was 🤯 because ai was telling me that was impressible. Nope! There's people running it on dual 3060 12gb cards! During token generation, my GPUs are at 50% utilization. It's just vram capacity limitations that you have to worry about. Don't believe any AI telling you to not try running models locally!
enternoescape@reddit
I've got a box running 8 3060 12gb cards, 22 core xeon, 128gb ddr4. The first four GPUs are on a PLX. The next three are on another PLX. The 8th GPU is using a nvme 4x to pcie 16x adapter. It's all pcie 3.0. I can run q8 at around 54 t/s across all 8 with using layer and 262144 context. Running at q5 on four on the same plx, I get around 70 t/s, same context size. Running at q4 on two on the same plx using tensor, I get 90 t/s, but I can't use full context. Tensor is unstable for me at anything more than 2 GPUs.
I was looking for the cheapest way to hit the highest amount of vram while having modern features like flash attention when I was specing this build. The raw vram numbers are pretty awesome but it is definitely a little more work to get everything fit properly because often the layers do fit, bit not the way llama.cpp thinks they should. I spend so much time with --tensor-split every time I want to try something different.
Talking to an llm about this specific kind of build spec did nothing but flash warnings about all the problems I was definitely going to have. My build can run the qwen 3.5 397b at q2 on two GPUs, the rest offloaded to ram at 6t/s. That's borderline usable speed, but an unfortunately dodgy quant. Right now I'm working on getting MiniMax M2.7 to load in q4 with 128k context. The amount of time waiting between figuring out if I got it right while I attempt to fit as much as possible into vram is absolutely maddening, but it's running around 25 t/s which is honestly pretty cool and this model is so smart.
FolsgaardSE@reddit
Thanks! I'm trying to get into the game using CPU only now just to learn (even if crawling). Was thinking a 3090 with 16gigs of ram might be within a good budget instead of those $1-2k cards.
my_name_isnt_clever@reddit
It kicks ass on unified memory systems, like Strix Halo and Apple Silicon.
g_rich@reddit
Personally I’ve gotten better results from Qwen3.6 27B. Initially there was a pretty significant drop in token generation speed when compared to the MOE Qwen3.6 35B variant but pairing the dense 27B with speculative decoding, particularly DFlash has brought things up to a usable level and it’s now my default go to model.
The same can now be said for Gemma 4 31B now that Google has released the assistant companion models to enable mtp for Gemma 4.
However despite how good the Qwen3.6 and Gemma 4 models are they can’t match the output of the foundation models. They simply do not have the knowledge base to effectively compete. You are comparing a 30 billion parameter model with ones that are over a trillion. That’s like comparing the knowledge in a set of encyclopedias to that of a whole research library. To get something on par with foundation models you’ll need something like Kimi K2.6 which is out of reach for most people.
lordekeen@reddit
Have you experimented with the MTP model or just the DFlash?
Feisty_Resolution157@reddit
Oddly, some people reported quite a speedup with mtp, and others have not. I've tried it on my Blackwell 6000 using the same settings as someone reporting a nice speedup on the same GPU and I don't see much of a speed up at all.
human_bean_@reddit
For some reason different weights from different sources seem to behave completely different. I get +100% tok/s on many, but for example UD quants don't get any improvement from MTP even when they're supposed to.
g_rich@reddit
DFlash with Qwen3.6-27b and mtp using the official Google assistant models with Gemma 4; running on a DGX Spark via vLLM.
Gullible-Analyst3196@reddit
I have tested dozens of models on my $500 pc, everything eventually was a disappointment. Either too slow or it couldnt complete tasks successfully. This model changed everything. For my 6gb VRAM and 32gb RAM it is quite fast, 20 tokens/sec, and it has so far completed every tasks. It even analysed my crypto trader, made recommendations and implemented them. My config:
FerLuisxd@reddit
What about tk/s? For each model?
Organic_Scarcity_495@reddit
the niche research code test is the real filter. most benchmarks are contaminated but if a model can reason about your obscure spec, it's actually learning capacity not memorizing. qwen 3.6 passing that test is what sold me too.
Agreeable_System_785@reddit
Ok, so OP used q4 quant and, as far as I can tell, did not tweak model parameters.
This is important to me. I got a lot more value using bf16 or q8 with the dense models, but also tweaking a lot.
The_Paradoxy@reddit (OP)
I can't do bigger quants with the dense models. For this task I need 180k context minimum. The 27b dense I had to step down to the Q4_K_M quant because context takes up so much more vram with the dense models. If you really have a strong belief that tweaking stuff like the temp, top p, top k, etc can substantively improve performance for the dense model, lmk your recommendations and I'll try them. The only way I can go larger on the model quant is it I drop to q4_0 quant on the kv cache
Agreeable_System_785@reddit
Hi, dont see it as a blame but more as a clarification for other users. If they have different hardware, they might end up with different results because of quantization choice.
https://huggingface.co/Qwen/Qwen3.6-35B-A3B see the best practices section and text it out if this works for you. That is a good starting point.
I have different use cases and for each use case, i end up going between the qwen 27b dense model the 35b moe, a Gemma variant or a Mistral one. But it really depends on the use case.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
MasterScrat@reddit
Is this comment really necessary in each and every successful post ?
LifeTelevision1146@reddit
How long was your context, 3-4B tokens? how long did it take to give you a verdict? And did the PC hum? Or was it being hot rod? I know too many questions, insights are always great.
The_Paradoxy@reddit (OP)
Prompt processing ~120k tokens, after generation with follow-up prompts ~180k tokens. Initial prompt processing was around 5 minutes and and then 3 minutes for the first token generation. Honestly feels faster than Claude when using more models. GPUs intermittently hit full load during prompt processing, but during token generation they're sitting at 50 utilization.
danalvares@reddit
What are your System’s spec?
The_Paradoxy@reddit (OP)
5060ti 16gb + 4060ti 16gb
PairOfRussels@reddit
Try the project with turbo quant enabled to extend your context size.
TheTom/llama-cpp-turboquant
The_Paradoxy@reddit (OP)
I can already extend context size with q5_1 quants. The problem is the computational overhead I went from 600 tk/s prompt processing to 6 tk/s. It just doesn't feel worth it. Like I said, the hype is real at q8_0 kv cache I'm getting excellent performance. The YouTube channel Protorikis has a good video testing turbo quant and finding that similar to q5_1 the computational overhead isn't worth it
Imaginary_Belt4976@reddit
I dont disagree they are great but I am surprised 27B didnt beat 35B-A3B. I use both but generally A3B when I want super fast inference and 27B when it actually matters intelligence wise
The_Paradoxy@reddit (OP)
I need 180k context minimum for this task which meant I had to use an inferior quant for the 27b model (the dense model might be smaller, but it's kv cache grows at a faster rate than the moe model). I suspect that's the explanation
AltruisticList6000@reddit
In my experience the 27b was either barely better, or in most cases, wasn't better than 35b. I only tried the old/manual method of generating more basic html or other code etc. so not agentic (I'm not a dev/coder) and minimal tool usage - using it on textgen webui - and whenever 35b got stuck I loaded up 27b which doesn't fit in my 16gb VRAM so is like 4x slower. I waited ages for it but no matter how many times I regenerated, the 27b wasn't better at solving the issues. However with me giving some ideas or rewording the prompt, 35b then pulled ahead and worked it out. So in my experience they are head to head but because of 27b being significantly slower in my case, it's better to just reroll or give coding ideas/other prompts to 35b and then it just works. Same for non-code stuff. So I'm at a point where I'll delete the 27b and keep 35b.
In my experience all MoE models were quite disadvantaged compared to dense models in understanding complex tasks or prompts (sometimes misinterpreting basic things or questions) but so far in my experience both qwen 35b and gemma 26b are way better than previous open MoE models and punch above their size.
HavenTerminal_com@reddit
can't find the intelligent human in any of the usual repos
sophlogimo@reddit
It is, unfortunately, behind a paywall.
The_Paradoxy@reddit (OP)
Yes but the payment is hard work not money
mixedliquor@reddit
I picked up a R9700 last week and spent the weekend using it with Qwen 27B and 35B A3B to do some Python coding because I've wanted to learn Python.
I was blown away. I gave it R code and it reasonably adapted it to Python. I had it write Software Definition Documents off my previous code and also off verbal descriptions and then fed that SDD back to it to draft the program in Python and it did great. It wasn't wonderful at thinking out of the box but it did what I told it to and corrected code reasonably well. When I told it what was missing (Error handling, prompt memory, etc.) it added it no problem.
I liked the way it approaches coding much better than ChatGPT. It gave me better instructions on updating code blocks than ChatGPT does and was able to correct my code when I made a mistake much more readily.
The_Paradoxy@reddit (OP)
Yeah, if you already know R or a similar high level language then you + Qwen 3.6 will be stronger than a frontier model. I'm sick of people reviewing these models based on their ability to vibe code. The value of these models is in their ability to increase your productivity; it's not in their ability to decrease your critical thinking.
If you don't know any programming language (I'm an intro stat computing professor), then don't try learning from an LLM. Just do it the old hard way or you'll plateau in your abilities really fast.
TonyPace@reddit
You mentioned several context management techniques. I'm unsure which ones were working on which models. I did a lookup and they sound interesting, but I'd like to hear what your experiences were like. I'm working with document processing and context problems are keeping me spending money on tokens.
The_Paradoxy@reddit (OP)
If you need recall accuracy, Qwen's gated delta net seems to be superior to Gemma's sliding window. YouTube channel Protorikis has a few videos discussing it that I recommend over anything I could say on the topic. If you don't care about vram consumption or accuracy and just want to use fewer tokens, then I'm not sure what the relevant recommendation is. Maybe disable thinking? Or use a model that naturally thinks less like nemotron? Lmk if you find a solution you really like.
mehyay76@reddit
I'm curious if it can make the last 0.02% of tests pass on https://github.com/mohsen1/tsz
Even GPT 5.5 is struggling
compass-now@reddit
Any one have build production grade app with any of this?
DeSibyl@reddit
Curious how Qwen 3.6 27B stacks up against Gemma 4 31B, or even Qwen 3.6 35B A3B stacks up against Gemma 4 31B… I mainly want it for general assistant stuff. Like a “ChatGPT” replacement for work.
my_name_isnt_clever@reddit
My understanding is Qwen 3.6 is better for agentic, tool calls, coding and Gemma 4 is better at writing, creativity, soft skills. So it depends what your priorities are.
jadbox@reddit
I found that Q5_K_S performs a tad overall better than Q4_K_XL that was used for for 35B in this post. This might be related to the cuda bug with q4.
audioen@reddit
You tested with Q4_K_M. In my opinion, this quant is worthless on 27b model, at least my personal experience says that performance of this model is mediocre at less than 6 bits, and I don't trust even 6-bit q6_k_xl because I've had it make really bad translations at this quant. q8_0 works fine as far as I can tell, though.
DinoAmino@reddit
100% There is a lot of non-technical hype for this model and everyone upvotes it.
StandardLovers@reddit
Gemma 4 - released a couple of weeks too early. Think it would hit differently today.
Sabin_Stargem@reddit
I am hoping we get the 3.6 122b soon. That is the biggest model that I can run at Q6, and considering all the improvements to LlamaCPP, it would be way faster than it used to be.
roninXpl@reddit
I'm getting 64toks on M3 Max (64GB, 40-core GPU) via LM Studio's distro.
27B gives me ca. 16toks
What's interesting LM Studio's GGUS is faster than MLX.
However it has an issue with the following prompt created to benchmark models:
```
curl http://localhost:1234/api/v1/chat \
-H "Content-Type: application/json" \
-d '{
"model": "qwen/qwen3.6-27b",
"system_prompt": "You are a network engineer and systems architect experienced with Synology SRM, Tailscale, and Raspberry Pi deployments.",
"input": "I need to route all traffic from a Raspberry Pi 5 through a Tailscale Exit Node on a Synology RT2600ac mesh network. The Pi must still access a local NAS (192.168.1.x) for Synology C2 backups without going through the tunnel. 1. Provide the exact tailscale up command with necessary flags. 2. Explain the static route configuration in SRM to prevent routing loops. 3. Identify the specific risk of radio crashes on the RT2600ac when handling high-frequency monitoring pings."
}'
```
It always loops indefinitely for me, does not happen on other versions or quants of this same model.
yobigd20@reddit
is it possible to put the kv cache context on a 2nd gpu? i have b70 32gb and 3x rtx a4000's. was wondering if 8 could run the largest quant i can squeeze on the 32gb but have the cache context go on one of the rtx a4000s.
tarruda@reddit
It is also has the best uncensored model with only 0.0015 KLD: https://huggingface.co/llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF
L0ren_B@reddit
I have the same experience with 27B in the last few days!
I found out a trick which worked for my 100k+ lines of code: Start the project using a smarter model (I use PI coding agent) and then switch to Qwen 27B.
The first prompt matter, also, how it aborts the issue etc..
Between Qwen 27B and Deepseek V4, I have not noticed much difference.
It did looped a couple of times, but in hours of usage, and I had to stop and prompt it to continue. But I've managed to get real work done!
There is no better smaller model that I've tested even close. Even Gemini Flash seemed worse for me!
If we get same increments, and Alibaba don't stop releasing weights, Qwen 3 to 4.5 will be all I need for daily work!
Also, I also think that companies stopped releasing smaller models now, as it's hard to beat Gemma and Qwen. They are probably rethinking their strategy!
justGuy007@reddit
Which quant?
L0ren_B@reddit
Q4 Autoround, as per here
The_Paradoxy@reddit (OP)
Nice! Have you also used open code? If so how does PI compare? Any tips on AI coding work flow are appreciated!
L0ren_B@reddit
I was using both opencode and pi. I love pi more, it feels less prone to errors? Also, I've made a continue plugin for long projects, where I could say "/continue 5" for example and after it fininishes, it will type "Continue" 5 times.
Some coding tips:
-For Qwen, first promp matter. While Claude and GPT you can text them in "blunt language" like do this, do that, I found out that first prompt matter the most. You need to be very specific. It wont be one shot, but it will get your there!
-I would start the project with a smarter AI (Like DeepSeek as it's cheap). And switch mid project.You will barely see the difference.
-Look out for loops. I've noticed they are happening around 128k Context for me ..(50 percent).
-Smaller models tends to hallucinate. So, in my case, when I had to split a huge 20k lines file, The model failed when it tried to just write it from memory. Like read from like to line and write new file. But aced it when I've prompt it to use cut/paste tools. So, it's not yet a fire and forget model.
Good Luck :)
The_Paradoxy@reddit (OP)
thanks 🙏
Rikers88@reddit
I tend to agree with the last statement. If you need Claude Opus 4.7 either you don't know what you are doing, or either you don't care and want to autopilot eveything.
Will you test the Qwen3.6 27b dense as well?
Alternative_Ad4267@reddit
Opus 4.7 is quite capable to the point several people won’t want to deal with a less capable model, it almost understands our messy ways to communicate what we wants out from it. Other models forces you to be more systematic and organized on your thinking.
The_Paradoxy@reddit (OP)
💯. Yeah, that's the dense 27b model reference in the post and in detailed analysis. It was fairly similar performance. 35b was very slightly more accurate and thorough while 27b was more holistically clear. Maybe 27b was suffering from a stronger quantization. Vram consumption in large contexts grows a lot faster with dense models, so even though the 27b is smaller than the 35b a3b, when just the initial prompt processing is ~120k tokens the 27b uses up more vram than the 35b a3b. Given how similar the performance was for me, I prefer 35b a3b for it's faster processing. If I ever need more world knowledge and less context for something I might try using a better quant of 27b rather than 35b a3b.
Alternative_Ad4267@reddit
I disabled comfyui and automatic 1111 services, even openwebui Nvidia service (it is running on CPU only mode, I don’t use RAG there), to release all the memory on my cards to run these medium size models. These are finally that good. Local models are finally delivering what I wanted from them for in first place.
keen23331@reddit
you can run this model at > 60 t/s on 12GB VRAM on a RTX 5080 LAPTOP. and full context with miminal loos to fp16.
Human-Cherry-1455@reddit
Thank you for sharing.
fasti-au@reddit
Considering it runs on 8 year old hardware and Turboquant makes it code for 5 year old laptops I think the difference is more like Nvidia crashes