Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better!
Posted by LocalAI_Amateur@reddit | LocalLLaMA | View on Reddit | 92 comments
A bit of context. I was coding up a little html tower defense game where you can alter the path by placing additional waypoints.
My setup: 32gb ram with 16gb vram 5070 ti. Using AesSedai/Qwen3.6-35B-A3B-GGUF IQ4_XS on LM Studio with OpenCode. I've graduated from one-shot vibe-coding prompts.
The spec for this game was complicated enough that it couldn't have been done in LM Studio so I tried OpenCode. The project was chugging along, Qwen3.6 35b-a3b was getting things done when 27b dropped. Naturally I had to try it. Only problem is that I couldn't use any of the Q4 models due to vram issues, so I dropped to an IQ3_M model from mradermacher/Qwen3.6-27B-i1-GGUF.
I had worries that IQ3_M would have been too much compression but it did fine and was even able to find a difficult bug that IQ4_XS version of Qwen3.6 35b-a3b couldn't. They say dense models handle compression better than MoE models. Is that the reason for this? What are other people's experience with 35b-a3b vs 27b versions of Qwen3.6?
Using LM Studio,
I got 50-60 tokens per second with Qwen3.6 35b-a3b (AesSedai/Qwen3.6-35B-A3B-GGUF IQ4_XS) but the prompt processing gets real slow sometimes.
I got 40ish tokens per second with mradermacher/Qwen3.6-27B-i1-GGUF IQ3_M but it was decent speed throughout.
How are people's experiences with these two models at 16gb vram?
Oh, the Waypoint Tower Defense game is done and can be on htmlbin. The save/load doesn't seem to work on their site, but if you download the file and open it in browser, it'll work fine. It's a self-contained single html game. Mean to be like minesweeper but for tower defense.
Brilliant_Anxiety_36@reddit
im running qwen3.6 27b q4km without vision with turboquant via llamacpp with arounfd 98k context. 7900Xt 20gb
without turboquant i can fit 45k context
MistingFidgets@reddit
Can you share your recipe for getting 40 tok/s on a 5070? I have a 5060 and want to see how close I can get to that
LocalAI_Amateur@reddit (OP)
Context Length: 32,768
GPU Offload: 64
Mac concurrency Predictions: 1
Unified KV Cache: on
Offload KV Cache to GPU Memory: on
Keep Model in Memory: off
Try mmap(): off
Flash Attention: on
K Cache Quant: Q8_0
V Cache Quant: Q8_0
ridablellama@reddit
Its really impressive and a huge relief to know that if worse comes to worse it will be the baseline. that can never be taken away from anyone who has 16-24 GB vram. no matter how expensive monthly cloud costs get this will exist and be readily available. it wasn't as good as Claude Code but I have done legit work with it and the speed and context window were totally fine great speed. in a few more weeks it will be fine tuned and be even better. Wild times
phazei@reddit
But my 3090 died, and I'm very sad :( They cost over 2x what I paid now :'(
ridablellama@reddit
yikes thats kinda of scary knowing my 4090 is 2+ years old. maybe i should sell it soon and get that Max Q ive always wanted
rhythmdev@reddit
This is exactly the reason i build a 5090 machine. I got it, it is mine. By 2030, i will own it and be happy.
ridablellama@reddit
yep, I bought my 4090 over 3 years ago I think at this point? its still worth most of what i paid for it. The resale factor is always left out of these discussions with cloud API costs. Once you pay its gone forever and not coming back.
SkyFeistyLlama8@reddit
In just two years of local LLMs, we've gone from stochastic parrots like Llama to Qwen 3.6 and Gemma 4 in roughly the same amount of RAM. You're right, this is the worst it will ever be.
I can't remember when I last used a cloud LLM over the past week.
crantob@reddit
A3B can't develop emergent thinking, 27B can.
It's like a hamster brain vs a smart dog.
FinalCap2680@reddit
Have not done much testing, just some html/css/js, but so far I like Qwen 3.6 35B-A3B most (UDQ8_K_XL). It gives much better results for UI and something that somewhat could be a starting point to build up on.
Can't wait to see what Qwen 3.6 122B will be capable of...
76vangel@reddit
Great. Which IDE are yiou using? And if it's VSCOde which extension are you using LM Studio with?
LocalAI_Amateur@reddit (OP)
just notepad++ it's html, and javascript so no need for much more.. LM Studio + OpenCode Desktop.
Weekly_Comfort240@reddit
I am using a QuantTrio/Qwen3.6-27B-AWQ quant with VLLM (2 x 48GB RTX 6000's, full context, 4 parallel). I started with 35B-A3B, but even though it was blistering fast, I absolutely cannot go back to it after experiencing the full thick goodness of this model. Simply put, 27B slays in deep understanding of what you ask it to do, and then doing it. I'm going to provide two examples where I hooked up the Claude Code harness front end to the vllm / Qwen 3.6-27B backend:
First example: Analyze ten word documents pertaining to a project involving healthcare integration, extract certain technical data and transform it, and analyze discrepancies according to my prompt. 1 hour 6 minutes later, it generated a report and deliverables exactly on spec.
Second example: Compare two codebases and give me the list of bugs I fixed in the first one, and ignore all the stuff involved in the platform migration I did from the first to the second codebase, covering hundreds of git commits. It uncovered stuff I completely forgot. 44 minutes and it cooked up a document that told me what bugs I fixed and how to propagate back to the first model.
In my own short-hand personal comparison of these same projects between 35B-a3B and 27B, 35B will complete the projects in half the time but deliver results that do not reflect the depth of understanding that 27B has. Honestly, 27B makes it seem like I got my own mini frontier-model class robot on tap, with zero token budgets and no data leaving my office.
Hot-Business8528@reddit
I’m running the same model on 2x5090. What TPS are you getting, is it the 18tps in the last paragraph?
LocalAI_Amateur@reddit (OP)
looks like I can probably delete the 35b-a3b model to save some space.
Direct_Turn_1484@reddit
35B seemed to hallucinate a lot for me. I had to switch back to other coding models.
Independent-Date393@reddit
MoE models at IQ3 lose more than dense because you're compressing routing logic and expert weights simultaneously. dense models distribute quantization error more gracefully. 35B-A3B IQ4 probably beats 27B IQ3 on most tasks, but if routing was misfiring on your specific problem the switch would feel like an upgrade even at lower quant.
LocalAI_Amateur@reddit (OP)
I know my experience is anecdotal so I'll probably switch back and forth between the two if I come across more bugs. Maybe this was just a fluke. Who knows.
dampflokfreund@reddit
I don't think so. I have compared Q4_K_L to Q3_K_XL (both from Bartowski) and the Q3 did excellent and frequently better in many of my tests. I don't know why, it is strange.
KillerX629@reddit
I'm looking to get more tokens per second, the noticeable slow down gives me a lot of friction for the switch
MasterLJ@reddit
I'm getting 150 tok/sec on a rented H100 that shuts down when I'm not using it. So much cheaper than API hits because it's by the hour and not per token.
Ambitious_Worth7667@reddit
....Login...?
blackashi@reddit
Company?
j_lyf@reddit
Tutorial?
Hot-Employ-3399@reddit
Switch to pythia 14M model. Tokens will fly through the roof.
KillerX629@reddit
Nah, with qwen 35a3b i get 100t/s more or less. But the 27b dense one gets me 30tokens per sec only
Kiro369@reddit
If you have to prompt it like 3 times to achieve something that the Dense model would do from the first time, is it really faster? Quality matters in that equation
CodeDominator@reddit
In coding accuracy > speed.
LocalAI_Amateur@reddit (OP)
this
rpkarma@reddit
It makes up for it by not thinking as much and getting correct answers faster in wall-clock time though.
Chiralistic@reddit
Since qwen3.6 35b is a MoE model you can load a Q8 version of that model without loosing much speed. I bet that codes even better.
Danmoreng@reddit
If you’re using it for coding, speculative decoding should improve speed further. Not sure if LMStudio has that though, you will most likely need plain llama.cpp for that. Tested it out today on my Laptop RTX 5080 16GB with IQ3_XXS. I get ~30 t/s normally, and if it repeats lots of pre-existing code that goes up to 50-80 t/s.
If you want to run llama.cpp I got powershell & bash scripts to compile from source and run Qwen3.5/6 models here: https://github.com/Danmoreng/local-qwen3-coder-env
Ill_Barber8709@reddit
LMStudio has Speculative decoding feature since February 18, 2025
https://lmstudio.ai/changelog/lmstudio-v0.3.10
Danmoreng@reddit
Sorry, let me rephrase: self-speculative decoding without draft model. https://github.com/ggml-org/llama.cpp/pull/22223
suprjami@reddit
Does this actually work for you?
It makes no difference for me.
I see the memory allocation in the logs (16 MiB?) but the best acceptance rate I got was 2 out of 2 drafts.
I've used draft models in the past with noticeable speedup.
Danmoreng@reddit
Yes it does for specific workload: repeating much of the previous input. Tested this yesterday with the Qwen3.6 27B model on my 5080: asked it to create an html website ~27 t/s. Asked it to make an edit to the previously generated website ~50 t/s.
suprjami@reddit
Ah I see. Well, at 16 MiB usage, no harm in leaving it on for occasional speedup. Thanks!
LocalAI_Amateur@reddit (OP)
I'm going to have to read up on this feature. Tho from what I've read so far, I'm not sure I have much vram to fit another smaller model. Maybe after Qwen3.6's smaller models come out, I can give this a try.
Danmoreng@reddit
It’s not with a smaller model, it’s re-using parts of the prompt and does hash matching. Very lightweight, documented here: https://github.com/ggml-org/llama.cpp/blob/master/docs/speculative.md#n-gram-mod-ngram-mod
simracerman@reddit
Q4_K_XL is killing it for me.
LocalAI_Amateur@reddit (OP)
what's your vram tho? I only picked IQ3_M because I needed to leave context room.
simracerman@reddit
5070 Ti 16GB. I offload some to iGPU using Vulkan backend:
${llamasvr-vulk} -m ${mpath}\Qwen3.6-27B-UD-Q4_K_XL.gguf --no-mmap -c 72000 ${quantKV-8.0} -np 1 -—ngl 63 --chat-template-kwargs "{\"preserve_thinking\":true}" --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence-penalty 0.0 --repeat-penalty 1.0
LocalAI_Amateur@reddit (OP)
ah, well my igpu is pretty weak so I'm not sure if it's worth the trouble to build llama.cpp for it. I'll keep that trick in mind if I really need to squeeze out more tokens/second tho.
simracerman@reddit
Which one do you have?
Mine is AMD 890m. It’s middle of the road, but nothing to brag about
LocalAI_Amateur@reddit (OP)
Ryzen 7 7840U w/ Radeon 780M graphics. I'm using my RTX 5070 ti through oculink. You have a link on how to use vulkan offload? from what I searched they all say I have to build my own llamacpp. I have no idea what your "${llamasvr-vulk}" variable means and which exe file it is.
simracerman@reddit
It’s late here, but I will post a link with more info and exact instructions.
Since you have the 780m (solid iGPU), it’s faster than CPU offload, quieter during inference, and consumes less energy. My setup is also via Oculink.
${llamasvr-vulk} Is a macro I use so I don’t have to type C:/path/to/llama.cpp/llama-server.exe
yeah_me_@reddit
how much tps do you get with that setup? I am also thinking about going the eGPU + iGPU offload route, I wonder how much potential speedup is there (since I am getting 11-ish tps on Strix Halo)
simracerman@reddit
Empty context I get 15tps, but with new spec decoding it does up to 20 sometimes .
LeonidasTMT@reddit
I'm also using a 5070 TI but settled for IQ3_XXS and 4bit K V
Q3_K_M was sort of unstable for me when running larger context sizes and trying to load everything to GPU and would freeze up my pc
Bloedbek@reddit
Can I ask what's your context limit set at?
LocalAI_Amateur@reddit (OP)
32,768
Express_Quail_1493@reddit
modern dense model are usually better than any MOE 3x its size qwen3.6-27b is on par with qwen3.5-397B MOE is still just.... an MOE. Raw active params wins the coherence and stability and reliabile outputs
TheyCallMeDozer@reddit
So i noticed something strange using the official models the 36B fast enough in LM Studio will run consecutively 4 prompts and text no issue. Switch down the the 27b model, incredibly slower like 5x the time to run a single prompt. 36B getting maybe 208-243 tok/s, 27b same setup thinking disabled ...etc 21 tok/s ?
uti24@reddit
While Qwen3.6 27b is much, much slower on the same hardware (like, 5× slower?) than Qwen3.6 35b-a3b, it still finishes tasks faster.
You have to babysit Qwen3.6 35b-a3b — it just doesn’t have the capacity to be as creative as Qwen3.6 27b, and it can’t figure out tricky moments. And Qwen3.6 27b is more like point-and-shoot: it will finish tasks without extra hiccups.
So even with a slow-ish AMD AI thing, I am much more happy with 27B (Although I already was somewhat happy with 35b-a3b but then 27b dropped). Also funny how Qwen3.5 27b didn't felt that way.
Dany0@reddit
Try the xtremeAI RYS. I swear this time it's even smarter (with caveats ofc)
XccesSv2@reddit
Hmmm just a 4 Bit Quant available, Im using UD_Q8_K_XL so it wouldnt be smarter, right?
Dany0@reddit
It definitely will be smarter. base Q8 might be more coherent very close to ctx limit
I think if you want you can dequant the q4, 2-3 hours on a good cpu iirc, or you used to be able to do it via google colab
odragora@reddit
18.6 GB, won't fit 16 GB VRAM GPUS unfortunately.
LocalAI_Amateur@reddit (OP)
tried. 8 tokens per second on my setup. no go sir.
UnlikelyTomatillo355@reddit
at these sizes, no a3b or e4b is going to be as good as something dense. the 27b is way better, same with gemma 4 31b.
Ranmark@reddit
I also was daily driving 35b a3b, but since release of 27b immediately switched. Even tho it's 2-3 times slower in my setup, it's doing job better and with less mistakes, so less rewrites.
jeremynsl@reddit
You can use a much higher quant of the MoE. And probably will be faster too. Check my post history just had a large discussion on this. I am using Q5 on a 8gb GPU, much faster than IQ4_XS. I’d say you can go Q6 for sure.
LocalAI_Amateur@reddit (OP)
problem is I only have 32gb of ram not 64. At least the few bigger ones I tries I get significant slow down in speed. Does a Q5 or Q6 versions of Qwen3.6 35b-a3b beat an IQ3_M version of Qwen3.6 27b? I don't know, maybe it does. but it takes quite a bit of time to test all the maybes. I can only speak to my experience switching between the two models I've used.
jeremynsl@reddit
Couldn’t say if it beats 27b, probably not. Will be MUCH faster though. You have 48gb total RAM which should be plenty for Q6.
LocalAI_Amateur@reddit (OP)
I have to repeat what others have said. In coding accuracy > speed.
FortiTree@reddit
Time to upgrade to 32Gb vram :)
Im on Stix Halo and can run 122B-A10B IQ4XS at 20 tk/s. Reading all this makes me think I should push for Q5 instead due to routing compression. Or fall back to 27B Q8 at the blistering speed of 7 tk/s, or Q4 13 tk/s. My projected daily driver is 35B-A3B Q8 for no compression loss at 50 tk/s.
lousyzen@reddit
what's the context window you use?
LocalAI_Amateur@reddit (OP)
32,768
YairHairNow@reddit
q4_0q4_0turbo3q4_02-GPU is 5080+2080. It's beneficial on 35B MOE 22GB to prevent offloading.
https://github.com/Danmoreng/local-qwen3-coder-env
Stainless-Bacon@reddit
What is the speed if you offload to CPU instead of second GPU? I’m wondering if I should grab a cheap used GPU with 4 or 8 GB to complement my 16 GB main instead of CPU offload.
YairHairNow@reddit
I'm 1x pcie crippled at the moment, but MOE models are faster if the model can fit in vram. However if the model were to fit on my 5080 alone it would be about twice as fast, with the tradeoff being a q3 quant.
For dense models, CPU is faster, but it suffers more from the pcie 1x penalty compared to moe, which barely has penalty.
I'd say go for it. I've been having fun figuring out ways to use it.
Benchmarks with 22gb q4moe qwen 3.6b 35b on 5080 and 5080+2080.
admajic@reddit
I'd say the 27b it's way better. I can run the 35b at 110 token/s and the 27b it's half a fast but the 35b will take 30 mins to complete a task due to having to fix stuff at the end vs the 27b having to to do less fixing at the end so it's ultimately faster.
ayylmaonade@reddit
You can get far more visually impressive results out of this model. If you're just messing around, go ask it to generate a ThreeJS Voxel Pagoda world. Or pretty much anything using ThreeJS/WebGL.
LocalAI_Amateur@reddit (OP)
Well, my personal understanding of ThreeJS/WebGL is limited and I didn't want my first test to totally be a ton of code I don't understand. AI code is ugly enough as it is.. I went through three rounds of cleanup / optimization to get it in current state. I've specifically ask it to optimize for human readability.
ayylmaonade@reddit
Hey, fair enough. Just thought I'd let you know! I've seen some really impressive things from Qwen 3.6. I'll include a couple examples of generations I've had by an MXFP4 quant of 35B-A3B. 27B must be great.
Simulated Browser OS
The pagoda I mentioned
Have fun experimenting!
Independent-Date393@reddit
the dense > MoE compression story holds up consistently. IQ3_M on 27B dense regularly beats IQ4_XS on 35B-A3B on reasoning tasks specifically. MoE routing adds too much noise at high compression ratios.
odragora@reddit
Makes me think that Pro and Flash closed models, however they are named by the creators, are dense and MoE models respectively.
cosmicnag@reddit
I think both are MoEs (pro still larger) jno way they have super fatass dense models
odragora@reddit
Maybe.
Perhaps they also have huge dense models for research, something they don't serve publicly but use themselves and give scientists.
TestingTheories@reddit
Thanks for this post... super interesting reading this thread given I have similar constraints.
LocalAI_Amateur@reddit (OP)
You're welcomed, I find us 16gb vram users to be lower-middle class citizens in this local AI society. We only have the 8gb peasants to look down upon. So sharing these tests probably helps somebody.
breadislifeee@reddit
The fact that this runs locally and is actually usable is the real win
breadislifeee@reddit
The fact that this runs locally and is actually usable is the real win
Independent-Date393@reddit
27b dense at IQ3_M finding a bug that 35b MoE at IQ4_XS missed is a useful data point. been sitting on the same choice with 16gb vram and this is probably what settles it for me
Independent-Date393@reddit
the dense-handles-compression-better-than-MoE intuition checks out. at IQ3_M the 27B is still mostly intact. the 35B-A3B's routing logic is the first thing to break when you compress it.
JLeonsarmiento@reddit
Why don’t you go Q5 or Q6 with the MoE? Lack of ram?
LocalAI_Amateur@reddit (OP)
generation speed went down quite a bit when I tried. So I stuck with this one.
tomByrer@reddit
Oooh I love me some TD.
In this test, he had an issue de-minifing a large JS file. Got it to work by splitting.
https://youtu.be/In825VzHzbU?t=273
Thanks for testing the heretic model; I've heard that aberlated models are better at agentic coding.
Pyros-SD-Models@reddit
It looks like the one game every LLM on earth somehow wants to implement if you ask it for a small puzzle game: laser-refractor-puzzles :D
but yes, dense qwen best qwen
LocalAI_Amateur@reddit (OP)
actually it didn't look like this at first. It was fairly basic.. I had it create a html screenshot of the game and generated various themes from terminal, vaperwave, glassmorphism monochrome etc and then merged the theme backed into the game. It was a fun process. Too bad it's still not stand out enough, but the gameplay is slightly different from the usual tower defense games. Mainly only have 5 waves and doesn't take much time
Great_Guidance_8448@reddit
Yea, I am really impressed with Qwen3.6 27b