Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better!

Posted by LocalAI_Amateur@reddit | LocalLLaMA | View on Reddit | 92 comments

A bit of context. I was coding up a little html tower defense game where you can alter the path by placing additional waypoints.

My setup: 32gb ram with 16gb vram 5070 ti. Using AesSedai/Qwen3.6-35B-A3B-GGUF IQ4_XS on LM Studio with OpenCode. I've graduated from one-shot vibe-coding prompts.

The spec for this game was complicated enough that it couldn't have been done in LM Studio so I tried OpenCode. The project was chugging along, Qwen3.6 35b-a3b was getting things done when 27b dropped. Naturally I had to try it. Only problem is that I couldn't use any of the Q4 models due to vram issues, so I dropped to an IQ3_M model from mradermacher/Qwen3.6-27B-i1-GGUF.

I had worries that IQ3_M would have been too much compression but it did fine and was even able to find a difficult bug that IQ4_XS version of Qwen3.6 35b-a3b couldn't. They say dense models handle compression better than MoE models. Is that the reason for this? What are other people's experience with 35b-a3b vs 27b versions of Qwen3.6?

Using LM Studio,

I got 50-60 tokens per second with Qwen3.6 35b-a3b (AesSedai/Qwen3.6-35B-A3B-GGUF IQ4_XS) but the prompt processing gets real slow sometimes.

I got 40ish tokens per second with mradermacher/Qwen3.6-27B-i1-GGUF IQ3_M but it was decent speed throughout.

How are people's experiences with these two models at 16gb vram?

Oh, the Waypoint Tower Defense game is done and can be on htmlbin. The save/load doesn't seem to work on their site, but if you download the file and open it in browser, it'll work fine. It's a self-contained single html game. Mean to be like minesweeper but for tower defense.

[-]

Brilliant_Anxiety_36@reddit

im running qwen3.6 27b q4km without vision with turboquant via llamacpp with arounfd 98k context. 7900Xt 20gb

without turboquant i can fit 45k context

[-]

MistingFidgets@reddit

Can you share your recipe for getting 40 tok/s on a 5070? I have a 5060 and want to see how close I can get to that

[-]

LocalAI_Amateur@reddit (OP)

Context Length: 32,768

GPU Offload: 64

Mac concurrency Predictions: 1

Unified KV Cache: on

Offload KV Cache to GPU Memory: on

Keep Model in Memory: off

Try mmap(): off

Flash Attention: on

K Cache Quant: Q8_0

V Cache Quant: Q8_0

[-]

ridablellama@reddit

Its really impressive and a huge relief to know that if worse comes to worse it will be the baseline. that can never be taken away from anyone who has 16-24 GB vram. no matter how expensive monthly cloud costs get this will exist and be readily available. it wasn't as good as Claude Code but I have done legit work with it and the speed and context window were totally fine great speed. in a few more weeks it will be fine tuned and be even better. Wild times

[-]

phazei@reddit

But my 3090 died, and I'm very sad :( They cost over 2x what I paid now :'(

[-]

ridablellama@reddit

yikes thats kinda of scary knowing my 4090 is 2+ years old. maybe i should sell it soon and get that Max Q ive always wanted

[-]

rhythmdev@reddit

This is exactly the reason i build a 5090 machine. I got it, it is mine. By 2030, i will own it and be happy.

[-]

ridablellama@reddit

yep, I bought my 4090 over 3 years ago I think at this point? its still worth most of what i paid for it. The resale factor is always left out of these discussions with cloud API costs. Once you pay its gone forever and not coming back.

[-]

SkyFeistyLlama8@reddit

In just two years of local LLMs, we've gone from stochastic parrots like Llama to Qwen 3.6 and Gemma 4 in roughly the same amount of RAM. You're right, this is the worst it will ever be.

I can't remember when I last used a cloud LLM over the past week.

[-]

crantob@reddit

A3B can't develop emergent thinking, 27B can.

It's like a hamster brain vs a smart dog.

[-]

FinalCap2680@reddit

Have not done much testing, just some html/css/js, but so far I like Qwen 3.6 35B-A3B most (UDQ8_K_XL). It gives much better results for UI and something that somewhat could be a starting point to build up on.

Can't wait to see what Qwen 3.6 122B will be capable of...

[-]

76vangel@reddit

Great. Which IDE are yiou using? And if it's VSCOde which extension are you using LM Studio with?

[-]

LocalAI_Amateur@reddit (OP)

just notepad++ it's html, and javascript so no need for much more.. LM Studio + OpenCode Desktop.

[-]

Weekly_Comfort240@reddit

I am using a QuantTrio/Qwen3.6-27B-AWQ quant with VLLM (2 x 48GB RTX 6000's, full context, 4 parallel). I started with 35B-A3B, but even though it was blistering fast, I absolutely cannot go back to it after experiencing the full thick goodness of this model. Simply put, 27B slays in deep understanding of what you ask it to do, and then doing it. I'm going to provide two examples where I hooked up the Claude Code harness front end to the vllm / Qwen 3.6-27B backend:

First example: Analyze ten word documents pertaining to a project involving healthcare integration, extract certain technical data and transform it, and analyze discrepancies according to my prompt. 1 hour 6 minutes later, it generated a report and deliverables exactly on spec.

Second example: Compare two codebases and give me the list of bugs I fixed in the first one, and ignore all the stuff involved in the platform migration I did from the first to the second codebase, covering hundreds of git commits. It uncovered stuff I completely forgot. 44 minutes and it cooked up a document that told me what bugs I fixed and how to propagate back to the first model.

In my own short-hand personal comparison of these same projects between 35B-a3B and 27B, 35B will complete the projects in half the time but deliver results that do not reflect the depth of understanding that 27B has. Honestly, 27B makes it seem like I got my own mini frontier-model class robot on tap, with zero token budgets and no data leaving my office.

[-]

Hot-Business8528@reddit

I’m running the same model on 2x5090. What TPS are you getting, is it the 18tps in the last paragraph?

[-]

LocalAI_Amateur@reddit (OP)

looks like I can probably delete the 35b-a3b model to save some space.

[-]

Direct_Turn_1484@reddit

35B seemed to hallucinate a lot for me. I had to switch back to other coding models.

[-]

Independent-Date393@reddit

MoE models at IQ3 lose more than dense because you're compressing routing logic and expert weights simultaneously. dense models distribute quantization error more gracefully. 35B-A3B IQ4 probably beats 27B IQ3 on most tasks, but if routing was misfiring on your specific problem the switch would feel like an upgrade even at lower quant.

[-]

LocalAI_Amateur@reddit (OP)

I know my experience is anecdotal so I'll probably switch back and forth between the two if I come across more bugs. Maybe this was just a fluke. Who knows.

[-]

dampflokfreund@reddit

I don't think so. I have compared Q4_K_L to Q3_K_XL (both from Bartowski) and the Q3 did excellent and frequently better in many of my tests. I don't know why, it is strange.

[-]

KillerX629@reddit

I'm looking to get more tokens per second, the noticeable slow down gives me a lot of friction for the switch

[-]

MasterLJ@reddit

I'm getting 150 tok/sec on a rented H100 that shuts down when I'm not using it. So much cheaper than API hits because it's by the hour and not per token.

[-]

Ambitious_Worth7667@reddit

....Login...?

[-]

blackashi@reddit

Company?

[-]

j_lyf@reddit

Tutorial?

[-]

Hot-Employ-3399@reddit

Switch to pythia 14M model. Tokens will fly through the roof.

[-]

KillerX629@reddit

Nah, with qwen 35a3b i get 100t/s more or less. But the 27b dense one gets me 30tokens per sec only

[-]

Kiro369@reddit

If you have to prompt it like 3 times to achieve something that the Dense model would do from the first time, is it really faster? Quality matters in that equation

[-]

CodeDominator@reddit

In coding accuracy > speed.

[-]

LocalAI_Amateur@reddit (OP)

this

[-]

rpkarma@reddit

It makes up for it by not thinking as much and getting correct answers faster in wall-clock time though.

[-]

Chiralistic@reddit

Since qwen3.6 35b is a MoE model you can load a Q8 version of that model without loosing much speed. I bet that codes even better.

[-]

Danmoreng@reddit

If you’re using it for coding, speculative decoding should improve speed further. Not sure if LMStudio has that though, you will most likely need plain llama.cpp for that. Tested it out today on my Laptop RTX 5080 16GB with IQ3_XXS. I get ~30 t/s normally, and if it repeats lots of pre-existing code that goes up to 50-80 t/s.

If you want to run llama.cpp I got powershell & bash scripts to compile from source and run Qwen3.5/6 models here: https://github.com/Danmoreng/local-qwen3-coder-env

[-]

Ill_Barber8709@reddit

LMStudio has Speculative decoding feature since February 18, 2025

https://lmstudio.ai/changelog/lmstudio-v0.3.10

[-]

Danmoreng@reddit

Sorry, let me rephrase: self-speculative decoding without draft model. https://github.com/ggml-org/llama.cpp/pull/22223

[-]

suprjami@reddit

Does this actually work for you?

It makes no difference for me.

I see the memory allocation in the logs (16 MiB?) but the best acceptance rate I got was 2 out of 2 drafts.

I've used draft models in the past with noticeable speedup.

[-]

Danmoreng@reddit

Yes it does for specific workload: repeating much of the previous input. Tested this yesterday with the Qwen3.6 27B model on my 5080: asked it to create an html website ~27 t/s. Asked it to make an edit to the previously generated website ~50 t/s.

[-]

suprjami@reddit

Ah I see. Well, at 16 MiB usage, no harm in leaving it on for occasional speedup. Thanks!

[-]

LocalAI_Amateur@reddit (OP)

I'm going to have to read up on this feature. Tho from what I've read so far, I'm not sure I have much vram to fit another smaller model. Maybe after Qwen3.6's smaller models come out, I can give this a try.

[-]

Danmoreng@reddit

It’s not with a smaller model, it’s re-using parts of the prompt and does hash matching. Very lightweight, documented here: https://github.com/ggml-org/llama.cpp/blob/master/docs/speculative.md#n-gram-mod-ngram-mod

[-]

simracerman@reddit

Q4_K_XL is killing it for me.

[-]

LocalAI_Amateur@reddit (OP)

what's your vram tho? I only picked IQ3_M because I needed to leave context room.

[-]

simracerman@reddit

5070 Ti 16GB. I offload some to iGPU using Vulkan backend:

${llamasvr-vulk} -m ${mpath}\Qwen3.6-27B-UD-Q4_K_XL.gguf --no-mmap -c 72000 ${quantKV-8.0} -np 1 -—ngl 63 --chat-template-kwargs "{\"preserve_thinking\":true}" --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence-penalty 0.0 --repeat-penalty 1.0

[-]

LocalAI_Amateur@reddit (OP)

ah, well my igpu is pretty weak so I'm not sure if it's worth the trouble to build llama.cpp for it. I'll keep that trick in mind if I really need to squeeze out more tokens/second tho.

[-]

simracerman@reddit

Which one do you have?

Mine is AMD 890m. It’s middle of the road, but nothing to brag about

[-]

LocalAI_Amateur@reddit (OP)

Ryzen 7 7840U w/ Radeon 780M graphics. I'm using my RTX 5070 ti through oculink. You have a link on how to use vulkan offload? from what I searched they all say I have to build my own llamacpp. I have no idea what your "${llamasvr-vulk}" variable means and which exe file it is.

[-]

simracerman@reddit

It’s late here, but I will post a link with more info and exact instructions.

Since you have the 780m (solid iGPU), it’s faster than CPU offload, quieter during inference, and consumes less energy. My setup is also via Oculink.

${llamasvr-vulk} Is a macro I use so I don’t have to type C:/path/to/llama.cpp/llama-server.exe

[-]

yeah_me_@reddit

how much tps do you get with that setup? I am also thinking about going the eGPU + iGPU offload route, I wonder how much potential speedup is there (since I am getting 11-ish tps on Strix Halo)

[-]

simracerman@reddit

Empty context I get 15tps, but with new spec decoding it does up to 20 sometimes .

[-]

LeonidasTMT@reddit

I'm also using a 5070 TI but settled for IQ3_XXS and 4bit K V

Q3_K_M was sort of unstable for me when running larger context sizes and trying to load everything to GPU and would freeze up my pc

[-]

Bloedbek@reddit

Can I ask what's your context limit set at?

[-]

LocalAI_Amateur@reddit (OP)

32,768

[-]

Express_Quail_1493@reddit

modern dense model are usually better than any MOE 3x its size qwen3.6-27b is on par with qwen3.5-397B MOE is still just.... an MOE. Raw active params wins the coherence and stability and reliabile outputs

[-]

TheyCallMeDozer@reddit

So i noticed something strange using the official models the 36B fast enough in LM Studio will run consecutively 4 prompts and text no issue. Switch down the the 27b model, incredibly slower like 5x the time to run a single prompt. 36B getting maybe 208-243 tok/s, 27b same setup thinking disabled ...etc 21 tok/s ?

[-]

uti24@reddit

While Qwen3.6 27b is much, much slower on the same hardware (like, 5× slower?) than Qwen3.6 35b-a3b, it still finishes tasks faster.

You have to babysit Qwen3.6 35b-a3b — it just doesn’t have the capacity to be as creative as Qwen3.6 27b, and it can’t figure out tricky moments. And Qwen3.6 27b is more like point-and-shoot: it will finish tasks without extra hiccups.

So even with a slow-ish AMD AI thing, I am much more happy with 27B (Although I already was somewhat happy with 35b-a3b but then 27b dropped). Also funny how Qwen3.5 27b didn't felt that way.

[-]

Dany0@reddit

Try the xtremeAI RYS. I swear this time it's even smarter (with caveats ofc)

[-]

XccesSv2@reddit

Hmmm just a 4 Bit Quant available, Im using UD_Q8_K_XL so it wouldnt be smarter, right?

[-]

Dany0@reddit

It definitely will be smarter. base Q8 might be more coherent very close to ctx limit

I think if you want you can dequant the q4, 2-3 hours on a good cpu iirc, or you used to be able to do it via google colab

[-]

odragora@reddit

18.6 GB, won't fit 16 GB VRAM GPUS unfortunately.

[-]

LocalAI_Amateur@reddit (OP)

tried. 8 tokens per second on my setup. no go sir.

[-]

UnlikelyTomatillo355@reddit

at these sizes, no a3b or e4b is going to be as good as something dense. the 27b is way better, same with gemma 4 31b.

[-]

Ranmark@reddit

I also was daily driving 35b a3b, but since release of 27b immediately switched. Even tho it's 2-3 times slower in my setup, it's doing job better and with less mistakes, so less rewrites.

[-]

jeremynsl@reddit

You can use a much higher quant of the MoE. And probably will be faster too. Check my post history just had a large discussion on this. I am using Q5 on a 8gb GPU, much faster than IQ4_XS. I’d say you can go Q6 for sure.

[-]

LocalAI_Amateur@reddit (OP)

problem is I only have 32gb of ram not 64. At least the few bigger ones I tries I get significant slow down in speed. Does a Q5 or Q6 versions of Qwen3.6 35b-a3b beat an IQ3_M version of Qwen3.6 27b? I don't know, maybe it does. but it takes quite a bit of time to test all the maybes. I can only speak to my experience switching between the two models I've used.

[-]

jeremynsl@reddit

Couldn’t say if it beats 27b, probably not. Will be MUCH faster though. You have 48gb total RAM which should be plenty for Q6.

[-]

LocalAI_Amateur@reddit (OP)

I have to repeat what others have said. In coding accuracy > speed.

[-]

FortiTree@reddit

Time to upgrade to 32Gb vram :)

Im on Stix Halo and can run 122B-A10B IQ4XS at 20 tk/s. Reading all this makes me think I should push for Q5 instead due to routing compression. Or fall back to 27B Q8 at the blistering speed of 7 tk/s, or Q4 13 tk/s. My projected daily driver is 35B-A3B Q8 for no compression loss at 50 tk/s.

[-]

lousyzen@reddit

what's the context window you use?

[-]

LocalAI_Amateur@reddit (OP)

32,768

[-]

YairHairNow@reddit

Model + Quant	Config	tg (t/s)	Max Ctx	Verdict


35B-A3B heretic Q3_K_S	5080 only, `q4_0`	136-149	\~65K	CURRENT DAILY DRIVER
35B-A3B Q3_K_S bartowski	5080 only, `q4_0`	149	\~65K	Same speed, non-uncensored
27B IQ4_XS	5080 only, `turbo3`	48 (flat)	196K	Long-context mode
27B IQ4_XS	5080 only, `q4_0`	65	32K	Short-ctx option
35B-A3B Q4_K_M	2-GPU	73	131K+	Big model, needs 2-GPU

2-GPU is 5080+2080. It's beneficial on 35B MOE 22GB to prevent offloading.
https://github.com/Danmoreng/local-qwen3-coder-env

[-]

Stainless-Bacon@reddit

What is the speed if you offload to CPU instead of second GPU? I’m wondering if I should grab a cheap used GPU with 4 or 8 GB to complement my 16 GB main instead of CPU offload.

[-]

YairHairNow@reddit

I'm 1x pcie crippled at the moment, but MOE models are faster if the model can fit in vram. However if the model were to fit on my 5080 alone it would be about twice as fast, with the tradeoff being a q3 quant.

For dense models, CPU is faster, but it suffers more from the pcie 1x penalty compared to moe, which barely has penalty.

I'd say go for it. I've been having fun figuring out ways to use it.

Benchmarks with 22gb q4moe qwen 3.6b 35b on 5080 and 5080+2080.

 ┌─────────────┬─────────┬────────┬────────┐
  │   Config    │ Context │ pp t/s │ tg t/s │
  ├─────────────┼─────────┼────────┼────────┤
  │ 1-GPU + CPU │ 32K     │ 211    │ 38.8   │
  ├─────────────┼─────────┼────────┼────────┤
  │ 1-GPU + CPU │ 64K     │ 124    │ 41.3   │
  ├─────────────┼─────────┼────────┼────────┤
  │ 1-GPU + CPU │ 256K    │ 131    │ 37.0   │
  ├─────────────┼─────────┼────────┼────────┤
  │ 2-GPU       │ 64K     │ 537    │ 79.4   │
  ├─────────────┼─────────┼────────┼────────┤
  │ 2-GPU       │ 128K    │ 470    │ 75.0   │
  ├─────────────┼─────────┼────────┼────────┤
  │ 2-GPU       │ 256K    │ 525    │ 79.0   │
  └─────────────┴─────────┴────────┴────────┘

[-]

admajic@reddit

I'd say the 27b it's way better. I can run the 35b at 110 token/s and the 27b it's half a fast but the 35b will take 30 mins to complete a task due to having to fix stuff at the end vs the 27b having to to do less fixing at the end so it's ultimately faster.

[-]

ayylmaonade@reddit

You can get far more visually impressive results out of this model. If you're just messing around, go ask it to generate a ThreeJS Voxel Pagoda world. Or pretty much anything using ThreeJS/WebGL.

[-]

LocalAI_Amateur@reddit (OP)

Well, my personal understanding of ThreeJS/WebGL is limited and I didn't want my first test to totally be a ton of code I don't understand. AI code is ugly enough as it is.. I went through three rounds of cleanup / optimization to get it in current state. I've specifically ask it to optimize for human readability.

[-]

ayylmaonade@reddit

Hey, fair enough. Just thought I'd let you know! I've seen some really impressive things from Qwen 3.6. I'll include a couple examples of generations I've had by an MXFP4 quant of 35B-A3B. 27B must be great.

Simulated Browser OS

The pagoda I mentioned

Have fun experimenting!

[-]

Independent-Date393@reddit

the dense > MoE compression story holds up consistently. IQ3_M on 27B dense regularly beats IQ4_XS on 35B-A3B on reasoning tasks specifically. MoE routing adds too much noise at high compression ratios.

[-]

odragora@reddit

Makes me think that Pro and Flash closed models, however they are named by the creators, are dense and MoE models respectively.

[-]

cosmicnag@reddit

I think both are MoEs (pro still larger) jno way they have super fatass dense models

[-]

odragora@reddit

Maybe.

Perhaps they also have huge dense models for research, something they don't serve publicly but use themselves and give scientists.

[-]

TestingTheories@reddit

Thanks for this post... super interesting reading this thread given I have similar constraints.

[-]

LocalAI_Amateur@reddit (OP)

You're welcomed, I find us 16gb vram users to be lower-middle class citizens in this local AI society. We only have the 8gb peasants to look down upon. So sharing these tests probably helps somebody.

[-]

breadislifeee@reddit

The fact that this runs locally and is actually usable is the real win

[-]

breadislifeee@reddit

The fact that this runs locally and is actually usable is the real win

[-]

Independent-Date393@reddit

27b dense at IQ3_M finding a bug that 35b MoE at IQ4_XS missed is a useful data point. been sitting on the same choice with 16gb vram and this is probably what settles it for me

[-]

Independent-Date393@reddit

the dense-handles-compression-better-than-MoE intuition checks out. at IQ3_M the 27B is still mostly intact. the 35B-A3B's routing logic is the first thing to break when you compress it.

[-]

JLeonsarmiento@reddit

Why don’t you go Q5 or Q6 with the MoE? Lack of ram?

[-]

LocalAI_Amateur@reddit (OP)

generation speed went down quite a bit when I tried. So I stuck with this one.

[-]

tomByrer@reddit

Oooh I love me some TD.

In this test, he had an issue de-minifing a large JS file. Got it to work by splitting.
https://youtu.be/In825VzHzbU?t=273

Thanks for testing the heretic model; I've heard that aberlated models are better at agentic coding.

[-]

Pyros-SD-Models@reddit

It looks like the one game every LLM on earth somehow wants to implement if you ask it for a small puzzle game: laser-refractor-puzzles :D

but yes, dense qwen best qwen

[-]

LocalAI_Amateur@reddit (OP)

actually it didn't look like this at first. It was fairly basic.. I had it create a html screenshot of the game and generated various themes from terminal, vaperwave, glassmorphism monochrome etc and then merged the theme backed into the game. It was a fun process. Too bad it's still not stand out enough, but the gameplay is slightly different from the usual tower defense games. Mainly only have 5 waves and doesn't take much time

[-]

Great_Guidance_8448@reddit

Yea, I am really impressed with Qwen3.6 27b