Qwen3.6 35b-a3b š¤Æ
Posted by EffectiveMedium2683@reddit | LocalLLaMA | View on Reddit | 116 comments
Originally I was a diehard fan of Gemma4 26b-a4b because it really is a remarkably intelligent llm. Ran qwen3.6 via ollama and found it impressive but still favored Gemma. Ollama did it a disservice at least on my pc.
Ran it straight through llama.cpp and it is much faster than gemma4 26b-a4b, roughly equivalent in general intelligence, better in strict prompt adherence, and it doesn't slow down on long context. Like, I'm back to being a Qwen fan.
Just thought I'd share haha
our_sole@reddit
I am just stunned how well qwen3.6-35b-A3B MOE is working for me. I have an rtx 3090 24GB VRAM, 64GB RAM on a beelink gti14 Ultra 9185H CPU and the beelink eGPU dock.
I switched from LM Studio to llama.cpp (not because LMS had any issues, I had just heard that llama.cpp was faster and very tunable).
I spent some time tuning llama.cpp with the LLM, got the pi.dev harness running, and started getting great results.
Up until now, local AI was just kind of a playtoy and I used Claude for heavy lifting and Copilot VS Code for medium/light stuff.
I'm getting close to 100 tk/s. I have been trying increasingly more difficult tests/prompts and its handling it fine. It feels close to haiku or maybe sonnet (but not opus obviously). I vibe coded a Flask/Javascript/Tailwind CSS app with local browser storage and it nailed it. Based on my PRD, it even found and added sample data so I could test things.
If i can use it for 60 or maybe/hopefully 70% of my daily ai coding and start to untether myself from the anthropic usage circus, I'll be quite happy. Unlimited tokens are awesome.
There are github PRs for a cache invalidation bug and lack of full MTP support in llama.cpp, which i hope will get merged soon. These should make the setup even better.
Local AI is becoming very powerful. Exciting times! šš
cheers
sirmeow-meow@reddit
What did you do on lama.cop to get it to 100t/s?
siegevjorn@reddit
You can use -ot flag to load expert layers to cpu. It needs a bit of engineering but works like a charm.
tylerhardin@reddit
Not an expert, but I think -fitt superseded the need to tune -ot manually a couple months ago
siegevjorn@reddit
Ha, thanks will look into thus.
tylerhardin@reddit
The idea behind the -fit/-fitt args is to automatically offload as much as possible, presumably attempting to offloading the fastest layers first, like expert layers. You can sometimes do better, but it tries to do what you do with -ot for you. Before -fit was added, it had to be done manually with -ncmoe or -ot.
siegevjorn@reddit
Yes. But -fit doesnt supercede or replace -ncmoe nor -ot. That's my point. What -fit does and -ncome/-ot do are functionally different.
tylerhardin@reddit
https://grok.com/share/bGVnYWN5LWNvcHk_7989596b-f702-42e7-af82-2355fc1fba55
That's exactly what it does do. That's the whole point. It's solving the optimization problem algorithmically.
siegevjorn@reddit
Sigh. Referencing to AI without verifying? Can't be serious, people.
tylerhardin@reddit
I use it. I'm not doing research into something I'm actively using successfully. Better things to do. You could use it too and save yourself some time. Or not, makes no difference to me.
siegevjorn@reddit
Actually, it makes a lot of difference. Because I'm speaking from experience. If you don't have time to verify yourself, why argue?
tylerhardin@reddit
Decided to find it for you. You're welcome.
See common/fit.cpp:
siegevjorn@reddit
Ok, thanks. I see your point now.
You're right about that
-fitdoes do some optimization for moes. For instance,-fitdoes keep all the active weights to GPU.But what it doesn't also matters for speed. For instance, all the attention weights, kv cache. If those are left on CPU, that's a performance hitāand
-otallows to load all those weights onto GPU, so is better optimization for speed.I assume you've got a dual symmetric GPU system. In dual GPU system, performance difference between
-fitand-otmay be minor. But when you have a single fast GPU, the advantage of-otbecome much significant.tylerhardin@reddit
I'm running exactly the type of system you are, actually -- GLM 5.1 on a single RTX 6000 and an Epyc. It works as well for me as the tool I wrote before (better actually). The reason it tags different layer types in the code is that there's a priority hierarchy. It puts the most important layers (for speed) on gpu, then moves down the hierarchy, packing as many as possible. It tries to do what you've been doing automatically. The biggest issue I've had with it is that sometimes its estimate is wrong and you get a CUDA OOM error, in which case you have to set -fitt higher. It's actually really hard to predict the exact total CUDA allocation needed for inference (my tool was very tedious to debug, hence why I was happy to abandon it).
Have you tried it since we've been chatting? I bet it'll be basically the same.
siegevjorn@reddit
Oh, I see. Yes, i understand. Using -ot could be really painful with oom.
Yes, i have tried. I have a rtx 4090 and a 5060 ti. I saw a notable speed difference when running MOEs
-fitvs-ot, like 20-30% speed difference favoring-ot.tylerhardin@reddit
I'm not arguing. I'm telling. I wrote a tool myself to deduce optimal -ot args before replacing it with -fit after llama added it. We're not on the same level. You use tools. I make them. I don't debate people below me. You can take the info from your better or not. I'm not wasting more time on you.
Southern-Expert22@reddit
Cpu? I load active via gpu and pin the rest on ram, is that what you mean or something else ?
siegevjorn@reddit
Yes expert layers are inactive. By definition.
huzbum@reddit
IQ4_NL fits comfortably on my 3090 with 256k context with Q8 KV cache. 110tps.
AdIllustrious436@reddit
I hit 90 to 110 tok/sec on a single 3090, 130k context. I could probably go up to 262k with agressive KV quantization. Gemini Flash 3 intelligence and throughput level. Mind-blowing.
Dependent-Guitar-473@reddit
What did you change exactly by saying "fine-tune it"
I am a beginner, I used 9B model (20T/s) and 35B (64t/s)
I would love to get more out of them
Public_Umpire_1099@reddit
I have basically your setup minus the GPU for my homelab. Even on that hardware, I still get bare minimum 30 tokens a second on properly quantized MTP models at Q4. Legitimately usable.
FormalAd7367@reddit
just curious - why you prefer A3B over 27B?
AdIllustrious436@reddit
3x faster throughput and only a marginal intelligence difference in my testing. The catch is fitting the full context on one 3090.
DeSibyl@reddit
Curious if you think it would be good as a general assistant? Right now Iāve been using Gemma 4 31B as my daily general assistant, only get about 30 t/s. I tried using Qwen 3.6 27B since I can get higher context and also use MTP to get 70 t/s, but it sometimes would get stuck in a loop thinking. Often enough I switched back to G4⦠I mainly use it for work, proofreading emails, asking it to create drafts based on pictures of info and such. Maybe some coding stuff
jopereira@reddit
It's interesting... I have yet to encounter a single loop with 27B IQ3 XXS (using turbo3 !!!). But I'm ALWAYS in no thinking mode. It solves every single problem I through at it!
AdIllustrious436@reddit
I've also had the thinking loop on 35B. It usually recovers, but it's not ideal. Never tried G4 (I mainly do agentic dev), but from what I've read, most people rank G4 26B-A4B as the best all-rounder in this weight class. The 31B just feels too heavy for my setup to run comfortably.
DeSibyl@reddit
Yea fair enough. Iām just worried a 3B active or 4B active MoE isnāt going to be smart enough to pull data from pictures correctly. Screenshots I send contain a decent amount of numbers, so Iād like it to be accurate, and reliable. (I always double check but still)
lumos675@reddit
It's pretty smart even for coding. I set it as a service when my computer turns on the model loads as my personal asistant for everything. I get 210 token on 5090
AdIllustrious436@reddit
PS :
This alone makes me question the claim tho...
bnightstars@reddit
My 35B is actually working great with Claude Code as harness but you need hardware that chan handle all the Prefill tokens Claude Code loves to spend. And the llama.cpp cache invalidation issues are not helping with that.
our_sole@reddit
You have claude code running against local qwen3.6-35b-A3B running under llama.cpp?
Could you share your claude shell script or bat file that does this (the env vars, --model, config, etc..)?
I tried for quite some time to do this and claude just flatly refused to use the model. It saw the model, but wouldn't use it: "There's an issue with the selected model..it might not exist or..."
bnightstars@reddit
I run on MacBook so MLX but overall is just of ENV vars that point Claude Code to a local LLM the DISABLE_NONESSENTIAL_TRAFIC one is key.
Opposite-Station-337@reddit
Just the first section variables got me going. Been using pi agent though.
https://www.reddit.com/r/LocalLLaMA/s/u0Fuj1kdBC
our_sole@reddit
Thanks much! I'll test this again today.
Cheers
superdariom@reddit
Which quant are you using?
our_sole@reddit
The unsloth dynamic UD-Q4_K_XL
Altruistic-Dust-2565@reddit
What about 262K context speed? That's like the minimum for usable coding now.
Cyber_Ghost@reddit
I find Gemma the most useful model for me for most knowledge related tasks, and helps me pretty good with translation and grammar (learning Italian).
I wanted to do some evaluations on the models using custom tests, so I let Claude Code build a test suite for doing it. I wanted to compare Gemma4 26B-A4B at FP8, Gemma3 31B at Q5 but now I'll also add Qwen3.6 35B-A3B as well. Sounds like an interesting idea to test.
It's running against 26B now on 2xB70 cards at max context.
nickless07@reddit
Can you add the qwen 122B too?
Cyber_Ghost@reddit
122B-A10B finished, at Q3 it wasn't very good in the tests. You can check the github for Claud's conclusions.
nickless07@reddit
Oof damn. I thought it is only bad a coding, but would perform better in such tasks due to the world knowledge.
Cyber_Ghost@reddit
I expect it to be better at a reasonable quant.
nickless07@reddit
Idk man. I can run only Qwen3.5-122B-A10B-UD-IQ2_XXS with \~4 token/s and for the few runs I used it the writing style was much better then Qwen3.6 in q8. I know MoE suffer more then dense models from quants, but for me it was pretty descent even in that low quant. Then I read everywhere that it is bad with code (not my usecase at all) and it is hard to find any tests that doesn't aim for coding. I really hope we get that as Qwen3.6 too, however it looks like we are out of luck there.
Cyber_Ghost@reddit
Did you try Gemma for writing?
nickless07@reddit
Yeah and whenever that one got stuck, I switched to the 122B to fix the problem. For Example is was working on a prompt with Gemma 4 for hours, a lot of back and forth, then was running in circles, switched to qwen3.5 and got it done in a single shot. It is just, my english isn't the best and where Gemma beats around the bush, the 122B qwen jumps in straight with the right phrases.
Cyber_Ghost@reddit
Just an FYI - Managed to get 122B-A10B to load at Q3, going to run the tests now.
Cyber_Ghost@reddit
I don't think I have the VRAM to run it in a good config, but I'll try to run Q3 and see what I get from it.
Cyber_Ghost@reddit
So, Claude took a while to finish all the tests on the 2 Gemma models and one Qwen.
I've had it dump everything into github.
I'll write a short post about it later but in general Gemma 31B seems to be the winner in the types of tests Claude made with Qwen3.6 35B-A3B falling behind. I'll try to get 122B-A10B running on my setup to test too but not sure how it'll perform on my system.
https://github.com/pelegw/llm-eval/blob/main/ANALYSIS.md
No-Juggernaut-9832@reddit
Friend donāt let friend use Ollama. Llama.cpp or omlx (if you are on Apple)
Southern-Expert22@reddit
Use yarn and Google turboquant to get a 1 million context window, do --no mmaps, I telling this model is better than opus and you get the 1 million token window without it losing track of big projects.
cell-on-a-plane@reddit
Thanks!
Iāve been having good luck on vllm on my 5090.
FIdelity88@reddit
Why not llama.cpp? It supports MTP now
cell-on-a-plane@reddit
Why not vllm?
FIdelity88@reddit
Better performance
Karyo_Ten@reddit
Try running 10 concurrent agents on llama.cpp and watch perf flounder while vLLM amd SGLang can do aggregate 1000+ tokens/seconds
FIdelity88@reddit
On what hardware would you run 10 concurrent agents with 1000+ tk/s?
Karyo_Ten@reddit
2x RTX 5090 can. I can reach 1500 tok/s on 2x RTX Pro 6000.
scooter_de@reddit
Not in the official branch though
BraceletGrolf@reddit
I assume you run it quantized, or can you share your setup ?
cell-on-a-plane@reddit
```vllm serve Lorbus/Qwen3.6-27B-int4-AutoRound --dtype half --max-model-len 131072 --gpu-memory-utilization 0.85 --kv-cache-dtype fp8 --max-num-seqs 2 --limit-mm-per-prompt '{"image": 0, "video": 0}' --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_xml --port 8080 --host 0.0.0.0 --trust-remote-code --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' --compilation-config.cudagraph_mode none --enable-prefix-caching```
DaMoot@reddit
Congrats on getting it working for your needs! For me it was a dumb, looping mess even with the correct model card settings. Wasn't even all that fast in llama.cpp.
Help out the rest of us; what are you actually using 35b-a3b for? Where is it succeeding, where is it failing?
For me Qwen3.5 27B is the current best to fit in 32gb for email scanning, classification and alerting, SIEM log processing, ticket generation/changes in ticketing system, light vibe coding (I do all my vibe on Claude), SQL access, web searches... 3.6 seems to have introduced some looping.
cognitium@reddit
Did you try 35b with thinking turned off? It wasn't usable for me until I turned it off. And it's 8x faster than 27b on my card.
DaMoot@reddit
You know what, I have not! I will give that a try this afternoon for my normal workflow and see how things go.
Thanks!
Civil-Reporter7812@reddit
Why are you still using Ollama? And I mean, seriously, why? It's rotten software with worse performance than llama.cpp
https://sleepingrobots.com/dreams/stop-using-ollama/
bighead96@reddit
There's a common belief that Gemma4 is very smart, its not, its actually very dumb. It's very good at confidentally telling you its fixed things and here are the issues and how it resolved them. If you create a bunch of bugs and ask them to fix them it will confidentally tell you it fixed them and almost none will work properly. Its like that friend you have that is dumb as a box of rocks but tells you they are an expert at everything. And they will look you dead in the eyes and be like trust me bro. No I'm not trusting you, because you break everything you touch as far as coding goes
Maleficent-Ad5999@reddit
I find Qwen 3.6 35b often deviating from my instructions despite not having too much context. For instance I tell it āok, just focus on addressing #1 from your suggestionā and it proceeds to fix all of them
cleversmoke@reddit
Same, that's why I moved to 27B, 27 tok/s vs 80+ tok/s on the 35B-A3B, but 27B is spot-on on following instructions and providing great output. 35B-A3B was a lot of rolling the dice for a good seed, in my early experience with it.
Imaginary-Unit-3267@reddit
Yes, you have to explicitly tell it what not to do, not only telling it what to do. But if you give it explicit "no" instructions, it follows them as long as it doesn't have such a bloated context that it's getting generally degraded cognition.
bighead96@reddit
oh wow interesting! I haven't been using it much yet, but will have to see if that's the case. I started finding people say to just use the 27B that its much better
rpkarma@reddit
I was shocked at how consistently bad Gemma 4 31B is at code analysis on my eval suite. Makes up the same fake bugs over and over lol
dim722@reddit
As many others, I was initially impressed with Gemma on my mediocre setup, but then I realized one thing: Gemma is not smart - it just gives the impression of being smart through fast output and oversized responses. Itās all about talking, nothing else.
The model has terrible tool discipline and is basically incapable of applying edits, no matter what harness you try to use - and Iāve tried them all, including tricked versions of Claude Code and Codex. It seems that all Gemma-4-class models inherited the same tool-related issues, since dense Gemma exhibits the same behavior.
thisguynextdoor@reddit
In having continuous reasoning loops with Qwen. It's almost unsable. You could also try Gemma 4 26b with the multi-token prediction assistant. It will speed up Gemma 2-3x.
cmndr_spanky@reddit
You almost certainly have param settings / context window settings wrong with Qwen. It does think for a while but not like that.
thisguynextdoor@reddit
I'm running it on LM Studio with default settings. Context window of 128k. I just today stopped it for looping after 50k of endless reasoning
Final answer:...
Am I consice?
Wait, user might mean...
I should reconsider
OK, final answer
No, wait..
cmndr_spanky@reddit
The fact that you couldnt answer my question tells me everything I need to know :)
thisguynextdoor@reddit
You haven't asked anything.
cmndr_spanky@reddit
What parameters are you using?
yeah-ok@reddit
I've been strict on --no-reasoning lately and having plenty of success with one-shot programming extensions for pi agent..etc..etc, think we have to remember that the top-k is in a sense a selection out of what is already a latent thought process in the model.
onewheeldoin200@reddit
You for sure have to use the recommended settings from Qwen team. Unsloth etc all have them on their HF pages too.
FranticBronchitis@reddit
It sure helps but it's still noticeably loopier than gemma4
DR4G0NH3ART@reddit
Have you put a repeat penalty and a reasonable temperature?
No_Swimming6548@reddit
What default settings? Qwen has its own ideal parameters for both coding and general purpose use.
the_fabled_bard@reddit
Their coding settings (which I use) actually encourage thinking loops. It happens often. I just stop it and say "you were spiraling" and it almost always picks up where it was, recognize it needed to shift perspective and gets the task done.
But it does happen often. Letting it run a long time unattended is recipe for failure.
Snoo_81913@reddit
Set reasoning at 4096
EffectiveMedium2683@reddit (OP)
That's interesting. I haven't seen that. I did see it with qwen3.5. In the past, when that was a major issue with most, I just used a small fine-tuned model to watch it and if it seemed to be looping, inject a message to stop thinking or even just force the tag. Have you tried adding something like "If at any point you are uncertain, just ask me. I won't bite."? I know that sounds ridiculous, but these CoT models have anxiety or something haha
Snoo_81913@reddit
What this guy said also make sure reasoning is set to 4096 to prevent over thinking. Adjust as to your needs of course. Then if you're using turboquant_plus with qwen models there are some k/v settings that will cause.????? Or ////// stick with turbo3 for v and Q8 of q4 for k and that shoukd stop it.
sid351@reddit
Do you know more about these k/v settings?
I'm running non-turbo quants and getting frequent (6 times today) "terminal thinking loops" where the token generation gets stuck just repeating "/" endlessly until the maximum length is hit for the prompt.
I'm running llama.cpp on Windows, and I have a post where I've detailed my setup and things I've tried so far.
Snoo_81913@reddit
Sure put the link in the post and I'll take a look.
sid351@reddit
Amazing, thanks:
https://www.reddit.com/r/LocalLLaMA/s/oCHseapcdr
Snoo_81913@reddit
Alright man, I took a look. I'm pretty sure I know what's going on and I've got a pretty decent way to test it that won't take too long. Give me like an hour or two and I'll do a pretty big post over there with everything.
sid351@reddit
Omg, you're a saint.
Snoo_81913@reddit
It's up.
anykeyh@reddit
You need to take time to tweak it for params of your taste. Honestly I was like you early, but after a bit of trial and error I am now very happy; no more forever thought;
Here are my parameters:
--temp 0.7 --top-k 20 --top-p 0.95 --min-p 0.01 --repeat-penalty 1.03 --frequency-penalty 0.05 --presence-penalty 1.5
I also keep reasoning budget to 4096 tokens.
SnooPaintings8639@reddit
What inference engine you're using? vLLM, llama.cpp, or something else?
DaMoot@reddit
Which Qwen?
Do you have the model launching with the recommended settings? That makes a big difference running it with no settings (just launch defaults) and correct settings.
vanbukin@reddit
Try latest vN chat template https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates/tree/main/qwen3.6
thisguynextdoor@reddit
Thanks! I'll try it right away.
abnormal_human@reddit
I'm seeing the same. It just thinks for 100k+ tokens, repeatedly backtracking.
siegevjorn@reddit
After llama.cpp MTP PR Qwen3.6 speed is truly insane.
JsThiago5@reddit
How much faster is it in comparison to Gemma4 26b?
EffectiveMedium2683@reddit (OP)
On my optiplex 3000 intel 12th gen i5 (Alder Lake) setup with zero GPU, I'm getting 12 tokens per second on long context. Like, it doesn't slow down. Gemma 4 26b-a4b, once it gets past like 10k context, I start seeing it slow down from \~11 tokens per second all the way down to like 8 tokens per second.
Living-Office4477@reddit
ddr4 ram?
EffectiveMedium2683@reddit (OP)
Yeah. It had a 16gb stick and I added an 8gb stick I found. Shockingly decent computer for llms. 75 watt power supply. I can run it literally for days of non-stop inference creating custom datasets and it doesn't even get hot.
boutell@reddit
This setup was not on my bingo card!
Living-Office4477@reddit
Super cool, you kind of gave me hope to try too, any other models you have tried and liked on that hardware?
virtualPNWadvanced@reddit
Are you able to tell the difference between 11-8 tks? Unless Iām paying SUPER close attention, I donāt know if Iāve dropped speed
EffectiveMedium2683@reddit (OP)
Honestly, not really. I just know from the output where it shows prefill and decode speed.
Barry_22@reddit
Wait what
Isn't Qwen 3.6 35B MoE MUCH more inyelligejt than gemma 26B MoE? Coding-wise, brevity-wise, just in general?
Organic_Scarcity_495@reddit
the ollama vs llama.cpp performance gap on moe models is real. ollama's default settings don't handle the expert routing well on desktop-class hardware. running through llama.cpp directly with tuned batch size makes a big difference.
DiscipleofDeceit666@reddit
What do you use it for?
koygocuren@reddit
Qwenās answers are not the issue. The issue is its enormous thinking blocks
yami_no_ko@reddit
Not polluting your system with bloat like ollama is a valuable lessen learned.
I cant tell for sure whether I prefer Qwen3.6. It's my go to for programming, but Gemma-4 performs better with language and knowledge tasks in the context of western culture.
mr_Owner@reddit
Try reasoning effort flag and reasoning end message with .cw at the end of the reasoning message works for me so far.
MundanePercentage674@reddit
Same thing to me my use case mostly agentic task Gemma failed me every time switch to qwen 3.6 it gets the job done.
nickm_27@reddit
In my experience it is considerably worse at prompt adherence at least at Q4_K_XL, Gemma4 is much better in that regard at least for my use case as a voice assistant