Any news (or hope) of Qwen-3.6 14B and 9B distills for local coding ?
Posted by QuchchenEbrithin2day@reddit | LocalLLaMA | View on Reddit | 44 comments
As the title suggests. I'm already testing (with some success, and few challenges) usage of Qwen-3.5 9B with a new work laptop that I've received with RTX 1000 6GB VRAM (I know it seems like a joke in today's time and age). I am using it with `pi` as the terminal coding harness. The issue I am facing with Qwen-3.5 9B is that I've encountered some (relatively infrequent) issues around:
- How it handles directories / folders - more than once, strangely I got a deeply nested folder structure for final code/test artefacts
- Recognized test run to be failure, while it was actually a success
Same prompts when used with gemini-2.5-flash and gemini-2.5-flash-lite don't see such issues, indicating the possibility that the issue is not with `pi`. I've read some reports of `pi` sometimes struggling with Qwen-3.5 tool-calling, and that is apparently fixed in Qwen-3.6. Thus wondering if anyone heard or Qwen-3.6-27B dense model distillations with 9B, 14B might also be released, enabling using in smaller GPUs.
Humble_Rabbt@reddit
you should try qwen3.6 35ba3b REAM APEX I quality
QuchchenEbrithin2day@reddit (OP)
BTW can you confirm this (mudler/Qwen3.6-35B-A3B-APEX-GGUF · Hugging Face) is what we are talking about ?
Humble_Rabbt@reddit
Yeah that's the normal qwen3.6-35b-a3b model quantized with muddler's APEX quantization(it's designed around MoE architecture so to preserve quality). It's good, but the one I was talking about is this
https://huggingface.co/keithnull/Qwen3.6-35B-A3B-REAM-192-heretic-APEX-GGUF
REAM here means that some of the experts were merged, so it has 192 experts instead of 256 normally. It merges together some experts so that it brings down model size. That is how I am able to run q5km (5.30 bpw) version of this model.
QuchchenEbrithin2day@reddit (OP)
Would you mind sharing your llama_serve full CLI ? I'm having suspected OOM issues inspite of progressively downsizing -igl, -c, -ub, -b and turning-off flash attention. My context is set at 32K, -igl is 9, -b/-ub at 128/64. It is already quite slow. I also suspect some thermal throttling happening, as this is a laptop, although a full pemium aluminum case (used in an airconditioned room)
QuchchenEbrithin2day@reddit (OP)
Very interesting claims on the model page, where they compare benchmarks like KL divergence (which I have no knowledge of), against Unsloth Dynamic quants. Is there some evidence of real-life experience being better with tighter quantization ?
gatornatortater@reddit
What kind of magic do you work to get that to run on 6gb vram?
LlamaDelRey10@reddit
If you want to avoid the nested folder issue you can try explicitly passing the absolute project root in the system prompt (or generally just be explicit about avoid nesting) and see if that helps.
On the distill question, Alibaba seem to be pushing MoE hard and the 30B-A3B is where the attention is. A 14B might happen if community pressure builds like it did for the DeepSeek Qwen3 8B distill.
Mordimer86@reddit
I don't think so and there is 35B which with MoE offloading can run on small VRAM with Q4_K_M quantization and decent context size. It can help with coding, I tested it with OpenCode (although it was Q5_K_M) and it did fine with a small Rust desktop app (Iced). It even figured out how to figure out a version of Iced it wasn't trained on. I would not expect anytthing better than 35B for lower VRAM setups.
n00bmechanic13@reddit
Oh shit is this agi
QuchchenEbrithin2day@reddit (OP)
Your observation with 35B MoE model is encouraging. I am no expert, but in a 35B MoE model, isn't the limiting factor the only 3B weights worth of knowledge that the fired expert have ? That is far too less than the 9B (or 14B) dense model's knowledge density ?
From what I've understood, MoE gives performance advantage, but only if entire 35B worth of weights (once quantized down) fit entirely into GPU. With CPU+RAM offload of majority of layers, would nullify the performance boost ?
TinyFluffyRabbit@reddit
If you can fit both into GPU, the MOE does run faster. However, an additional advantage of MOE is that it's actually usable even if it doesn't fit into GPU.
GrungeWerX@reddit
I think you've got that wrong. It's 35B worth of knowledge, but only 3B active parameters.
QuchchenEbrithin2day@reddit (OP)
Ah indeed, you are right.
Risen_from_ash@reddit
Yea, just to give you peace of mind, the 35b is nuts. 9b is.. kinda dumb. 14b is less dumb, but dumb. Qwen 3.6 35b a3b UD Q8 K XL at 262k ctx blows my fkn mind every time I use it.
GrungeWerX@reddit
what hardware do you have? Best I used was the q6, but even that bottlenecks at high ctx.
Risen_from_ash@reddit
285k, 5080, 96gb ddr5. I do the classic all moe weights on cpu, rest on gpu. Still pretty fast, don’t remember exactly.. maybe 35-40 tok/s decode?
ProfessionalSpend589@reddit
Not quite. To have 3B active parameters let’s say at quant 8 (8bit parameter) means for every token the computer have to process around 3GB of data (either on GPU or CPU).
For dense models with let’s say 9B parameters that would mean the computer has to process 9GB of data.
On a GPU it doesn’t really matter. But on consumer CPUs with 2 channel DDR5 memory it matters, because RAM is 10 or more times slower than Video RAM.
So, when having slower RAM it’s good to have to process as little as possible data (active parameters) while keeping most of it ready to be read (35B total parameters) from RAM.
Of course, it’s good only if we’re chasing numbers and not what people may call “intelligence” or "performance" on tests/queries.
jacek2023@reddit
there are many finetunes of 9B, the problem is people here forget about old models a few minutes after new one is released
https://huggingface.co/models?other=base_model:finetune:Qwen/Qwen3.5-9B
QuchchenEbrithin2day@reddit (OP)
Ah, good point. No Unsloth dynamic quants available though, right ? I'm guessing, OmniCoder fares much better with tool-calling, right ?
Will give the Q3_K_M / Q3_K_L a spin, to leave KV cache headroom in my tiny 6GB GPU VRAM.
jacek2023@reddit
You must learn how to use local models. Try making rules in AGENT.md. it takes time to make everything work correctly. And various models need various rules. Just talk to it after the failure how to do things better
QuchchenEbrithin2day@reddit (OP)
Actually "talk to pi" to solve a problem is something that I keep hearing in the `pi` community, but the answer is actually from the same LLM and not really pi, right ? If so, doesn't the quality of response i.e. how pertinent, how on-topic, how helpful and effective the answer is, also depends on quality/capability of the model, right ?
jacek2023@reddit
after every local model release (GLM Flash, Gemma, Qwen) people complain that it thinks too much, I realized that it thinks less in OpenCode than in the plain chat, because OpenCode has system prompt to direct model into clear answer, pi probably does something similar just like other solutions
Prof_ChaosGeography@reddit
Those are based off 3.5 OP asked about 3.6
jacek2023@reddit
There is no 3.6 9B
Prof_ChaosGeography@reddit
Exactly, while 3.5 will work for them they specifically want 3.6 so they asked about it
Pleasant-Shallot-707@reddit
And it doesn’t freaking exist
charmander_cha@reddit
Utilizo ele, bem responsa
InteractionSmall6778@reddit
No 3.6 distills at 9B or 14B yet. For 6GB with `pi`, Q4 Qwen-3.5 9B plus explicit AGENT.md rules covering directory depth and test exit codes handles most of what you're hitting: those failure patterns are scaffolding behavior, not model capability limits at that size.
QuchchenEbrithin2day@reddit (OP)
Thanks, that is encouraging too.
ea_man@reddit
If you want to try Aider has a different approach to tooling as it mostly just do diffs so even a small model like Omnicoder 2 won't fuck up all time with file EDITS / APPLY. Also it's more precise in using selected files from projects.
pmttyji@reddit
Did you try https://huggingface.co/Tesslate/OmniCoder-9B ? It's based on Qwen3.5-9B only.
There's no 14B model on 3.5 series. Still hoping for 3.6-9B & 3.6-120B from Qwen soon or later.
I see many Distills(for Qwen3.5-9B) on HF. Dig deep there
https://huggingface.co/models?sort=trending&search=Qwen3.5-9B+Distill
Thigh_Clapper@reddit
Is omnicoder better than copaw/qwenpaw flash?
philmarcracken@reddit
just tried that omnicoder as a subagent, turboquant. Ran far better than expected...
charmander_cha@reddit
Quero a versão 3.6 para 9B
Seria incrível
QuchchenEbrithin2day@reddit (OP)
Sim, eu concordo
Risen_from_ash@reddit
Qwen 3.6 35b a3b will dogwalk the 9b and 14b models. Straight up. If you haven’t tried it, you don’t even know. I’m not trusting anything of mine with the 9b or 14b models.
akram200272002@reddit
Imagine, instead of the 3b active, it's 6b or 9b, with 4 or 8 experts ?
necrophagist087@reddit
Qwen3.6 35B3A(q4m) run 30tok/s on my laptop with rtx4070 8gb VRAM (32g ram) for simple tasks (like image recognition and captioning), it’s dumber than 27b dense but outperforms any lower weight models by miles.
sagiroth@reddit
There is no need for one if there is MOE
Organic_Scarcity_495@reddit
the 35B A3B MoE is already running on 6GB VRAM with q4_k_m and offloading, i'd be surprised if they bother with smaller distills. the MoE architecture is their answer to the vram problem — you get 35B parameter intelligence while only loading ~3B active per token.
brickout@reddit
The 35b a3b should run fine
ps5cfw@reddit
I would guess they simply don't make any sense in terms of performance compared to 35B (Which can at least run with CPU Offloading fairly speedily)
QuchchenEbrithin2day@reddit (OP)
Sorry I missed this comment, but don't the 3B effective expert, have far weaker reasoning and world knowledge compared the 9B dense model ? Also, doesn't CPU offload slow it down significantly ?
ps5cfw@reddit
Yes and no, itàs not acting like a 3B model but it may lack some awareness of the 9B model in certain focused scenarios.
CPU Offload of a 35B MoE may still be significantly faster than a 9B Dense model in CPU Offload, so there's also that.