Any news (or hope) of Qwen-3.6 14B and 9B distills for local coding ?

Posted by QuchchenEbrithin2day@reddit | LocalLLaMA | View on Reddit | 44 comments

As the title suggests. I'm already testing (with some success, and few challenges) usage of Qwen-3.5 9B with a new work laptop that I've received with RTX 1000 6GB VRAM (I know it seems like a joke in today's time and age). I am using it with `pi` as the terminal coding harness. The issue I am facing with Qwen-3.5 9B is that I've encountered some (relatively infrequent) issues around:

How it handles directories / folders - more than once, strangely I got a deeply nested folder structure for final code/test artefacts
Recognized test run to be failure, while it was actually a success

Same prompts when used with gemini-2.5-flash and gemini-2.5-flash-lite don't see such issues, indicating the possibility that the issue is not with `pi`. I've read some reports of `pi` sometimes struggling with Qwen-3.5 tool-calling, and that is apparently fixed in Qwen-3.6. Thus wondering if anyone heard or Qwen-3.6-27B dense model distillations with 9B, 14B might also be released, enabling using in smaller GPUs.

[-]

Humble_Rabbt@reddit

you should try qwen3.6 35ba3b REAM APEX I quality

[-]

QuchchenEbrithin2day@reddit (OP)

BTW can you confirm this (mudler/Qwen3.6-35B-A3B-APEX-GGUF · Hugging Face) is what we are talking about ?

[-]

Humble_Rabbt@reddit

Yeah that's the normal qwen3.6-35b-a3b model quantized with muddler's APEX quantization(it's designed around MoE architecture so to preserve quality). It's good, but the one I was talking about is this
https://huggingface.co/keithnull/Qwen3.6-35B-A3B-REAM-192-heretic-APEX-GGUF
REAM here means that some of the experts were merged, so it has 192 experts instead of 256 normally. It merges together some experts so that it brings down model size. That is how I am able to run q5km (5.30 bpw) version of this model.

[-]

QuchchenEbrithin2day@reddit (OP)

Would you mind sharing your llama_serve full CLI ? I'm having suspected OOM issues inspite of progressively downsizing -igl, -c, -ub, -b and turning-off flash attention. My context is set at 32K, -igl is 9, -b/-ub at 128/64. It is already quite slow. I also suspect some thermal throttling happening, as this is a laptop, although a full pemium aluminum case (used in an airconditioned room)

[-]

QuchchenEbrithin2day@reddit (OP)

Very interesting claims on the model page, where they compare benchmarks like KL divergence (which I have no knowledge of), against Unsloth Dynamic quants. Is there some evidence of real-life experience being better with tighter quantization ?

[-]

gatornatortater@reddit

What kind of magic do you work to get that to run on 6gb vram?

[-]

LlamaDelRey10@reddit

If you want to avoid the nested folder issue you can try explicitly passing the absolute project root in the system prompt (or generally just be explicit about avoid nesting) and see if that helps.

On the distill question, Alibaba seem to be pushing MoE hard and the 30B-A3B is where the attention is. A 14B might happen if community pressure builds like it did for the DeepSeek Qwen3 8B distill.

[-]

Mordimer86@reddit

I don't think so and there is 35B which with MoE offloading can run on small VRAM with Q4_K_M quantization and decent context size. It can help with coding, I tested it with OpenCode (although it was Q5_K_M) and it did fine with a small Rust desktop app (Iced). It even figured out how to figure out a version of Iced it wasn't trained on. I would not expect anytthing better than 35B for lower VRAM setups.

[-]

n00bmechanic13@reddit

It even figured out how to figure out

Oh shit is this agi

[-]

QuchchenEbrithin2day@reddit (OP)

Your observation with 35B MoE model is encouraging. I am no expert, but in a 35B MoE model, isn't the limiting factor the only 3B weights worth of knowledge that the fired expert have ? That is far too less than the 9B (or 14B) dense model's knowledge density ?

From what I've understood, MoE gives performance advantage, but only if entire 35B worth of weights (once quantized down) fit entirely into GPU. With CPU+RAM offload of majority of layers, would nullify the performance boost ?

[-]

TinyFluffyRabbit@reddit

If you can fit both into GPU, the MOE does run faster. However, an additional advantage of MOE is that it's actually usable even if it doesn't fit into GPU.

[-]

GrungeWerX@reddit

I think you've got that wrong. It's 35B worth of knowledge, but only 3B active parameters.

[-]

QuchchenEbrithin2day@reddit (OP)

Ah indeed, you are right.

[-]

Risen_from_ash@reddit

Yea, just to give you peace of mind, the 35b is nuts. 9b is.. kinda dumb. 14b is less dumb, but dumb. Qwen 3.6 35b a3b UD Q8 K XL at 262k ctx blows my fkn mind every time I use it.

[-]

GrungeWerX@reddit

what hardware do you have? Best I used was the q6, but even that bottlenecks at high ctx.

[-]

Risen_from_ash@reddit

285k, 5080, 96gb ddr5. I do the classic all moe weights on cpu, rest on gpu. Still pretty fast, don’t remember exactly.. maybe 35-40 tok/s decode?

[-]

ProfessionalSpend589@reddit

Not quite. To have 3B active parameters let’s say at quant 8 (8bit parameter) means for every token the computer have to process around 3GB of data (either on GPU or CPU).

For dense models with let’s say 9B parameters that would mean the computer has to process 9GB of data.

On a GPU it doesn’t really matter. But on consumer CPUs with 2 channel DDR5 memory it matters, because RAM is 10 or more times slower than Video RAM.

So, when having slower RAM it’s good to have to process as little as possible data (active parameters) while keeping most of it ready to be read (35B total parameters) from RAM.

Of course, it’s good only if we’re chasing numbers and not what people may call “intelligence” or "performance" on tests/queries.

[-]

jacek2023@reddit

there are many finetunes of 9B, the problem is people here forget about old models a few minutes after new one is released

https://huggingface.co/models?other=base_model:finetune:Qwen/Qwen3.5-9B

[-]

QuchchenEbrithin2day@reddit (OP)

Ah, good point. No Unsloth dynamic quants available though, right ? I'm guessing, OmniCoder fares much better with tool-calling, right ?

Will give the Q3_K_M / Q3_K_L a spin, to leave KV cache headroom in my tiny 6GB GPU VRAM.

[-]

jacek2023@reddit

You must learn how to use local models. Try making rules in AGENT.md. it takes time to make everything work correctly. And various models need various rules. Just talk to it after the failure how to do things better

[-]

QuchchenEbrithin2day@reddit (OP)

Actually "talk to pi" to solve a problem is something that I keep hearing in the `pi` community, but the answer is actually from the same LLM and not really pi, right ? If so, doesn't the quality of response i.e. how pertinent, how on-topic, how helpful and effective the answer is, also depends on quality/capability of the model, right ?

[-]

jacek2023@reddit

after every local model release (GLM Flash, Gemma, Qwen) people complain that it thinks too much, I realized that it thinks less in OpenCode than in the plain chat, because OpenCode has system prompt to direct model into clear answer, pi probably does something similar just like other solutions

[-]

Prof_ChaosGeography@reddit

Those are based off 3.5 OP asked about 3.6

[-]

jacek2023@reddit

There is no 3.6 9B

[-]

Prof_ChaosGeography@reddit

Exactly, while 3.5 will work for them they specifically want 3.6 so they asked about it

[-]

Pleasant-Shallot-707@reddit

And it doesn’t freaking exist

[-]

charmander_cha@reddit

Utilizo ele, bem responsa

[-]

InteractionSmall6778@reddit

No 3.6 distills at 9B or 14B yet. For 6GB with `pi`, Q4 Qwen-3.5 9B plus explicit AGENT.md rules covering directory depth and test exit codes handles most of what you're hitting: those failure patterns are scaffolding behavior, not model capability limits at that size.

[-]

QuchchenEbrithin2day@reddit (OP)

Thanks, that is encouraging too.

[-]

ea_man@reddit

If you want to try Aider has a different approach to tooling as it mostly just do diffs so even a small model like Omnicoder 2 won't fuck up all time with file EDITS / APPLY. Also it's more precise in using selected files from projects.

[-]

pmttyji@reddit

Did you try https://huggingface.co/Tesslate/OmniCoder-9B ? It's based on Qwen3.5-9B only.

There's no 14B model on 3.5 series. Still hoping for 3.6-9B & 3.6-120B from Qwen soon or later.

I see many Distills(for Qwen3.5-9B) on HF. Dig deep there

https://huggingface.co/models?sort=trending&search=Qwen3.5-9B+Distill

[-]

Thigh_Clapper@reddit

Is omnicoder better than copaw/qwenpaw flash?

[-]

philmarcracken@reddit

just tried that omnicoder as a subagent, turboquant. Ran far better than expected...

[-]

charmander_cha@reddit

Quero a versão 3.6 para 9B

Seria incrível

[-]

QuchchenEbrithin2day@reddit (OP)

Sim, eu concordo

[-]

Risen_from_ash@reddit

Qwen 3.6 35b a3b will dogwalk the 9b and 14b models. Straight up. If you haven’t tried it, you don’t even know. I’m not trusting anything of mine with the 9b or 14b models.

[-]

akram200272002@reddit

Imagine, instead of the 3b active, it's 6b or 9b, with 4 or 8 experts ?

[-]

necrophagist087@reddit

Qwen3.6 35B3A(q4m) run 30tok/s on my laptop with rtx4070 8gb VRAM (32g ram) for simple tasks (like image recognition and captioning), it’s dumber than 27b dense but outperforms any lower weight models by miles.

[-]

sagiroth@reddit

There is no need for one if there is MOE

[-]

Organic_Scarcity_495@reddit

the 35B A3B MoE is already running on 6GB VRAM with q4_k_m and offloading, i'd be surprised if they bother with smaller distills. the MoE architecture is their answer to the vram problem — you get 35B parameter intelligence while only loading ~3B active per token.

[-]

brickout@reddit

The 35b a3b should run fine

[-]

ps5cfw@reddit

I would guess they simply don't make any sense in terms of performance compared to 35B (Which can at least run with CPU Offloading fairly speedily)

[-]

QuchchenEbrithin2day@reddit (OP)

Sorry I missed this comment, but don't the 3B effective expert, have far weaker reasoning and world knowledge compared the 9B dense model ? Also, doesn't CPU offload slow it down significantly ?

[-]

ps5cfw@reddit

Yes and no, itàs not acting like a 3B model but it may lack some awareness of the 9B model in certain focused scenarios.

CPU Offload of a 35B MoE may still be significantly faster than a 9B Dense model in CPU Offload, so there's also that.