did anyone replace old qwen2.5-coder:7b with qwen3.5:9b in nonThinker mode?

Posted by Impossible_Art9151@reddit | LocalLLaMA | View on Reddit | 8 comments

I know, qwen3.5 isn't the coder variant yet.
Nevertheless I guess an actual 9b dense performs better just from a responnse quality perspective. Just seen from the overall evolution since 2.5 has been released.
We are using the old coder for autocomplete, fill in the midlle, loadbalanced by nginx.

btw. 2.5 is such a dinosaur! And the fact that it is still such a work horse in many places is an incredible recommendation for the qwen series.

[-]

tomByrer@reddit

How much VRAM & context window are you using?

[-]

Impossible_Art9151@reddit (OP)

the old qwen2.5-coder is running beside other, bigger models on two strix halo.
form my memory.
./llama-server with -np 2 -c 64000

Theoretically I can serve 4 concurrent requests.

[-]

AcidumIrae@reddit

stix halo has 128GB of unified memory, but can utilize 96GB VRAM only.

[-]

RadiantHueOfBeige@reddit

Qwen3.5 is FIM tuned so it can do this, but like you said, there's little left to improve since 2.5. It's a dinosaur but it gets the job done for cheap. We're running it on a silly refact.ai cluster and while we played with qwen3 coder 30B-A3B we all went back to the 7 or 14B 2.5, because it's already doing what we want for half the cost (VRAM).

[-]

Impossible_Art9151@reddit (OP)

good point.
a qwen3.5-9b, 4b could serve other use cases beneath autocomplete, fim.

There are good reasons to consolidate LLM sprawl.

[-]

QuestionMarker@reddit

Tangemt but my bet is that we are unlikely to see a 3.5 coder model unless someone outside Qwen does it. Happy to be wrong but with the core team leaving, even if they had something in flight they may not have the will or ability to do it justice any more.

[-]

Impossible_Art9151@reddit (OP)

that is what I am fearing as well

[-]

promobest247@reddit

try this https://huggingface.co/Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled