Qwen3 Coder Next as first "usable" coding model < 60 GB for me

Posted by Chromix_@reddit | LocalLLaMA | View on Reddit | 222 comments

I've tried lots of "small" models < 60 GB in the past. GLM 4.5 Air, GLM 4.7 Flash, GPT OSS 20B and 120B, Magistral, Devstral, Apriel Thinker, previous Qwen coders, Seed OSS, QwQ, DeepCoder, DeepSeekCoder, etc. So what's different with Qwen3 Coder Next in OpenCode or in Roo Code with VSCodium?

I run the model this way:
set GGML_CUDA_GRAPH_OPT=1

llama-server -m Qwen3-Coder-Next-UD-Q4_K_XL.gguf -ngl 99 -fa on -c 120000 --n-cpu-moe 29 --temp 0 --cache-ram 0

This works well with 24 GB VRAM and 64 GB system RAM when there's (almost) nothing else on the GPU. Yields about 180 TPS prompt processing and 30 TPS generation speed for me.

OpenCode vs. Roo Code:

Both solved things with the model, yet with OpenCode I've seen slightly more correct answers and solutions. But: Roo asks by default about every single thing, even harmless things like running a syntax check via command line. This can be configured with an easy permission list to not stop the automated flow that often. OpenCode on the other hand just permits everything by default in code mode. One time it encountered an issue, uninstalled and reinstalled packages in an attempt of solving it, removed files and drove itself into a corner by breaking the dev environment. Too autonomous in trying to "get things done", which doesn't work well on bleeding edge stuff that's not in the training set. Permissions can of course also be configured, but the default is "YOLO".

Aside from that: Despite running with only a locally hosted model, and having disabled update checks and news downloads, OpenCode (Desktop version) tries to contact a whole lot of IPs on start-up.