Tested how OpenCode Works with SelfHosted LLMS: Qwen 3.5, 3.6, Gemma 4, Nemotron 3, GLM-4.7 Flash - v2

Posted by rosaccord@reddit | LocalLLaMA | View on Reddit | 30 comments

I have run two tests on each LLM with OpenCode to check their basic readiness and convenience:

- Create IndexNow CLI in Golang (Easy Task) and

- Create Migration Map for a website following SiteStructure Strategy. (Complex Task)

Tested Qwen 3.5, & 3.6, Gemma 4, Nemotron 3, GLM-4.7 Flash and several other LLMs.

Context size used: 25k-50k - varies between tasks and models.

The result is in the table below, the most of exact quant names are in the speed test table.

Hope you find it useful.

---

Here in v2 I added tests of

- Qwen 3.6 35b q3 and q4 => the result is worse then expected

- Qwen 3 Coder Next => very good result

- and Qwen 3.5 27b q3 Bartowsky => disappointed

The speed of most of these selfhosted LLMs - on RTX 4080 (16GB VRAM) is below (to give you an idea how fast/slow each model is).

Used llama.cpp with recommended temp, top-p and other params, and default memory and layers params. Finetuning these might help you to improve speed a bit. Or maybe a bit more than a bit :)

My Takeaway from this test iteration:

- Qwen 3.5 27b is a very decent LLM (Unthloth's quants) that suit my hardware well.

- Qwen3 Coder Next is better then Qwen 3.5 and 3.6 35b.

- Qwen 3.5 and 3.6 35b are good, but not good enough for my tasks.

- Both Gemma 4 26b and 31b showed very good results too, though for self-hosing on 16GB VRAM the 31b variant is too big.

---

The details of each LLM behaviour in each test are here:

https://www.glukhov.org/ai-devtools/opencode/llms-comparison/

[-]

DeltaSqueezer@reddit

Thanks for sharing the results. I'm surprised at the poor performance of the Qwen3.5-9B esp. since you are using unquantized (not sure if KV is also unquantized).

This 9B has been my daily driver, and I have been using it instead of the 27B for longer context and faster processing. Tool calling was also problem free.

Did you use the same chat template across all runs? I'm wondering if this could also be a factor in the variations e.g. between Bartowski and Unsloth 27B quants.

[-]

This is good to know. I found 27B at q3 felt better than 9B at q8 when I did testing my initial testing, but I didn't use it more than a day. Qwen3.6 just released 27B so perhaps the 9B will follow shortly.

[-]

DeltaSqueezer@reddit

I've been using the unquantized 9B for a while now and like it. A few days ago, I restarted it and didn't notice that I'd started a version with Int8 quantization on full attention (still full bf16 on linear layers) and noticed degradation - these were small such as extra characters here and there, but for long agentic chains, it becomes noticeable which is what led me to check and found I'd reloaded the wrong model.

[-]

InternationalNebula7@reddit

Do you have 16gb vram?

[-]

DeltaSqueezer@reddit

I have 24GB VRAM. If it was just text, you would not notice, but I have agentic workflows that literally make 100s of tool calls each, and with quantization, there can be small glitches like adding a '>' at the end of a file etc. which then breaks the flow so it becomes obvious.

[-]

rosaccord@reddit (OP)

Qwen3.5:9B is a default (q4) from ollama lib.

yes, the same prompt across all the runs.

[-]

DeltaSqueezer@reddit

Ah. OK. If Q4, then more understandable as I noticed the 9B is quite sensitive to quantization.

[-]

InternationalNebula7@reddit

Please run with Qwen3.6:27B when unsloth releases the quants. Look forward to seeing the results!

[-]

Still-Wafer1384@reddit

I'm joining the queue

[-]

sittingmongoose@reddit

This^ it’s a new release as of today.

[-]

Bingo-heeler@reddit

Can you drop your llama.cpp configs for qwen3 coder next and qwen3.6?

[-]

korino11@reddit

YOur qwen 3.6 is doesnt corect. use this - https://huggingface.co/LuffyTheFox/Qwen3.6-35B-A3B-Plus-Uncensored-Wasserstein-GGUF it only one who fixed qwen layers.

[-]

rosaccord@reddit (OP)

you should send this message to unsloth

[-]

korino11@reddit

Dude it doesnt understand at all. It means you even doesnt read that... This model restores what was broken in the official Qwen 3.6 open weight release! So where is here unsloth? they quantize broken model as all others...

[-]

jacek2023@reddit

I am wondering how small context is usable for other people. I usually need at least 100k (so I setup 200k with gemma 26B).

[-]

Weird_Linux_Nerd_07@reddit

I'm surprised about the GLM-4.7-Flash ranking because this model was a big disappointment for me. I have tested same set of models on my dual GPU Linux desktop, RTX 4060 Ti 16Gb (video output), RTX 5060 Ti 16Gb (pure LLM usage). Java, Python and SQL is my primary focus. Benchmarks used

SQL
https://github.com/nlothian/llm-sql-benchmark
Aider Polyglot
https://github.com/Aider-AI/aider/tree/main/benchmark

Top models, llama.cpp config params:
- Bartowski Qwen_Qwen3.5-27B-IQ3_XS.gguf - chat-template-kwargs = {"enable_thinking":false}, temp = 0.5
- Unsloth Qwen3.5-35B-A3B-UD-Q4_K_XL - chat-template-kwargs = {"enable_thinking":true}, temp = 0.5
- Unsloth Qwen3.6-35B-A3B-UD-Q4_K_XL - chat-template-kwargs = {"preserve_thinking": true}, temp = 0.5
- Unsloth Qwen3-Coder-Next-UD-IQ4_XS - temp = 1.0

Notes:
- all remaining settings are the default values suggested by Qwen authors.
- ctx-size = 131072
- my personal favorite is Qwen3.5-35B because of the speed/reliability ratio. Qwen3.5-27B is the best one but 3x slower than the MoE brother.

GLM-4.7-Flash, Gemma4 - bad benchmark results, and also many tool calling issues in OpenCode.

[-]

pmttyji@reddit

Hope you're using optimized llama.cpp already.

Could you test Qwen3.6-35B-A3B's Q4_K_M since you tested big models? I know you have VRAM limitation, but it's better stick to Q4 quants(at least) of 20-40B models when you have 16GB VRAM. Also include IQ4_XS of Qwen3.5-27B.

^(I refuse to use Q3(and below) for small/medium models even though I have only 8GB VRAM. I'm talking about Qwen3-30B MOEs & do use IQ4_XS quant with help of RAM(32GB DDR5).)

[-]

rosaccord@reddit (OP)

I did use IQ4-XS of qwen3.6 35b, see in the table... should not be much of the difference from q4km

[-]

No-Refrigerator-1672@reddit

Qwen team recommends very speficic temperature, top_k and presence penalty for coding tasks, which differ from default parameters. This is applicable to both 3.5 and 3.6. Did you just use defailt parameters, or the correct ones?

[-]

rosaccord@reddit (OP)

correct ones

[-]

Designer_Reaction551@reddit

Qwen3 Coder Next beating the larger 3.5/3.6 general models tracks with what I've seen on agentic coding tasks on similar hardware. Task-tuned models hold structure better at 25-50k context than bigger general ones, and the difference shows up most on the complex task where planning-level reasoning matters. Would be curious how much the gap closes if you bump up repetition_penalty and lock temperature to 0.1-0.2 on the 3.6 35b q4 - in my runs that was the difference between coherent migration maps and rambling ones.

[-]

ps5cfw@reddit

Honestly I cannot say the same, using opencode I've always seen qwen 3.6 doing better than qwen 3 coder next in IRL .NET and Javascript scenarios, the thinking really does help and qwen coder not having it gimps it's ability a lot from my POV

[-]

rosaccord@reddit (OP)

Are you running some kind of summarizing?

or you did migration maps on q3.6 too?

[-]

R_Duncan@reddit

You compared Qwen3.6 in IQ4_XS against Qwen-coder-next 4-bit, a model 4 times the size, and won't even raise the quant..... ok that's faster, but for a quality comparison you should use at least Q5_K_XL for Qwen3.6

[-]

rosaccord@reddit (OP)

The reddit user asked me to test coder next so I did. Probably he was deciding on the local model...

Raising quant because of lower params count... No thank you. not testing all the quants to find where it starts working.

[-]

TheItalianDonkey@reddit

Not sure what you're testing for then.

I mean, either tests for best result by the space you have (which means different quants based on memory occupation) or test for best model by standard quantization (which means same quants)

You did neither, so i'm unsure what the results prove or for whom they are good for.

[-]

DeltaSqueezer@reddit

You mention 16GB GPU. Which model is it?

[-]

rosaccord@reddit (OP)

RTX 4080

[-]

pulse77@reddit

For complex coding tasks where precision matters the 3-bit quantization is "gambling"...

[-]

rosaccord@reddit (OP)

yes. and for some models q4 is not enough either.

But restriction I have is 16GB VRAM - so I have tested how something squizeable in there works.

Some think that something less than Opus or Sonet is worthless - so - horses for courses.