Qwen3.6 is incredible with OpenCode!

Posted by CountlessFlies@reddit | LocalLLaMA | View on Reddit | 71 comments

I've tried a few different local models in the past (gemma 4 being the latest), but none of them felt as good as this. (Or maybe I just didn't give them a proper chance, you guys let me know). But this genuinely feels like a model I could daily drive for certain tasks instead of reaching for Claude Code.

I gave it a fairly complex task of implementing RLS in postgres across a large-ish codebase with multiple services written in rust, typescript and python. I had zero expectations going in, but it did an amazing job. PR: https://github.com/getomnico/omni/pull/165/changes/dd04685b6cf47e7c3791f9cdbd807595ef4c686e

Now it's far from perfect, there's major gaps and a couple of major bugs, but my god, is this thing good. It doesn't one-shot rust like Opus can, but it's able to look at compiler errors and iterate without getting lost.

I had a fairly long coding session lasting multiple rounds of plan -> build -> plan... at one point it went down a path editing 29 files to use RLS across all db queries, which was ok, but I stepped in and asked it to reconsider, maybe look at other options to minimize churn. It found the right solution, acquiring a db connection and scoping it to the user at the beginning of the incoming request.

For the first time, it felt like talking to a truly capable local coding model.

My setup:

Qwen3.6-35B-A3B, IQ4_NL unsloth quant
Deployed locally via llama.cpp
RTX 4090, 24 GB
KV cache quant: q8_0
Context size: 262k. At this ctx size, vram use sits at \~21GB
Thinking enabled, with recommended settings of temp, min_p etc.

llama server:

```
docker run -d --name llama-server --gpus all -v :/models -p 8080:8080 local/llama.cpp:server-cuda -m /models/qwen3.6-35b-a3b/Qwen3.6-35B-A3B-UD-IQ4_NL.gguf --port 8080 --host 0.0.0.0 --ctx-size 262144 -n 8192 --n-gpu-layers 40 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --parallel 1 --cache-type-k q8_0 --cache-type-v q8_0 --cache-ram 4096
```

Had to set `--parallel` and `--cache-ram` without which llama.cpp would crash with OOM because opencode makes a bunch of parallel tools calls that blow up prompt cache. I get 100+ output tok/sec with this.

But this might be it guys... the holy grail of local coding! Or getting very close to it at any rate.

[-]

GeneralEnverPasa@reddit

He uses OpenCode so beautifully and professionally; I can honestly say he’s the best I’ve used to date. I asked him, "I want to hear your voice—how can we make that happen?" and he presented me with several options. By writing Python code and setting up a text-to-speech engine, he actually started speaking to me! :)

The next step is to take him out of OpenCode and enable communication through a different interface—a portable chatbox on my screen where we can correspond via voice or text. Since he already possesses image processing technology, I’m going to ask him to capture images from my screen whenever I want and click on specific coordinates or perform similar tasks. I’ll also have him set up different systems so he can conduct research on Google and beyond.

In short, I can now say he is at a level where he can handle all of this. With a 264k context window, I finally have exactly the kind of "beast" I was looking for.

[-]

Falagard@reddit

Ummm.

[-]

ailee43@reddit

every day i regret more the 16GB of VRAM on my 5070ti.... should have gone 3090

[-]

grumd@reddit

I've got a 5080 (also 16GB) and the only model I can't run is Qwen 27B and Gemma 31B.

We're good, mate. Just use llama.cpp and offload MoE experts to RAM. I'm running Qwen 3.6 35B-A3B with FULL 262k context (f16 kv cache) right now. 15GB VRAM used + 29GB RAM used by the llama-server process, getting 35 t/s generation speed.

llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL \
  --fit on -fitt 512 -fitc 0 --no-mmap --kv-unified \
  -b 4096 -ub 2048 \
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0

You can also run Qwen 3.5 122B IQ3_XXS if you have 64GB RAM or even 122B Q4_K_S if you have 96GB RAM

[-]

relmny@reddit

With TheTom-turboquant (which I'm not saying I recommend or not, as I'm still testing it now and then, for whether I see any loss or so), I can run Unsloth's Q3_K_M on 16gb VRAM, but only at " -c 49152" and haven't tested more than 5k total tokens.

[-]

Danmoreng@reddit

Why Q6 instead of Q4? I get ~70 t/s on Q4 with similar params and cache Q8 on a 5080 mobile 16GB. https://github.com/Danmoreng/local-qwen3-coder-env#server-optimization-details

[-]

simracerman@reddit

I have 5070Ti, and fit Q5_K_XL with 128k context window. Getting 50t/s generation and 300t/s for processing. Not the best processing, but this model is fast enough for a 6000 lines repo to clean code, optimize and fixed random bugs here and there within an hour.

[-]

superdariom@reddit

The -ub 4096 and -b 3072 parameters tripled my processing speed

[-]

simracerman@reddit

What sorcery is this..?! I changed these batching params and voila! It’s indeed up significantly!

I’ve experimented with these a lot before with dense models like the 27B but nothing changed.

[-]

superdariom@reddit

Yeah I don't know either but it really made it workable for me. I saw it on another comment on this sub.

[-]

simracerman@reddit

I spoke a bit soon. On shorter context it’s reaaaaallllyy fast. Once you go above 50k it’s tanking for some reason

[-]

TheOriginalOnee@reddit

Same

[-]

ailee43@reddit

Ive been exploring adding a 5060ti in, keeping them both in 8x PCIE-5 slots, and running tensor parallel, or just hosting context on the 5060ti, but thats a painful solution

[-]

No-Name-Person111@reddit

Dual GPU isn't painful at all. I run 2x5060Ti and I never even think about it. I just have 32GB of VRAM for LLMs.

[-]

Material_Rich9906@reddit

Really? I thought it didn't really work to share

[-]

Ranmark@reddit

bro i run 1080 ti + 2060 super xD
and it just works out of the box.

[-]

superdariom@reddit

Presumably you need a board with dual GPU slots?

[-]

dellis87@reddit

Same. I almost pulled the trigger on a 5060ti at Best Buy today and got nervous.

[-]

Xonzo@reddit

I'm in the same situation. Ryzen 9800X3D, 64GB DDR5-6000, 5070TI.

Wondering if I should add a 5060ti. Usable 24-30GB vram would be perfect.

[-]

Corosus@reddit

I've got this dual gpu setup and it's great, 85tps at fresh context with q4 full context, windows so im not even benefitting from pcie lane speed

[-]

ailee43@reddit

while we're wishlisting :D i want a 12 memory channel ddr5 setup with an Epyc

[-]

Due-Project-7507@reddit

I am waiting for the Qwen 3.6 27B. The mradermacher Qwen3.5-27B-i1-GGUF IQ4_XS works with my A5000 laptop GPU (16 GB) with 64k turboquant 3 bit context length very good (around 20 t/s at beginning, around 15 t/s at 10k context).

[-]

pneuny@reddit

Just use the Unsloth UD-IQ3_XXS. I have it set to 190k token context window with 75 t/s on Bazzite Linux using llama.cpp on a 16 GB RX 9070 XT. Sure, it would be faster if it could all fit in VRAM, but GTT overflow is good enough with the Vulkan backend.

[-]

Turbulent_Pin7635@reddit

Sell it and buy a 3090, no?

[-]

andy2na@reddit

I was going to do that but ended up using both 3090 and 5060ti. qwen3.6 35b Iq4 xs with 262k context (q8/turbo cache) fits perfectly in 24gb. I then have TTS and comfy models loaded on the 5060ti

[-]

T3KO@reddit

Q4 still works better than it should on 16gb.

[-]

Durian881@reddit

I was playing with it (Q8) on Qwen Code and it did pretty well too using a "McKinsey-research skill" that involved use of subagents using lots of tool calls (websearch and webfetch). Overall, it ran more than 1.5 hours.

There were some issues along the way (subagents not saving output) but after one reminder, it recovered and checked for subsequent iterations that output files are saved. The other boo-boo was the final presentation where 12 slides were rendered concurrently instead of sequentially.

[-]

SheikhYarbuti@reddit

This is amazing! Could you please give more details about your setup - especially the agent side.

[-]

Durian881@reddit

I am using Qwen Code which is modeled after Claude Code. LM Studio provides the OpenAi endpoint.

[-]

Jaded_Towel3351@reddit

How does opencode compare to Claude code? I’ve been using Claude code + everything Claude code plugin + Qwen locally since GitHub copilot limit student’s plan last month and I’ve never open copilot again. Maybe I will give opencode a try.

[-]

CountlessFlies@reddit (OP)

They’re both really good harnesses, so, model being the same, I doubt there’ll be a huge difference between the two. I somewhat like the OpenCode TUI better, seems more polished.

[-]

Sh1d0w_lol@reddit

Actually there is a difference. The system prompt and tooling of Claude code is superior compared to opencode I’ve tested this many times using same local model for both and CC was able to complete the tasks perfectly and even managed context properly where with opencode it either failed the task or hit context limit mid task

[-]

That_Faithlessness22@reddit

How did you get CC to use the preserve_thinking?

[-]

SmartCustard9944@reddit

The context engineering inside OpenCode is far weaker than Claude Code. The way OpenCode structures the context is a bit garbage.

[-]

mrinterweb@reddit

I did nearly the same experiment last night. I used OpenCode. I used LM Studio to run it, which I think I'll switch to plain llama.cpp. I was getting usually around 100tps. The results weren't as good as I was expecting though. I wasn't sure if the issue was OpenCode, but I compared it to Claude Code (Opus 4.7), and the claude code experiece was much better for me. I am going to try using Qwen 3.6 with claude code next to see if it is an agent or llm difference. I will say that while opencode + qwen didn't beat cc, it was for sure usable. Another thing I will say for it was the average inference speed felt faster. CC's inference speed can vary a lot, but Qwen 3.6 on my RTX 4090 was keeping at a consistent ~100tps. The large 262K context makes it usable.

[-]

That_Faithlessness22@reddit

I've been using it with Claude code, and I'm getting similar speeds. But I won't be measuring the quality on it because you can't have the harness doesn't support the preserve_thinking flag. It is incompatible unless you parse- and that's a little outside my comfort zone for now. I'll probably try to figure it out tonight, or I'll just do the dive into Hermes I've been putting off.

[-]

CountlessFlies@reddit (OP)

Exactly… the context makes a huge difference.

Did you run it with thinking enabled (it’s the default)? I found that it does much better with thinking on. And also, I think there’s a separate flag you need to set to send the thinking traces with each request, that might also help improve performance.

[-]

mrinterweb@reddit

It was definitely thinking. I also tried it with hermes agent, and my results were pretty different. So I think a lot of my subjective evaluation is going to come down to the agent, which is why I think I should point claude code at qwen 3.6, so I can get more of an apples to apples comparison. I don't have a background in evaluating model scores so what I'm doing is just feels. I pay for Claude, but if Qwen 3.6 can get me close, there are plenty of tasks I would much rather use my own hardware.

[-]

SmartCustard9944@reddit

Yes, please try this. I tried Open Code with LM Studio Qwen 3.6 and it didn’t pass simple tests that Gemma 4 passes easily there.

My first test is asking it how many tools it supports. The correct number is 27. Gemma always answers correctly, never misses a beat. Qwen 3.6 hallucinates the number. It says 28 and then proceeds to list 27 items, but one is a duplicate. This happens even with thinking enabled. It is really baffling, especially after seeing everybody praising it here.

The second test is the typical car wash test. Gemma 4 always passes, Qwen 3.6 routinely says to walk. The interesting thing is that Qwen answers correctly when the prompt is at 0 context (without a harness).

[-]

mrinterweb@reddit

I find that many agents trip up when asked introspective questions, so I don't bother with those kinds of prompts. General logic tests are important, but most of what I do with agents is coding specific. So whatever is better at code is what I'll use. I'll try giving Gemma 4 another go locally.

[-]

klenen@reddit

Let us know how it goes using it w cc please!

[-]

donk8r@reddit

Same experience here. The local quality jump is wild.

One thing that helped me get reliable results: giving the agent a "map" of the codebase before it starts coding. Not just files — actual relationships. What imports what, what calls what.

Without that it was guessing based on variable names. With it, it navigates like it built the thing.

Qwen3.6 + structured context = finally dropped my cloud API keys.

[-]

nuhnights@reddit

Nice! Can you provide an example?

[-]

Apart_Fudge1224@reddit

I had claude build a script that I can just run when ever and it prepares a full file tree and json of all the relationships and imported. And an HTML visualizer w a node diagram vibe for me, the meat sac. It's been a game changer honestly cuz it's easy to ID weird patterns that are pretty abstract without visuals. For me any way

[-]

amelech@reddit

If I have a 9070 xt with 16gb vram and 32gb what quant can I run in llama.cpp and what max context size can I safely use? I want to use it for assisting on an android app using opencode

[-]

superdariom@reddit

Is the iq4 quant special? I don't really know what that means. I'm running Q5 with 12 moe layers on cpu

[-]

Interesting_Key3421@reddit

Also with Pi coding agent

[-]

rm-rf-rm@reddit

just saw the dev's excellent talk delivered at AI Engineer Europe. its exactly the solution we need, especially us power users who want to control our workflow.

[-]

Caffdy@reddit

have you tried using the flag --chat-template-kwargs '{"preserve_thinking": true}'?

[-]

GrungeWerX@reddit

Wake me up when they release the 27b…

[-]

Professional_Diver71@reddit

*Cries in 5070ti 16gb *

[-]

anthonyg45157@reddit

Damn I'm running the UD-Q4_K_XL and fighting context 😂 ight need to switch

[-]

Keras-tf@reddit

Is there a reason to go UD-Q8? I tried it yesterday via Cline and it seems good but I feel it is overkill?

[-]

RelicDerelict@reddit

Is someone running this on a 4GB VRAM and 32GB system ram? Just asking for a friend (you don't need to remind me that I am poor).

[-]

matjam@reddit

Nice I’ll have to try it.

[-]

Old-Sherbert-4495@reddit

not so much for me... coz im testing it out in a project and asking it to make hard coded color into a primary color variable in css. damn, it just yaps... yaps.. and after a very long time multiple compactions it finally starts to edit files and then onwards it takes a long time to finish the task. i tried with Q6 and Q5Ks and Q4kxl q6 got to editing and finished the task earlier than other quants.

But the results were not satisfying.

to compare i tried 3.5 27B IQ3xxs and damn it got the point and got to work immediately in a few steps. even though its significantly slower tkps it finished off the task much quicker than all of the 3.6 quants. i dont mind if it missed a few things, i can prompt it again.

I'm using the recommended params for both context 70k coz of vram. this is the reason for frequent compactions

[-]

robertpro01@reddit

For me, it is just on pair with gemini 3 flash, that means I don't need to pay for it anymore.

[-]

Tymid@reddit

How do with Gemini flash 3? For coding? Other tasks?

[-]

robertpro01@reddit

No idea, I use it only for coding

[-]

thejacer@reddit

I am missing the iteration…I’m not a dev so I rely really heavily on the model (entirely really) and I don’t mind that it screws up, but it still sometimes tries to explore directories that just don’t exist and after making any attempt it just completes and waits…I wouldn’t mind it breaking stuff and fixing it, but it just breaks stuff and sits. Is there something I need to do in OpenCode to enable the iterative work other people are getting it to do?

[-]

imgroot9@reddit

I also started with IQ4_NL, then downloaded bartowski Q4_K_M and built Turbo Quant locally to see if it makes any difference. I don't know why, but this setup is like a cheat code. I'm not sure what happened, but anything I try gives me amazing results.

[-]

TheLinuxMaster@reddit

Hi. will this same setup work for me ? I have rtx 3090 and 32gb of ddr5

[-]

Turbulent_Pin7635@reddit

Wow!!! With q4 quant?!?!

I have downloaded it to my M3U, even with access to larger models I preferer the small ones (the softwares I run can easily eat 350 GB RAM).

[-]

CountlessFlies@reddit (OP)

Yes! It’s really good, I’m really interested to try out q6 and beyond to see if they are even better

[-]

FinBenton@reddit

I have been testing it with llama.cpp + cline, works super well with this after just a few tests.

[-]

abmateen@reddit

What is the difference with Q4_NL?

[-]

soyalemujica@reddit

May I ask, how "weak" or "less smart" is UD_IQ4_NL in comparison to 4KM / UD4KM ?

[-]

CountlessFlies@reddit (OP)

I think this might be useful https://www.reddit.com/r/unsloth/s/YyPjuAckGT

[-]

CountlessFlies@reddit (OP)

Haven’t really compared the two yet! Might try it next

[-]

Uncle___Marty@reddit

Saw someone making a reply to another post about qwen 3.6 saying roughly "so many qwen 3.6 posts are getting boring". I TOTALLY disagree. I'm literally swimming in posts with peoples experiences right now and im loving it. Maybe because I didnt try it for myself yet but whatever. Appreciate your thoughts on it!

[-]

CountlessFlies@reddit (OP)

Thank you! Really interesting to see others posts as well