Qwen 3.6 is actually useful for vibe-coding, and way cheaper than Claude

Posted by sdfgeoff@reddit | LocalLLaMA | View on Reddit | 92 comments

Launched claude code, pointed it at my running Qwen, and, well, it vibe codes perfectly fine. I started a project with Qwen3.6-35B-A3B, and then this morning switched to 27B, and both worked fine!

Running on a dual 3090 rig with 200k context. Running Unsloth Q_8. No fancy setup, just followed unsloths quickstart guide and set the context higher.

```
#!/bin/bash
llama-server \

-hf unsloth/Qwen3.6-27B-GGUF:Q8_0 \

--alias "unsloth/Qwen3.6-27B" \

--temp 0.6 \

--top-p 0.95 \

--top-k 20 \

--min-p 0.00 \

--ctx-size 200000 \

--port 8001 \

--host 0.0.0.0

```

```
#!/bin/bash
export ANTHROPIC_AUTH_TOKEN="ollama"

export ANTHROPIC_API_KEY=""

export ANTHROPIC_BASE_URL="http://192.168.18.4:8001"

claude $@

```

The best part is seeing Claude Code's cost estimate. Over that 8 hours I would have racked up $142 in API calls, and instead if cost me <$4 in electricity (assuming my rig pulled 1kw the entire time, in reality it's less, but I don't have my power meter hooked up currently). So to all the naysayers about "local isn't worth it", this rig cost me \~$4500 to build (NZD), and thus has a payback period of \~260 hours of using it instead of Anthropic's API's.

What did I vibe code? Nothing too fancy. A server in rust that monitors my server's resources, and exposes it to a web dashboard with SSE. Full stack development, end to end, all done with a local model.

Kicking off projects in the evening every now and then, that's a payback period of, what, maybe a couple months?

I'm probably not going to cancel my codex subscription quite yet (I couldn't get codex working with llama-server?), but it may not be long

[-]

Kinky_No_Bit@reddit

Very nice !!! how are you liking the dual 3090 setup? decent?

[-]

sdfgeoff@reddit (OP)

Honestly, best mistake I ever made. I set out to get a single 3090, a good deal came up bought one, I thought it fell through so I got another one, and ended up with both.

[-]

Kinky_No_Bit@reddit

Are you using a NVLink bridge on them? or just card to card?

[-]

sdfgeoff@reddit (OP)

Being random second hand cards, I am not using nvlink. I just plugged them in and they worked. I haven't put in more effort than that currently!

[-]

Kinky_No_Bit@reddit

I think this is why I'm starting to see those open wire mining racks come back from the dead too. People are using those to fudge the spacing a bit to make those bridges work.

[-]

TFABAnon09@reddit

I dunno - I reckon several months of Claude Code Max is still cheaper than 2x 3090s

FWIW the cost estimates aren't relevant unless you're using your budget - which isn't something I've needed to do in months, even with 7 concurrent Opus4.6 1M sessions running.

In fact, the only time I've hit my limit since Xmas was yesterday when I decimated my weekly Design limit trying it out 🤣

[-]

gthing@reddit

I reckon several months of Claude Code Max is still cheaper than 2x 3090s

But can Claude Code Max play Crysis?

[-]

sdfgeoff@reddit (OP)

IDK. I can hit my limit in codex pretty quick without issues. I'd imagine I could run out of claude pretty quick too. I guess I should give qwen3,6 a slightly more torturous test and see how well it can orchestrate subagents.

But yes, at any given point in time, a subscription is pretty cheap compared to a hardware rig. With the caveat the now I have powerful hardware I can do what I like with (eg training vision models, generate 3d models with hunyan, render things in blender, play games etc. etc.).

[-]

TFABAnon09@reddit

We're not comparing Codex though. This is a thread about Claude.

Amazon has the RTX 3090Ti for sale at £1,700 - so lets call that £3,500 in GPU, plus the cost of the rest of the system. At today's prices - lets be super generous and call it the total build cost £5k (it would be much more than that).

At £150/month for a 20x MAX plan, the breakeven point is nearly 3 years - and we havent even factored in electricity costs or upgrades along the way.

I get your point about being able to do other stuff with the hardware - so it does skew the calculus, and only you can put a value on the ownership factor- but its hardly a slam-dunk win - is all I'm saying.

[-]

RealestNagaEver@reddit

What kind of generation speed do you get with 2x3090 and 27b model?

[-]

chimph@reddit

I ran exactly same model (27B 8bit unsloth) on a 128gb M5 Max MB and was getting 15 tok/s

For the 36B model @ Q6_K I was getting 85 tok/s

For a coding test they both do really well but 27B was maybe slightly better. At the slower speed it took the 27B 5mins of thinking vs 45secs with the 36B.

Qwen3.6 seems to be very impressive.

[-]

migsperez@reddit

I just gave it a spin, "build a website for an electrician", I've tested many models including some of the big ones. Qwen 3.6 output was one of the best, I'd put it in second place. It used about 27gb of the vram.

[-]

ukieninger@reddit

I got \~ 9 t/s on a 48gb M4 Pro MB with 27B-8bit-MLX version

[-]

Apprehensive-Cap7551@reddit

Are you using via mlx, or with ollama + claude code ?

[-]

chimph@reddit

using the MLX version in LM Studio

[-]

antunes145@reddit

Got the same results and I thought I was doing something wrong.

[-]

TheLexikitty@reddit

Happy cake day!

[-]

chimph@reddit

I’m thinking to run my coding sessions with 36B and then offload to review with the 27B. I honestly didn’t think I’d be replacing Claude/Codex and I’m sure I still won’t but I’m now thinking that I actually be using local for coding a lot more than anticipated.

[-]

sdfgeoff@reddit (OP)

27tk/s on the 27B, but I haven't tried to optimize anything.

[-]

Icy_Restaurant_8900@reddit

I’m getting 30T/s at 4k context on RTX 3090 + 3060 Ti OC (505GB/s) with the Unsloth Q4_KM of qwen3.6 27B. Just using the auto fit and I assume it’s using some kind of tensor parallel so it doesn’t drop to the speed of a single 3060 Ti.

[-]

Single_Ring4886@reddit

Please please how are your prefil speeds (the context reading)? Mine are like 50 t/s!!!! which is so small

[-]

Danmoreng@reddit

You can most likely improve that by a lot. The simplest probably is the speculative decoding with default values which got added to llama.cpp just two days ago: https://github.com/ggml-org/llama.cpp/pull/22223

[-]

CheatCodesOfLife@reddit

Does this work for Qwen in mainline llama.cpp: -sm tensor

If so you'll get like 2x t/s.

[-]

SexyAlienHotTubWater@reddit

Presumably that's without DFlash/DDTree?

[-]

Cold_Tree190@reddit

Do we know when those are coming to llamacpp?

[-]

Wise-Hunt7815@reddit

vllm: 48token/s ; llama.cpp: 27token/s

[-]

zipperlein@reddit

\~80 t/s with mtp 3 spec decoding.

[-]

MrMisterShin@reddit

27t/s @ Q8 and 37t/s @ Q4_K_M

[-]

jikilan_@reddit

Can reached 23 in linux

[-]

szansky@reddit

So if I want to compare Qwen 3.6 to Claude Code or Codex what will be the results in coding?

[-]

vr_fanboy@reddit

Anyone else having issues in CC with repetitions?, for example.

● Starting bench-judge-lb13 for judgement 68a1311850b28dc5b2f7c. Let me load the bench data and MongoDB document in parallel.

Searched for 1 pattern (ctrl+o to expand)

● Let me correct the path and retry.

● Running bench_fetch.py and loading MongoDB doc in parallel. ● The path got mangled - let me find the correct skill directory and script.

● I made typos in the path - let me find the correct directory and file.

● Let me find the correct paths first.

● I made typos in paths - let me correct.

It will go on for a while trying to 'find the correct path'.

It happens with other skills too, this is my current config with a single 3090:

exec "$LLAMA_SERVER" \ --model /models/Qwen3.6-27B-UD-Q5_K_XL.gguf \ --alias "dev_ml_model" \ --spec-type ngram-mod --spec-ngram-size-n 16 --draft-min 4 --draft-max 32 \ --dry-multiplier 0.8 --dry-base 1.75 --dry-allowed-length 2 --jinja --ctx-size 65536 --parallel 1 \ --fit on --fit-target 0 -fa on -ctk q8_0 -ctv q8_0 \ -b 4096 -ub 1536 --cache-ram 0 --ctx-checkpoints 12 \ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \ --reasoning-format deepseek \ --presence-penalty 0.1 \ --repeat-penalty 1.0 \ --host 0.0.0.0 \ --port 8001

[-]

sdfgeoff@reddit (OP)

Claude code doesn't handle short context windows /at all/. It assumes every model has 200k IIRC. So that could be your issue.

[-]

chimph@reddit

I had this exact issue with using superpowers plugin in opencode. Stuck in a loop saying it had made typos. Considering superpowers is meant for CC, you think that’s the same issue?

[-]

FullOf_Bad_Ideas@reddit

The best part is seeing Claude Code's cost estimate. Over that 8 hours I would have racked up $142 in API calls

is this honestly counting the prefix cache discount rate that Anthropic has?

or that you could be using Qwen 27B from OpenRouter?

there are many ways to tweak the presentation of LLM costs to show or hide them.

[-]

Sad-Arrival46@reddit

The cost math is compelling. $142 in API calls vs $4 in electricity is the kind of comparison that makes the case for local clearly.

What I find interesting is the middle ground, not everything needs a dual 3090 rig, and not everything needs $142 in API calls. I've been running a setup where a Conductor model classifies each request and routes it. Simple tasks go to local (free), complex tasks go to paid APIs (cheap per-token).

The real savings come from not sending simple tasks to expensive models at all. Your $142 estimate assumes every request goes to Anthropic. In practice, probably 60-70% of those requests were straightforward enough for a local model, and only the complex architecture decisions needed frontier capability.

Built an engine that handles this routing automatically if anyone's curious: https://github.com/hlk-devs/nadiru-engine. It discovers your local models through Ollama and your paid APIs, then the Conductor decides where each request goes. Your dual 3090 setup would be the primary target for most tasks, with paid APIs as the fallback for the genuinely hard stuff.

Great writeup on the payback calculation though. That kind of concrete math is what people need to see.

[-]

Canchito@reddit

Qwen 3.6 is not only really usable for coding, but also writing as well as other applications. I thought I was done being pleasantly surprised for the month after Qwen 3.5 and Gemma 4, but damn...

These improvements in smaller models are very welcome at a time when the large api providers are collectively shitting their pants.

[-]

Real_Ebb_7417@reddit

Qwen3.6 is useful for writing? I have to try, because tbh while Qwen3.5 was the best at coding from local models, it was definitely worse at creative writing/rp than Gemma/Nemotron/Mistral.

[-]

mxmumtuna@reddit

Qwen would love to tell you a story about Elara and her detecting the smell of ozone.

[-]

SkyFeistyLlama8@reddit

Toasted GPU? LOL.

I keep a big case fan pointed at my laptop to keep it cool during inference so I'm paranoid about inference potentially wrecking hardware.

[-]

necile@reddit

No. Ozone is a reference to how bad writing models constantly say everything smells like ozone.

[-]

Real_Ebb_7417@reddit

It wasn't Ozone. It was

[-]

mxmumtuna@reddit

No, inference engine.

[-]

Canchito@reddit

I don't want to get your hopes up too much, but let's say I find it definitely a solid step above Qwen 3.5. I prefer it easily over Nemotron and Mistral, but unsure for Gemma 4.

[-]

Kahvana@reddit

Gemma 4 31B even at Q5_K_M takes the crown, It's genuinely great, at least on part with Deepseek v3.2.

Gemma 4 26B-A4B is super fast, but lacks nuance and emotional intelligence. Seems to me more like a workhorse (like OCR, translations) than a writer.

Was pleasantly surprised by Qwen3.5/3.6 27B at Q5_K_M for writing too, it's a large improvement over older Qwen models.

[-]

overand@reddit

G4-31B:Q5_K_M clocks in at just under 22 GB - sounds like there's at least a little hope for folks with a single 24 GB or dual 12GB cards, which is nice to hear! I'm spoiled on dual 24G 3090s, I should try some of these models cut down to one just to see!

Are you doing inference with llama.cpp on Linux?

[-]

Kahvana@reddit

Nope, stuck on windows sadly. Do use llama.cpp though!

[-]

ThisGonBHard@reddit

Qwen 3.5 Heretic was better than Gemma 4 in my experience.

[-]

Serprotease@reddit

Writing? I don’t know about 3.6 but 3.5 and 3 were really bland. The 27b, 122b and 235/397b.

Mistral, glm and especially gemma4 (Gemma3 was okay too) are “good” at writing.

TBH, most llm are actually very poor at writing but some are not as bad as the others.

[-]

cleversmoke@reddit

Yea! I'm using Qwen3.6-35B-A3B Q4_K_M to research the stocks I own too. It's quite fascinating. Been pitting it up against Claude Sonnet 4.6, and it's holding up quite well in its research and recommendations.

[-]

SkyFeistyLlama8@reddit

Heck yeah. I tossed an old code base at Qwen 3.6 35B, waited half an hour for it to finish generating a bunch of documentation (on a freaking laptop CPU of all things) and it came up with something immediately usable.

If you have time, then these new MOEs seriously whip the llama's ass.

[-]

jrodder@reddit

This comment brought to you by Winamp.

[-]

Aham_bramhasmmi@reddit

what kind of hardware it need ! Can i run on mac mini and how is the result in terms of coding task and agentic task as well !

[-]

ScaredyCatUK@reddit

"Running on a dual 3090 rig with 200k context. "

Did you factor in the equipment cost, because that's a factor.

[-]

gargoyle777@reddit

Why did you chose 27b instead of the 35b moe? Execution timenis waaaaay better for very similar result

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

danigoncalves@reddit

honest question, why using Claude code with Open models and not use opencode? never used Claude code that's why I am asking.

[-]

sdfgeoff@reddit (OP)

Because I had it installed, and didn't have opencode.... In other words, no real reason.

[-]

Steus_au@reddit

try pi - it's fancy and a bit faster, and a bit extreme by default

[-]

FullOf_Bad_Ideas@reddit

I'm sometimes using Claude Code with local Qwen 3.5 397B. I have Claude subscription from employer. If my Qwen messes up I can easily switch to Opus and recover. Same thing when Claude goes offline. In practice I was doing the above but last few days I've been using OpenCode with Qwen since CC was invalidating prefix cache and it was prefilling over and over, I haven't fixed that yet. I could probably use CC subscription in OpenCode too, didn't set that up yet.

[-]

UnbeliebteMeinung@reddit

I just took a day for one feature i guess?

[-]

sdfgeoff@reddit (OP)

Took a day for a complete greenfield program. A simpleish one, but still a lot more than one feature

[-]

AdOrnery4151@reddit

These improvements in smaller models are very welcome

[-]

car_lower_x@reddit

What resource monitor tool is that?

[-]

sdfgeoff@reddit (OP)

The one that qwen vibe-coded for me. I'm pretty happy with how it turned out!

[-]

car_lower_x@reddit

Looks impressive. Is it on Linux?

[-]

sdfgeoff@reddit (OP)

Yep, it's linux. I'll chuck it on github tomorrow

[-]

inaem@reddit

Why not fp8 with vllm?

[-]

sdfgeoff@reddit (OP)

Setup. Llama.cpp is super easy. I should probably try vllm again, but when I tried it a year or so ago itnwas a big faff

[-]

anonutter@reddit

Totally agree. I've been using it on 3090ti and roo code. Now only use Claude code for really complex tasks that would need opus 4.6

[-]

car_lower_x@reddit

Have to say I was a little perturbed at how long it took to think about coding tasks but the output was brilliant.

[-]

Smokeey1@reddit

Stop showing me stats and graphs, show me what you built!

[-]

sdfgeoff@reddit (OP)

That's the second picture! It's not (just) me showing off the rig, it's ... the app that qwen vibecoded for me:

It's a web dashboard for a server. A rust web backend records system performance stats and a frontend displays a graph of the past four hours. Updates over SSE every 5 seconds. Packaged into a single binary file with no dependencies.

[-]

count023@reddit

what about for planning or debugging, how does qwen stack up to claude?

[-]

sdfgeoff@reddit (OP)

Some of the later prompts were debugging, and it seemed to handle it fine. Not a legacy codebase though

[-]

Zestyclose839@reddit

How did you get Qwen to run for 8 hours with just five prompts? Feel like that's impressive in itself.

I've never gotten any agent to run for longer than an hour before either failing a tool call, getting stuck in a thought loop, or just finding a reason to prematurely call it done haha.

[-]

sdfgeoff@reddit (OP)

Try the "grill-me" skill.

[-]

AdOdd4004@reddit

Man, can you share your rig components? I just got two 3090s and wanna build one too!

[-]

sdfgeoff@reddit (OP)

2x3090
i7-10th gen
64Gb DDR4
Asus Z570F motherboard, which has 3 slots. Not all x16, but I can't say I noticed.
1600W PSU. Superflower IIRC
A bunch of HDD's and a 256Gb SSD I wish I had bought more SSDs last year!

[-]

EvilGuy@reddit

Yeah I agree. This 3.6 27b is decent.

Seems smart enough to be useful and when it runs on your own hardware it's at least consistent.

Good for a backup at least. I don't know that I am going to be fully switching to it.

[-]

_-_David@reddit

"Good for backup" I'm a retired, hobbyist coder. And when my $20 codex subscription runs out for the first time, this will be the first time I'll consider using a local model to keep going instead of just walking away for a little while.

[-]