Devs using Qwen 27B seriously, what's your take?

[-]

Unlucky-Message8866@reddit

i've been exclusively using it since release, this one is already "good enough" for my needs. here's a massive deslop refactor it did to the pi extensions opus wrote for me a while ago, just asking it to fix errors from eslint/fallow: https://github.com/knoopx/pi/commit/0a31b9ac241ea4949e8403cf02473b01e7911f1b

[-]

mtomas7@reddit

I hope you will not mind if I "borrow" some stuff for my own Pi setup ;)

[-]

Unlucky-Message8866@reddit

go ahead :)

[-]

Fragrant_Scale6456@reddit

Very cool. I have a 5090 and tried for a while to get the qwen and gemma4 moe models to work at an acceptable quality since I was getting over 200tok/sec with those but qwen 27b is just so much smarter that I have to use it for my vibe coding level skills. It can be frustrating though since I only get around 60t/s with it

[-]

formlessglowie@reddit

Is 60 t/s not enough for you? That’s around what you get with GPT/Opus in regular use, if not more, unless you use fast mode, which is much more expensive. Codex can generate at faster rates, but it stalls in between bits of work enough that the overall experience feels slower than what I get with Qwen at 50 tk/s.

[-]

Fragrant_Scale6456@reddit

60 is definitely enough to feel productive although with some larger tasks I end up having to take breaks or watch YouTube or whatever while I wait. 200 felt positively fast and could work faster than I could think about what the next steps should be.

Either way I’m grateful to be able to run such powerful software on my home pc at usable speeds

[-]

ang3l12@reddit

Here I am with my strix halo machine getting about 15-20 tps max with Qwen 3.6 27b and thinking it is good enough…

[-]

Oshden@reddit

Same

[-]

FinalCap2680@reddit

And here I am, mostly on CPU, getting 1.5-2 t/s and waiting to try Qwen 3.6 122B when released 😄

Are you running BF16 or quant?

[-]

brother_spirit@reddit

Maybe the enforced breaks are a feature, not a bug? I find those 5 minutes windows of waiting for agent are good times to reflect and consider.

[-]

u23043@reddit

60t/s would be enough if these smaller models were as good as the larger ones but in my experience they need more thinking tokens and more try/fail/iterate loops to produce working code versus larger models. So even though they generate about as fast as bigger models on better hardware, they take longer to do just about everything.

[-]

mr_zerolith@reddit

Problem with 60 tokens/sec is that it can easily become 20-30 tokens/sec as that context window gets loaded up ( IE you are really using your LLM ).

[-]

windwardmist@reddit

Right as long as it’s faster than I can read I don’t care. I’d rather power cap my gpu and save the heat and power.

[-]

QuinQuix@reddit

Wsl or Linux?

Question, if all the programmers are on Linux how do you develop windows games and apps?

Seems like a hassle and less reliable to do the testing to remotely or in a vm.

[-]

u23043@reddit

Windows games and apps are tiny percentage of total software, so even if most devs were on Linux (not at all true) it wouldnt really matter because there are plenty of Windows devs in comparison to the amount of Windows software.

[-]

thrownawaymane@reddit

A well set up VM is no hassle at all compared to the inverse/having a whole separate box.

Definitely needs to have its own gpu though.

[-]

Fragrant_Scale6456@reddit

I’m using Linux but I’m developing web apps and data analysis tools for my home business so it’s an appropriate environment for that. At some point I do need to make mobile apps but I have a lot to learn before attempting that.

I did start with LM studio in windows but wasn’t happy with it. Opencode and llama.cpp is treating me well so far.

[-]

qudat@reddit

Wait you have a 5090 and only get 60tps output?? That doesn’t sound right. I have a titan rtx 24gb and get 25tps Q4 quant with Q8 kv cache llama.cpp

[-]

Fragrant_Scale6456@reddit

I’m using Qwen3.6-27B-UD-Q5_K_XL.gguf with q8 kv cache and 256k context in llama.cpp

[-]

Joscar_5422@reddit

May you please share your settings for qwen 35b a3b and qwen 27? i got a 5090 but never topped more than 180 ish with 35b.

I used it for word tasks so the speed of 35b is too addictive.

my current gives 170-180:

llama.cpp/models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \

--host 0.0.0.0 \

--port 8080 \

--ctx-size 200000 \

--n-gpu-layers -1 \

--threads 16 \

--batch-size 512 \

--parallel 4 \

--flash-attn on \

--temp 0.7 \

--repeat-penalty 1.1

[-]

Fragrant_Scale6456@reddit

I don't have it anymore but main difference is I was using parallel 2. I also had ubatch 256, --cont-batching on, and q8 kv cache

[-]

phazei@reddit

I've seen a bunch of posts about people getting it to go at over double that with the right setup even on a 3090. Have you see those or made the attempt?

[-]

layer4down@reddit

https://youtu.be/vbPGvvSB8IQ?si=565O8s6ykt738hwY

[-]

teachersecret@reddit

I managed to get it up over 100t/s on the 4090 with 4 bit + dflash (I tested at 128k context and worked fine) but when I did some heavy testing at that level, the greedy decoding etc had enough of a negative effect that I went back to the slower 40t/s 128k llama.cpp setup I have going instead.

Hoping for the same thing - this thing's a lil' beast and I just want it to go faster.

[-]

n00b001@reddit

Is there a guide for dflash qwen3.6 27b 4bit on 4090?

[-]

teachersecret@reddit

I ended up following their github (well, I threw it at Codex, and chatgpt did most of the heavy lifting). It's mostly identical to how they set up the 3090. Minimal changes.

I can tell you the model is degraded running in dflash with turboquant. IDK what the benchmarks would say, but it's noticeably worse and I went back to the slower 40t/s llama.cpp setup I was using almost immediately. Your mileage may vary. Let me know if you do go that route.

[-]

CptZephyrot@reddit

Which quant are you using for the 40t/s setup?

[-]

teachersecret@reddit

4 bit llama.cpp pulls over 40t/s single user, and over 100 with multi-user batching.

[-]

ArtfulGenie69@reddit

I'm not sure either what he was using. My guess is vllm but that's because I don't think any llamacpp works with dflash. I haven't heard about the degredation from dflash yet but who knows maybe the combo with turbo quant does something nasty to the model or it's implementation is messed up.

When I tried ik llama a while back I had a lot of issues with batching and it getting really weird and stupid when I would continuous batch with it. Mtp didn't work back then on it but it did work on vllm. I think now mtp works on ik for qwen3.5+

[-]

n00b001@reddit

Yeah I am getting about 40tps with just llamacpp, 200k context, --parallel 1

[-]

qudat@reddit

That’s my problem: I need speed. I want 100 tps with the 27b

[-]

The_RedWolf@reddit

Out of curiosity, what's your GPU?

[-]

Unlucky-Message8866@reddit

5090

[-]

jacek2023@reddit

more t/s are possible in the future, look at WIP MTP PR https://github.com/ggml-org/llama.cpp/pull/20700

[-]

Admirable_Reality281@reddit (OP)

Honestly, that might actually be possible with the right hardware and a bit of setup trickery

[-]

Unlucky-Message8866@reddit

Yeah definitely coming soon, there's a few things for sm120 and speculative decoding being cooked by community

[-]

Virtamancer@reddit

Speculative decoding should be built into either the models or their releases and the runtimes that support them. If it's anything like the hype, then it can radically increase generation speed so it feels weird that it's not prioritized more 🤷‍♂️

[-]

nunodonato@reddit

Runs fine on "my" H200

[-]

Admirable_Reality281@reddit (OP)

Ahah no! I don't even have a dedicated GPU but I've seen some wild stuff on X lately:
https://x.com/pupposandro/status/2046264488832213174
https://github.com/Luce-Org/lucebox-hub

[-]

starkruzr@reddit

interesting. I didn't see a lot about how stupid this makes it. also wonder how well this approach would work on a pair of 16GB 5060Tis.

[-]

theveganite@reddit

Honestly, it's been really great provided that I organize my code properly, ensure lots of documentation, keep individual file sizes smaller, and check the code. It needs rules and proper prompting. With that, it's been really great at automating work I would normally do myself or put into a frontier model. It's plenty fast and I haven't run into any egregious problems with my current setup, but like I said have to be realistic and work with its capabilities and not beyond them. Can't just give it absolutely everything and expect miracles like you kind of can with Opus.

Qwen3.6-27B-Q_6_K 160,000 context on a 5090.

[-]

Admirable_Reality281@reddit (OP)

Can you please give a concrete example of the kind of task where it breaks down compared with Opus?

By the way, might be useful for you: 160k is huge already, but just in case you ever want more headroom for longer sessions, I was able to run the same quant (Q6) with 262k context and Q8 KV cache on a rented 5090 using a simple llama-server setup

[-]

Embarrassed_Adagio28@reddit

I find qwen3.6 27b q5 to be on par with sonnet 4.6. As others have mentioned, it can have knowledge gaps in edge cases but allowing it to websearch does wonders to fix this issue. I am currently setting up Hermes agent with it, since it has self improving skills I think it surpass sonnet 4.6 over time.

[-]

itroot@reddit

I use qwen3.6 27b (q4) on llama.cpp with pi. That works extremely well for me. Also, I'm a Claude Code user. I would say that 27b could be substitute for Claude Code if:

- you are willing to break down to smaller tasks. So more hand-holding is needed. (Bad and good, depends how you look at it. Good as you will get better at breaking down)
- it has knowledge gaps. So you'd better provide it with docs access, or - give it a ability to ask for a help from a bigger cloud model

Doing these 2 things, I can't really distinguish it from Claude. So.... 😁

[-]

dinerburgeryum@reddit

This is exactly correct. Q6 27B user. Proper access to documentation is a must. Smaller focused tasks with more human in the loop. Since both of those things were already my style it fit like the missing piece in a puzzle for me. (Still looking for a better harness, also for better up-to-date docs scraping.)

[-]

Admirable_Reality281@reddit (OP)

Haven't really had the chance to test it on truly obscure stuff, but I'm definitely going to keep using it and see how it holds up.

As for the hand-holding.. I'm honestly leaning more and more toward that approach. Every time I let go too much, even with GPT-5.5, I've seen some awful stuff: unhandled cases, unnecessary levels of complexity, or just terribly convoluted stuff.

[-]

Zestyclose839@reddit

Same here. Feel like Qwen is finally helping me become a better software architect, since it's amazingly smart, but I actually need to be part of the process. Also nice that I can debate with it all day on design choices without racking up any usage limits -- had to ditch Claude Code for that reason, since an hour of planning (not even reading files or anything) is enough to hit the 5-hour limit.

[-]

superdariom@reddit

Codex broke my code base today then told me I was out of credit and try again later

[-]

dinerburgeryum@reddit

Classic.

[-]

dinerburgeryum@reddit

I’ve never used one of the big providers, in part because of stories like this. When you take the human out of the loop it seems to end in a shambling nightmare of code strung together with bird shit and duct tape. I’ve been experimenting with late recently, which really feels like it would be a home run for this kind of work, but it’s a little bare bones. Still, the idea feels exactly correct. https://github.com/mlhher/late

[-]

Admirable_Reality281@reddit (OP)

Time to merge with the machines 🧠😂

Glad to know I’m not the only one noticing this. Never heard of Late, seems like a cool concept

[-]

dinerburgeryum@reddit

Yeah it gets a ton of stuff right. I’m trying to flex it to plan out a VSCode extension which captures the best parts.

[-]

Tamitami@reddit

Try forge code as a harness, Opus etc, perform even better with it than with native claude code. I'm pretty happy with it as it integrates so well into zsh.

[-]

dinerburgeryum@reddit

It looks nice for sure, but are you using the ForgeServices add-on? Can I evaluate the source for it? The sparse documentation on it is kind of a bummer.

[-]

horrorpages@reddit

Curious. When you say documentation are you referring to detailed specs/tasks only or something else like programming/framework/design guides, both? I do the former with as much detail as possible (limited to no ambiguity) but always looking for more edge in my process.

[-]

dinerburgeryum@reddit

I mean literal API calls. 27B can retain “smarts” but it just can’t be asked to learn every method signature of every framework in every language. Getting the exact right calls with the exact right signatures is one of my last big hills to climb.

[-]

Aggressive-Fan-3473@reddit

Did you try giving it context7? Seems like that would be an easy solution.

[-]

dinerburgeryum@reddit

Context7 seemed nice at first blush but always gives more than you want or need, bafflingly sometimes not what you need at all. Maybe that’s a byproduct of working in Unity, but I’ve found it extremely underwhelming.

[-]

LeucisticBear@reddit

Which do you use or have tried? I've heard mostly negative things about Claude code with other models, wondering if pi or open code would work better.

[-]

toad-leech@reddit

Is there a reason not to use Qwen3.6 27B with Claude Code? I am running the model also using llama.cpp and so far it has been pretty solid using Claude Code.

[-]

our_sole@reddit

Can you tell me more about your claude/llama.cpp config that runs local Claude Code?

Here's my llama-server.bat cmd (Windows):

llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL ^
--alias qwen36_35B ^
--host 0.0.0.0 ^
--port 8000 ^
-ngl 999 ^
--threads 8 ^
-c 65536 ^
-b 2048 ^
-ub 1024 ^
--parallel 1 ^
-fa on ^
--cache-type-k q8_0 ^
--cache-type-v q8_0 ^
--jinja ^
--keep 1024 ^
--no-context-shift ^
--reasoning off ^
--temp 0.7 ^
--top-p 0.8 ^
--top-k 20 ^
--min-p 0.00 ^
--no-mmap

And here's my Claude shell script (Linux) ANTHROPIC_BASE_URL=http://wagner:8000 \ ANTHROPIC_AUTH_TOKEN=llama \ CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 \ CLAUDE_CODE_ATTRIBUTION_HEADER=0 \ ANTHROPIC_API_KEY="sk-no-key-required" \ claude --model qwen36_35B --dangerously-skip-permissions "$@"

I have an RTX3090 with 24GB VRAM and 64GB RAM. Claude is v2.1.122.

When I try to run Claude locally with that script, I always get: There's an issue with the selected model (qwen36_35B). It may not exist or you may not have access to it. Run --model to pick a different model.

This curl http://wagner:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "qwen36_35B","messages": [{"role": "user", "content": "hello"}] }' works great

This curl http://wagner:8000/v1/models | jq works great.

But not Claude.

Any ideas? I have successfully run Claude locally with Ollama cloud and a similar claude shell script. It seems like its maybe a llama.cpp issue more than a Claude issue? Any help greatly appreciated.

[-]

No_Ad_8807@reddit

Curious to know about your pi configuration and how much is costs.

[-]

SnooPaintings8639@reddit

As someone who works with the same setup (vllm+qwen_3.6_27b_q8+pi) I 100% agree.

It is *great at coding*, but it is not SOTA on high level planning or analysis, and also does have some dark spots when it comes to instructions following.

So as long as you keep all the architectural decisions on your side, and you **know** what you want to build, it is very capable model.

Currently I just team it up with either Opus or Codex to help me manage the higher level tasks, and delegate coding to the pi.

A cherry on top: When I do use Opus for code, I like to pass the output for other models to review. I used to use Codex, but now I believe Qwen 3.6 27B is as good at this task, and it can find high to critical issues on Opus code in nearly each non-trivial code output. It is that good.

[-]

mp3m4k3r@reddit

I find the same to be true, similar setup and its bounce between 35B-A3B and 27B and theyre both very very good for something you can run locally fairly well.

[-]

philmarcracken@reddit

damm now I really want more than 12gb of vram lol

[-]

1millionnotameme@reddit

It's good. I got it running on a Mac and a 5090 and like others have said it requires more hand holding and harnesses but imo local models that are run much cheaper are going to be getting more popular

[-]

Cruel_Tech@reddit

I've been using it as my main model in OpenCode. I have it act as a team lead coordinating a bunch of subagents that run either 27b or 9b depending on how much context is needed.

For 27b on my 3090 I can only fit about 64k tokens context. Whereas 9b I can fit the max context with room to spare.

I've been using it continuously to build a workout coach app and so far I haven't found anything it's failed at. I've had to iterate and start it in the right direction for hard features like streaming data with Server-Sent Events, but it will get it eventually. It definitely takes way more iterations than something like Claude and it's UI design can be quite trash, but it gets the job done.

One thing that particularly impressed me was how long it'll work on a problem. I had it write E2E tests and it literally worked for 8 hours writing, running, and debugging tests until everything was green.

Running locally is significantly slower than a cloud API which is a bummer but its nice knowing I could completely unplug the Internet and it would still work.

TL;DR I find it to be quite competent but don't expect SOTA performance.

[-]

nikhilprasanth@reddit

How is 9B performing compared to 27B and 35B?

[-]

suprjami@reddit

The second thing I ever said to Qwen 9B Q8 in Hermes Agent:

"Search Duckduckgo for this thing and provide a summary."

It spins for AGES. Returns the response: "Sorry, I used all the tool-calling for this session searching the local filesystem for that thing, now I can't use the web search tool to look online."

I read the web search skill: "Do not use the local filesystem search too. If you are using local filesystem search, stop and use web search."

So I won't touch 9B again for agent tasks.

The small Qwen models have always been fine at text summarisation, though I prefer the writing style of Mistral Nemo 12B.

[-]

nikhilprasanth@reddit

The 9B model is very overthinking.

[-]

Cruel_Tech@reddit

I use 9B (because of the max context length) for the explore agent and for writing documentation so it doesn't have a lot of thinking to do. It can read tons of files in the repo to report back where exactly the coding agent needs to look for what it's working on.Because of that I can't really compare the two. However I find the setup to be very useful. Before I was just using 27B and it had to compact the session constantly.

[-]

quantier@reddit

what settings have you found is optimal for coding? I understand that this is very very important for quality.

[-]

layer4down@reddit

Qwen 27B: Bring your brain

[-]

apeapebanana@reddit

WebDev here, having a freaking blast with it. Ask my pi to connect ssh to my old T430 laptop with linux mint which i installed years ago, update, secure, and install PI that uses my local model. Then I asked the laptop pi to create a mini-game with p5js.

for work wise, I had to use local to brainstorm, send to gemini-pro for analysis, then reevaluate the plan. then send off to kimi-k2.6 to build out the things (i find minimax a little lacking on following instruction)

for non-essential and personal usage, Qwen 27B is lifting a lot of weights, not perfect, hoping for less repetition thinking loop tho

[-]

AlwaysLateToThaParty@reddit

It has to be said that we dev, for good and bad reasons, is a focus of many coding benchmarks. It does seem to do that well

[-]

apeapebanana@reddit

i believe for web developer, their pivot would be into their own circle of interests, and webdev services as secondary

[-]

Flamenverfer@reddit

Its good.

[-]

MasterLJ@reddit

It's performing the tasks I ask it to do quite well, it can rival paid SOTA models with the right harness. It's even correcting designs made by SOTA models.

I'm using a full vLLM setup on an H100 and FP8.

Can't say enough good things about it, I'm trying to cut the cable from Anthropic... messing around with Mistral Medium 128B as the orchestrator this morning.

[-]

dtdisapointingresult@reddit

it can rival paid SOTA models with the right harness

These words doing 99% of the lifting here. 95% of the people on this sub won't be using Qwen 27B anywhere close to paid SOTA models. I'll believe it when I see it.

If you disagree, write a guide on how to setup said harness. I don't mean "Well, just install Pi and experiment and improve it on your own." That's not an answer. That's like saying "Just get a bank loan and build your million-dollar business."

[-]

MasterLJ@reddit

I'm not here to make you a believer.

I take some from these subs, and I give some... OP asked for a survey of opinion and I gave it. I included my exact settings and I gave a reasonable over view of what my harness looks like.

A harness that works is valuable. The harness I built also leverages my 25+ years of software building experience. It's not for you.

[-]

riceinmybelly@reddit

The way he asked was so entitled and you answered correctly but still you got me very interested! You built your own harness?!?

[-]

rpkarma@reddit

I recommend doing so! It’s easier than you’d think and it’s a great way to learn how these things work.

[-]

MasterLJ@reddit

I mean, yeah, it's not as sexy as it sounds I guess, but an LLM generates a plan (I call them Missions), they are negotiated, and then executed by "lesser" models. For me, Qwen3.6 27B dense as mentioned above are the workers. One H100 can support 3 sequences. I think I have a mismatch between my token window size and KV cache size, but it's all working well.

Large LLM, could be Opus, could be local (I'm testing Mistral Medium, the new one, right now) generates the plan, initializes the runtime, then lords over the run.

[-]

riceinmybelly@reddit

I was a bit bummed we got a flash today instead of the 122B but I plan on using that instead of the 27B, I now run 35B for tool calls, lookups and terminal cmds, once info is gathered it moves to 27B to plan, then the plan gets a second opinion from glm 5.1 and is pushed as an implementation proposal to opus in CC CLI. Opus has access to the initial ask and resources and my vault has some methodology/skills/strategies specific for the project

[-]

MasterLJ@reddit

I have Mistral Medium 122B dense up and running on a B200 right now, getting to know it. I'm impressed. It's very verbose and "planny" but that's kinda what I want.

Debugging a context size mismatch but so far I'm impressed.

I was about to let it rip as the orchestrator of my harness, but it's been little by little.

I have successfully run Qwen3.6 35B MoE as the orchestrator and it did pretty poorly.

Opus 4.6 or even 4.7 are extremely good at playing the role of orchestrator. Will you let you know more about Mistral Medium as I learn more.

[-]

riceinmybelly@reddit

Thanks!

[-]

riceinmybelly@reddit

Oh and I doubt a dense 122B wouldn’t make me cry, my bandwidth is only 400GB/s and the prefill on Macs is excruciating

[-]

Admirable_Reality281@reddit (OP)

Uh! Mistral Medium 128B, not to switch topics, but how did it go?
On paper the numbers don't look that impressive.

[-]

MasterLJ@reddit

I've not had enough time with it to make a good review. It's up and running on a B200, I'm trying to get it to work with my harness (I can specify a model to work as the Orchestrator).

Tooling isn't working properly with Cursor and it's driving me a little nuts.

Very very prematurely, it's fast and the quality seems pretty reasonable for reasoning.

[-]

Admirable_Reality281@reddit (OP)

You'll get there! Always rooting for Mistral. Hopefully it ends up being better than the paper numbers make it look

[-]

nunodonato@reddit

vLLM and H200 here. 🤓

[-]

Kahvana@reddit

It's "good enough" and that's all it needs to be to be useful.

You can't "vibe code" with it in the sense that you can be vague and expect a good result. But when you give it small tasks, review it's output like it's a starting intern and give it very specific instructions, it works for what it needs to do and can really save time.

Personally I found that using a bigger model (GPT 5.3 Codex) for planning and then having Qwen3.6 27B execute the task didn't save me time. Without steering it directly, the quality was lacking (more cleanup work).

Higher quants might perform better, I am using Bartowski's Q4_K_L quant with Q8_0 for KV cache to run full context.

[-]

SirBardBarston@reddit

What is you performance like?

[-]

Kahvana@reddit

Slow! (but I don't mind considering the low cost of my machine compared to others here, if I want it to be faster I'd use Qwen3.6-35B-A3B or Gemma4-26B-A4B instead).

Qwen3.6 27B:

Processing: 1280 t/s at 32k, 710 t/s at 100k
Generation: 20 t/s at 32k, 14 t/s at 100k, 9 t/s at 260k

Gemma4 31B

Processing: 970 t/s at 32k, 620 t/s at 100k
Generation: 17 t/s at 32k, 9 t/s at 100k

[-]

thirteen-bit@reddit

You can check llama-server switch --no-mmproj-offload if you need to conserve vram and use image input occasionally.

[-]

SirBardBarston@reddit

I was thinking about the same setup, but have looked into a 3090 instead now.

[-]

Kahvana@reddit

Fair enough!

If raw performance is what you're after, then 2x 3090s will give you more for sure (especially using nvlink bridge). You might want to consider AMD Radeon AI 9700 Pro's instead; roughly the same power consumption with more VRAM and blower-style cooling for when you stack them without spacing. Not sure for which the bandwidth is higher though.

Personally I was far more concerned with energy consumption (which is very low for 2x RTX 5060 Ti's, \~250W for heavy inference), noise levels (the ASUS PRIME RTX 5060 Ti 16GB is very quiet) and heat generation. On that end my cards do extremely well, going off-the-grid for them is actually possible but much tougher for 3090's.

[-]

SirBardBarston@reddit

Is AMD equally well performing? I only really see people using NVIDIA it seems.

[-]

Techngro@reddit

"Personally I found that using a bigger model (GPT 5.3 Codex) for planning and then having Qwen3.6 27B execute the task didn't save me time. Without steering it directly, the quality was lacking (more cleanup work)."

Did you mean that using 5.3 for planning did save you time?

[-]

Kahvana@reddit

Sorry, english isn't my native language! How would you word it instead?

And no, it sadly didn't.

[-]

Techngro@reddit

No, I completely misunderstood what you were saying. Apologies.

I am also planning to use Chatgpt and Claude to plan and then use local models to execute. Do you have any hypothesis as to why Qwen couldn't follow the plan well?

[-]

Kahvana@reddit

Can't say much for Opus as I haven't tested that due to costs, you might have better luck with it!

For GPT 5.3 Codex (from Copilot):

It REALLY likes smoke tests written in powershell, I have to actively correct it each time to use dotnet tests instead (even if I've specified multiple times to not write powershell tests).

It tends to "one-up" / overthink / overplan / finding too many edge-cases that don't need resolving, not sure how to word it.

Codex can feel overconfident sometimes too, and a little stubborn.

Because I have to correct codex multiple times, the time it would save writing a plan and then having qwen run it, is lost. It's really not qwen's fault.

[-]

octopus_limbs@reddit

It is perfect for development but not for vibe-coding - you really have to describe what you need and make it think less. It helps with cutting down on coding for me; I use it to do the small things when I run out of Claude tokens

[-]

Rerouter_@reddit

Its great at getting tasks done, its a bit too eager when its not a clearly defined task, If push comes to shove it will take some wild paths to accomplish a task,

I've been trying to tweak my tooling to get it working how I'd like, the biggest issue is the slow token rate, and you do need BF16 for reliable operation over 100K context, so \~25t/s on a 6000 pro,

I've been giving it mainly directed small scope tasks, e.g. here is a folder your working in, write a program to accomplish X, or get data from 2 API's figure out how to map A to B, and find all the edge cases,

[-]

Party-Log-1084@reddit

Been daily driving the 27b for a bit now. It handles actual implementation and bug fixing surprisingly well, but you definitely have to steer the ship. It falls apart on high-level planning or massive codebase-wide refactors where Opus or Claude usually excel. If you just break tasks down into smaller steps and keep a tight TDD loop, it easily replaces the paid APIs for like 90% of your daily grunt work.

[-]

Due_Net_3342@reddit

not great, lacks the world knowledge of bigger models

[-]

Orolol@reddit

It's good for it's size, and it's amazing that I can run it at home, on my own GPU, at 100+ tok/s.

But i'll be honest, I won't use it for something more complicated than basic programmation (simple website, APIs, etc ...)

I'm working on Deeplearning projects and I won't trade Opus for this.

[-]

suprjami@reddit

What are you using to run at 100+ tok/sec?

[-]

thecodeassassin@reddit

honest take; I've been using it for close to a week on serious tasks and what I've seen is:

It sometimes starts things but doesn't finish them, adds stubs even though my agents.md forbids thar
Secure coding is not a thing, needs a review pass from 5.3 codex most of the time
Not suitable for large tasks, needs to work on a small feature, review, next feature. Otherwise too much gets lost.

For example it implemented a whole API but forgot about all UI aspects.

Good but needs hand holding and very strict task definitions. Always check the output of any model but especially a small dense one.

Actually still happy with it, best model for its size.

[-]

AlwaysLateToThaParty@reddit

As with all things LLMs, your workflows influence how successful or unsuccessful they will likely be. Technical considerations aside, different people can use certain models very effectively, and some less effectively. If you have a workflow that takes away a process, especially from a person whose time is expensive, they are a productivity multiplier.

I use mine mainly for pure professional language analysis of private documents, so I got the rtx 6000 pro. But what I'm also able to validate is whether the 'smaller' models do that. My experience is that if it isn't greater than 30 or so GB at full context, it's frustrating. More than that, and it's about which model suits that workflow where you're trying to introduce an efficiency. Different models have different capabilities so it's about your use case.

I use the qwen 3.5 122b/a10b heretic mxfp4_MOE quant. I've tested most of those other ones, and they were good enough. The programming side they did well. The professional and detailed analysis, it didn't do well in. As in, validated that there's something at that level it doesn't do. Too many inaccuracies, no matter how you fashioned its prompt.

But what is does do well, it does well. I'm talking about full quant 27B taking 70 or so GB of VRAM. A quarter of the speed and not as good quality as the 122b model. I recognized that it needed structured inputs, but good if it got them.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

kmouratidis@reddit

People forget than Claude 3.5/3.7 Sonnet were useful models and pretend only 4.5+ exists. Those that remember using 3.5/3.7 will tell you that even Qwen3 is useful. I personally used everything from (Llama) Nemotron 70B, Mistral Small 3.1, even gpt-3.5-turbo, and even though nothing was ever "good enough", everything could be useful and save time.

[-]

Admirable_Reality281@reddit (OP)

True, the threshold for "good" just keeps moving up

[-]

Evanisnotmyname@reddit

Imagine talking about this two years ago.

[-]

Admirable_Reality281@reddit (OP)

Can easily imagine the catastrophic headlines 😂

[-]

Evanisnotmyname@reddit

AI TAKES OVER HUNDREDS OF THOUSANDS OF COMPUTERS AND CONTROLS THEM REMOTELY, SENDING CHATS TO EACH OTHER

[-]

Intelligent_Ice_113@reddit

I'm sure in two years most of us even forget openAI ever existed.

[-]

XccesSv2@reddit

yes good is relative because when you can work way more efficient with paid sota models why should you pain yourself with local models. A bit less intelligence can mean that you have spent hours of more time into debugging something the sota model may can fix. So good just means, when the gap is close enough that its cheaper to run local models.

[-]

rmhubbert@reddit

I'm very impressed, so far. It's been edging out Qwen3-Coder-Next for me recently, which is high praise.

For anything other than minor tweaks, my workflow involves a web search assisted research phase, planning phase, and task breakdown phase for anything I ask an LLM to do, regardless of the size of model, before I let it write any code. Within that workflow, 27B really shines, the quality of code is excellent, certainly on par for any of the frontier models I've tried.

Outside of that workflow is no doubt a different story. My advice is to use the tools that best suit the way you want to work. I only switched to local only LLM because those models suit my workflow, if they didn't, I wouldn't sacrifice the quality of the work.

[-]

windictive@reddit

What are you using for web searches?

[-]

rmhubbert@reddit

SearXNG via MCP. I run a private instance on one of my servers. Before that, I also got good results from Tavily via MCP.

[-]

Substantial_Swan_144@reddit

As long as you don't expect it to keep coherence on very large files (or on large multiple files) you should be fine.
Also, keep in mind smaller models are also less accurate on more obscure knowledge.

[-]

Ueberlord@reddit

The point you mention that all models do not really care for a good code housekeeping is the one thing which really hinders me from always blindly using LLM for coding. It will very often introduce duplicate new methods where it should have rather re-used some older method (it had read the helper class before so it came across it for sure).

This is why I have a clause in my global AGENTS.md for usage with OpenCode where I instruct the model to conduct a review for duplicated and prunable code each time it is done with its current task. But it does not work well enough.

Maybe we need a dedicated janitor/housekeeper trained model which cleans up after the construction troop went over our codebase...

[-]

Admirable_Reality281@reddit (OP)

It did trip up for me today on a pretty annoying refactor across multiple files.

Around file 4 or 5, it started losing the thread a bit or it tried to automate parts of the refactor instead of handling them one by one.

That's the only major negative experience I've had. Outside of that, it's been surprisingly solid. Even with large context windows (\~150k), coherence has held up nicely.

[-]

duhd1993@reddit

What coding agent are you using?

[-]

Admirable_Reality281@reddit (OP)

I've tested it only with Kilo so far.

[-]

ScoreUnique@reddit

I suggest try goose , even 35b runs well on it

[-]

juanchob04@reddit

Could you give us more information about your setup (hardware, software) and model quant/speed?

[-]

Admirable_Reality281@reddit (OP)

I tested it on a single 5090 with a pretty basic llama.cpp setup, Q6 with Q8 KV cache.
Didn't benchmark the speed, but it felt like roughly 60 tok/s.

[-]

yensteel@reddit

Local models have always been terrible with powershell 5.1 for me, as there's a ton of traps. They're certainly decent with basic C++ and Python though.

[-]

StatusSociety2196@reddit

I think it's just an overcorrection from when you asked for anything and it would immediately start writing from scratch even if there's clearly specified scripts that do 70% of what you need and you only needed that last 30%.

[-]

Voxandr@reddit

3.5 122BMoe is much better

[-]

formlessglowie@reddit

Experience depends a lot on what you expect from it. For me, the intelligence in 3.5 was already obviously mesmerizing for its size, but I never got to use it much because it was so painfully slow when you put together sub 30 tok/s for the larger quants + the never ending thinking tokens. I kinda just resorted to Gemma4 26b instead and was mildly satisfied.

Fast forward to 3.6, benchmarks and anecdotal reports were good enough that I decided to make a serious effort at improving my setup and extracting all I could from the model. Switched to vLLM, learned to set up MTP speculative decoding, a few more tweaks, and voila: INT4 running in full 262k FP8 context at 50+ tok/s, prefill is also way faster than what I remember getting from GGUFs in llama.cpp. Now, I can say the model is AMAZING. I still use GPT 5.5 extensively for the harder stuff, but most of what I do was already way below SOTA like months ago, and having used stuff like Sonnet 3.7 and Gemini 2.5 Pro daily for months in the past, I can confidently declare this model clearly superior to those in most of my tasks. Which is absolutely nuts, because those were SOTA less than a year ago, and now I get more power in a potato PCIe 3.0 motherboard from China and two used 3090s. I mean, how awesome is that?

Qwen3.6 27b is not close to GPT/Opus current levels, don’t listen to anyone who claims that. But it’s absolutely at least as smart as SOTA from one year ago (although not as knowledgeable, if that matters for you), but comes equipped with the modern agentic capabilities the big guys lacked in 2025. You could describe it as “Sonnet 3.7/4 if it were made today for running in an agent harness”. For me, it’s absolutely amazing and I no longer fear the prospects of SOTA no longer being subsidized in the near term.

[-]

sdfgeoff@reddit

I'm also running a dual-3090 rig, and I've been considering switching to vLLM over llama.cpp for the performance/MTP. I think I'm in the same boat where it's crossed some sort of magic threshold of "Ok, local is now viable even if GPT/Claude are better"

Any suggestions/blog posts that were helpful?

[-]

Imaginary-Unit-3267@reddit

I'm using 3.6 35B, on an RTX 3060 (yes, that's all I've got) and 64 GB of RAM. I get 20-24t/s decoding, about 200t/s prefill (except with llama.cpp's inability to handle hybrid attention, it has to reprocess the entire prompt every string of tool calls, which takes minutes every time). Would you say it's worth the bother to try to switch over to vLLM like you did, or would I not likely get much improvement with my system?

[-]

Admirable_Reality281@reddit (OP)

Yes! It's truly mind-blowing, made me believe in local setups again.

> Qwen3.6 27b is not close to GPT/Opus current levels,

I honestly need more time with it. Right now, I genuinely can't tell if there's a real gap (and how bit it is) or if I'm just overanalyzing every little thing cause I'm trying to evaluate it.

[-]

formlessglowie@reddit

There is a gap. Test it with more esoteric stuff and you will see, like functional programming. But that’s pretty much model size capped, people keep coming up with these examples to diminish the accomplishment that these Qwen models represent, but it’d be alien technology for 2026 to have 27b models internalizing the same knowledge as current SOTA, which now goes well above 1T parameters. Intelligence wise, though, 3.6 27b gets pretty close to that level in most domains, and that’s simply impressive.

[-]

superdariom@reddit

It wrote me a massive detailed python module from a vague specification and some sample data and implemented examples anticipating my use case and I was sceptical it would work at all but it was flawless. So cool to have a tool like this sitting in the corner of my room.

[-]

wu3000@reddit

I switched last week and use 35B-A3B as my primary backend (fp8 model, fp16 kv cache) for python+web development. i would say 60% of the opencode tasks are solved reasonably well, but the other ones require lots of cleanup to make it work. I still find myself switching to minimax or codex for large code reviews and planning complex features.

[-]

truthzealot@reddit

I’ve been using qwen3.5 9b and it’s been pretty great. It the same model, but still relevant ;)

[-]

Enough-Astronaut9278@reddit

I've been on Qwen 27B for like two weeks doing real work, mostly Python backend and some React. Gotta say it's surprisingly good at single-file stuff. Refactoring, writing tests, yeah it handles that fine. Where it starts to struggle is when you need it to reason across multiple files, like understanding how different services connect in your codebase. I still switch to Claude or GPT for that.

That said, running it locally the value is insane. I basically treat it like a solid junior dev. Give it clear instructions and it delivers. Just don't expect it to make architecture calls for you.

[-]

DonkeyBonked@reddit

I'm primarily using a modified version of the uncensored 3.6 35B A3B. It needs some hand holding, but especially when paired with a good agent it can do alright depending on the task.

I think to use these models well for coding, you really need to build use specific (Q)LoRAs and use a RAG index for your code, the performance difference is night and day. I've been turning my own code base into LoRAs and the more I force myself to keep updating this the less headaches I get from using the models overall.

The hardest part for me has been the discipline of turning my own database of my work into training data. Even a thousand pairs can significantly impact a model's performance, and I have do much more than that, but I keep getting hung up on wanting to perfect everything before I convert it into data, then not having the time to do it.

It really is worth it though, especially if you put the right kinds of meta tags and watch out for deprecated code.

Even just taking all your past work and sorting it by language though and indexing it as RAG vectors is huge. It's really how these models were meant to be used. You can't possibly put enough data in a 27B to 35B model to make it great, but all the underlying logic is there, so if you just expand what it knows and what it prioritizes then you can see results very quickly.

[-]

turbotunnelsyndrome@reddit

What's your process for tuning these models exactly? Was there a guide that you used?

[-]

super_g_sharp@reddit

!Remind me in 2 days

[-]

RemindMeBot@reddit

I will be messaging you in 2 days on 2026-05-02 03:38:43 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

yeah61794@reddit

Using it on Windows (straight - no WSL) with llamma.cpp with Claude Code harness and a 3090 and it works great IMO.

Context length is my biggest issue for my project as I'm getting \~91k with it on llamma.cpp and it breaks for longer tasks. Not sure tok/s but it's plenty for what I need.

[-]

teenaxta@reddit

Tried using it opencode and honestly it's just ok. Thing about Claude and codex is that I just tell them to do something and off they go but this guy hell no you have to hold it's hand at every step of the way. That might look like a small issue but doing it again and again makes it really annoying to the point that I go like here's my $100

[-]

dmter@reddit

seems to me they have bad math training data, gemma4 dense is much more knowledgeable.

[-]

neo123every1iskill@reddit

It was good for simpler tasks. This seems promising tho: https://github.com/itayinbarr/little-coder

[-]

someguy@reddit

It's useful to provide me with working beginner-level code for ancient WinAPI stuff or for areas I had yet no experience in.

That output is of course vile slop and it's 5x faster/cleaner/smaller after I've rewritten it, but it is a solid base to start with.

It also helped me by explaining circular deadlocks - something I had not encountered yet and had trouble visualizing in my head.

Overall I don't use it much.

It completely shits the bed when I paste it my actual code or ask more advanced questions.

However, sometimes its retarded hallucinations give me good design ideas ("No, that's nonsense, however... maybe I could...") and it's always good when I quickly need some dumb labor.

It's a great model for beginners and that's it IMO.

For anything even mid-level I've only received satisfactory answers from massive proprietary models or models I can't run without 16 GPUs.

[-]

leftovercarcass@reddit

I dont trust it for major debug or design choices, for explaining codebase and inplementation according to specifications, yes it works pretty decently but i do not delegate understanding to it, i still rely on opus and glm 5.1 for larger debugging and complicated stacktraces. Tracing stacktraces and explaining a backtrace is something sonnet can do aswell and likewise qwen but that when i interact with the agent, not when an agent interacts with an agent.

[-]

Admirable_Reality281@reddit (OP)

Have you actually tried, or did you just not trust it enough? I haven't really had the chance to push it much over the last day and a half.

[-]

leftovercarcass@reddit

I have tried sonnet + opus and glm 5.1, they all leave dead stale code, they tend to reinvent synchronization primitives or conflict with each with certain synchronization primitives. All theee have managed to cause deadlocks, corruption of file system. So no i havent tried, mostly because the other models are more capable and still make stupid choices due to context pollution or losing context and they forget what they did. Opus at one poitn burnt my token reinventing pythons native logging when i came back i wondered how.. how could it go so bad?

[-]

ComfyUser48@reddit

I am using it a lot. Like, a LOT.

It's doing 95% of what I need. The remaining 5% I fill with basic $20 Codex plan.

It completely changing the way I work with agentic coding, bcs now I have the freedom to use it as much as I want without overthinking it.

I'm blown away on how good it is.

[-]

_raydeStar@reddit

whats a good context size that works? I can do up to 64k safely, past that, it's iffie on my card

[-]

uriejejejdjbejxijehd@reddit

Which coding agent and LLM server do you use for qwen?

[-]

soyalemujica@reddit

This is my case, however I have not relied in codex at all whatsoever or any paid models at all anymore. I wish I had more than 30t/s with my 7900xtx with 160k context with q4 quant plus q8 kv cache

[-]

ComfyUser48@reddit

I'm mostly on Q8_K with 105k context with q8 kv cache, but when I need 256k context I can do that with Q6 XL. On the 5090 I'm getting 50tps with Q8 and 60 with Q6

What I use Codex for final verification of code reviews. All code changes are now done by Qwen 3.6 27b, which is insane to me. 2 weeks ago I coding rely on any local model.

[-]

dwrz@reddit

It's the first local model I've been able to use for code at work, and I use it extensively, sometimes with multiple agents in parallel, all running locally.

I will occasionally use larger models to discuss architecture and evaluate system design, but for actually shaping the code, it works just as well, if not better, than hosted models. It's great to know I'm getting the precision and configuration I want.

It will sometimes be quite dumb, and need handholding. But I have that happen with the hosted models, too.

[-]

kant12@reddit

It's definitely usable but I can't decide if I prefer it or 3.5 122B. Both do reasonably well. My workflow has been let those two provide a solution. Pick the one I prefer. Cleanup/fix 20% of it myself and ask GPT 5.5 about the 2% I actually need real advice for.

[-]

crombo_jombo@reddit

Performance is better than 32B model and seems smarter at the same time. Gemma 4 is better at short tasks IMO but Qwen 3.6 27b 3a manages my rust monorepo workspace better than anything else I have tried at this point tho. You just have to try not to bloat the prompt with agent files and pay attention to how it handles your agent skills.md and adjust as necessary. Qwen 27 also seems to handle parallel hand offs better but I haven't exactly measured it, just seems to lag a bit less and pick back up faster. I have 16 g of cream and 64g of ddr5 5400 so unified kv caching helps tons

[-]

HongPong@reddit

it did a nice refactor and could check for api documents to find a better strategy for cache management. this was not perfect but i was impressed indeed. with opencode

[-]

viperx7@reddit

I’ve been using this setup for a variety of tasks over the past couple of weeks, and honestly, it just works. That said, I’m still a little hesitant—mainly because it’s a local model and 27B is definitely smaller than what Opus is running on but the more I use it, the more I realize I might be discriminating against it just because it’s running on my own hardware.

I catch myself over-monitoring prompts and outputs because I’m subconsciously worried about it making mistakes… which, ironically, it hardly ever does.

My setup:

Model: Qwen 3.6 27B (Q8 quant)
Context: 262k tokens (ctk format, no context compression)

I’ll be the first to admit that cloud models have the edge on wildly complex or highly specialized problems or things that require a lot of knowledge.

But I’m not solving quantum puzzles every day, and for my actual workflow, this local setup has been more than enough.

I mainly use the model for Agentic workflow and coding.

[-]

hesperaux@reddit

Using imatrix NEOCODE at Q6. It's impressively good. It has a claude feel and does a lot of work autonomously. It makes way, way fewer mistakes than the 35B. I tried 4 or 5 different 35B quants and I can't get good results. 27B is a model I can actually use. It's not perfect, and if course it's not a 4T model, but it behaves better than 3.5 122B a10B and it's better than full precision glm4.7. I am super impressed and excited about this model and I cannot wait for other variants to be released (fingers crossed for 122B and smaller ones for spec decoding).

[-]

spencer_kw@reddit

been running it for two weeks on actual production code, not benchmarks. here's the honest split: anything touching 3 files or fewer it's indistinguishable from opus. refactors, test generation, boilerplate, all clean. the moment you hit 4+ files in a single edit it falls apart. starts hallucinating imports, loses track of which module it already touched, occasionally writes code that references functions it deleted two steps ago. my setup now is qwen for the 80% that's mechanical and opus for the 20% that actually requires holding the whole codebase in its head. anyone telling you it matches frontier across the board hasn't tried it on a repo they know well enough to catch the mistakes.

[-]

Few_Water_1457@reddit

qwen 27b + vscode + kilocode + cline ---> All I need

[-]

ThePixelHunter@reddit

Kilo and Cline? You prefer one over the other at times?

[-]

spencer_kw@reddit

been running it for about two weeks on real codebases, not benchmarks. it handles the bread and butter stuff (refactors, test generation, boilerplate) at maybe 85% of opus quality. where it falls apart is multi-file reasoning across a large repo. anything touching 4+ files at once and it starts hallucinating imports or losing track of which module it already edited. my workflow now is qwen for everything under 3 files, opus for the architectural stuff. saves a ton on API costs and honestly the output quality on simple tasks is indistinguishable.

[-]

kiwibonga@reddit

Pretty good. I asked (free tier) Claude and ChatGPT for their opinion on a crash that happens in Windows but not Linux in my app and they both suggested different things. I went back and forth between them and Claude accused ChatGPT of "gold plating".

Finally they both converged on the same solution. One line to add to call a Close function on a socket after calling Stop.

Qwen pointed out that Close already calls Stop internally and the cleanup we need is in Close anyway, so we replace Stop with Close, we don't keep both.

I went back to Claude and ChatGPT. Claude was out of free credits, ChatGPT tried to gaslight me into believing that calling both functions is safer and better. In the end it admitted it was a baseless claim and that the local model was right.

ChatGPT disagreed with my characterization that the two frontier models "just got bodied".

[-]

Blaze344@reddit

q2_k_xl is finally good enough to replace gps-oss-20b for me! Finally!

I run it with opencode. It's *slower* than OSS 20b on my rig (20GB VRAM with a 7900XT) but it's finally a qwen model that 1) doesn't think forever and 2) actually is more precise and codes better than OSS-20B, and it also doesn't randomly bug out and error the format output for the agent harness.

[-]

Ok-Measurement-1575@reddit

Q2 is under rated on 35b for sure.

[-]

superdariom@reddit

The 3.6 moe model runs a lot faster and I can use a higher quant (Q5) with the same vram at maximum context - so that's what I'm using. I previously was in love with 3.5 27b but it's just to slow now for my data processing desires.

[-]

natermer@reddit

That said, I'm still not sure whether I'd fully trust it enough to move away from the big players.

Don't.

It isn't a either or situation. I run Qwen MoE versions locally and use that... I also have OpenCode Zen access configured and switch between models as I need them.

Think about using them strategically. Like setup a Ralph loop to use Qwen locally over night, but have it also start up a agent with a flagship LLM to audit its changes and tweak things like every 5th or 10th loop or something like that.

That way you keep the context size for long running things small as possible to keep your local LLM in it's "sweet spot". But you can leverage the flagship LLMs to make it so you don't have to baby sit it.

Or use a flagship LLM to build tooling, prompts and skills for local Qwen to use. That way you dedicate long running or more experimental stuff to local LLM so you don't burn through your tokens.

I am sure that if you think about what you are doing and what you want do to with a LLM you can think of something. Like maybe use Qwen for your chat bot or something.

As long as you are using agents and services that are not tied to a particular subscription then this sort of thing is something you can do to increase your LLM usage budget without costing you a arm and a leg.

[-]

buildingstuff_daily@reddit

been running qwen 27b for about 3 weeks now for actual coding work not just benchmarks and heres my honest take

its surprisingly good at understanding existing codebases. you give it a few files of context and it can follow the patterns and conventions already in place which is something a lot of models struggle with. it doesnt try to rewrite your entire architecture just to add a button

where it falls apart is complex multi-step reasoning. if you ask it to "refactor this module to use dependency injection, update the tests, and make sure the CI config still works" it'll do step 1 great, step 2 okay, and completely forget about step 3. you have to break things down more than you would with claude or gpt4

the sweet spot ive found is using it for focused single-file tasks. write this function, fix this bug, add error handling to this endpoint. it absolutely crushes those. and the speed advantage of running it locally means the feedback loop is way tighter than waiting for API responses

one thing nobody mentions is how good it is at reading and explaining code compared to writing it. i use it constantly for "what does this function actually do" type questions on unfamiliar codebases and its better than most models twice its size for that

[-]

uriejejejdjbejxijehd@reddit

Has anyone managed to use one of the 3.6es for coding with Xcode? Which coding agent and LLM server do you use?

[-]

StatusAnxiety6@reddit

I code a lot professionally. I like it a lot

[-]

D2OQZG8l5BI1S06@reddit

I'm using 35BA3B Q4KM, but I read experience will be consistent, just better, with 27B.

In a nutshell I'd say it's not always that good fixing wild bugs you would have no idea how to fix in the first place but excellent at following instructions and doing basically anything you want, the way you want.

[-]

Spiritual_Tie_5574@reddit

With linux + vllm + sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP around 100-116 tok/sec

[-]

shokuninstudio@reddit

Claims that a 27B model are as good as Claude are either lying to themselves or part of a marketing campaign to upsell you to a cloud subscription later on. They give you the little taster model and when you're disappointed they push you to use API. It's a classic sales technique used in many sectors.

If anyone, especially someone anonymous on reddit, claims a sub 100B local model is amazing at x ask them to do an uncut live stream demo with viewer requests otherwise they are not producing evidence of x. It only takes a few minutes to start a live stream.

So take all claims with a pinch of salt and only believe your own eyes.

[-]

Blaze6181@reddit

It's not the same. It's good, but it's missing that extra something. Extremely strong for 27b though.

[-]

Admirable_Reality281@reddit (OP)

Yeah, that's pretty much where I'm at too. I just can't tell if it's a real gap or me overthinking it cause I can easily remember times where 5.4/5.5 completely dropped the ball as well

[-]

rpkarma@reddit

It's a real gap, but it's smaller than one would expect at this size I think. Impressive stuff really, actually usable I find.

[-]

Zc5Gwu@reddit

I think it’s a real gap. It’s kind of like an intuition gap. It understands but not always what you mean.

[-]

rpkarma@reddit

It's decent. Misses some more complex/deep architectural things, but honestly it's more than good enough.

Way I use it is with Opus 4.7 or GLM 5.1 generating a PLAN.md then getting 27B to follow and implement said plan. Both through Pi (or obvs Claude Code for Opus. Annoying...)

I'm testing GPT 5.5 as the "planner" now too.

[-]

beedunc@reddit

Full Quant?

[-]

StardockEngineer@reddit

Love it. Great with Pi coding agent and it's making lots of good code. I often pause and do a code review with GPT 5.x and Opus, and they rarely have any complaints. They complain about each other as much as they do 27b

[-]

cenderis@reddit

I've been using it (in opencode) for a few things. I think I prefer 35B slightly because it's a bit faster but they've both been good. This was making relatively simple changes to a large codebase so very much the kind of thing I'd hope a local model (with a good harness) could do, but both models have been fine.

[-]

kcksteve@reddit

I use it with opencode and it does pretty well. It gets stuck in thinking once in a while and I'm trying to tweak some parameters to mitigate that. I am running q4_k_m with no Quant on context. With ctx size at 200k I get about 35tps. This is my favorite model so far but I will mess with gemma a bit more soon. I try not to switch too often so I put them through their paces. I have access to copilot at work and have used opus 4.5 a fair amount. It is better but it's not perfect either... looking forward to seeing how gemma 4 stacks up.

[-]

Manitcor@reddit

I am hearing good results from 27b @ minimum 64k context width. Usually paired with a context stack to guide the session.

[-]

donmario2004@reddit

Daily driver, I use it for sensitive data analytics and analysis. Can have amazing output, but at times can get confused… it doesn’t see all and that’s where I have to do my own share of analysis to get things back on track… but it truly is great.

[-]

Evanisnotmyname@reddit

What have you used to reduce hallucinations and keep it grounded in the data?

[-]

dyeusyt@reddit

I think if paired with better context engineering, this model could excel at niche tasks; by that I mean framework-specific MCP servers serving the latest documentation, as well as a skills.md defining model harness. (With code related WebSearch tools it'll be cherry on top)

This could turn out to be a monster if used in the correct way for users' specific needs. People who've got the hardware for it — have you tried development like that?

[-]

Evanisnotmyname@reddit

Have you looked into Karpathy’s LLMwiki, hybrid RAG, etc?

[-]

drwebb@reddit

Like all LLMs you gotta be on top of it. Some are smarter than others, 27B is not the sharpest tool in the shed, but it's a capable model. I have access to quite a bit of AI at work, and yeah right now I'd rather pay for DeepSeek v4 Pro

[-]

Equivalent_Job_2257@reddit

After using Claude Code it is uncomfortable. But still usable, you just should know its limits. On the bright say, you now MUST di code review, which is good in the long run, I hope...

[-]

suprjami@reddit

Using it for code explanation on an enormous established C codebase. It's very strong.

More or less the same answers as Claude but not as nicely explained as Claude.

Like all LLMs I verify everything it says but so far Qwen hasn't steered me wrong or given a false answer.

27B is easily stronger than 35B at this. 35B is not even worth the bother.

[-]

Admirable_Reality281@reddit (OP)

Didn't try the 35B. Good to know!

[-]

ieatdownvotes4food@reddit

both the 27b and 35b have their place for sure. very capable, and great for agentic work.. pushed 35b as high as 650 t/s with a 6000 pro which really lets you (or it) iterate super quickly, and run parallel operations.

I'm not too attached because there's a great new model every week.. but I don't mind hanging out with this one for a while.

[-]

jablokojuyagroko@reddit

I have been having a lot of success dumping it into my codebase to debug weird bugs that would have taken me tons of tokens with claude, also very decent for code reviews.

But i dont use it for main implementations, but to be honest it wouldnt be an issue

[-]

mateszhun@reddit

I would say it is near Sonnet 4.5 level. It can one-shot simple bugs, and some medium complex bugs when I try to debug with it. (Bug description+log dump)

[-]

ravage382@reddit

I'm using it for system agent work and basics like flask interfaces for various system panels and it's a beast at it. Its doing all the work in bash and a playwright mcp.

[-]

itsyourboiAxl@reddit

I am addicted to claude code, but I want to try qwen 3.6 since it released. Next time I hit my rate limit I will try it out

[-]

z0_o6@reddit

Just connect the local model to the Claude code CLI.

[-]

Admirable_Reality281@reddit (OP)

Let me know how it goes! I've been running it on a 5090, and now I'm seriously thinking about buying one 😂

[-]