Opinion: Qwen 3.6 27b Beats Sonnet 4.6 on Feature Planning
Posted by Zestyclose839@reddit | LocalLLaMA | View on Reddit | 28 comments
I keep hearing the argument that that large models are better for high-level planning and task orchestration, since they have more general knowledge to work from when making decisions. However, I've been testing Qwen 3.6 27b (Unsloth Q5_K_M) quite a lot since its release, and it's consistently outperforming larger models on attention to detail and foresight.
SBS comparison attached of Qwen (running in Pi, a lightweight harness that tends to benefit small models) and Sonnet 4.6 (in Claude Code) given the same "plan review" task using identical prompts and `Claude.md` files.
Qwen thoroughly explored the code I'd already written, catching significantly more potential issues. It better understood what I'd already built, and how this feature would fit in. Also suggested an efficiency improvement "search_and_read()" to eliminate a round-trip, and new categories to add to the plan.
Claude did highlight access control and points about native vs. custom tool parsing, but completely missed the mark understanding how the feature would fit into the existing system -- an odd shortcoming, since it has a dense memory file that it's been filling in for months now.
I theorize that Qwen was trained to be less blindly self-confident and spend more time reviewing what currently exists, as token budgets aren't as important with a 27b model. Large models like Claude don't bother to check for token efficiency.
Wondering if this stacks up with your experience of the Qwen 3.6 series.
m_mukhtar@reddit
I just finished testing a web app for generating moon visability map based on two research papers that show the math and calculations. I used qwen3.6 27b q5 kl in claude code and sonnet 3.6 in claude code. Gave both models the same inlut prompt and two research papers and while qwen took forever because it was slow but it implemented it perfectly while sonnet failed misrabley in a way that i dont think it can easly fix. So i share the same expireance as yours.
Puzzleheaded_Base302@reddit
if you run qwen3.6-27b on 5090 or RTX PRO 6000 with MTP properly enabled, it can run twice faster than sonnet. it's amazing.
Zestyclose839@reddit (OP)
sounds like a much more high-intellect project than what I was building haha. especially impressive considering q5 27b qwen had to handle the enormous system prompt in claude code while creating it -- that alone broke anything under 80b params a year ago.
perfopt@reddit
Will it run with 24GB VMEM?
Zestyclose839@reddit (OP)
easily; just use unsloth q4_k_m so you have enough memory left for longer contexts
nunofgs@reddit
Which coding agent are you using and how are you connecting it to qwen locally?
Zestyclose839@reddit (OP)
Using Pi as the coding harness and LM Studio for inference. Importantly, you need to keep Qwen 3.6 thinking blocks in the context window (which normally get cut out between messages). JINA template needs to contain
{%- set preserve_thinking = true %}.Also, in Pi's settings.json, you need to tell it to preserve the thinking blocks by adding this for your given provider:
lemon07r@reddit
Sonnet 4.6 is junk so not a high bar to clear tbh.
meissullo@reddit
not junk, but it got nerfed really hard in the last 2 weeks. i noticed smaller context, very prone to forgetting things and just plain hallucination. i did not happen like this a month ago
lemon07r@reddit
It's always been bad, whether or not cause they've nerfed it. And I will not judge a model at its peak if the provider has shown a propensity to nerf it whenever they feel like just to save money, because you know they will do it again. People are only just now starting to notice these models are shit, and it's only cause anthropic themselves have finally come out and publicly confirmed that they did gimp these models depending on demand. Until then people were getting constantly gaslit for calling it out and saying it feels dumb, being told things like skill issue. I've a/b tested these models over hundreds of millions of tokens and if you don't know what to look for they are extremely convincing and will confidently tell you the wrong information. I've never seen models that hallucinate so badly and try to gaslight the user as badly as the Gemini 3/3.1 pro models, opus 4.7, and sonnet 4.6. Opus 4.5 was probably the last good model, and opus 4.6 was good too at the start. Gpt 5.4 has been pretty decent too. The worst part is these paper tiger models still score highly in evals so it's impossible to tell until you try to actually use them for work and know exactly what to look for.
Long_comment_san@reddit
Qwen 3.6 MOE was doing some crazy structuring in thinking for my roleplay. I was kinda baffled and thought that it was my system prompt. I switched it off and it just kept doing it.
I was forced to make a system prompt to force it not think so hard (so it would be faster).
I had to nerf it's thinking structure.
Rude_Marzipan6107@reddit
Yes but did you smell ozone? 😂
Long_comment_san@reddit
A lingering smell of ozone, yeah
Zestyclose839@reddit (OP)
That's just MOEs in general. The price you pay for lower active param density is longer thinking. 27b thinks much more concisely.
Caffdy@reddit
what was your prompt in this case?
CalligrapherFar7833@reddit
this is a very generalized plan with no actual implementation and verification steps try it with making a more detailed plan execute code and do what where and then you will see qwen falling short
Dany0@reddit
I've been using qwen3.5 27b to plan and opus to implement since it came out. People thought I was joking but this is exactly why I've been using it. It's a good generalised steering device for famously unsteerable and easily poisoned by bad tokens MoEs (of which opus is certainly an moe)
somerussianbear@reddit
What a day, Qwen architect and Opus blue collar worker.
Dany0@reddit
The more senior you get, the more you appreciate old/simple, boring but reliable tooling
Dense models are inherently more predictable. It's nice knowing that the industry changed a lot, but kholmogorov complexity stays
somerussianbear@reddit
Agree 100%, just found funny cause people usually do the other way around.
Dany0@reddit
If anyone wants the recipe, it's simple. You can try compacting often, but poison tokens can and will survive. You can either edit history if your tooling allows you to, or just do the hard work of rolling back any time the moe gets into a bad state
A simple way of thinking about it is: if your convo history looks like training data, you're good to go
It's a good approach because it's almost a tautology and just goes back to ML fundamentals: most ML algorithms assume the answer must be part of training distribution
Dany0@reddit
On the contrary if your convo history is full of "Claude what the fuck did you do" then you're just a pleb vibe coder and not an engineer
Zestyclose839@reddit (OP)
Trying it now and will report back with how it goes. Currently using the "grill me" skill so it asks like 15 questions before actually building anything.
ProfessionalSpend589@reddit
I’m playing with opencode to start A project with three files in Org format: - spec.org - implementation.org - todo.org
The last one is used to track progress as the LLM is following the instructions in implementation for the specification.
I do get simple service running quickly along with some historically and TODO lists to check.
Zestyclose839@reddit (OP)
Is Qwen working well for you in opencode? I was using opencode with local models for the past year, but the prefill times were driving me mad (took over a minute to just process the system message and tool definitions on first prompt). Switched to Pi since its sys messages and tools are so much lighter.
But maybe they've better optimized Qwen 3.6 for the opencode harness -- is it getting tool calls correct?
ProfessionalSpend589@reddit
Qwen 3.5 397B UD-Q4_K_XL works good. Mine is running distributed, so it’s a bit slow. With context size set to 200k tokens I’m waiting up to 45-60 minutes.
Honestly the LLM writes code faster than I can think of features for my personal product. But needs a bit of hand holding and pointing for fixing bugs.
NNN_Throwaway2@reddit
Are you sure they're real issues, though? I've done similar things and like 7/10 of the "issues" it found were not real.
Zestyclose839@reddit (OP)
100% real issues. In fact I had a separate Obsidian note open for brainstorming, and a lot of what Qwen suggested were issues I'd already started to note down, but hadn't figured out a clear way to define.
If I'm asking it to look for bugs, it does like to make things up there. But that's an issue with every model tbh -- Opus, Kimi, GLM; they're all anxious to impress you with their bug-finding ability. The point is that the local model performs as well as the cloud model.