Switching from Opus 4.7 to Qwen-35B-A3B

[-]

Confident_Ideal_5385@reddit

I tried vibe coding on the weekend (as someone who tends to the "artisinally produced systems code" side of the clanker coding divide, fwiw) because i didn't want to context switch from rust to TS to write a throwaway web UI. I'd assumed I'd give qwen 3.6 a shot, watch it totally fail, try the 27B, and eventually ask my wife to use her claude sub to do the thing for me.

I was .. surprised.

qwen 3.6 a3b managed to get the UI right straight away, plumb the websocket connection using io-ts codecs generated from the rust models that i put a copy of in a markdown file for it, and wrote more unit tests than god. Ended up with a coverage of like 80% (etc etc coverage isn't a real metric i know).
qwen 3.5 27B did even better, writing a mock websocket server and a bunch of integration tests for the thing as well.

I tried the a3b out on a simple rust prompt to see if it was any good - "using tokio and axum, write an api with a login endpoint that returns a token, and an authenticated logout endpoint. Store user data in sqlite and use argon2 password hashing.". Results were middling - the code it came back with would block the async thread when calling the db, and it couldn't figure out argon2 so it swapped that crate out for bcrypt without asking me.

So that's not so good, altho tbh i have no idea how opus goes with rust.

But if you're vibe coding TS/JS or python you're gonna have a good time. The 27b still beats the a3b though from what i saw.

Grain of salt, YMMV, etc.

[-]

Flinchie76@reddit

I'm doing exactly that.

The thing is, Opus will do your thinking for you, but that comes with downsides: you'll end up generating 50k lines of code in a 4-day streak and understand very little of it, so you'll end up spending another week asking Opus to explain how everything works while you try to absorb the architecture, and then you'll find all sorts of nasty shortcuts, breaking encapsulation, mocked out stuff in what should be functional tests, etc.

Having a less capable model which can execute well, means you stay on top of what is being built. You think, it executes and you just keep tight control over the direction by inspecting the diffs. These little models are so fast that you can iterate very quickly.

In the end, you're much more likely to own and understand what you've created.

[-]

sonicnerd14@reddit

This is pretty much how we should always be doing it even with the frontier API's. The bigger models do the thinking and planning. The smaller models like qwen 35b simply does the execution of the plan.

[-]

loady@reddit

man I get so excited when Opus is fanning out my project to dizzying heights and many times I get stuck with a mess of a project that ate a day or week for nothing

[-]

New-Implement-5979@reddit

Yes agreed

[-]

ANTIVNTIANTI@reddit

this is. the way

[-]

wasnt_in_the_hot_tub@reddit

You think, it executes

That's how I try to keep it too

[-]

qwen_next_gguf_when@reddit

You will be disappointed

[-]

Excellent_Koala769@reddit (OP)

Right now I have a tiered system. Opus is the architect and visionary, then builds a spec for Codex to wirte. Codex has the autonomy to dig deep and reason about the repository, but for the most part follows whatever spec Opus wrote. I really rely on Opus to dig into my repo and create ideas for new implementations.

If I replaced both models for Qwen-35B-A3B, would it be a huge downgrade, or would Qwen still get the job done well?

[-]

indicava@reddit

I don’t use Claude much but gpt-5.4-codex (especially on xhigh) is a beast, and there is absolutely no open weights model that gets close.

[-]

Kamimashita@reddit

I found that Codex tends to over engineer things and make things more complicated than they should be. Its really good at scrutinizing other plans and doing deep dives though.

[-]

AliMas055@reddit

I am new to this and am just discovering codex's tendency to over engineer. The thing is rewriting the entire methods instead of calling super()

[-]

GreenGreasyGreasels@reddit

Half my agents.md files is instructions for gpt/codex to calm the fuck down and not over engineer. Removing feature with a hard cutover means to cut it immediately root and branch - no extra checks, fallbacks, no shims, no catches, no silent redirections , no soak periods before removal, no text monologues is comments about why and what is was - list is endless.

[-]

nomorebuttsplz@reddit

have you tried glm 5.1?

[-]

sir_turlock@reddit

For your reference Opus is Anthropic’s flagship model. Meanwhile the open-weight model in question has a total of 35B params.
Other open-weight models have:

MiniMax-M2.7 -> 229B-A10B
Largest Qwen3.5 -> 397B-A17B
GLM.51 -> 754B(744B?)-A40B
Kimi K2.5 -> 1T-A32B

GLM-5.1 and Kimi K2.5 are current SOTA in the open-weight space. Qwen3.5 is close and also has smaller versions alongside minimax for simpler tasks.
For example Qwen3.5-122B-10B can use web search well enough for my uses case to understand the results and properly extract information without hallucination when it needs to read a web page converted to markdown. Qwen3.6-35B-A3B fucks it up regularly.

In other words if we look at GLM-5.1 it has 13 times the active params and 21 times the total params of the small Qwen3.6. So there is no way in hell that it can replace Opus.
Although, to be fair, it entirely depends on what you store in your repo and the complexity of the tasks. If Opus was a severe overkill then it might get usable results, but I wouldn't expect anything coherent.

[-]

Caffdy@reddit

For example Qwen3.5-122B-10B can use web search well enough for my uses case to understand the results and properly extract information without hallucination when it needs to read a web page converted to markdown

how do you make qwen use web search?

[-]

GreenGreasyGreasels@reddit

I really rely on Opus to dig into my repo and create ideas for new implementations.

I suggest giving Gemini 3.1 Pro a shot at this. Superceeds Opus at this. Gemini throws up ideas, Opus triages and refines them and then GPT to hard check on Opus work and rationalization against the code base and generate a detailed implimentation plan.

PS : Don't let Gemini pro touch your codebase, let it read and report only.

[-]

epicycle@reddit

I use my $20/mo Opus to make prds and plans and then Qwen 122B or 35B to implement. Works great. I haven’t gotten Qwen 3.6 to not misbehave for me since it came out but once I figure it out till probably be my new driver for implementation. You can run claude code and opencode using your local models as well. I don’t know if claude will work that way forever but it works now.

[-]

EuphoricPenguin22@reddit

It's funny you mentioned that because I thought I was super clever for using Opus to write implementation plans and using smaller models to do the actual legwork. I sort of figured other people would catch on to that, especially since Opus is ludicrously expensive.

[-]

Caffdy@reddit

I thought I was super clever for using Opus to write implementation plans and using smaller models to do the actual legwork

even Gemini advises that approach all the time, heck, if you check your session after using the cli, most of the grunt work it's done by the lesser models by default

[-]

EuphoricPenguin22@reddit

Not sure what agent you're using, but Cline just uses whatever model you have selected for everything. I like it for that, though, as it's super predictable and is designed around the user manually deciding what model to use. It works super well with local models for that reason, and it's open source. I just use the OpenRouter Chat client to do a 1-3 API calls to Opus and then use a smaller model after that. It's more expensive to load a model like Opus or Gemini Pro through a full agent as it has to load a bunch of preprompt junk that is superfluous for writing a plan. If you're on a subscription, that's different, but I prefer pay-as-you-go as this new Qwen model seems to be my new gruntwork model and I only really need a frontier model for initial planning.

[-]

siegevjorn@reddit

This sounds like a very smart way to leverage both frontier and open source. Do you use claude code for open models, or use other (e.g. open code)?

[-]

epicycle@reddit

I’ve been using both. I want to use opencode more but it’s rough around the edges for sure.

[-]

siegevjorn@reddit

I see, so I guess open code isn't the best option ans claude code often works better... I wonder how Does claude code work for open models. Do you just set up llama server & give claude code api end point, or do you need extra harness for open models on claude code? I'd love to try your setup.

[-]

rsatrioadi@reddit

just set up llama server & give claude code api end point

Not who you’re asking to, but pretty much this.

[-]

higglesworth@reddit

I’ve been thinking of doing exactly the same using Claude frontier models as architect/ta/spec generator, and then using Hermes agent with local qwen3.6 to do the tasks that Claude speced out.

[-]

Quick-Penalty4883@reddit

As a somewhat of a novice witg agentic development, how do you create persisted plans that another model can use?

Do you have smy good leaning resources?

[-]

JamesEvoAI@reddit

Tell it to write it to one or more markdown files and then hand those over to the other model. If you're looking for educational resources you actually want to look into traditional software project management, like behavioral driven development and how to write a good PRD. We've all been promoted/demoted (depending on your background) to project management.

[-]

EuphoricPenguin22@reddit

If you ask for a PRD or implementation plan, you can save it as a markdown/text file and then have the smaller model read it before each session. If you need to change something down the line, you can ask the smaller model to edit the document to reflect those changes.

[-]

edsonmedina@reddit

> I haven’t gotten Qwen 3.6 to not misbehave

What temperature are you using?

[-]

droning-on@reddit

Are you running both qwen's locally?

What hardware?

[-]

autoencoder@reddit

misbehave

What does it do that you don't want?

[-]

eleqtriq@reddit

This will work pretty well as long as the spec has enough detail. It’s the same pattern I use.

[-]

notlongnot@reddit

Try it on a new project. Speed and code quality and refactor will give you first hand info on what’s what.

[-]

9gxa05s8fa8sh@reddit

if you build a good enough plan, and provide good enough documentation, a dumb agent can succeed at a programming task, yes

[-]

drraug@reddit

What is your role then?

[-]

Excellent_Koala769@reddit (OP)

I lead the projects and steer the ship.

[-]

arcanemachined@reddit

would it be a huge downgrade

Yes.

[-]

stormy1one@reddit

Wild that you have Codex to the implementation. I tried the same and got absolute garbage and lies from Codex. I now have Codex only do QA, and then vet the output through Opus before sending off to Qwen for iteration. Works well enough on Typescript/Python base of around 15k lines

[-]

zYKwn@reddit

I would still use Opus for this same job it already does in your setup.

Then you can feed Qwen to make the same job as Codex. Its not in the level of codex, but its highly competent when you take the architect/planner job from it and use it as a coding agent foloowing strict instructions

[-]

IrisColt@reddit

heh

[-]

Agile-Orderer@reddit

Your handle 👏 Yes! When 3.6? And when MLX?

[-]

Fresh-Resolution182@reddit

on M5 Max 128gb the 35B A3B is leaving compute on the table. at minimum try the 122B before deciding — the A3B quant is optimized for memory-constrained setups, not yours

[-]

Korici@reddit

Instead of the Qwen3.6 35B-A3B, I would recommend trying a 3-bit UNSLOTH Quant of Minimax M2.7 that released recently.

[-]

Euphoric_Emotion5397@reddit

do you have time to create your own agentic loop and connect up all the different connectors that Claude has already done like Excel, Internet Search, coding sandbox, and all the other stuff that make Claude or any frontier models online good. The one online is actually an app that has an LLM as an orchestrator to all the tools and spawning its own subagents.

[-]

sn2006gy@reddit

i’ve been building this and it’s a lot of work for sure :) qwen does great with a strong governor - which just leads me to believe a lot of what people love about opus is really in its api layer and not just in the model as people seem to think

[-]

Euphoric_Emotion5397@reddit

Yup. Anthropic does a great job documenting everything from mcp connectors to orchestrator agent architecture and even the prompts. You can literally copy the whole design and adapt to local use.

[-]

Hodler-mane@reddit

5 trillion active parameters vs 3 billion. just no.

[-]

datbackup@reddit

The idea that opus is 5 trillion active is completely implausible

5 trillion total? Maybe… but active is likely going to be somewhere in the billions. 30, 50, a hundred… 5T active even on the fastest hardware would be ridiculously slow… would destroy their business model even more than it already is

[-]

sn2006gy@reddit

that many total params would be overfitting and useless for development - MoE for sure with strong router and solid experts as needed is best for coder models.

[-]

Needacupoficedtea@reddit

Honestly I think replacing Codex with Qwen makes more sense than replacing Opus with Qwen.

Based on your setup, Opus is doing the expensive but high-value part: understanding the repo, deciding what should be built, and writing the spec. That’s the exact role where a downgrade hurts the most. Qwen might do fine on implementation if you give it tight instructions, but I wouldn’t trust it to be both architect and implementer unless you’re okay with worse decisions on messy, ambiguous stuff.

So yeah, I’d keep Opus as brain, use Qwen as hands.

That seems like the sweet spot.

[-]

Excellent_Koala769@reddit (OP)

thank you, i agree.

[-]

mjuevos@reddit

if your previous experience is say a 9 with opus/codex >> it will now be 6 or 6.5 with qwen3.6. lots more babysitting, smaller sprints, iterating, etc.. atleast thats my experience

[-]

ResearcherFantastic7@reddit

Sure anything is possible. You're just replacing an engineer with a 6 grade kid.

Can it react to the same command - yes, but can it deliver the same result... I guess you'll have to try it out

[-]

-dysangel-@reddit

Hey guys, there's this thing that I only I can really decide for myself, and that has a very low barrier to entry to find out the answer - could you just tell me the answer instead?

[-]

books-r-good@reddit

Just scrolling by without comment is always an option if you don't have anything to add.

OP asked what others' experience has been. They can't decide what other people's experience was...

What is the harm in asking, in being social on social media?

[-]

ComfyUser48@reddit

I downgraded from x5 max regular pro plan. I am not replacing it, I am using it besides it for certain tasks. It can't replace it completely as of today.

[-]

Techngro@reddit

I'm thinking of doing this as well. Have the $20 Claude and ChatGPT plans for designing and prompting, use local for implementation.

Now all I need is a 3090 that isn't $1200. My 4080 Super +64GB won't cut it to run these Qwen 3.6 models.

[-]

Competitive-Job-1431@reddit

Would qwen3.6 run on an AMD 7900xtx?

[-]

Soarin123@reddit

I would say it could depending on your standards, I am running mine on a 7800 XT with Q3, and this is with a Gnome desktop and applications running with full GPU offloading + 64K context.

I reduced from 41/41 layers offloaded to 35, and I haven't noticed a massive speed decreased but it gave me enough VRAM headroom to not have my desktop freeze often. Your memory bandwidth being >300GB/s faster than my 7800 XT and having 24GB VRAM with better INT4/INT8 acceleration leads me to believe it would be a good experience.

I get typically around 40-50 tp/s with my setup for reference.

[-]

ComfyUser48@reddit

As of today, I am mainly using qwen3.6 for code reviews. It's really good at it.

[-]

National_Cod9546@reddit

What Opus can do in an hour will take Qwen a day. And over that day, you'll need to guide it a lot. If you have the time and patience for that, it's fine.

[-]

k0zakinio@reddit

My experience, which is obviously personal so take with a grain of salt. I currently have the 5x max, and will certainly be downgrading/cancelling next month because this changes how I work quite a bit.

Most of my coding is plumbing, 95% of it is not particularly interesting (add a new endpoint for this, wire this service up and add a new dependency), I don't need the smartest model in the world most of the time, but I think I've just become accustomed to the tooling of CC that I have just used it for everything.

Qwen 3.6 on my 2x3090's is running at Q6 @ ~120 t/s, with full context and prompt caching. It is blazing fast. I love seeing stuff happening at lightning speed.. it's really hard to go back to Opus after that.. no more alt tabbing for 20 minutes.

Now I get Opus/other big model to do the plan, and then feed that plan into qwen to implement, but qwen is also great at planning/exploring when you have a quick question you need answering. It is certainly not as smart as Opus, i.e. it doesn't know all the niche frameworks and syntax as well as Opus off the bat, so needs either an example or some hand holding or me having to build some skills to assist with some of the gotchas. But it gets stuff done, the results so far have looked good and I have only see it get stuck in a loop once at high context but it managed to dig itself out of it. I'm not building fully autonomous teams, I am generally just sat with 1/2 terminal windows open which is whizzing away at one task at a time, and for that, it is great.

I think these sorts of models will be great for developers, as you can build in your 'style' and add knowledge in context relatively easily. Excited to see how this changes things, as for me this is a Deepseek moment for locally runnable LLM's.

[-]

Downtown-Pear-6509@reddit

what coding tool do you use for 3.6?

[-]

k0zakinio@reddit

I've mainly been using opencode, but it also worked well via hermes agent, and claude code

[-]

CountlessFlies@reddit

Could you please share some details about the Claude code setup? How do you make CC work with an OpenAI compatible API? And what about the preserve_thinking flag to send back full thinking context with each call. I don’t suppose CC does that already?

[-]

Blues520@reddit

Are you using llamacpp and do you mind sharing your config?

I've used the Q6 quant on 2x 3090's and it's been mid on my side so maybe the config needs adjustment.

[-]

ImportantFollowing67@reddit

You can't do that. Two different things Not explaining or reading. Just.. Consider giving your Opus a tool... That uses Qwen ....

[-]

cmndr_spanky@reddit

It’s not even close friend. Like comparing Albert Einstein to Homer Simpson

[-]

Excellent_Koala769@reddit (OP)

haha.

it was enlightening reading the comments on this post. learned a lot.

[-]

dreamai87@reddit

I canceled my Claude subscriptions. I have this qwen 3.6 as my daily driver explorations and building small projects and fitting bugs in larger one whereas keeping still codex if need something quick and complex to handle on large repo. Qwen 3.6 is the best for tool calling and works amazing with vibe qwen code cli

[-]

neonwatch@reddit

What hardware do you run it on?

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Potential-Wrap-303@reddit

Consider Gemma4-31b as an alternative. MoE is faster, but you need consistency for daily work. And for coding alone, use at least a qwen-coder version.

[-]

tednoob@reddit

It gets stuck in silly loops where it tries to add more and more code to solve a problem and it does not have enough tools out of the box to do well. I think the agent's ability to search for external information is vital. I would say that that it can work independently on some tasks, but not all.

[-]

Longjumping_Virus_96@reddit

local models are not on Opus 4.7 level yet, but the gap is closing

[-]

Ok-Bill3318@reddit

Local models are nowhere close to sonnet in my experience never mind opus

[-]

TapAggressive9530@reddit

I’ve seen the published numbers on the gap closing - but I’m unable to find any open weight model that comes close to opus or gpt . Not even in the same ball park. Maybe in a few years current open weight model might get close . I’ve done side by side comparisons and I can’t use any open weight for real production grade software . For me open weights are fine to play around with - make a few demos , etc. but nothing more . The good news about open weight LLM, you won’t lose your coding skills - because you’ll need them to fix what it’s produced. Or if you get really stuck , just take whatever has been created and pass to opus and say “ pls fix this garbage …) and presto , it will refactor and make it much better

[-]

Mirczenzo@reddit

Try glm 5.1 or m2.7

[-]

Longjumping_Virus_96@reddit

few years is really not realistic imo. a few months, and we'll get a model as good as a frontier model. maybe :).

[-]

traveddit@reddit

Yes it will suffice for you because you're not doing anything that requires Opus if you think this is a serious question.

[-]

Excellent_Koala769@reddit (OP)

Clever.

I do believe that I have tasks that require Opus, just wondering if I can get the same level out of Qwen since everyone has been raving about this new release.

[-]

Ok-Bill3318@reddit

Have you tried your prompts in sonnet or haiku or just go straight to opus?

[-]

SettingAgile9080@reddit

Qwen 3.6 is impressive in the way that you are impressed when your teenage child paints something - it's actually pretty good when a few years ago they were eating crayons, and it's cool that this happened in your house. But that painting ain't making it into a museum. Opus is Picasso, Qwen is your kid.

[-]

ambient_temp_xeno@reddit

Come back tomorrow, the shills don't work on Sundays.

[-]

Cute_Obligation2944@reddit

😅💀

[-]

svachalek@reddit

Dude. Opus is the world’s premiere coding agent, probably on the order of 100X the size of the qwen model. The little guy is impressive for its size but you’re replacing a bulldozer with a crowbar. It’s not even the best qwen.

[-]

medialoungeguy@reddit

No. Worlds apart.

[-]

Ok-Bill3318@reddit

Maybe drop to sonnet or haiku first, both likely better than a local 35b model.

[-]

VihmaVillu@reddit

why not 8b model while you're at it lol

[-]

Captain2Sea@reddit

Same spec. opus 4.7 + qwen3.6 is sweet spot.

[-]

SecondSnek@reddit

Why not use it as subagents for now with something like minimax 2.7?

[-]

FatheredPuma81@reddit

I'd like to say you'd be disappointed but if Opus is anything like Sonnet these days then... idk man.

[-]

grumd@reddit

M5 Max 128gb? You should run Qwen 3.5 122B

[-]

Excellent_Koala769@reddit (OP)

How does it compare to Qwen-35B-A3B?

[-]

mp3m4k3r@reddit

Just above the anchor this will land you on shows the differences in benchmark scores https://qwen.ai/blog?id=qwen3.5#performance

Personally id recommend going Qwen3.6-35B-A3B, it isn't the same as Opus, but its been great for me (as 3.5 was)

[-]

TheItalianDonkey@reddit

thing about this, is that it does tell you the difference between full models, not quantized ones.
While he can run a 35b a3b at 8q, he will not be able to run the 122b at 8q, probably need to go to 5 or 6q.

Which chips away a bit, basically landing in the same results coding wise.

Real world knowledge is probably better on the 122b, but that's not what he needs.

Thats the reason why i - with an ultra 128gb - i'm waiting for the 3.6 122b.

[-]

mp3m4k3r@reddit

Yep for sure, its always best to test out a few and see which works for your flow and needs. I've found that with q8 kv cache and 3.6 i can hit 256k context on 3.6 with an unsloth Q5 XL quant. It hits great (imo) processing throughput and has been fairly smart/competent in a lot of the coding tasks i do. Its also handy with documents and such. Though definitely lacks a lot of the internal trained information of the bigger models but I offset that with accessing the internet and searches if/when necessary, or usually actually run into things niche enough it wouldnt have been trained on them likely

[-]

TheItalianDonkey@reddit

i just wish that quantizers would do benchmark testing on the different quants following the benchmarks of the original model card, so that the quality can be more accurately measured than a 'very high quality', 'high quality', etc that we have now.

Because today i'm stuck on losing 3 days on new models for rebenching and understanding if the bigger model @ q5 is better than the smaller model @ q8...

[-]

mp3m4k3r@reddit

I'm guessing youre meaning more in depth than like this chart

Which goes over the deviation from original for the weights in an attempt to relay how 'close' they are to original weights which does show that they have a difference but doesn't really describe how itll impact user(s)?

From: https://unsloth.ai/docs/models/qwen3.6

[-]

Far-Low-4705@reddit

no id think 122b is better.

i think 3.5 is good enough with agentic tasks, but your just going to get much better nuanced understanding with 122b, and you dont need to be as direct or have as perfectly crafted prompts.

[-]

glad-k@reddit

It's better, way closer to what you can find on a cloud model

Qwen3.6 35B is the top of what you expect local models to do from 3 digits on your touching cloud models, not opus 4.6 tho

[-]

CreamPitiful4295@reddit

Not opus. But very good. And, Gemma4 is very good too! I think both get close to sonnet which is cool. Still leaning on Claude

[-]

glad-k@reddit

Gemma 4 is good for text Gen, I need good instruct which I kinda found disappointing

[-]

idnvotewaifucontent@reddit

I've found Gemma 4 to be absolutely miserable for coding / tool calling.

[-]

glad-k@reddit

Yeah...

I use qwen for that reason, donno if other models might compete but they seem on top on all of this

[-]

GrungeWerX@reddit

It’s still below 27b…or so Ive been told.

[-]

AVX_Instructor@reddit

attention capacity, you can afford to do more in one agent iteration

[-]

bigsybiggins@reddit

The prompt processing speed with be dogshit

[-]

edsonmedina@reddit

Define dogshit and realisticly sized context

[-]

Beautiful-Floor-5020@reddit

GLM 5.1 coding sub in all honesty kills it for me. Otherwise does things well. Manage your 200k context well and it absolutely hits. Opus...im just so disappointed in 4.7 it forgets so fast every 5 mins unless you bloat it with memory. Makes up facts. It sent me on a whole goose chase for made up SKU for DDR5 Ram and the same with other things. It assumed many things and I wont deal with it anymore to be honest.

[-]

jinnyjuice@reddit

Others already mentioned it, but 122B at 4bit quant would be your best bet for your hardware. You can dedicate about 100GB (~80%) of that machine to get max tokens (256K).

Other comments are saying that there will be a big difference, but don't really mention big difference in what. What differentiates proprietary/paid services to the open weights is the skills/tools/prompts/MCPs that come along with the models. They have full-time/over-time teams working on them, as well as connecting them to various services, as well as machine-learning benchmarked prompts. Intellectually, you won't notice their difference at all.

For example, if you can set up a system prompt that forces them to read relevant documentation as RAG before they start thinking, that's already 20% there.

So even if you had 512GB memory and run GLM 5.1 at 4bit quant (which according to benchmarks are very close to 4.6 Opus, even beating Gemini and ChatGPT), you still won't be able to get performance as good as Gemini/ChatGPT.

Other than that, the real bonus of running your own local AI is no limits. You can have your agents running 24/7 without a hiccup even when you're asleep. That's the biggest bonus for most people, I think. No downtime when Cloudflare/AWS/Google are down, no throttling at peak hours, no next-version transitions (like right now, people are complaining about 4.6 Opus because Anthropic allocated resources to 4.7), etc.

[-]

Thump604@reddit

You can run the 122b, but it’s not Opus 4.7 level. That may be fine.

[-]

Excellent_Koala769@reddit (OP)

How does 3.6 compare to 122b?

[-]

RealisticNothing653@reddit

Overall qwen3.6-35b slightly outperforms qwen3.5-122b. It's very close. There are some benchmarks which 122b performs better, but overall being smaller and faster, 3.6 has the edge

[-]

jinnyjuice@reddit

It's the other way around. 122B wins in more metrics than 3.6-35B, and some of them are by a big margin, whereas whenever 3.6-35B is better than 122B at something, it's by a small margin.

[-]

RealisticNothing653@reddit

yes, but overall Artificial Analysis ranks 3.6-35b higher that's been my impression too, albeit with different quantizations. Perhaps the big difference in hallucination is the reason https://artificialanalysis.ai/models/qwen3-5-122b-a10b?intelligence=artificial-analysis-intelligence-index&model-filters=open-source&models=qwen3-5-122b-a10b%2Cqwen3-6-35b-a3b&intelligence-comparison=intelligence-vs-output-speed#artificial-analysis-intelligence-index

[-]

uti24@reddit

Sounds about right. So there is no reason to chose 122B instead of 35B.

[-]

vex_humanssucks@reddit

The model migration calculus is real. At some point the math shifts from best-in-class to good-enough-local-and-free, and Qwen has gotten surprisingly good at making that trade feel painless. Would be curious how it handles your edge cases over a few weeks - that tends to be where the cracks show up.

[-]

cdshift@reddit

This for me. Im cheap and dont mind tinkering so qwen3.6 has been an absolute gem. I can fit 200k context into it with my setup and still get 60 tps.

I keep the iterations on my code small and to the point, and I build with large planning sessions.

Its nowhere close to opus in performance but you cant beat free, and private so I dont mind taking the extra cycles

[-]

Confusion_Senior@reddit

Opus 4.7 is basically a super human that solves your problems for you, you just delegate tasks

Local qwen allows you to use english but you still must be the coder

Very different use cases

[-]

KURD_1_STAN@reddit

I was skeptical of qwen3.6 but it seems to be much much better for coding compared to 3.5, but still u will be frustrated with it and might fall back to using at least free sonnet 4.6 many times, but as i said, it is a much better improvement than what I thought so wait for qwen3.6 27b, if they treat it like 35b then it should be very good for coding and might give u enough confidence to switch for most tasks apart from needing mythos to hack nasa and cia or whatever

[-]

helios_csgo@reddit

Im running Qwen 3.6 35B-A3B-Q5KXL on my local 5090 with native 256 context on llama.cpp - getting around 200 tok/s.

I have wired it upto Opencode and created an MCP for claude-code to use Opencode as a subject.

Now I run the full build workload from Claude code with Opus 4.7 on high, it hands off many tasks to opencode and then runs verification. Now i can code all day.

It comes close to 80-90% usage on my Clade max 5x subscription. Very much impressed by Qwen.

[-]

rebelSun25@reddit

No, please don't assume it's even close. I tested the unsloth 4 bit quant and 5 bit quant. It's good. Don't get me wrong. But then after using it to create a tiny library to call openrouter, I going few glaring omissions. So, the verdict is, it's perfectly fine for private, non production work. It's not reliable enough to give me working code, just yet. Maybe the 8 bit quant behaves better, no idea. Maybe we need to wait for something larger.

[-]

Far-Low-4705@reddit

in my anecdotal experience, for short contexts, i cant tell the difference at all between Q4 and Q8.

maybe you could tell a difference in the instruct mode? but idk.

If you really need better performance or that tiny bit of extra nuanced understanding, go for 27b. but even then it still has these quirks that pretty much all qwen models have and i dont think its a question of quality at that point.

[-]

pulse77@reddit

For precise coding use the best quant you can put on you machine. 8-bit quants are MUCH better than 4-bit quants for coding!

[-]

nunodonato@reddit

Maybe don't judge based on 4 bit quants... Just saying

[-]

Agile-Orderer@reddit

Qwen 3.5 35b is more akin for Sonnet 4.6 with thinking... maybe!

It’s a mixture of experts model, so while it has 35v total params it only activates 3b at a time. Which is super efficient and still has great output, but since you’re aiming for near Opus level you may want to use the 27b dense model instead.. even though it’s less total params, they all activate, which is more resource intensive but especially on an M5 at 128gb RAM, you have the headroom.. plus you can try the Opus 4.6 distilled version by Jackrong which is only available on the 27b (I think v2 is the latest, but check huggingface).

You won’t get anywhere close to Opus 4.7, you probably won’t even get Opus 4.6, but you’ll probably come close for most use cases especially if you have skills migrated from your Claude setup.

I’d say, still use real Opus 4.6/4.7 for top tier needs and then find the use cases where Qwen 27b with Opis Distill will work for you as your “local opus”. You can keep the 35b as your “local sonnet” and then maybe have a lower param version of Gemma 4 as your “local haiku” (or anything that is writing specific since it’s more western trained and aligned with Google. Better for creative text).

Full transparency, I’m on a 64gb RAM M1-Max, and personally I’m not using any of the 27b dense models because they’re painfully slow. Token per second is like 7 at best, where as the 35b MoE has a tps of over 50. RAM isn’t the bottle next for me and the M1-Max is phenomenal but I guess having a dense model fully run through with all params loaded a just pushing it too far. Your M5 should be better equipped to handle that level of throughput. If you can get like 15+ tps it might be usable.

Good luck 🤞

[-]

Holiday_Purpose_3166@reddit

Switching from Opus 4.7 to Qwen3.6* 35B-A3B will be a terrible experience.

Whilst Qwen is really good, especially tool-equipped, it will fall short on some edges where Opus can reach. You could adapt and workaround its limits, but won't feel the same - requires more hand-holding to keep that edge.

I've got a Codex sub which I've been barely using past couple weeks just because of personal experience with local models. SOTA cloud models make you lazy, but it's a good turn-key solution. Working local requires more brain to keep it sharp.

[-]

mr_Owner@reddit

When you add mcp for knowledge like context7 or other via docker mcp or any web mcp tool, it should be as good as it gets when your harness (cline, opencode, claude code etc) is also tuned (prompt engineering.. just very very good prompts for instruction following).

This applies to many, and with enough context size and patience, even a qwen3.5 9b gets very usable.

If your goal is to just vibe it out, with no understanding with what your doing, then stay with cloud frontier models untill.

[-]

vikarpa@reddit

I was thinking about that as well. Please report later how it is going - would be very interesting to know the outcome.

[-]

metigue@reddit

If you have 128gb you might get good results with a big quant MiniMax M2.7

[-]

RealisticNothing653@reddit

If you're one to rely on the model entirely and just want it to do literally everything, then it'll be a struggle without Opus.

If you are instead one who would massage it, and the code, yourself through the process, you'll be just fine. I've been using 3.5-122b on my spark and just recently 3.6-35b. My impression is they're pretty evenly matched, but qwen3.6 may not work with all agents at the moment until things are sorted out. With Qwen Code, 3.6 works great. With Mistral Vibe, 3.5 works great.

The spark handles concurrency pretty well, so the smaller size of 3.6-35b frees more memory for caching, and I can have plenty of subagents or parallel worktrees going.

[-]

CreamPitiful4295@reddit

I don’t know. Scared to use it on my code directly. We are talking 3.6 I assume. 3.5 was good. 3.6 is a serious step up. Last night I asked 3.6 to make a utility to convert png to svg pixel for pixel. That was the whole prompt. Flawless. Gemma4 is very nice too.

[-]

stormy1one@reddit

Why scared? Assuming everything in git and proper permissions what’s your main worry?

[-]

CreamPitiful4295@reddit

Yes it is. Just a drama queen. Been on Claude for 8 months and It’s gotten so used to me I just need to give it vague prompts and only recently has let me down. Though, I want an offline solution too

[-]

Borkato@reddit

It can do way more than these people claim but way less than you’re used to either opus.

It’s replaced about 95% of my calls.

[-]

davekilljoy@reddit

How’re you delineating between local calls bs Opus calls? Manual or automatic kinda thing?

[-]

Borkato@reddit

I just manually do it by going to the dir and opening my custom harness built in python for local and I use aider when my local models can’t figure something out

[-]

davekilljoy@reddit

Appreciate you

[-]

Safe-Buffalo-4408@reddit

I would suggest qwen 3.5 27B in this case. And you need to have a open mind 😊 it's capable if you accept the slower speeds.

[-]

SettingAgile9080@reddit

Running Qwen 3.6-35B-A3B and have a Claude Max sub. The Max sub isn't going anywhere.

What is new with this release is that Qwen 3.6 is actually useful for agentic coding (using OpenCode) and long-running tasks, and I can run it on my 20GB 4000 Ada at \~55 tok/sec which is almost enough to be useful.

That it is free (disregarding the cost of electricity+hardware... but we don't talk about that here) means I've been experimenting with it doing long-running QA runs, both against test plans and just futzing around like a simulated user, without worrying about usage limits or plans.

It takes a lot longer to get there than Opus does but who cares when it's free and just runs in the background. Reliable tool usage is a game-changer, as is the multi-modal ability where I can hook it up to chrome devtools mcp and just have it crawl around my web app's dev environment all day trying to break shit and it can analyze screenshots it takes.

Also using it for simple command-line stuff where it feels wasteful to burn paid tokens. It is good at that, and pretty fast.

You can probably run one of the larger models on M5 128GB. Or at least run this at crazy speeds. But it still feels like a "glimpse at the future" rather than the actual future here today.

[-]

pedronasser_@reddit

Do not switch completely. I highly recommend you proceed slowly through this process. You need to set up the harnessing correctly to achieve a good result with Qwen3.5 at the Opus 4.7 levels.

For coding, I am still asking Opus to plan the task, and the rest is handled by Qwen3.6. So it's basically Qwen3.6 advised by Opus 4.7.

[-]

Endurance_Beast@reddit

Use Qwen3.6-35B-A3b instead, you will be amazed.

[-]

truthputer@reddit

You have the hardware. The software is free. Why don’t you try it instead of asking strangers who don’t have the same priorities and expectations as you?

[-]

BinarySplit@reddit

I wouldn’t normally recommend Ollama over building llama.cpp

Recommend LM Studio instead! It's an easy-to-use interface over llama.cpp that includes a model browser, automatic settings for offloading, and a toggleable local API.

Ollama has such a long history of problems... You can never trust that it's using the right prompt template or good quants. Stuff will just silently not work well.

[-]

Potential-Leg-639@reddit

Qwen3.6-35B-A3B is nowhere near Opus, why do you think it can compete against it? It‘s really dumb compared to Opus or other Frontier Models.

[-]

Rich_Artist_8327@reddit

Why not, if there are some languages which Gemma4-31B is better than any large closed model, then why not smaller models could beat larger in some certain tasks?Coding is not rocket science

[-]

Unlucky-Message8866@reddit

i do mix and match all the time, qwen is excellent for doing the bulk work. for large stuff i do: opus PLAN.md -> qwen execute PLAN.md -> qwen fix all the type/lint errors -> opus figure out and fix remaining stuff

[-]

acecile@reddit

For people like me who have never used any local model, how do you do this ? Can you somehow integrate this model switching into Claude code?

[-]

stormy1one@reddit

I do this with two terminal window tabs. Claude Code running Opus on one, OpenCode with Qwen in the other. You could use the same devtool for both if you want. The point is to have the architect/orchestrator have some way to communicate (file system) but you are the human in the middle to ensure to execute to your liking. It’s easy enough to have Opus bully Qwen off the side rails into doing something you don’t want.

[-]

Unlucky-Message8866@reddit

i have not automated this process, i like to have control over what does what, i usually default to local and switch when underperforms. i use pi coding agent with a very custom setup https://github.com/knoopx/pi

[-]

jacek2023@reddit

this is bad place to ask questions like that, most people here strongly hate local LLMs they just want to use cheap Chinese cloud LLMs

[-]

Longjumping_Virus_96@reddit

Qwen 3.6 and Gemma 4 are really good, but not on Opus 4.7 level (yet)

[-]

AppealSame4367@reddit

qwen3.6 (very important difference, not 3.5) is quite smart. If you are a programmer and know what you want to work on a bunch of files: perfect.

If you wanna have a whirlwind go through your code, write 20 files at once and create whole apps and plugins: it's not enough.

[-]

sagiroth@reddit

People really need to do research. There us really high expectations

[-]

pj-frey@reddit

Depends a lot on the complexity. I use Qwen (3.6) 35B as the first shot, because it is fast. It often gives a solution. But you need GLM 5.1 and Opus 4.7 for the harder things.

Big problem: Qwen gives you an aswer, but it needs a lot of experience or good gut feeling, if this is nonsense or not.

Qwen3.6 as a substitue?. No. As a companion? Absolutely, yes.

[-]

nrauhauser@reddit

I'm on Claude Max and I just got Ollama cloud on a whim. I put GLM-5.1 to work using the second-opinion skill from Trail Of Bits, and it found one series bug in 55k lines of code and another dozen things that needed attention. Even running on the cloud, it's slow compared to Claude direct. Previously I did this experiment with RTX 4090 + pair of RTX 4060 using glmctxsmol (19GB on disk) and it wasn't impressive, but it could do testing well enough. I have unit/smoke/live test stuff in the repo.

Not what you asked, but nearby, and I'm commenting so I can find this in the future. My 16GB M1 machine is going to give way to *something* more potent, and I'm curious if 128GB Mac is doing to take some of the load, or if I need something more potent. The guy who lets me remote use the 4090 rig says he races another guy who has an M3 Ultra Mac Studio, and anything that fits in memory, the 4090 always wins ...

[-]

Caffdy@reddit

unit/smoke/live test stuff

what's that, if you don't mind me asking?

[-]

Ok-Internal9317@reddit

Tag me up or dm me about your experience if you actually did.

[-]

PvB-Dimaginar@reddit

Qwen can’t replace Opus or Sonnet for the heavy lifting. I still use Claude to prepare delegated tasks for Qwen. I have custom skills to guide and guard this. But even then approximately 1 or 2 tasks out of 10 are not implemented 100% correctly. Then I create a task for Qwen to fix it or let Claude do it.

[-]

Prudent-Ad4509@reddit

It generated a very reasonable plan for refactoring a large file. But I would be very wary of giving it large tasks. Small tasks with subagents - yeah, probably. I'm still on the fence about letting it to actually change the code, I still prefer 122B or gpt5.3-codex, depending on the complexity of the task.

Basically, expect it to require more handholding where your coding harness have already required plenty of handholding.

PS. And I've managed to make it loop by instructing it to run 10 rounds of first criticizing its design from a certain viewpoint, then suggesting and implementing the solution. This was not about the code, but about the certain engineering task. So, recursive self-reflection and autoiterative solution improvement has a pretty low ceiling with this model.

[-]

Excellent_Koala769@reddit (OP)

Damn.

[-]

vick2djax@reddit

There’s nothing local that touches Opus 4.7. Two completely different universes.