I'm running qwen3.6-35b-a3b with 8 bit quant and 64k context thru OpenCode on my mbp m5 max 128gb and it's as good as claude
Posted by Medical_Lengthiness6@reddit | LocalLLaMA | View on Reddit | 312 comments
of course this is just a trust me bro post but I've been testing various local models (a couple gemma4s, qwen3 coder next, nemotron) and I noticed the new qwen3.6 show up on LM Studio so I hooked it up.
VERY impressed. It's super fast to respond, handles long research tasks with many tool calls (I had it investigate why R8 was breaking some serialization across an Android app), responses are on point. I think it will be my daily driver (prior was Kimi k2.5 via OpenCode zen).
FeelsGoodman, no more sending my codebase to rando providers and "trusting" them.
New_Slice_1580@reddit
How does it compare to Gemma4 a4b?
Because as per name this has 4b active parameters compared to Qwen a3b’s 3b
RegularImportant3325@reddit
I’ve been very impressed as well. The speed is nuts and the quality, especially in research, is really great.
Going to up the context tonight.
ohthetrees@reddit
How long does prompt processing take for a 30-40k token prompt? I have a lesser machine, an M5 MacBook Air with 32 GB of RAM and I can run ~30B 4 bit models with quite acceptable tokens per second, around 20, but the prompt processing takes forever. So it’s not really usable when I hook it to Claude code, only good for short chats. This is running on LM studio with mostly defaults because I don’t really know what I’m doing.
cosmicnag@reddit
Its the best local model so far IMO. On a 5090, the friggin speed gives an overall unmatched experience to any cloud model. The speed is insane. Havent even tried a NVFP4 yet lol.
Sort-Aromatic@reddit
How are you running it? Vllm? What is your setup config? I was getting memory issues when trying to set mine up last night
cosmicnag@reddit
Running using llama.cpp, and dual GPU 5090 + 4090, sorry forgot to mention it in the original comment. I use the Q8_XL unsloth quant with Q8 KV cache, full 262k context. On only 5090, I would prolly go Q6 in both, and drop context a bit to 200k or something.
mannydelrio1@reddit
whats your setup for the 5090 ?
Glittering-Call8746@reddit
Is there nvfp4 quant ?
DistanceSolar1449@reddit
Yes, but nvfp4 Qwen 3.6 is braindead.
Running Qwen 3.6 35b on a 5090 is fast enough already, just stick with a Q4 gguf.
FinBenton@reddit
Q5_XL fith into 5090 with full 256k context nicely, maybe even Q6.
duyleekun@reddit
Im doing q5_xl, does q6 worth the bigger size :)
kozer1986@reddit
Can you share your llama configuration)
FinBenton@reddit
thats part of my systemd service file that auto launches the model
Glittering-Call8746@reddit
Maybe cos the static weights is fp4 ?
DistanceSolar1449@reddit
It's because nvfp4 quants will quant attention and ssm into 4-bit, whereas most gguf quants keep them at q6/q8/bf16. Especially ssm, that needs to be bf16. That kills model perf while only saving a few megabytes.
Glittering-Call8746@reddit
Yes that why.. it needs to be hybrid. Ty for confirming this.
fullouterjoin@reddit
https://huggingface.co/models?sort=trending&search=qwen3.6+nvfp4
ranting80@reddit
Yes there's a few of them out now: https://huggingface.co/mmangkad/Qwen3.6-35B-A3B-NVFP4/
CapeChill@reddit
How are you running it on a 5090? What quant and context? Been wanting to try it on that but wasn’t sure it would fit at a reasonable quant/context.
Still-Wafer1384@reddit
It doesn't have to fit a sparse MoE model like this can easily offload some to CPU / system memory without too much performance penalty. Also Q8 is probably overkill, something like Q6 is probably the sweet spot.
snikurtv@reddit
Based on the benchmarks, Q5_K_M is the sweet spot. Can fit Q5_K_M with \~160k context on a single RTX 5090 without any extra tweaking. Need to drop down context to enable vision though.
FinBenton@reddit
Im running Q6_K on 5090 with full 256k context on llama.cpp no problem.
dreamai87@reddit
Bro, you have enough vram to to fit this model. If 32 then great or having laptop with 5090 then 24gb vram You can easily put ud 4 k_xl model and get around 100t/s
CapeChill@reddit
Q4 is where you start to see drop off even if its the xl and running Q5 or Q6 only give \~16k context headroom. Gemma 4 seems pretty workable with Q6 and 100k+ context. That said my Strix Halo box just started testing it, 25ish tok/s by the look of things on vulcan + llama.cpp.
FortiTree@reddit
Which model you get for 25 t/s?
I also have a Strix Halo with 96Gb on Ubuntu 24.04.3 trying out various models and combos like LM studio + Vulcan, Ollama + RoCm 7.2.1. So far Vulcan outperforms RoCm in raw speed at 55 t/s vs 42 t/s for Qwen3.6-35B-A3 Q6KXL and 32K context.
Qwen3.5-122B-A10 Q3KM is a lot slower at 12 t/s on Vulcan.
I'll try llama.cpp next to see how much of a jump it can get but just figuring out a prompt system for it first so I can compare apple to apple.
AlwaysLateToThaParty@reddit
Genuinely curious how you found the quality at that level of quantisation?
FortiTree@reddit
Im also curious but Im not benchmarking model quality yet. So far Im just using a list of simple logic questions like the car wash test, string on a nail, listing complex but well known knowledge like Networking stack and principles, and ask it to correlate them altogether just to get a feeling of speed and quality. Those test wont mean much since everyone's use case is different and the "sweet spot" is really dependent on their own code base, hw size and requirements.
But the raw HW speed and context memory overhead would be the same for any question/task so thats what Im trying to benchmark first.
My takeaway so far is I want more speed than the highest quality because for super hard stuff, I'd trust Opus4.6 the most and I can reserve my Cloud subscription for those tasks. Then offload anything that Haiku or Sonnet can do to local server for speed and cost. This would require me to breakup my agentic tasks in a way to interact with multiple systems, which is entirely possible.
My goal is to find a setup that can best use the AMD shared vram, a lot slower than nvidia but has more vram capacity at lowest price.
Regarding your quant quality, I usually just ask Opus on which variant is proven to be best for that model and ask it to check latest community feedback. But the real cap is your memory size, so I cant really go to Q4 without scarificing speed and Im not even sure I can load it yet. At Q3 it's already too slow for my taste, 13 t/s is barely usable. So I havent engaged with it that much as Sonnet and Opus.
I think at the end of the day, everyone will have to try out things for themselves and find their sweet spot. The code quality for production is what matter at the end so none of these benchmark would mean anything if it cant do what you need it to do.
mycycle_@reddit
Forgive me, but what benchmarks are we using? I find it really hard to trust normal benchmarks sense so many models are trained on them, but for quants maybe its still useful.. regardless, very happy to learn about new resources!
AlwaysLateToThaParty@reddit
My work computer has 16GB of VRAM, and I haven't been happy with the performance of many models at that capacity, but I've been reluctant to use bigger models at smaller quantisation. My work environment is locked down, so no cloud anything really. You're right about use-cases; They're all different. Who knew? Thanks for the insight.
FortiTree@reddit
Ofc, Im happy to share. I think 16Gb is already below the minimum for usefulness. That's you most limiting factor.
I'd gun for 24-32gb Nvidia + CPU + 32Gb ram as minimum so you have enough GPU memory to run the entire "active model" on the card (10x faster than AMD vram), offload the MoE models on CPU Mem. So you get the best speed + most capable MoE model at that spec. Picking which model is less important as this is the sweet spot for consumer range that all those model companies are targeting for so these models will improve over time for you, and you just try out the best one that are recommended by the community.
I have a different setup for work that the company paid for so my local setup is mostly just for experiment and learning. I'm cap by the AMD memory speed but I got the bigger memory space trade-off to try different model size, and it's a lot cheaper (2k-3k range vs 5k-10k).
If your company got stuck with this mediocre laptop, I'd say you invest in your own local setup and bypass it. In this AI era, you need to move fast to be at the top of the food chain. We cant afford to be stuck with ancient hardware.
AlwaysLateToThaParty@reddit
I'm afraid you mistake me. At home I have an rtx 6000 pro. I'm really talking about my locked down work computer. Good info though. Thanks.
cosmicnag@reddit
Well actually I have multi gpu setup, 5090 plus my older gaming 4090 (which I didnt sell cause AI happened lol). I use the best unsloth quant Q8 XL, and max context window in Q8. Its amazingly fast and really usable. Its not the latest opus, but damn definitely a watershed moment for local AI. The raw speed makes up for how much ever better SOTA is.
CapeChill@reddit
My old gpu is AMD :/ but that makes sense, multi gpu.
karimusben@reddit
It runs well on my 7900xtx
CapeChill@reddit
Are you talking about Nvidia + AMD multi gpu here or just qwen 3.6 q4?
karimusben@reddit
Qwen
Free-Combination-773@reddit
IQ4_S fits into 7900 XTX with Mac context
john0201@reddit
The latency and general experience is just better. I bought the perplexity search api credits and I just straight up prefer it for many things.
Opus 4.7 is still much better for coding. I sometimes use the 122B which gets close but it’s just too slow on an M5 max. Once Apple releases an M5 ultra and there’s a qwen3.6 122B q6 to try I’ll give it another shot.
Smooth_Bus_3010@reddit
Sorry to burst your bubble. Cheap GPUs will never happen. There are endless of AI applications these are useful for other than LLMs. Smaller companies like mine will always want more GPUs, no matter the power draw.
john0201@reddit
I don’t think you have a good understanding of how the semiconductor business works.
-p-e-w-@reddit
There are plenty of cheap GPUs available today: Any that don’t support BF16. The same thing will happen in the future with newer technologies. Once a GPU is missing a must-have feature, it’s worthless.
techdevjp@reddit
Which will be mighty interesting, except that setup draws a peak of 10,000W through 6 x 3300W PSUs. Plus the cooling requirements.
But, 1.1TB of H200 memory would be quite a trip.
john0201@reddit
That is exactly why old server gear is so cheap, power.
techdevjp@reddit
Yup, which was my point. $10k is easily affordable for a lot of people. Furnishing it with 10,000W of power and the required cooling is not.
xorgol@reddit
That definitely requires changing contracts, but 6kW is fairly standard around here, it doesn't seem entirely unfeasible.
techdevjp@reddit
It's certainly not standard where I live. 10g fiber is cheap & easy to get but a 10kw power hookup would be nearly impossible for a house. And very expensive.
No_Communication7072@reddit
Especially because the police will ask you why you need so much power, and if you are gonna plant weed
No-Budget2376@reddit
DGX Station will be available in 6 months and that will smoke anything, it's way faster and cheaper. Frontier models will be scaling parameters by then sp the 1T models will be local and 10 - 100T cloud
john0201@reddit
A B300 will not be faster than 8xH200. It will be more efficient though.
techdevjp@reddit
DGX Station has appeared for order already, just under $100k. I've no doubt it will be incredibly powerful and of course it has a truckload of memory. And it's actually designed to run under someone's desk. But if anything that helps make /u/john0201's point that the price of 8xH200 is going to fall, and fast.
darktotheknight@reddit
You can easily cap H200 at 400W. Even at the minimum cap of 200W they will still crush anything you've ever seen so far.
That being said, these systems will stay unaffordable for quite some time. VRAM is king and 141GB HBM3e will stay relevant for quite some time. It would absolutely surprise me, a single H200 would be listed on eBay at 10.000$ within the next 3 years, let alone 8.
xrvz@reddit
You're using those words wrong. Correct is: within the next 8 years, let alone 3.
mesasone@reddit
Read the sentence again. They were taking about the amount of cards available on eBay, not the time span.
Polite_Jello_377@reddit
No, you are wrong
techdevjp@reddit
As /u/No-Budget2376 pointed out, DGX Station will be arriving soon. Single device under your desk with 748GB of memory and performance far exceeding GB200.
https://www.reddit.com/r/nvidia/comments/1jh2exl/whats_the_update_from_gb200_to_gb300_sorry_if/
They're going to be $100k, but even at that price they will dramatically drive down the price of H200 systems.
Ok_Clue5241@reddit
I HOPE you’re right about the price of enterprise gpus, but come on…..
john0201@reddit
I have 2x5090s and can train GPT-2 in an hour. That model was SOTA when we were freaking out about covid. Wasn’t all that long ago. The servers they trained it on are now not quite doorstops but close.
fullouterjoin@reddit
Which mechanism are you using to use perplexity with opencode?
john0201@reddit
You just add it as a tool
AlwaysLateToThaParty@reddit
I would like this very much.
MongoWithBongoss@reddit
Hopefully, photonic computing will bring down GPU prices.
Likeatr3b@reddit
This is the thesis of my company
k0zakinio@reddit
this is what I think too, the tipping point has been reached for models to be able to be run locally which are capable enough for most day to day coding tasks. You'd still want the 'smart' model to help with more zany things but this massively reduces the value proposition of the big model companies.
I wonder whether there will be a deepseek moment with the markets once they get wind of this.. there was about a weeks delay when deepseek launched until the market actually reacted last year.
9gxa05s8fa8sh@reddit
concur, and we're 1 year away
john0201@reddit
I think it’s probably towards the end of the Rubin lifecycle, so 18-24 months. By then AMD, Intel, and the other secondary players like Google/Meta/Apple will have caught up.
Medical_Lengthiness6@reddit (OP)
That's what I noticed right away. Immediately responds with quality
NaabSimRacer@reddit
5090? does it fit on 1?
wen_mars@reddit
Q4 fits easily on 5090 with max context
weedebest@reddit
better than genma4?
jacobpederson@reddit
Try this prompt please. Frustrated yet? Now boot up Gemma-4-26b-a4b and watch for a 1 minute one-shot :D
msaraiva@reddit
Can you please share the quant and settings you used with the 5090?
cosmicnag@reddit
Read the other comments first - I actually have a 5090 + 4090 setup. So I can run Q8XL, and KV cache Q8 max context. For only 5090, I suppose you can do the same with Q6 quants for the model. See other peoples posts in this thread as well. Cheers.
nomorebuttsplz@reddit
by "local" you mean can fit a 3090?
cristoper@reddit
The Q4 quantized ggufs can fit and work well on a 3090
YairHairNow@reddit
Nvfp4 cut 5s off my prefill/first token. I'm on a 5080, 60tk/s output through qwen code and its running my discord swarm. Somehow I'm still able to fire off 30s generations in comfyui/zimage with the model loaded in my swarm. I'm kind of blown away. I'm using the gguf because I'm having fun with the uncensored version. But the MXFP4 i'm using is fast.
Best local model I've experienced yet.
lolwutdo@reddit
Are you fully offloaded on gpu? What quant?
neur0__@reddit
Running it with a 5090 and I get very good performance as well, can’t wait for the future
dilberx@reddit
Imo rtxs are a bad decision being energy hungry
Glittering-Call8746@reddit
So what's the alt ..
dilberx@reddit
Mac or strix halo system
Free-Combination-773@reddit
With both having slow pp
max123246@reddit
What quant are you using? Isn't nvfp4 slower than fp4 anyways? It's just better accuracy
Worldly-Plastic-2516@reddit
What are you running with on the 5090? Love to give it a spin
cosmicnag@reddit
Well actually I have multi gpu setup, 5090 plus my older gaming 4090 (which I didnt sell cause AI happened lol). I use the best unsloth quant Q8 XL, and max context window in Q8 (llama cpp,CachyOS). Its amazingly fast and really usable. I mostly use it with Pi coding agent,and its just freakin awesome.
Specter_Origin@reddit
It still struggles on complex issues and ends up looping for me. Definitely not as good as claude or sonnet but best local most for sure. Pretty close to Minimax 2.7
StardockEngineer@reddit
I’ve been using it for 6 hrs straight without a loop.
CircularSeasoning@reddit
You're the loop now. Human in the loop, haha!
cosmicnag@reddit
Lol
Varmez@reddit
What kind of settings are you using? What harness etc? I'm using 8bit in omlx with their 0.6 temp recommended settings and am getting constant looping in claude code, qwen-code, and open code.
epicycle@reddit
mine seems to get stuck using omlx as well. Did you try the settings @StardockEngineer shared? Was gonna try tomorrow as I wanted to give 3.6 a go. Tried other things and failed. I hope it’s not a omlx bug.
Varmez@reddit
I was using the "Thinking mode for precise coding tasks" settings and getting the looping a lot. Trying the general tasks ones now and they don't seem to be happening. I even tried using them but lowering temperature to 0.8 and started again right away.
StardockEngineer@reddit
llama.cpp and pi coding agent. Using unsloth's recommended settings per https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF:
dreamai87@reddit
Enable preserve thinking model. It makes model way better
Specter_Origin@reddit
Thanks, I already did and it sure helps, but still gets into loop specially if you are using for agentic coding where there are chained todo items for it to do!
Surprisingly gemma-4 does not have that issue but it definitely is not as good at problem solving as 3.6
dreamai87@reddit
Okay don’t know why the case, though I use this model without even with reasoning and for me I never felt any looping issue till 123k context size that I tested. You can see some examples I posted in my previous post. Building an app from research paper that holds a lot of information but still this model pulled off better than many
Weary_Long3409@reddit
Yes. It's on par with minimax 2.7 with better problem solving on agentic coding.
StardockEngineer@reddit
I just got a loop lol
H_DANILO@reddit
it used to loop with temp 0.6, I'm using temp 1.0 and it stopped looping altogether.
There's not such a thing as complex issue. In computer science complexity talks about asymptotic computational cost of functions.
Complicated is something else. Something complicated is something that hasnt been well defined, and that hasnt yet been broken down into smaller, manageable, problems.
segmond@reddit
99% of the people in this world are not solving complex issues.
Specter_Origin@reddit
I agree! It is the best local model for chat etc.
dasNavy@reddit
Guys what do you want to accomplish. Keep hearing those its better its better its better, but what is actually your daily driver and the benefit. I‘m tired boss, just let me live under the clouds on this planet and don‘t care of ai and stuff
Acceptable-Opinion56@reddit
M5 max vs 2 5090
florinandrei@reddit
lol
Maybe for very simple things. Give it more complex agentic tasks and you will see the difference.
That being said, for a 35b it's pretty good.
hyperdikmcdallas@reddit
Yeah, but technically, wouldn’t it figure it out eventually?
florinandrei@reddit
If that were true, then put any model in a Ralph loop and ask it to solve string theory.
Looping over partial solutions does yield some improvements. But if the problem is sufficiently hard, any model will get stuck, eventually.
hyperdikmcdallas@reddit
Or have it try to solve the three body problem I wonder if in 30 years they can use quantum ai to solve it
Maleficent-Pea-3494@reddit
My MBP max 128 is scheduled to deliver on Friday and I can’t wait! I built a UI and running parallel OCR (Chandra 2) and inference testing (Qwen 3.6) in Claude and hoping to run local as soon as I get the machine. If it’s as good as you’re saying I’ll hop in the queue for M5 Studio asap.
H_DANILO@reddit
you can easily go 256k, context is VERY CHEAP on Qwen, and this model is REALLY good with context
snowglowshow@reddit
Can you explain why it's any better than any other model at context? I've made a very strict habit out of starting new windows as often as possible, often 16k or less. If Qwen is special in this way I would love to know!
CircularSeasoning@reddit
For me, Qwen models have always respected my context in a certain way that's noticeable compared to other models. It's like it retains the inherent shape or quirkiness of my code better. Sometimes this is actually a pain when I want it to use more initiative in changing things up all at once to experiment with variations, but this is the double-edged sword.
wen_mars@reddit
Something about a "hybrid attention" architecture with "gated delta network attention" whatever that means.
H_DANILO@reddit
You can just try for yourself, myself, I've been strict to 64k, same as you, always starting a new session, then, recently, I felt comfortable at 128k...
This is the first time I went to 256k and I'm happy about it.
I've seem someone else posting here in reddit they tested a needle in the haystack on this qwen 3.6 35b a3b, and it got a very high score, I cant remember exactly, but that prompted me to test out 256k and i'm very very very surprised.
Ofc it won't give the same attention to very old context compared to new context, but tbh, this is desired, as long as it can remember if I ask it to, I'm happy.
This actually gave me hope one day we gonna have 1mi, perhaps who knows 2mi context where models can actually read the whole codebase and be much more succint rather than having to read chunks of the codebase and losing coalescence.
Snoo_48368@reddit
Thank you for this comment! I've been using qwen3.6-35B-A3B-Q5_K_M with a 128k context split across two slots on a 4080 Super (16GB VRAM) + 7950X3D (16 core) + 96gb DDR5. I was getting \~45 tokens/sec and great results using with opencode, but hitting compaction quickly resulting in tons of pain.
Just upped overall context to 256k, resulting in each parallel run getting 128k. and it's keeping the same speed!
I do notice some issues from time to time (loops, hallucinations), but it's blowing me away. I pointed it at an insanely complex codebase (think custom flavor of bison parser definitions), and it not only managed to understand it, but worked quite well!!!
H_DANILO@reddit
glad it helped, i stopped having loops problem when I set temperature to 1.0, try that
Artistic_Okra7288@reddit
Yea I push mine to 1M context at Q4_0 KV cache (probably could get away with Q8_0 but it's working well enough).
Octopotree@reddit
Does that work? The model has a context window of 256k
Artistic_Okra7288@reddit
You can scale it using yarn and rope settings. Llama.cpp has a bug where you have to pass in a specific kv override as well, but yea, I’ve gotten it working.
bwjxjelsbd@reddit
Does it support compaction?
H_DANILO@reddit
It does
Medical_Lengthiness6@reddit (OP)
I just bumped it up and it seems like you're right
youngishgeezer@reddit
I'm running the Q8_0 version on my M5 Max 128GB MBP. It's amazingly fast and seems to do a good job with coding, though not quite the same level as what I'm getting from GPT5.4 in Codex. However I just gave it 4 hand written, in cursive, recipes snapped with my phone and it got all 4 recipes extracted in under 20 seconds with basically perfect accuracy. I'm very impressed.
epicycle@reddit
what harness and settings are you using? i have the same machine and can’t get a stable Qwen 3.6.
youngishgeezer@reddit
I'm running LM Studio with Qwen3.6-35B-A3B-Q8_0.gguf with the stock parameters except maxing out the context. The coding is done through codex with the same settings. I ran the image processing experiment using LM Studio's chat. I'm new to all this so I'm making no claims this is optimal.
What are you running?
epicycle@reddit
I run oMLX on a M5 Mac 128 GB and use MLX optimized models for apple silicon. I’m gonna give LM Studio a try I think. I had tried them all and settled on oMLX.
spawncampinitiated@reddit
"as good as Claude"
I mean, yeah right.
You and I fucking wish
tarruda@reddit
It really depends on your use case. 3.6 is very good at agentic coding, can explore and understand codebases, edit files, etc.
Yea, it is not going to match frontier models on understanding very complex things, but I think most users don't have that need.
I've been using 3.6 since it was released and it became my favorite local model for agentic coding. Even though I can run larger models, the fact that 3.6 is good enough for most things + being super fast makes me prefer it.
If 3.6 122b follows the same improvement curve, there's a chance it could match sonnet for everything.
power97992@reddit
Dude even 3.6 plus probably cant do what sonnet does … from my exp , 3.6 plus is even worse than gemini 3 flash, it misunderstands directions sometimes and presumes stuff, its output is way worse u have to correct it and reprompt it …
Loud_Key_3865@reddit
48tps effective with single 12 GB 5070 Ti (15K context)
Qwen3.6-35B-A3B IQ2_M via llama.cpp,
model = Qwen3.6-35B-A3B-UD-IQ2_M.gguf
ctx = 15360
parallel = 2
n-gpu-layers = 99
fit = on
cache-type-k = q4_0
cache-type-v = q4_0
threads = 8
batch-size = 1024
ubatch-size = 256
reasoning = off
I'm having other LLMs like Codex, Gemini & Claude send tasks to a router, which looks at this model to see if similar tasks fail, then routes back to the calling model for a high-score answer, and if tests pass, sends it back to the model to learn the fix for the prompt, so if a similar pre-failing prompt comes in, the learned context will be tried again, and scored. Low => send back to paid in the future, high=>local llm can handle it.
Then I have paid models review and test.
Keeping things small, like plugins, for example, helps reduce context needs and seems to work well with the small 15K context for small tasks.
I also have Opus/Codex/Gemini - one of them create a plan, get consensus on it after they all review, then give me the best of all results. Then have one of them break all the tasks into small tasks, with complete directions, so the tasks are clear and easy, and can be done in parallel, by smaller, dumber models with the same quality.
Once that's done, I tell the model to implement the plan, and hand off tasks in to the local model for learning and load reduction.
The 15K context, temperature and all the other settings were thru trial and error to get the highest results at min 50-60tps on my machine, with the best quality. Strangely, I discovered context levels can have actually degrade performance in some situations. This gave me the optimal balance between > 55tps and even scored 100% quality on my home-grown coding tests, that I built for testing relevancy of my workflow.
### Benchmark Results (Coding Suite - 31 Tasks)
| Model / Config | Quality | Effective TPS | Total Time | Notes |
|----------------------------|--------:|--------------:|-----------:|--------------------------------------------|
| QWEN36-35B-IQ2M | 85.48 | 51.04 | 82.18s | Best overall balance in the leaderboard |
| QWEN36-35B-IQ2M-R2 | 83.87 | 49.50 | 83.37s | Slightly lower quality, still very fast |
| QWEN36-35B-IQ2M-NC13 | 85.48 | 46.64 | 100.20s | Same quality as top, but slower |
| QWEN36-35B-IQ2M-CTX24K | 82.26 | 44.45 | 105.48s | Older high-context run |
| QWEN36-35B-IQ3S | 82.26 | 38.96 | 110.74s | Lower quality and slower than IQ2M |
| QWEN36-35B-Q2KXL-CTX24K | 85.48 | 36.90 | 104.97s | Good quality, but significantly slower |
New_Slice_1580@reddit
Is this for a business or your personal stuff?
Loud_Key_3865@reddit
Laptop for business
New_Slice_1580@reddit
How up to date is Opus and other commercial models?
As qwen3.6-35b-a3b says its training data is up to 2025
sleepy_quant@reddit
Same here. Running Qwen3.6-35B-A3B-8bit on M1 Max 64GB via MLX, and the speed + context handling is genuinely impressive. Love that we can finally keep our codebases local without sacrificing quality
New_Slice_1580@reddit
Is the mlx version bf16? If so try gguf or mlx fp16 and see if faster (As m1 and m2 don’t handle bf16)
sleepy_quant@reddit
Ran the benchmarks — you were spot on. All on M1 Max, 64GB, Q8 quant, same 65-token prompt + 200 token greedy generation:
| Path | tok/s | vs bf16 |
|------------------------|-------|---------|
| MLX Q8 (bf16 default) | 21.18 | 1.00x |
| MLX Q8 forced fp16 | 26.22 | +24% |
| GGUF Q8_0 (llama.cpp) | 34.08 | +61% |
Dtype probe confirmed mlx_lm was loading non-quantized params (scales, norms, embeddings) as bf16 — 1245/1757 params bf16 on M1 Max was the emulated path. Cast bf16 → fp16 after `load()` is a \~10-line patch. Already shipped as the new default on my side. llama.cpp on Metal wins another 30% though. Genuinely better-tuned MoE kernels, and honestly a more mature stack than mlx_lm right now. Not switching immediately (priority queue + per-agent thinking-mode + memory hygiene layer are all MLX-coupled, and a rewrite is 1-2 weeks I'd rather spend on the actual product), but noted for the next major refactor. Thanks again — 24% from a 10-line patch is the best ROI I've had in a while.
New_Slice_1580@reddit
Glad I could help 👍
sleepy_quant@reddit
Good catch, I haven't checked. Running `Qwen3.6-35B-A3B-8bit` on M1 Max, so if MLX is defaulting to bf16, that's the emulated path. Will benchmark current vs fp16-forced MLX vs a GGUF build and reply with numbers. Thanks
ContextDNA@reddit
How many t/s? Is it worthwhile?? Thanks
sleepy_quant@reddit
\~10 tok/s decode at Q8 on M1 Max 64GB — around 44-47GB total unified RAM. Q4 was 49-60 tok/s but I swapped back to Q8 because the slowdown brought sharper outputs (fabrication detection. Worth if the stack has graders in the loop. For simple use case Q4 is the better trade
KarenBoof@reddit
How big is your context?
sleepy_quant@reddit
64K in practice. Native goes up to 256K per the config but cache on 64GB Mac makes anything past \~128K uncomfortable. Agent prompts are typically under 16K so 64K has plenty of headroom
AraAraKunn@reddit
How do you give access to the local project? Like cursor changes so many files at a time . Is there a way to do this using local llm ? As of now qwen cannot even look into the contents of the file and we have to manually copy paste file codes ourselves
New_Slice_1580@reddit
Are you using open code?
sturmen@reddit
I'm using LM Studio as well and it's unbearably slow due to prompt processing taking forever. is there some sort of KV caching in LM Studio that I need to enable?
New_Slice_1580@reddit
Changed context setting to full? 256k
jomaha23@reddit
Have you tried running qwen3-coder-next? How does this one compare?
Medical_Lengthiness6@reddit (OP)
Ya I was using the before this. I found that it was just not smart enough. Hallucinating wrong answers. It was ok as far as calling tools and stuff.
jacobpederson@reddit
Really though? Running the 8bit quant in open code and it can't even get the following cube prompt running at all. A prompt Gemma-4-26b-a4b one-shot in 1 minute btw :D
AlwaysLateToThaParty@reddit
qwen3.5 122b/10a heretic mxfp4_MOE (75GB VRAM) one shot that on my RTX 6000 Pro.
llitz@reddit
Can you share the settings you are using for the 122b?
AlwaysLateToThaParty@reddit
-m G:\ai\llama.cpp\models\120B\Qwen3.5-122B-A10B-heretic.mxfp4_moe-00001-of-00002.gguf --mmproj G:\ai\llama.cpp\models\media\mmproj-F32.gguf --jinja -ngl 200 --ctx-size 262144 --host 0.0.0.0 --port 13210 --no-warmup --temp 0.4 --top-k 20 --top-p 0.95 --min-p 0.00 --repeat_penalty 1.05 --presence_penalty 0.0 --cache-type-k q8_0 --cache-type-v q8_0 --image-min-tokens 1024 --image-max-tokens 4096
llitz@reddit
Ty, will give a try
jacobpederson@reddit
OP is talking about a qwen3.6-35b-a3b with 8 bit quant - which doesn't even get to a working version after an hour of debugging in opencode :D Gemma-4-26b-a4b seems way smarter - but can't do the tool calling in opencode . . . we always seem to be on the verge of something great in OS models, but never quite getting there. RTX 6000 Pro not realistic for local hardware - hell my 5090, 4090, 3090 setup isn't realistic either :D We need a tool harness capable model that can run on single 5090 class hardware.
GMP10152015@reddit
How does it compare to Gemma 4? Recently, I’ve been using Gemma4, and it was very good, especially for tool calling and code (using at least a 128K context window).
tarruda@reddit
3.6 feels much stronger than gemma 4
GMP10152015@reddit
Could you provide some examples? Thx
rm-rf-rm@reddit
As good as claude? Claude Haiku? Sonnet 4.5? Sonnet 3.7? Opus 4.6?
Claude is not a model.
Immediate-Rooster926@reddit
It’s as good as haiku 1
Agile-Orderer@reddit
AMAZE 👏
I’ve been waiting for 3.6, cause as good as 3.5 is, I still felt the need for Claude often enough to keep me from daily driving Qwen. This should reduce my dependence, cost and usage at least 🙌or maybe even get me fully local and secure at best🤞
Thanks for sharing👌
Krillian58@reddit
I dont know, I just switched from opus to qwen 3.6 plus and its substantially worse at everything I was doing. Maybe its because its picking up opus loose ends, would be nice to know.
Quartich@reddit
As someone without GPT or Claude subscriptions, 3.6 35b A3B Q6KL outperformed the free versions for my complex test program prompts.
Tested 2 different prompts based on programs I had made. Free GPT couldn't do it right even with some handholding and Clause just ran out of tokens. Qwen took under 30-45 minutes and was right on my first prompt (with opencode). I'm sure the paid tiers are much better, especially if I ran them with Opencode.
No-Name-Person111@reddit
Qwen 3.6 is somewhere below sonnet in my experience thus far.
That's not a bad thing. It's impressive. But in head to head contests for day to day networking and system infrastructure work, it's a B+ compared to Opus at A+ and Sonnet at A-.
You just can't compare something like Opus against a local model. The scales are different.
Late_Film_1901@reddit
Yep it's just wild to compare a tiny local model with what I was paying openai for just a year and a half ago.
Krillian58@reddit
I use Gemma 4 / ministral 3 / qwen 3.5 for any sustained subagent tasks like scraping with tandem and to summarize every conversation at 10k token marks for reinjection on compression. Works swimmingly. Definitely took some opus turns to figure out how to effectively prompt them though. I havent tried to recreate my 2024/25 chatgpt experience yet.
Krillian58@reddit
I think I would agree with that.
I have tried xaomi and was able to do most of what I was doing with opus, and at the time qwen 3.6 plus felt like 3rd but now that 4.7 opus is out I think it was just because they had watered down opus at the time. Starting om Opus spoiled me.
n4pst3r3r@reddit
Apples to Oranges, kind of. Opus is in another league, probably much larger and super expensive. If anything, Qwen Plus should be compared to Sonnet.
Quartich@reddit
I have a 3090 + 3060ti + 64GB RAM and I am thoroughly impressed. With Open Code I am "one shotting" working programs that the free tiers of ChatGPT and Claude can't do write, ChatGPT couldn't even with handholding
sob727@reddit
Yeah. About Qwen.
caetydid@reddit
does it matter for coding?
sob727@reddit
I'm not sure. But I trust the model less, that's certain.
Savantskie1@reddit
This is why I use uncensored versions of 3.6
Limp_Classroom_2645@reddit
Im running it with 250k context on RTX3090, and for my use cases i have the same experience as you, at some point i just forgot i switched openclaude to a local model while i was working the other day
GravitasIsOverrated@reddit
What’s the advantage of manually compiling llama.cpp?
Limp_Classroom_2645@reddit
Being able to pull that latest changes as soon as they are released and probably better performance
llitz@reddit
That, compiling the turboquant version with support for cuda and rocm/vulkan I'm a single binary
TheOnlyBen2@reddit
Interesting, thanks for sharing.
How many t/s you reach on a 3090 ? Also how do you reach 250k context ?
Limp_Classroom_2645@reddit
Between 60-70
bobspadger@reddit
Can you provide your config for this? I’ve got a 3090 as well and have just been using LmStudio for speed of testing, but would prefer to use it headless on my Ubuntu server
Covert-Agenda@reddit
I’ve got the same machine and been running the same model via mlx - this is the first time I’m actually impressed with local AI.
tken3@reddit
I’ve considered getting into this. How well or poor would this run on a RX6800 with 16gb vram?
TopChard1274@reddit
FeelsGoodman
Saul Goodman
Artistic_Okra7288@reddit
I've been running Unsloth's Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf with Q4_0 KV cache at 1M context on my MacBook Pro M5 Max 128GB. It's been working OK but it's definitely no Claude Sonnet 4.6.
Still-Wafer1384@reddit
Why would you run the model at q8, but your cache at q4?
(Not that it's gonna bring you up to Sonnet 4.6, but anyways)
Artistic_Okra7288@reddit
Well because I started with what I could run at ~200k context with minimal quant. So I started with the Q8_K_XL with unquantized kv cache at 256k and decided to push it as high as I could, ended up at 1M with Q4_0 and haven't wanted to change it yet. I'll probably revert back to probably 512k because 1M is just too large for my needs, just wanted to see if I could do it.
Still-Wafer1384@reddit
Understand, thanks. Have you tested running the model at q6? Supposedly there shouldn't be much difference with q8.
Artistic_Okra7288@reddit
From the graph that was posted the other day, it looks like the Q6 is probably the sweet spot for size vs perplexity. I have the RAM so figured I'd go higher. Probably not going to be playing much more with Qwen3.6 until other models come out. It just still isn't very good at agentic coding for my needs. Probably will try to get MiniMax M2.7 working... struggling getting to 200k context with that one... might go back to Qwen3-Coder-Next... actually just remembered Devstral-2-123b, I wanted to try that when it first came out. I think MacBooks unified RAM, it's basically geared for MoE models so it will probably struggle with dense models like Devstral. Hopefully Qwen3.6 larger models are decent! The multimodal on Qwen3.6-35B-A3B has been pretty nice.
Still-Wafer1384@reddit
Did you find qwen3 coder next better?
Artistic_Okra7288@reddit
For some specific things it does equally well or better. Some things not so well. I’m doing custom ansible roles and automation using Proxmox and kubernetes and Qwen3.6 is struggling with things outside its training set. Even when I give it full access to source and documentation. Qwen3-Coder-Next does an adequate job but still not as good as Claude Sonnet 4.5
an0maly33@reddit
Been rubbing it here for a couple of days too and it's better than any other local model I've tried. Gem4 was decent, especially after the template fixes and llama updates. But qwen3.6-35-a3b has been a rockstar with pi so far. It's not perfect but when I do hit an error, I can paste the output and it figures it out on the first try. The self linting catches almost any problems before it turns things back over to me anyway.
With gem4 I was getting the "I'll do this thing." Then it would just sit there so I'd have to nudge it with "ok... so do it then." Also inevitably hit thought loops. So far on qwen, I've had no instances of needing nudged to continue and had only one thought loop. Cancelled generation and had it try again and it got through it the second time.
I wonder if puerile having trouble with it aren't seeing the parameters correctly when loading the model. There's a page on unsloth's site that tells you the settings both for conversation and coding/agentic work.
Medical_Lengthiness6@reddit (OP)
I was seeing some weird stuff with Gemma 4 as well. Like it would say " I need to see x and y files to do that." Despite them being available in the directory. Like it wasn't really trained to use tools very well. Like bro you can grep them..
an0maly33@reddit
My biggest problem was gem would edit a file but it seemed like it was making assumptions on what was in it. It would constantly fail because the pattern it was looking for didn't match. It worked better when I told it to always read the file before editing but it would forget to do that a lot. So instead of changing 1 line, it would rewrite the whole file.
Techngro@reddit
"Been rubbing it here for a couple of days..."
Ouch. Take a day off, bro.
an0maly33@reddit
lol. Gotta love mobile keyboards/correction.
goldin_pepe@reddit
Cope
OkBase5453@reddit
Me too, same model on a 3090 Ti with spill on my 512GB server... but with Qwen-Code... I do not know if it is as good as but i do not complain, since there are no significant errors or halutinations... but yet again, I do not vibe code with it, just scripting and debugging...
evilbarron2@reddit
This is my setup - been running qwen3.5 since its release and been happy, planning on moving to 3.6 once it settles down. You’re getting good results from the 35b moe?
OkBase5453@reddit
So far very impressive. I am starting to learn how to use the qwen code for my use cases, but it is super fast at agentic work.... also some benchmarks: RTX 3090 Ti + 512GB + 2x E5696v4:
root@llama-cpp:\~# numactl --interleave=all /opt/ik_llama.cpp/build/bin/llama-bench -m /mnt/nvme-llm/models/Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf -ngl 14,18,999 --n-cpu-moe 999 -t 78 -ser 1,8 -p 1024 -n 128 --mmap 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
| model | size | params | backend | ngl | threads | ser | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ---------: | ---: | ------------: | ---------------: |
| qwen35moe 35B.A3B Q8_0 | 35.80 GiB | 34.66 B | CUDA | 14 | 78 | 1,8 | 0 | pp1024 | 420.66 ± 28.34 |
| qwen35moe 35B.A3B Q8_0 | 35.80 GiB | 34.66 B | CUDA | 14 | 78 | 1,8 | 0 | tg128 | 21.34 ± 0.04 |
| qwen35moe 35B.A3B Q8_0 | 35.80 GiB | 34.66 B | CUDA | 18 | 78 | 1,8 | 0 | pp1024 | 442.93 ± 10.75 |
| qwen35moe 35B.A3B Q8_0 | 35.80 GiB | 34.66 B | CUDA | 18 | 78 | 1,8 | 0 | tg128 | 22.25 ± 0.04 |
| qwen35moe 35B.A3B Q8_0 | 35.80 GiB | 34.66 B | CUDA | 999 | 78 | 1,8 | 0 | pp1024 | 534.87 ± 13.81 |
| qwen35moe 35B.A3B Q8_0 | 35.80 GiB | 34.66 B | CUDA | 999 | 78 | 1,8 | 0 | tg128 | 42.67 ± 0.26 |
danihend@reddit
I am running Q2_K_XL on RTX3080 10GB+64 GB RAM and it's amazing. 30 tps, max context window. It is not as good as frontier models for sure, but it is seriously capable of helping out. Not too long ago we were dreaming of running something like Deepseek R1 locally at 2 tps and this is better than it for coding and we can run it on a regular computer. Pace of improvement is mind blowing.
HockeyDadNinja@reddit
I'm also using the 8 bit quant. I have a rtx 5060 and 4060 with a total of 32G vram, 64G system ram.
I used opencode today to start a project and I'm so impressed. 27 t/s isn't blazing fast but I wasn't annoyed with the wait. I have some upgrades planned too.
caetydid@reddit
is it really worth to use q8 instead of having e.g. q6 and doubled speed?
HockeyDadNinja@reddit
This would be my attempt at max quality. I also intend to run Q4 for speed but I may test Q6 as well.
fromage9747@reddit
What settings are you running?
HockeyDadNinja@reddit
I'm running llama-server like this in order to switch models, etc.
llama-server --host 0.0.0.0 --models-preset ./models.ini
And in there I have:
; Qwen3.6-35B-A3B (MoE: 35B total, ~3B active); Q8: 35.8GB model, MoE expert offload to CPU RAM, target ~96K ctx; --fit auto-picks n-cpu-moe per device (handles dual-GPU split that fixed N can't); fit-target 512 MiB headroom per device; KV at q8_0 halves footprint[Qwen3.6-35B-A3B-Q8]model = /vol2/LLM/Qwen3.6-35B-A3B-UD-Q8_K_XL.ggufc = 98304fit = onfit-ctx = 98304fit-target = 512no-mmap = truemlock = true; Put faster 5060 Ti (CUDA1) first so it holds layers 0-15;; layers execute sequentially, so the faster card starts every token.device = CUDA1,CUDA0cache-type-k = q8_0cache-type-v = q8_0temp = 0.6top-p = 0.95top-k = 20min-p = 0.00fromage9747@reddit
Thanks mate!
NNN_Throwaway2@reddit
Yes, it seems based on reports to be about as good as Qwen3.5 27B, which was already competitive with Claude 4.5 models for a lot of stuff. The 3.6 version of 27B and 122B will be crazy if they see a similar jump in performance. My expectation is that the 122B will be a powerhouse as all the MoE from 3.5 felt a little undercooked compared to the 27B dense. The 35B being as good as it is now seems to be bearing that hypothesis.
H_DANILO@reddit
no way its way better than Qwen3.5 27B.
I was running Qwen 397B locally and this is on the same level but faster.
john0201@reddit
What hardware and tps for 397?
H_DANILO@reddit
rtx 5090 + 128gb ram ryzen 9900x3d.
TG: 20 TPS
PP: 1200 TPS
Usable at Q2
FullstackSensei@reddit
You're running Q2 397B and comparing to 27B?!!! Of course Q2 won't be much better
H_DANILO@reddit
Is there a rule saying Q2 can't be compared?
At to this point Qwen 3.5 397b Q2 would beat anything local, including gemma4.
It doesn't beat a model that is 1/4 of its size now(Qwen 3.6 35b)
NNN_Throwaway2@reddit
The 397B basically suffers zero quality loss down to Q2. Settle down.
FullstackSensei@reddit
Sorry but no. I run Q4_K_XL and even that is not the same as Q8, which I also tried.
Maybe if you're doing trivial stuff, but for anything remotely complicated the difference is there.
NNN_Throwaway2@reddit
Sorry but yes.
FullstackSensei@reddit
You're changing the goal posts. Your claim was that Q2 has zero quality loss.
NNN_Throwaway2@reddit
Its a very small amount however you want to slice it.
NNN_Throwaway2@reddit
Well there you go, then.
outthemirror@reddit
8 bit on my 3090 and epyc rig can confirm it is good
_int10h@reddit
Didn’t anyone try to get his hands on a GB200 System? On eBay sometimes people sell these.
ContextDNA@reddit
I have a couple 2011 MacBook Pros 64g RAM and a 2025 M5 32d RAM: any chance these could do anything at all anywhere even close to as amazingly cool as what you all are doing… ???
Should I even attempt to run qwen3.6-35b-a3b?
Every-Comment5473@reddit
Anybody running this with vllm with 8bit on RTX Pro 6000? If yes it will be very helpful to share the command for it.
AlwaysLateToThaParty@reddit
I've used the full qwen3.6 35b/a3 quantisation, but so far the 122b/10 heretic mxfp4_MOE model is better. For me at least. Gave them the same prompts too.
xignaceh@reddit
Perhaps to add if I may, anyone got a vllm command for Gemma 4 31b? I got it running but the context is eating vram like crazy.
bandman614@reddit
Been a while since I looked at local models - what's the memory footprint for an 8 bit quant 35b model?
sakuser@reddit
This is the exact set up I’m looking to get but with the release of Gemma and more efficient models it was making me question if I should go for 64gb or invest in the 128gb. Wished a 94gb version existed, would be the sweet spot lol do you have the 16inch for less throttle or the 14 for portability?
logic_prevails@reddit
I can assure you it is not as good as claude, but it is quite good
mister2d@reddit
Which Anthropic model are you referring to? Claude is just the harness; which you can use to run qwen!
logic_prevails@reddit
Sonnet or opus 4.6
xrvz@reddit
Thank you for your useful answer!
Personally, I think Gemma E2B outperforms Cunthropic's Mythos.
herovals@reddit
you can’t even have access to mythos so why just make it up?
xrvz@reddit
Exactly, I can use Gemma, and can't use Mythos at all. So Gemma is infinitely better for me. 👍
GasolinePizza@reddit
You didn't say "better for you", you said "outperforms".
So unless you have some access that the rest of us don't, that's also known as "just making stuff up"
NgtfRz@reddit
M5pro and the same. To be honest 8 bit quantbseems overkill
br_web@reddit
For conversational (q&a) type of chats is it better than Gemma 4 26B MoE?
rz2000@reddit
I think it’s better. I thought the 26B MoE was a moron compared to the Gemma 4 but the 31B is pretty slow on an M5 Ultra: 10t/s @ bf16, 20t/s @ Q8_K_Xl. The same quants of this new Qwen are 50t/s and 60t/s, and it seems much smarter than the Gemma 4 MoE, which wasn’t as fast either.
Top-Rub-4670@reddit
For casual role play? no, 26b is more pleasant and "moldable". For technical questions? Yes, but only with thinking enabled.
With thinking disabled it still fails the car wash question, for example.
Savantskie1@reddit
The model works great with a system prompt instead of zero prompt. Especially with thinking turned off.
scythe000@reddit
Which would be the best version for me to run on a 24 gig 3090?
autonomousdev_@reddit
Tried that exact setup last month. For my use case (code review automation), it hallucinated function signatures 20% of the time. Claude 3.5 Sonnet still wins on consistency for me, but the cost difference is massive.
bwjxjelsbd@reddit
It’s insane that model this size can be as good or even close to Claude with trillion of parameters
Rubixu@reddit
it's not great at c++ / reverse engineering
produces code that doesn't even compile, the reverse engineering thinking approach is completely dumb... claude 4.5 is miles ahead
milpster@reddit
You might want to use a q6 quant instead, i don't think there is anything to be gained between a q6 quant and a q8 one.
Potential-Leg-639@reddit
For more serious agentic coding tasks you need much more context and it‘s for sure not as good as Claude
Emotional-Insect1060@reddit
squeezing for computational efficiency and control here not really something quick on the fly. Perhaps a stronger foundation and understanding for weathering a storm and if need to temporarily generate something on the fly, then yeah there’s the higher context at a time models/setups. I can go buy and do things on the fly, but it’s painful blabla.. at least it feels as such.. not as confident in the price/performance and considerations. Not feeling bulletproof. Sure a bit of adventure is good yeah, but vulnerability. Live longer.
sn2006gy@reddit
wut
Emotional-Insect1060@reddit
okay, okay. Some things people can do on the fly and it’s solidly amazing
sn2006gy@reddit
that's easier to read than your prior paragraph
ExplorerPrudent4256@reddit
The qwen3.6 release really caught everyone off guard - suddenly that '3.5 was final' narrative looks a bit awkward. But hey, at least they're shipping instead of endlessly teasing.
sammcj@reddit
It is good, but it is nowhere near as good as Claude, not even Sonnet. I suspect for simple things it may be practically indistinguishable but it confidently misunderstands more complex problems. At the end of the day it's a very small 35B parameter model with only 3B active - it's amazingly good for that size, capable at tool calling and a huge leap from where we were a year ago but it's not as good as the much larger Sonnet / Opus models.
evilbarron2@reddit
Tbf I find sonnet confidently misunderstands complex problems, and happily lies to cover those misunderstandings, especially with longer tasks and conversations. Starting to suspect the bigger resources aren’t always an advantage.
bigsybiggins@reddit
How fast is the prompt processing when context fills up? Like at around the ctx limit are you waiting minutes?
Medical_Lengthiness6@reddit (OP)
I think I haven't gotten to that point yet. I tend to keep things pretty laser focused. If something is taking so long that it needs the whole context I just do it myself or break it apart into smaller tasks.
Guilty_Spray_6035@reddit
I found gemma-4-31b-it better in complex tasks
Medical_Lengthiness6@reddit (OP)
I tried it but for some reason it was taking forever to fill the initial prompt, like super slow, then the output was not even as good as qwen
Tomr750@reddit
how does 8 bit compare to 4 bit?
fittyscan@reddit
Much better. Compared with the Qwen 3.5 models, the gap between Q4 and Q8 is much more significant, especially for tool calling and reasoning.
Certain-Cod-1404@reddit
64k context is very low for agentic coding no?
fittyscan@reddit
The harness makes quite a difference. I use Swival, which is built for small context windows, and in practice it really feels like a 16K context works just as well as a 256K one, even over long sessions without any manual compaction.
I get much better and more consistent results with it than I do with Claude Code and Opencode when running local models.
Even if you have enough RAM, local model performance degrades pretty quickly as the context grows. So even though I could push it further, I usually stick to 16K or 32K max to keep things reasonably fast
SnooPaintings8639@reddit
Very, very low for this model especially.
Medical_Lengthiness6@reddit (OP)
So far it's been ok for today, it was at 32 which wasn't enough. What do you use?
I don't do full agentic though, more research and medium targeted refactors.
StardockEngineer@reddit
Go full 256k. You have the RAM
Medical_Lengthiness6@reddit (OP)
I'll try it, thanks
Certain-Cod-1404@reddit
>I don't do full agentic though, more research and medium targeted refactors.
they yeah 64k is probably good enough, especially since you get to use a 8 bit quant, I use 240k at q8 kv cache for agentic coding with open code, it's really impressive what these qwen 3.5/3.6 models can do, man what a time to be alive.
Medical_Lengthiness6@reddit (OP)
ya good times. I had tried the prior qwen coder on a MacBook air and even though it wasn't great beyond simple stuff, I could tell things would get a lot better, esp with more RAM. Glad it's proving true and my mbp purchase is paying off
Cherlokoms@reddit
Last sentence about sending the codebase to randos hit hard
MeganDryer@reddit
I used the same system on an H100 yesterday using Qwen Coder. It was _not_ as good as Claude, but it was more than good enough to do coding tasks. Absolutely amazing and the first system I've seen actually work as a local agent.
iamn0@reddit
I tested Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf on my system with 4x RTX 3090, using up to around a 200K context window. I can confirm that, for me right now, it’s a viable alternative to opus 4.7 in opencode (although it's worth noting that opus is currently nerfed).
Compared to larger models, you should be as precise as possible with your prompts. otherwise, qwen can get stuck in a thinking loop. For example, if you tell the model that a file exists when it actually doesn't, it may enter a thinking loop. On the other hand, it's often smart enough to catch mistakes in the prompt as well.
Longjumping_Virus_96@reddit
have you tried q4 ggufs of larger local models?
Medical_Lengthiness6@reddit (OP)
I havent
Hood-Boy@reddit
Is anyone running it in AMD Strix Halo and show me their llama.CPP command?
I'm still trying to understand the optimizations
yehyakar@reddit
ykarout/Qwen3.6-35B-A3B-NVFP4
CriticalCup6207@reddit
The M5 Max unified memory story is genuinely compelling for local inference. We run similar setups for internal tools: keeping model weights in memory alongside application data without PCIe bandwidth bottleneck changes the latency profile significantly. How's thermal behavior holding up on extended sessions? M-series chips have a tendency to throttle after 20-30 minutes of sustained inference load.
chankeypathak@reddit
Will it work on my gtx 1650 4gb, ryzen 5 3600, 32gb ram 3200mhz?
Alex_1729@reddit
But 64k... Can you actually do any serious work with this?
Also, Claude is dogshit rn
Heavy-Focus-1964@reddit
i must be doing something wrong. i just tried using this exact model with pi on omlx, and watched it get into a loop of panic about having to fix 200 lint errors of the same 7 types.
it would just say it was going to break them down by file, see there was 200 and start freaking out again. rinse and repeat. i was going to take a screenshot but i was too annoyed
cafedude@reddit
It's as good as what version of Claude? I could see it being as good as Claude from about a year ago, so Claude 4.0?
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
valkiii@reddit
What kind of tasks are you running? Are you doing coding?
Medical_Lengthiness6@reddit (OP)
Ya just coding. I mostly send it off on research or refactor tasks in the background.
ranting80@reddit
It's not better than claude. It's extremely good for a local model especially at this weight and especially as an MoE. I've used M2.7 and honestly I'd say it's near par to that which is incredible for how small and fast it is.
nakedspirax@reddit
Running 8 bit quant with 250k context on strix halo with 128gb ram. Surely you can up the context.
Medical_Lengthiness6@reddit (OP)
I ended up upping it to max after someone else said the same as you and it's still running great
riceinmybelly@reddit
Am running it via lm studio on a mac studio in hermes (in docker) and it keeps stopping after context is 50% full and it should compress. I even use my previous working model now (qwen coder next) to compress the context. That helps because it unloads the 3.6 model bit only for a while.
Anyone else running it with lm studio and hermes?
Medical_Lengthiness6@reddit (OP)
Fwiw you have to crank the context up in LM Studio on the model if you haven't already. also in opencode config have to set a limit
riceinmybelly@reddit
Yeah I’ve given it full context and hermes detects it as being 262k correctly. Maybe the rolling window is the problem or the jinja is having problems
Blues520@reddit
I tried the Q6 unsloth quant for a day and ended up going back to qwen3-coder-next.
Aroochacha@reddit
It’s not in my experience. It’s good, but for the big task, I still rely on MiniMax-M2.5 running locally. Even then, Claude is just on another level.
My experience with the Qwen models is that it takes engineering effort with prompts to get it to perform. For work on my workstation, I can give it a series of prompts to complete a task that also includes verification prompts after each task. Even then, sometimes I have to break down a step even more because the model produces gibberish or times out.
That said, I do like reading the positive feedback on Qwen3.6. I’m excited to put it to work on Monday.
totallynotmyfakename@reddit
It' good, but there's no way it's as good as Claude lol. Like pretty far a part from my experience, even with free Claude
DraconPern@reddit
I am using it using lm studio and continue, what am I missing by not using opencode?
Medical_Lengthiness6@reddit (OP)
I haven't kept up with continue. That's like cursor right? Opencode is more like Claude code or codex where it's just a terminal (although I think they have a cursor like option..)
RazsterOxzine@reddit
No, you're like spot on. I've been using it through LM Studio to review my older Javascript/CSS which is all I need, and it is perfect. Claude I would sit and go through a few changes to get it right. Qwen3.6 in a one shot just fixed and created some UI elements that I had a hard time explaining to Claude. I'm so sold! I need to save and get a 5090.
onyxlabyrinth1979@reddit
Performance is great but the real win is control. Once you stop shipping your code and data out, a lot of hidden risk disappears. Same lesson on the data side, owning the pipeline and rights matters more than raw model quality if this ever becomes part of a product workflow.
Luke2642@reddit
What tok/s?
Medical_Lengthiness6@reddit (OP)
About 80
Savantskie1@reddit
It's probably because I'm using MI50'S But I get 50 t/s. Still good for me
picosec@reddit
I've been pretty impressed with qwen3.6-35b-a3b, it is a big improvement over 3.5. It can perhaps do some things as well as Claude, but there are almost certainly things Claude will do better on.
Medical_Lengthiness6@reddit (OP)
For sure. I usually have a daily driver and then only use codex or Claude for things where my daily can't manage, but that range has been decreasing
PlayfulLingonberry73@reddit
In my experience if you are building simple websites or generating contents, it's fine. But if you are building something complex it is definitely noticeable. And the context length also will matter.
Potential-Leg-639@reddit
Way too much hype about it, it‘s still only a 3B MoE model, i dont expect too much, for serious stuff it probably wont be able to compete with bigger or Dense models.
dreamai87@reddit
That’s what I thought but no it’s pulling a lot above its weight.
MR_Weiner@reddit
Eagerly awaiting 3.6 27b. I hope!
Medical_Lengthiness6@reddit (OP)
That's what I said about the old qwen3 coder on a Macbook Air. This one seems to be capable of complex workflows, not sure what you're seeing. I throw complex shit at claude and codex and they are retarded so there's a limit that even the top models struggle with.
LocoMod@reddit
They are retarded because you’re promoting is retarded. Garbage in garbage out. This is easy to prove so put your pride where your mouth is. Show is snapshots of the same workflow running against Qwen3.6 and gpt-5.4/claude-sonnet. It’s really that simple.
I won’t hold me breath. Saving face is more important than seeking truth isn’t it?
antunes145@reddit
Its superior encoding compared to MiniMax 2.7 imo
ComplexJellyfish8658@reddit
Have they supported MLX backend yet? Hard for me to go for on qwen3.5 just since it supports MLX with some builds.
sskarz1016@reddit
Many quants from mlx-community on hgf
SomeOrdinaryKangaroo@reddit
Yeah, this model is wayyy better for coding than Claude opus 4.6 and opus 4.7
Easily best coding model available right now
ryfromoz@reddit
i would actually believe it now they nerfed opus. Actually no i dont 🤣🤣🤣
BP041@reddit
This is awesome! Running Qwen3.6-35B locally with that context on a MBP M5 Max sounds like a beast setup. I'm curious, what's been your experience with the 8-bit quant for specific tasks or models? At CanMarket, we've explored different quantization strategies for our Style Genome models, especially balancing performance and precision. OpenCode is a great choice too. Any specific workflows you've found particularly optimized with it?
themostsuperlative@reddit
What hardware are you running this on?
Medical_Lengthiness6@reddit (OP)
mbp m5 max 128 gb
rebelSun25@reddit
I'm not impressed so far. It didn't know to call getContents() on psr request body returned by guzzle... Which is shocking because it got a lot right. I guess that's the price for smaller total parameter models
CircularSeasoning@reddit
No one understands what you said because we don't read or write code anymore. Have an upvote!
ikkiho@reddit
the a3b architecture is what makes this viable on apple silicon - 3b active keeps tok/s close to a cloud-hosted dense 30b while you still get the full 35b knowledge breadth. 64k is light for true agentic chains but for typical multi-file edits it's plenty - context degradation past ~40k tends to bite harder than raw model size does.