I'm running qwen3.6-35b-a3b with 8 bit quant and 64k context thru OpenCode on my mbp m5 max 128gb and it's as good as claude

[-]

New_Slice_1580@reddit

How does it compare to Gemma4 a4b?

Because as per name this has 4b active parameters compared to Qwen a3b’s 3b

[-]

RegularImportant3325@reddit

I’ve been very impressed as well. The speed is nuts and the quality, especially in research, is really great.

Going to up the context tonight.

[-]

How long does prompt processing take for a 30-40k token prompt? I have a lesser machine, an M5 MacBook Air with 32 GB of RAM and I can run ~30B 4 bit models with quite acceptable tokens per second, around 20, but the prompt processing takes forever. So it’s not really usable when I hook it to Claude code, only good for short chats. This is running on LM studio with mostly defaults because I don’t really know what I’m doing.

[-]

cosmicnag@reddit

Its the best local model so far IMO. On a 5090, the friggin speed gives an overall unmatched experience to any cloud model. The speed is insane. Havent even tried a NVFP4 yet lol.

[-]

Sort-Aromatic@reddit

How are you running it? Vllm? What is your setup config? I was getting memory issues when trying to set mine up last night

[-]

cosmicnag@reddit

Running using llama.cpp, and dual GPU 5090 + 4090, sorry forgot to mention it in the original comment. I use the Q8_XL unsloth quant with Q8 KV cache, full 262k context. On only 5090, I would prolly go Q6 in both, and drop context a bit to 200k or something.

[-]

mannydelrio1@reddit

whats your setup for the 5090 ?

[-]

Glittering-Call8746@reddit

Is there nvfp4 quant ?

[-]

DistanceSolar1449@reddit

Yes, but nvfp4 Qwen 3.6 is braindead.

Running Qwen 3.6 35b on a 5090 is fast enough already, just stick with a Q4 gguf.

[-]

FinBenton@reddit

Q5_XL fith into 5090 with full 256k context nicely, maybe even Q6.

[-]

duyleekun@reddit

Im doing q5_xl, does q6 worth the bigger size :)

[-]

kozer1986@reddit

Can you share your llama configuration)

[-]

FinBenton@reddit

ExecStart=/home/benii/llama.cpp/build/bin/llama-server \
  -m /home/benii/llama.cpp/models/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf \
  --host 0.0.0.0 \
  --port 8099 \
  --api-key 1234 \
  --alias qwen \
  -c 256000 \
  -np 1 \
  --n-gpu-layers 999 \
  -b 4096 \
  --flash-attn on \
  -ctk q8_0 -ctv q8_0 \
  --jinja \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0 \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --reasoning on \
  --chat-template-kwargs '{"enable_thinking": true}' \
  --reasoning-budget 4096 \
  --reasoning-budget-message "Proceed to final answer."

thats part of my systemd service file that auto launches the model

[-]

Glittering-Call8746@reddit

Maybe cos the static weights is fp4 ?

[-]

DistanceSolar1449@reddit

It's because nvfp4 quants will quant attention and ssm into 4-bit, whereas most gguf quants keep them at q6/q8/bf16. Especially ssm, that needs to be bf16. That kills model perf while only saving a few megabytes.

[-]

Glittering-Call8746@reddit

Yes that why.. it needs to be hybrid. Ty for confirming this.

[-]

fullouterjoin@reddit

https://huggingface.co/models?sort=trending&search=qwen3.6+nvfp4

[-]

ranting80@reddit

Yes there's a few of them out now: https://huggingface.co/mmangkad/Qwen3.6-35B-A3B-NVFP4/

[-]

CapeChill@reddit

How are you running it on a 5090? What quant and context? Been wanting to try it on that but wasn’t sure it would fit at a reasonable quant/context.

[-]

Still-Wafer1384@reddit

It doesn't have to fit a sparse MoE model like this can easily offload some to CPU / system memory without too much performance penalty. Also Q8 is probably overkill, something like Q6 is probably the sweet spot.

[-]

snikurtv@reddit

Based on the benchmarks, Q5_K_M is the sweet spot. Can fit Q5_K_M with \~160k context on a single RTX 5090 without any extra tweaking. Need to drop down context to enable vision though.

[-]

FinBenton@reddit

Im running Q6_K on 5090 with full 256k context on llama.cpp no problem.

[-]

dreamai87@reddit

Bro, you have enough vram to to fit this model. If 32 then great or having laptop with 5090 then 24gb vram You can easily put ud 4 k_xl model and get around 100t/s

[-]

CapeChill@reddit

Q4 is where you start to see drop off even if its the xl and running Q5 or Q6 only give \~16k context headroom. Gemma 4 seems pretty workable with Q6 and 100k+ context. That said my Strix Halo box just started testing it, 25ish tok/s by the look of things on vulcan + llama.cpp.

[-]

FortiTree@reddit

Which model you get for 25 t/s?

I also have a Strix Halo with 96Gb on Ubuntu 24.04.3 trying out various models and combos like LM studio + Vulcan, Ollama + RoCm 7.2.1. So far Vulcan outperforms RoCm in raw speed at 55 t/s vs 42 t/s for Qwen3.6-35B-A3 Q6KXL and 32K context.

Qwen3.5-122B-A10 Q3KM is a lot slower at 12 t/s on Vulcan.

I'll try llama.cpp next to see how much of a jump it can get but just figuring out a prompt system for it first so I can compare apple to apple.

[-]

AlwaysLateToThaParty@reddit

Qwen3.5-122B-A10 Q3KM

Genuinely curious how you found the quality at that level of quantisation?

[-]

FortiTree@reddit

Im also curious but Im not benchmarking model quality yet. So far Im just using a list of simple logic questions like the car wash test, string on a nail, listing complex but well known knowledge like Networking stack and principles, and ask it to correlate them altogether just to get a feeling of speed and quality. Those test wont mean much since everyone's use case is different and the "sweet spot" is really dependent on their own code base, hw size and requirements.

But the raw HW speed and context memory overhead would be the same for any question/task so thats what Im trying to benchmark first.

My takeaway so far is I want more speed than the highest quality because for super hard stuff, I'd trust Opus4.6 the most and I can reserve my Cloud subscription for those tasks. Then offload anything that Haiku or Sonnet can do to local server for speed and cost. This would require me to breakup my agentic tasks in a way to interact with multiple systems, which is entirely possible.

My goal is to find a setup that can best use the AMD shared vram, a lot slower than nvidia but has more vram capacity at lowest price.

Regarding your quant quality, I usually just ask Opus on which variant is proven to be best for that model and ask it to check latest community feedback. But the real cap is your memory size, so I cant really go to Q4 without scarificing speed and Im not even sure I can load it yet. At Q3 it's already too slow for my taste, 13 t/s is barely usable. So I havent engaged with it that much as Sonnet and Opus.

I think at the end of the day, everyone will have to try out things for themselves and find their sweet spot. The code quality for production is what matter at the end so none of these benchmark would mean anything if it cant do what you need it to do.

[-]

mycycle_@reddit

Forgive me, but what benchmarks are we using? I find it really hard to trust normal benchmarks sense so many models are trained on them, but for quants maybe its still useful.. regardless, very happy to learn about new resources!

[-]

AlwaysLateToThaParty@reddit

My work computer has 16GB of VRAM, and I haven't been happy with the performance of many models at that capacity, but I've been reluctant to use bigger models at smaller quantisation. My work environment is locked down, so no cloud anything really. You're right about use-cases; They're all different. Who knew? Thanks for the insight.

[-]

FortiTree@reddit

Ofc, Im happy to share. I think 16Gb is already below the minimum for usefulness. That's you most limiting factor.

I'd gun for 24-32gb Nvidia + CPU + 32Gb ram as minimum so you have enough GPU memory to run the entire "active model" on the card (10x faster than AMD vram), offload the MoE models on CPU Mem. So you get the best speed + most capable MoE model at that spec. Picking which model is less important as this is the sweet spot for consumer range that all those model companies are targeting for so these models will improve over time for you, and you just try out the best one that are recommended by the community.

I have a different setup for work that the company paid for so my local setup is mostly just for experiment and learning. I'm cap by the AMD memory speed but I got the bigger memory space trade-off to try different model size, and it's a lot cheaper (2k-3k range vs 5k-10k).

If your company got stuck with this mediocre laptop, I'd say you invest in your own local setup and bypass it. In this AI era, you need to move fast to be at the top of the food chain. We cant afford to be stuck with ancient hardware.

[-]

AlwaysLateToThaParty@reddit

I'm afraid you mistake me. At home I have an rtx 6000 pro. I'm really talking about my locked down work computer. Good info though. Thanks.

[-]

cosmicnag@reddit

Well actually I have multi gpu setup, 5090 plus my older gaming 4090 (which I didnt sell cause AI happened lol). I use the best unsloth quant Q8 XL, and max context window in Q8. Its amazingly fast and really usable. Its not the latest opus, but damn definitely a watershed moment for local AI. The raw speed makes up for how much ever better SOTA is.

[-]

CapeChill@reddit

My old gpu is AMD :/ but that makes sense, multi gpu.

[-]

karimusben@reddit

It runs well on my 7900xtx

[-]

CapeChill@reddit

Are you talking about Nvidia + AMD multi gpu here or just qwen 3.6 q4?

[-]

karimusben@reddit

Qwen

[-]

Free-Combination-773@reddit

IQ4_S fits into 7900 XTX with Mac context

[-]

john0201@reddit

The latency and general experience is just better. I bought the perplexity search api credits and I just straight up prefer it for many things.

Opus 4.7 is still much better for coding. I sometimes use the 122B which gets close but it’s just too slow on an M5 max. Once Apple releases an M5 ultra and there’s a qwen3.6 122B q6 to try I’ll give it another shot.

[-]

Smooth_Bus_3010@reddit

Sorry to burst your bubble. Cheap GPUs will never happen. There are endless of AI applications these are useful for other than LLMs. Smaller companies like mine will always want more GPUs, no matter the power draw.

[-]

john0201@reddit

I don’t think you have a good understanding of how the semiconductor business works.

[-]

-p-e-w-@reddit

There are plenty of cheap GPUs available today: Any that don’t support BF16. The same thing will happen in the future with newer technologies. Once a GPU is missing a must-have feature, it’s worthless.

[-]

techdevjp@reddit

Then hgx 8xH200s will be on eBay for $10,000.

Which will be mighty interesting, except that setup draws a peak of 10,000W through 6 x 3300W PSUs. Plus the cooling requirements.

But, 1.1TB of H200 memory would be quite a trip.

[-]

john0201@reddit

That is exactly why old server gear is so cheap, power.

[-]

techdevjp@reddit

Yup, which was my point. $10k is easily affordable for a lot of people. Furnishing it with 10,000W of power and the required cooling is not.

[-]

xorgol@reddit

That definitely requires changing contracts, but 6kW is fairly standard around here, it doesn't seem entirely unfeasible.

[-]

techdevjp@reddit

It's certainly not standard where I live. 10g fiber is cheap & easy to get but a 10kw power hookup would be nearly impossible for a house. And very expensive.

[-]

No_Communication7072@reddit

Especially because the police will ask you why you need so much power, and if you are gonna plant weed

[-]

No-Budget2376@reddit

DGX Station will be available in 6 months and that will smoke anything, it's way faster and cheaper. Frontier models will be scaling parameters by then sp the 1T models will be local and 10 - 100T cloud

[-]

john0201@reddit

A B300 will not be faster than 8xH200. It will be more efficient though.

[-]

techdevjp@reddit

DGX Station has appeared for order already, just under $100k. I've no doubt it will be incredibly powerful and of course it has a truckload of memory. And it's actually designed to run under someone's desk. But if anything that helps make /u/john0201's point that the price of 8xH200 is going to fall, and fast.

[-]

darktotheknight@reddit

You can easily cap H200 at 400W. Even at the minimum cap of 200W they will still crush anything you've ever seen so far.

That being said, these systems will stay unaffordable for quite some time. VRAM is king and 141GB HBM3e will stay relevant for quite some time. It would absolutely surprise me, a single H200 would be listed on eBay at 10.000$ within the next 3 years, let alone 8.

[-]

xrvz@reddit

within the next 3 years, let alone 8.

You're using those words wrong. Correct is: within the next 8 years, let alone 3.

[-]

mesasone@reddit

Read the sentence again. They were taking about the amount of cards available on eBay, not the time span.

[-]

Polite_Jello_377@reddit

No, you are wrong

[-]

techdevjp@reddit

As /u/No-Budget2376 pointed out, DGX Station will be arriving soon. Single device under your desk with 748GB of memory and performance far exceeding GB200.

https://www.reddit.com/r/nvidia/comments/1jh2exl/whats_the_update_from_gb200_to_gb300_sorry_if/

They're going to be $100k, but even at that price they will dramatically drive down the price of H200 systems.

[-]

Ok_Clue5241@reddit

I HOPE you’re right about the price of enterprise gpus, but come on…..

[-]

john0201@reddit

I have 2x5090s and can train GPT-2 in an hour. That model was SOTA when we were freaking out about covid. Wasn’t all that long ago. The servers they trained it on are now not quite doorstops but close.

[-]

fullouterjoin@reddit

I bought the perplexity search api credits

Which mechanism are you using to use perplexity with opencode?

[-]

john0201@reddit

You just add it as a tool

[-]

AlwaysLateToThaParty@reddit

a qwen3.6 122B

I would like this very much.

[-]

MongoWithBongoss@reddit

Hopefully, photonic computing will bring down GPU prices.

[-]

Likeatr3b@reddit

This is the thesis of my company

[-]

k0zakinio@reddit

this is what I think too, the tipping point has been reached for models to be able to be run locally which are capable enough for most day to day coding tasks. You'd still want the 'smart' model to help with more zany things but this massively reduces the value proposition of the big model companies.

I wonder whether there will be a deepseek moment with the markets once they get wind of this.. there was about a weeks delay when deepseek launched until the market actually reacted last year.

[-]

9gxa05s8fa8sh@reddit

concur, and we're 1 year away

[-]

john0201@reddit

I think it’s probably towards the end of the Rubin lifecycle, so 18-24 months. By then AMD, Intel, and the other secondary players like Google/Meta/Apple will have caught up.

[-]

Medical_Lengthiness6@reddit (OP)

That's what I noticed right away. Immediately responds with quality

[-]

NaabSimRacer@reddit

5090? does it fit on 1?

[-]

wen_mars@reddit

Q4 fits easily on 5090 with max context

[-]

weedebest@reddit

better than genma4?

[-]

jacobpederson@reddit

Try this prompt please. Frustrated yet? Now boot up Gemma-4-26b-a4b and watch for a 1 minute one-shot :D

Create a single-file HTML page using only HTML, CSS, and vanilla JavaScript (no libraries).
Build a centered 3D scene containing a fully functional Rubik’s Cube made of 27 smaller cubies. Each cubie must have correctly colored faces (classic cube colors).
The cube should:

Start idle with a slight 3D perspective view
Include a "Start" button below the scene
When clicked, automatically scramble the cube with random realistic face rotations
Then solve itself step by step using reverse moves or a logical sequence
Each move must animate smoothly with easing (no instant jumps)
Rotations should affect only correct layers (like real cube physics) Animation requirements:
Total loop duration: ~30 seconds
Include phases: scramble → solve → short pause → repeat infinitely
Use smooth cubic-bezier or ease-in-out transitions Visual style:
Dark background (black or gradient)
Glowing cube faces with subtle reflections
Soft shadows and depth for realism
Clean modern UI button with hover animation Extra features:
Allow mouse drag to rotate the entire cube in real time
Maintain transform consistency (no breaking cube structure)
Ensure animation is smooth and optimized Output:
Return complete working code in one HTML file only
No explanation, only code

[-]

msaraiva@reddit

Can you please share the quant and settings you used with the 5090?

[-]

cosmicnag@reddit

Read the other comments first - I actually have a 5090 + 4090 setup. So I can run Q8XL, and KV cache Q8 max context. For only 5090, I suppose you can do the same with Q6 quants for the model. See other peoples posts in this thread as well. Cheers.

[-]

nomorebuttsplz@reddit

by "local" you mean can fit a 3090?

[-]

cristoper@reddit

The Q4 quantized ggufs can fit and work well on a 3090

[-]

YairHairNow@reddit

Nvfp4 cut 5s off my prefill/first token. I'm on a 5080, 60tk/s output through qwen code and its running my discord swarm. Somehow I'm still able to fire off 30s generations in comfyui/zimage with the model loaded in my swarm. I'm kind of blown away. I'm using the gguf because I'm having fun with the uncensored version. But the MXFP4 i'm using is fast.

Best local model I've experienced yet.

[-]

lolwutdo@reddit

Are you fully offloaded on gpu? What quant?

[-]

neur0__@reddit

Running it with a 5090 and I get very good performance as well, can’t wait for the future

[-]

dilberx@reddit

Imo rtxs are a bad decision being energy hungry

[-]

Glittering-Call8746@reddit

So what's the alt ..

[-]

dilberx@reddit

Mac or strix halo system

[-]

Free-Combination-773@reddit

With both having slow pp

[-]

max123246@reddit

What quant are you using? Isn't nvfp4 slower than fp4 anyways? It's just better accuracy

[-]

Worldly-Plastic-2516@reddit

What are you running with on the 5090? Love to give it a spin

[-]

cosmicnag@reddit

Well actually I have multi gpu setup, 5090 plus my older gaming 4090 (which I didnt sell cause AI happened lol). I use the best unsloth quant Q8 XL, and max context window in Q8 (llama cpp,CachyOS). Its amazingly fast and really usable. I mostly use it with Pi coding agent,and its just freakin awesome.

[-]

Specter_Origin@reddit

It still struggles on complex issues and ends up looping for me. Definitely not as good as claude or sonnet but best local most for sure. Pretty close to Minimax 2.7

[-]

StardockEngineer@reddit

I’ve been using it for 6 hrs straight without a loop.

[-]

CircularSeasoning@reddit

You're the loop now. Human in the loop, haha!

[-]

cosmicnag@reddit

Lol

[-]

Varmez@reddit

What kind of settings are you using? What harness etc? I'm using 8bit in omlx with their 0.6 temp recommended settings and am getting constant looping in claude code, qwen-code, and open code.

[-]

epicycle@reddit

mine seems to get stuck using omlx as well. Did you try the settings @StardockEngineer shared? Was gonna try tomorrow as I wanted to give 3.6 a go. Tried other things and failed. I hope it’s not a omlx bug.

[-]

Varmez@reddit

I was using the "Thinking mode for precise coding tasks" settings and getting the looping a lot. Trying the general tasks ones now and they don't seem to be happening. I even tried using them but lowering temperature to 0.8 and started again right away.

[-]

StardockEngineer@reddit

llama.cpp and pi coding agent. Using unsloth's recommended settings per https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF:

We recommend using the following set of sampling parameters for generation

    Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
    Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
    Instruct (or non-thinking) mode for general tasks: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
    Instruct (or non-thinking) mode for reasoning tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

[-]

dreamai87@reddit

Enable preserve thinking model. It makes model way better

[-]

Specter_Origin@reddit

Thanks, I already did and it sure helps, but still gets into loop specially if you are using for agentic coding where there are chained todo items for it to do!

Surprisingly gemma-4 does not have that issue but it definitely is not as good at problem solving as 3.6

[-]

dreamai87@reddit

Okay don’t know why the case, though I use this model without even with reasoning and for me I never felt any looping issue till 123k context size that I tested. You can see some examples I posted in my previous post. Building an app from research paper that holds a lot of information but still this model pulled off better than many

[-]

Weary_Long3409@reddit

Yes. It's on par with minimax 2.7 with better problem solving on agentic coding.

[-]

StardockEngineer@reddit

I just got a loop lol

[-]

H_DANILO@reddit

it used to loop with temp 0.6, I'm using temp 1.0 and it stopped looping altogether.

There's not such a thing as complex issue. In computer science complexity talks about asymptotic computational cost of functions.

Complicated is something else. Something complicated is something that hasnt been well defined, and that hasnt yet been broken down into smaller, manageable, problems.

[-]

segmond@reddit

99% of the people in this world are not solving complex issues.

[-]

Specter_Origin@reddit

I agree! It is the best local model for chat etc.

[-]

dasNavy@reddit

Guys what do you want to accomplish. Keep hearing those its better its better its better, but what is actually your daily driver and the benefit. I‘m tired boss, just let me live under the clouds on this planet and don‘t care of ai and stuff

[-]

Acceptable-Opinion56@reddit

M5 max vs 2 5090

[-]

florinandrei@reddit

it's as good as claude

lol

Maybe for very simple things. Give it more complex agentic tasks and you will see the difference.

That being said, for a 35b it's pretty good.

[-]

hyperdikmcdallas@reddit

Yeah, but technically, wouldn’t it figure it out eventually?

[-]

florinandrei@reddit

If that were true, then put any model in a Ralph loop and ask it to solve string theory.

Looping over partial solutions does yield some improvements. But if the problem is sufficiently hard, any model will get stuck, eventually.

[-]

hyperdikmcdallas@reddit

Or have it try to solve the three body problem I wonder if in 30 years they can use quantum ai to solve it

[-]

Maleficent-Pea-3494@reddit

My MBP max 128 is scheduled to deliver on Friday and I can’t wait! I built a UI and running parallel OCR (Chandra 2) and inference testing (Qwen 3.6) in Claude and hoping to run local as soon as I get the machine. If it’s as good as you’re saying I’ll hop in the queue for M5 Studio asap.

[-]

H_DANILO@reddit

you can easily go 256k, context is VERY CHEAP on Qwen, and this model is REALLY good with context

[-]

snowglowshow@reddit

Can you explain why it's any better than any other model at context? I've made a very strict habit out of starting new windows as often as possible, often 16k or less. If Qwen is special in this way I would love to know!

[-]

CircularSeasoning@reddit

For me, Qwen models have always respected my context in a certain way that's noticeable compared to other models. It's like it retains the inherent shape or quirkiness of my code better. Sometimes this is actually a pain when I want it to use more initiative in changing things up all at once to experiment with variations, but this is the double-edged sword.

[-]

wen_mars@reddit

Something about a "hybrid attention" architecture with "gated delta network attention" whatever that means.

[-]

H_DANILO@reddit

You can just try for yourself, myself, I've been strict to 64k, same as you, always starting a new session, then, recently, I felt comfortable at 128k...

This is the first time I went to 256k and I'm happy about it.

I've seem someone else posting here in reddit they tested a needle in the haystack on this qwen 3.6 35b a3b, and it got a very high score, I cant remember exactly, but that prompted me to test out 256k and i'm very very very surprised.

Ofc it won't give the same attention to very old context compared to new context, but tbh, this is desired, as long as it can remember if I ask it to, I'm happy.

This actually gave me hope one day we gonna have 1mi, perhaps who knows 2mi context where models can actually read the whole codebase and be much more succint rather than having to read chunks of the codebase and losing coalescence.

[-]

Snoo_48368@reddit

Thank you for this comment! I've been using qwen3.6-35B-A3B-Q5_K_M with a 128k context split across two slots on a 4080 Super (16GB VRAM) + 7950X3D (16 core) + 96gb DDR5. I was getting \~45 tokens/sec and great results using with opencode, but hitting compaction quickly resulting in tons of pain.

Just upped overall context to 256k, resulting in each parallel run getting 128k. and it's keeping the same speed!

I do notice some issues from time to time (loops, hallucinations), but it's blowing me away. I pointed it at an insanely complex codebase (think custom flavor of bison parser definitions), and it not only managed to understand it, but worked quite well!!!

[-]

H_DANILO@reddit

glad it helped, i stopped having loops problem when I set temperature to 1.0, try that

[-]

Artistic_Okra7288@reddit

Yea I push mine to 1M context at Q4_0 KV cache (probably could get away with Q8_0 but it's working well enough).

[-]

Octopotree@reddit

Does that work? The model has a context window of 256k

[-]

Artistic_Okra7288@reddit

You can scale it using yarn and rope settings. Llama.cpp has a bug where you have to pass in a specific kv override as well, but yea, I’ve gotten it working.

[-]

bwjxjelsbd@reddit

Does it support compaction?

[-]

H_DANILO@reddit

It does

[-]

Medical_Lengthiness6@reddit (OP)

I just bumped it up and it seems like you're right

[-]

youngishgeezer@reddit

I'm running the Q8_0 version on my M5 Max 128GB MBP. It's amazingly fast and seems to do a good job with coding, though not quite the same level as what I'm getting from GPT5.4 in Codex. However I just gave it 4 hand written, in cursive, recipes snapped with my phone and it got all 4 recipes extracted in under 20 seconds with basically perfect accuracy. I'm very impressed.

[-]

epicycle@reddit

what harness and settings are you using? i have the same machine and can’t get a stable Qwen 3.6.

[-]

youngishgeezer@reddit

I'm running LM Studio with Qwen3.6-35B-A3B-Q8_0.gguf with the stock parameters except maxing out the context. The coding is done through codex with the same settings. I ran the image processing experiment using LM Studio's chat. I'm new to all this so I'm making no claims this is optimal.

What are you running?

[-]

epicycle@reddit

I run oMLX on a M5 Mac 128 GB and use MLX optimized models for apple silicon. I’m gonna give LM Studio a try I think. I had tried them all and settled on oMLX.

[-]

spawncampinitiated@reddit

"as good as Claude"

I mean, yeah right.

You and I fucking wish

[-]

tarruda@reddit

It really depends on your use case. 3.6 is very good at agentic coding, can explore and understand codebases, edit files, etc.

Yea, it is not going to match frontier models on understanding very complex things, but I think most users don't have that need.

I've been using 3.6 since it was released and it became my favorite local model for agentic coding. Even though I can run larger models, the fact that 3.6 is good enough for most things + being super fast makes me prefer it.

If 3.6 122b follows the same improvement curve, there's a chance it could match sonnet for everything.

[-]

power97992@reddit

Dude even 3.6 plus probably cant do what sonnet does … from my exp , 3.6 plus is even worse than gemini 3 flash, it misunderstands directions sometimes and presumes stuff, its output is way worse u have to correct it and reprompt it …

[-]

Loud_Key_3865@reddit

48tps effective with single 12 GB 5070 Ti (15K context)

Qwen3.6-35B-A3B IQ2_M via llama.cpp,

model = Qwen3.6-35B-A3B-UD-IQ2_M.gguf

ctx = 15360
parallel = 2
n-gpu-layers = 99
fit = on
cache-type-k = q4_0
cache-type-v = q4_0
threads = 8
batch-size = 1024
ubatch-size = 256
reasoning = off

I'm having other LLMs like Codex, Gemini & Claude send tasks to a router, which looks at this model to see if similar tasks fail, then routes back to the calling model for a high-score answer, and if tests pass, sends it back to the model to learn the fix for the prompt, so if a similar pre-failing prompt comes in, the learned context will be tried again, and scored. Low => send back to paid in the future, high=>local llm can handle it.

Then I have paid models review and test.

Keeping things small, like plugins, for example, helps reduce context needs and seems to work well with the small 15K context for small tasks.

I also have Opus/Codex/Gemini - one of them create a plan, get consensus on it after they all review, then give me the best of all results. Then have one of them break all the tasks into small tasks, with complete directions, so the tasks are clear and easy, and can be done in parallel, by smaller, dumber models with the same quality.

Once that's done, I tell the model to implement the plan, and hand off tasks in to the local model for learning and load reduction.

The 15K context, temperature and all the other settings were thru trial and error to get the highest results at min 50-60tps on my machine, with the best quality. Strangely, I discovered context levels can have actually degrade performance in some situations. This gave me the optimal balance between > 55tps and even scored 100% quality on my home-grown coding tests, that I built for testing relevancy of my workflow.

### Benchmark Results (Coding Suite - 31 Tasks)

|----------------------------|--------:|--------------:|-----------:|--------------------------------------------|

| QWEN36-35B-IQ2M | 85.48 | 51.04 | 82.18s | Best overall balance in the leaderboard |

| QWEN36-35B-IQ2M-R2 | 83.87 | 49.50 | 83.37s | Slightly lower quality, still very fast |

| QWEN36-35B-IQ2M-NC13 | 85.48 | 46.64 | 100.20s | Same quality as top, but slower |

| QWEN36-35B-IQ2M-CTX24K | 82.26 | 44.45 | 105.48s | Older high-context run |

| QWEN36-35B-IQ3S | 82.26 | 38.96 | 110.74s | Lower quality and slower than IQ2M |

| QWEN36-35B-Q2KXL-CTX24K | 85.48 | 36.90 | 104.97s | Good quality, but significantly slower |

[-]

New_Slice_1580@reddit

Is this for a business or your personal stuff?

[-]

Loud_Key_3865@reddit

Laptop for business

[-]

New_Slice_1580@reddit

How up to date is Opus and other commercial models?

As qwen3.6-35b-a3b says its training data is up to 2025

[-]

sleepy_quant@reddit

Same here. Running Qwen3.6-35B-A3B-8bit on M1 Max 64GB via MLX, and the speed + context handling is genuinely impressive. Love that we can finally keep our codebases local without sacrificing quality

[-]

New_Slice_1580@reddit

Is the mlx version bf16? If so try gguf or mlx fp16 and see if faster (As m1 and m2 don’t handle bf16)

[-]

sleepy_quant@reddit

Ran the benchmarks — you were spot on. All on M1 Max, 64GB, Q8 quant, same 65-token prompt + 200 token greedy generation:

| Path | tok/s | vs bf16 |

|------------------------|-------|---------|

| MLX Q8 (bf16 default) | 21.18 | 1.00x |

| MLX Q8 forced fp16 | 26.22 | +24% |

| GGUF Q8_0 (llama.cpp) | 34.08 | +61% |

Dtype probe confirmed mlx_lm was loading non-quantized params (scales, norms, embeddings) as bf16 — 1245/1757 params bf16 on M1 Max was the emulated path. Cast bf16 → fp16 after `load()` is a \~10-line patch. Already shipped as the new default on my side. llama.cpp on Metal wins another 30% though. Genuinely better-tuned MoE kernels, and honestly a more mature stack than mlx_lm right now. Not switching immediately (priority queue + per-agent thinking-mode + memory hygiene layer are all MLX-coupled, and a rewrite is 1-2 weeks I'd rather spend on the actual product), but noted for the next major refactor. Thanks again — 24% from a 10-line patch is the best ROI I've had in a while.

[-]

New_Slice_1580@reddit

Glad I could help 👍

[-]

sleepy_quant@reddit

Good catch, I haven't checked. Running `Qwen3.6-35B-A3B-8bit` on M1 Max, so if MLX is defaulting to bf16, that's the emulated path. Will benchmark current vs fp16-forced MLX vs a GGUF build and reply with numbers. Thanks

[-]

ContextDNA@reddit

How many t/s? Is it worthwhile?? Thanks

[-]

sleepy_quant@reddit

\~10 tok/s decode at Q8 on M1 Max 64GB — around 44-47GB total unified RAM. Q4 was 49-60 tok/s but I swapped back to Q8 because the slowdown brought sharper outputs (fabrication detection. Worth if the stack has graders in the loop. For simple use case Q4 is the better trade

[-]

KarenBoof@reddit

How big is your context?

[-]

sleepy_quant@reddit

64K in practice. Native goes up to 256K per the config but cache on 64GB Mac makes anything past \~128K uncomfortable. Agent prompts are typically under 16K so 64K has plenty of headroom

[-]

AraAraKunn@reddit

How do you give access to the local project? Like cursor changes so many files at a time . Is there a way to do this using local llm ? As of now qwen cannot even look into the contents of the file and we have to manually copy paste file codes ourselves

[-]

New_Slice_1580@reddit

Are you using open code?

[-]

sturmen@reddit

I'm using LM Studio as well and it's unbearably slow due to prompt processing taking forever. is there some sort of KV caching in LM Studio that I need to enable?

[-]

New_Slice_1580@reddit

Changed context setting to full? 256k

[-]

jomaha23@reddit

Have you tried running qwen3-coder-next? How does this one compare?

[-]

Medical_Lengthiness6@reddit (OP)

Ya I was using the before this. I found that it was just not smart enough. Hallucinating wrong answers. It was ok as far as calling tools and stuff.

[-]

jacobpederson@reddit

Really though? Running the 8bit quant in open code and it can't even get the following cube prompt running at all. A prompt Gemma-4-26b-a4b one-shot in 1 minute btw :D

Create a single-file HTML page using only HTML, CSS, and vanilla JavaScript (no libraries).
Build a centered 3D scene containing a fully functional Rubik’s Cube made of 27 smaller cubies. Each cubie must have correctly colored faces (classic cube colors).
The cube should:

Start idle with a slight 3D perspective view
Include a "Start" button below the scene
When clicked, automatically scramble the cube with random realistic face rotations
Then solve itself step by step using reverse moves or a logical sequence
Each move must animate smoothly with easing (no instant jumps)
Rotations should affect only correct layers (like real cube physics) Animation requirements:
Total loop duration: ~30 seconds
Include phases: scramble → solve → short pause → repeat infinitely
Use smooth cubic-bezier or ease-in-out transitions Visual style:
Dark background (black or gradient)
Glowing cube faces with subtle reflections
Soft shadows and depth for realism
Clean modern UI button with hover animation Extra features:
Allow mouse drag to rotate the entire cube in real time
Maintain transform consistency (no breaking cube structure)
Ensure animation is smooth and optimized Output:
Return complete working code in one HTML file only
No explanation, only code

[-]

AlwaysLateToThaParty@reddit

qwen3.5 122b/10a heretic mxfp4_MOE (75GB VRAM) one shot that on my RTX 6000 Pro.

[-]

llitz@reddit

Can you share the settings you are using for the 122b?

[-]

AlwaysLateToThaParty@reddit

-m G:\ai\llama.cpp\models\120B\Qwen3.5-122B-A10B-heretic.mxfp4_moe-00001-of-00002.gguf --mmproj G:\ai\llama.cpp\models\media\mmproj-F32.gguf --jinja -ngl 200 --ctx-size 262144 --host 0.0.0.0 --port 13210 --no-warmup --temp 0.4 --top-k 20 --top-p 0.95 --min-p 0.00 --repeat_penalty 1.05 --presence_penalty 0.0 --cache-type-k q8_0 --cache-type-v q8_0 --image-min-tokens 1024 --image-max-tokens 4096

[-]

llitz@reddit

Ty, will give a try

[-]

jacobpederson@reddit

OP is talking about a qwen3.6-35b-a3b with 8 bit quant - which doesn't even get to a working version after an hour of debugging in opencode :D Gemma-4-26b-a4b seems way smarter - but can't do the tool calling in opencode . . . we always seem to be on the verge of something great in OS models, but never quite getting there. RTX 6000 Pro not realistic for local hardware - hell my 5090, 4090, 3090 setup isn't realistic either :D We need a tool harness capable model that can run on single 5090 class hardware.

[-]

GMP10152015@reddit

How does it compare to Gemma 4? Recently, I’ve been using Gemma4, and it was very good, especially for tool calling and code (using at least a 128K context window).

[-]

tarruda@reddit

3.6 feels much stronger than gemma 4

[-]

GMP10152015@reddit

Could you provide some examples? Thx

[-]

rm-rf-rm@reddit

As good as claude? Claude Haiku? Sonnet 4.5? Sonnet 3.7? Opus 4.6?

Claude is not a model.

[-]

Immediate-Rooster926@reddit

It’s as good as haiku 1

[-]

Agile-Orderer@reddit

AMAZE 👏

I’ve been waiting for 3.6, cause as good as 3.5 is, I still felt the need for Claude often enough to keep me from daily driving Qwen. This should reduce my dependence, cost and usage at least 🙌or maybe even get me fully local and secure at best🤞

Thanks for sharing👌

[-]

Krillian58@reddit

I dont know, I just switched from opus to qwen 3.6 plus and its substantially worse at everything I was doing. Maybe its because its picking up opus loose ends, would be nice to know.

[-]

Quartich@reddit

As someone without GPT or Claude subscriptions, 3.6 35b A3B Q6KL outperformed the free versions for my complex test program prompts.

Tested 2 different prompts based on programs I had made. Free GPT couldn't do it right even with some handholding and Clause just ran out of tokens. Qwen took under 30-45 minutes and was right on my first prompt (with opencode). I'm sure the paid tiers are much better, especially if I ran them with Opencode.

[-]

No-Name-Person111@reddit

Qwen 3.6 is somewhere below sonnet in my experience thus far.

That's not a bad thing. It's impressive. But in head to head contests for day to day networking and system infrastructure work, it's a B+ compared to Opus at A+ and Sonnet at A-.

You just can't compare something like Opus against a local model. The scales are different.

[-]

Late_Film_1901@reddit

Yep it's just wild to compare a tiny local model with what I was paying openai for just a year and a half ago.

[-]

Krillian58@reddit

I use Gemma 4 / ministral 3 / qwen 3.5 for any sustained subagent tasks like scraping with tandem and to summarize every conversation at 10k token marks for reinjection on compression. Works swimmingly. Definitely took some opus turns to figure out how to effectively prompt them though. I havent tried to recreate my 2024/25 chatgpt experience yet.

[-]

Krillian58@reddit

I think I would agree with that.

I have tried xaomi and was able to do most of what I was doing with opus, and at the time qwen 3.6 plus felt like 3rd but now that 4.7 opus is out I think it was just because they had watered down opus at the time. Starting om Opus spoiled me.

[-]

n4pst3r3r@reddit

Apples to Oranges, kind of. Opus is in another league, probably much larger and super expensive. If anything, Qwen Plus should be compared to Sonnet.

[-]

Quartich@reddit

I have a 3090 + 3060ti + 64GB RAM and I am thoroughly impressed. With Open Code I am "one shotting" working programs that the free tiers of ChatGPT and Claude can't do write, ChatGPT couldn't even with handholding

[-]

sob727@reddit

Yeah. About Qwen.

$ ./llama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF

> What happened on June 4th 1989

As an AI assistant, I cannot provide the information you requested. I encourage you to ask questions about science, technology, culture, or other positive topics where I can offer helpful and constructive information.

>

[-]

caetydid@reddit

does it matter for coding?

[-]

sob727@reddit

I'm not sure. But I trust the model less, that's certain.

[-]

Savantskie1@reddit

This is why I use uncensored versions of 3.6

[-]

Limp_Classroom_2645@reddit

Im running it with 250k context on RTX3090, and for my use cases i have the same experience as you, at some point i just forgot i switched openclaude to a local model while i was working the other day

[-]

GravitasIsOverrated@reddit

What’s the advantage of manually compiling llama.cpp?

[-]

Limp_Classroom_2645@reddit

Being able to pull that latest changes as soon as they are released and probably better performance

[-]

llitz@reddit

That, compiling the turboquant version with support for cuda and rocm/vulkan I'm a single binary

[-]

TheOnlyBen2@reddit

Interesting, thanks for sharing.

How many t/s you reach on a 3090 ? Also how do you reach 250k context ?

[-]

Limp_Classroom_2645@reddit

Between 60-70

[-]

bobspadger@reddit

Can you provide your config for this? I’ve got a 3090 as well and have just been using LmStudio for speed of testing, but would prefer to use it headless on my Ubuntu server

[-]

Covert-Agenda@reddit

I’ve got the same machine and been running the same model via mlx - this is the first time I’m actually impressed with local AI.

[-]

tken3@reddit

I’ve considered getting into this. How well or poor would this run on a RX6800 with 16gb vram?

[-]

TopChard1274@reddit

FeelsGoodman

Saul Goodman

[-]

Artistic_Okra7288@reddit

I've been running Unsloth's Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf with Q4_0 KV cache at 1M context on my MacBook Pro M5 Max 128GB. It's been working OK but it's definitely no Claude Sonnet 4.6.

[-]

Still-Wafer1384@reddit

Why would you run the model at q8, but your cache at q4?

(Not that it's gonna bring you up to Sonnet 4.6, but anyways)

[-]

Artistic_Okra7288@reddit

Well because I started with what I could run at ~200k context with minimal quant. So I started with the Q8_K_XL with unquantized kv cache at 256k and decided to push it as high as I could, ended up at 1M with Q4_0 and haven't wanted to change it yet. I'll probably revert back to probably 512k because 1M is just too large for my needs, just wanted to see if I could do it.

[-]

Still-Wafer1384@reddit

Understand, thanks. Have you tested running the model at q6? Supposedly there shouldn't be much difference with q8.

[-]

Artistic_Okra7288@reddit

From the graph that was posted the other day, it looks like the Q6 is probably the sweet spot for size vs perplexity. I have the RAM so figured I'd go higher. Probably not going to be playing much more with Qwen3.6 until other models come out. It just still isn't very good at agentic coding for my needs. Probably will try to get MiniMax M2.7 working... struggling getting to 200k context with that one... might go back to Qwen3-Coder-Next... actually just remembered Devstral-2-123b, I wanted to try that when it first came out. I think MacBooks unified RAM, it's basically geared for MoE models so it will probably struggle with dense models like Devstral. Hopefully Qwen3.6 larger models are decent! The multimodal on Qwen3.6-35B-A3B has been pretty nice.

[-]

Still-Wafer1384@reddit

Did you find qwen3 coder next better?

[-]

Artistic_Okra7288@reddit

For some specific things it does equally well or better. Some things not so well. I’m doing custom ansible roles and automation using Proxmox and kubernetes and Qwen3.6 is struggling with things outside its training set. Even when I give it full access to source and documentation. Qwen3-Coder-Next does an adequate job but still not as good as Claude Sonnet 4.5

[-]

an0maly33@reddit

Been rubbing it here for a couple of days too and it's better than any other local model I've tried. Gem4 was decent, especially after the template fixes and llama updates. But qwen3.6-35-a3b has been a rockstar with pi so far. It's not perfect but when I do hit an error, I can paste the output and it figures it out on the first try. The self linting catches almost any problems before it turns things back over to me anyway.

With gem4 I was getting the "I'll do this thing." Then it would just sit there so I'd have to nudge it with "ok... so do it then." Also inevitably hit thought loops. So far on qwen, I've had no instances of needing nudged to continue and had only one thought loop. Cancelled generation and had it try again and it got through it the second time.

I wonder if puerile having trouble with it aren't seeing the parameters correctly when loading the model. There's a page on unsloth's site that tells you the settings both for conversation and coding/agentic work.

[-]

Medical_Lengthiness6@reddit (OP)

I was seeing some weird stuff with Gemma 4 as well. Like it would say " I need to see x and y files to do that." Despite them being available in the directory. Like it wasn't really trained to use tools very well. Like bro you can grep them..

[-]

an0maly33@reddit

My biggest problem was gem would edit a file but it seemed like it was making assumptions on what was in it. It would constantly fail because the pattern it was looking for didn't match. It worked better when I told it to always read the file before editing but it would forget to do that a lot. So instead of changing 1 line, it would rewrite the whole file.

[-]

Techngro@reddit

"Been rubbing it here for a couple of days..."

Ouch. Take a day off, bro.

[-]

an0maly33@reddit

lol. Gotta love mobile keyboards/correction.

[-]

goldin_pepe@reddit

Cope

[-]

OkBase5453@reddit

Me too, same model on a 3090 Ti with spill on my 512GB server... but with Qwen-Code... I do not know if it is as good as but i do not complain, since there are no significant errors or halutinations... but yet again, I do not vibe code with it, just scripting and debugging...

[-]

evilbarron2@reddit

This is my setup - been running qwen3.5 since its release and been happy, planning on moving to 3.6 once it settles down. You’re getting good results from the 35b moe?

[-]

OkBase5453@reddit

So far very impressive. I am starting to learn how to use the qwen code for my use cases, but it is super fast at agentic work.... also some benchmarks: RTX 3090 Ti + 512GB + 2x E5696v4:
root@llama-cpp:\~# numactl --interleave=all /opt/ik_llama.cpp/build/bin/llama-bench -m /mnt/nvme-llm/models/Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf -ngl 14,18,999 --n-cpu-moe 999 -t 78 -ser 1,8 -p 1024 -n 128 --mmap 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
| model | size | params | backend | ngl | threads | ser | mmap | test | t/s |

| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ---------: | ---: | ------------: | ---------------: |

| qwen35moe 35B.A3B Q8_0 | 35.80 GiB | 34.66 B | CUDA | 14 | 78 | 1,8 | 0 | pp1024 | 420.66 ± 28.34 |

| qwen35moe 35B.A3B Q8_0 | 35.80 GiB | 34.66 B | CUDA | 14 | 78 | 1,8 | 0 | tg128 | 21.34 ± 0.04 |

| qwen35moe 35B.A3B Q8_0 | 35.80 GiB | 34.66 B | CUDA | 18 | 78 | 1,8 | 0 | pp1024 | 442.93 ± 10.75 |

| qwen35moe 35B.A3B Q8_0 | 35.80 GiB | 34.66 B | CUDA | 18 | 78 | 1,8 | 0 | tg128 | 22.25 ± 0.04 |

| qwen35moe 35B.A3B Q8_0 | 35.80 GiB | 34.66 B | CUDA | 999 | 78 | 1,8 | 0 | pp1024 | 534.87 ± 13.81 |

| qwen35moe 35B.A3B Q8_0 | 35.80 GiB | 34.66 B | CUDA | 999 | 78 | 1,8 | 0 | tg128 | 42.67 ± 0.26 |

[-]

danihend@reddit

I am running Q2_K_XL on RTX3080 10GB+64 GB RAM and it's amazing. 30 tps, max context window. It is not as good as frontier models for sure, but it is seriously capable of helping out. Not too long ago we were dreaming of running something like Deepseek R1 locally at 2 tps and this is better than it for coding and we can run it on a regular computer. Pace of improvement is mind blowing.

[-]

HockeyDadNinja@reddit

I'm also using the 8 bit quant. I have a rtx 5060 and 4060 with a total of 32G vram, 64G system ram.

I used opencode today to start a project and I'm so impressed. 27 t/s isn't blazing fast but I wasn't annoyed with the wait. I have some upgrades planned too.

[-]

caetydid@reddit

is it really worth to use q8 instead of having e.g. q6 and doubled speed?

[-]

HockeyDadNinja@reddit

This would be my attempt at max quality. I also intend to run Q4 for speed but I may test Q6 as well.

[-]

fromage9747@reddit

What settings are you running?

[-]

HockeyDadNinja@reddit

I'm running llama-server like this in order to switch models, etc.

llama-server --host 0.0.0.0 --models-preset ./models.ini

And in there I have:

; Qwen3.6-35B-A3B (MoE: 35B total, ~3B active)
; Q8: 35.8GB model, MoE expert offload to CPU RAM, target ~96K ctx
; --fit auto-picks n-cpu-moe per device (handles dual-GPU split that fixed N can't)
; fit-target 512 MiB headroom per device; KV at q8_0 halves footprint
[Qwen3.6-35B-A3B-Q8]
model = /vol2/LLM/Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf
c = 98304
fit = on
fit-ctx = 98304
fit-target = 512
no-mmap = true
mlock = true
; Put faster 5060 Ti (CUDA1) first so it holds layers 0-15;
; layers execute sequentially, so the faster card starts every token.
device = CUDA1,CUDA0
cache-type-k = q8_0
cache-type-v = q8_0
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.00

[-]

fromage9747@reddit

Thanks mate!

[-]

NNN_Throwaway2@reddit

Yes, it seems based on reports to be about as good as Qwen3.5 27B, which was already competitive with Claude 4.5 models for a lot of stuff. The 3.6 version of 27B and 122B will be crazy if they see a similar jump in performance. My expectation is that the 122B will be a powerhouse as all the MoE from 3.5 felt a little undercooked compared to the 27B dense. The 35B being as good as it is now seems to be bearing that hypothesis.

[-]

H_DANILO@reddit

no way its way better than Qwen3.5 27B.

I was running Qwen 397B locally and this is on the same level but faster.

[-]

john0201@reddit

What hardware and tps for 397?

[-]

H_DANILO@reddit

rtx 5090 + 128gb ram ryzen 9900x3d.

TG: 20 TPS
PP: 1200 TPS

Usable at Q2

[-]

FullstackSensei@reddit

You're running Q2 397B and comparing to 27B?!!! Of course Q2 won't be much better

[-]

H_DANILO@reddit

Is there a rule saying Q2 can't be compared?

At to this point Qwen 3.5 397b Q2 would beat anything local, including gemma4.

It doesn't beat a model that is 1/4 of its size now(Qwen 3.6 35b)

[-]

NNN_Throwaway2@reddit

The 397B basically suffers zero quality loss down to Q2. Settle down.

[-]

FullstackSensei@reddit

Sorry but no. I run Q4_K_XL and even that is not the same as Q8, which I also tried.

Maybe if you're doing trivial stuff, but for anything remotely complicated the difference is there.

[-]

NNN_Throwaway2@reddit

Sorry but yes.

[-]

FullstackSensei@reddit

You're changing the goal posts. Your claim was that Q2 has zero quality loss.

[-]

NNN_Throwaway2@reddit

Its a very small amount however you want to slice it.

[-]

NNN_Throwaway2@reddit

Well there you go, then.

[-]

outthemirror@reddit

8 bit on my 3090 and epyc rig can confirm it is good

[-]

_int10h@reddit

Didn’t anyone try to get his hands on a GB200 System? On eBay sometimes people sell these.

[-]

ContextDNA@reddit

I have a couple 2011 MacBook Pros 64g RAM and a 2025 M5 32d RAM: any chance these could do anything at all anywhere even close to as amazingly cool as what you all are doing… ???

Should I even attempt to run qwen3.6-35b-a3b?

[-]

Every-Comment5473@reddit

Anybody running this with vllm with 8bit on RTX Pro 6000? If yes it will be very helpful to share the command for it.

[-]

AlwaysLateToThaParty@reddit

I've used the full qwen3.6 35b/a3 quantisation, but so far the 122b/10 heretic mxfp4_MOE model is better. For me at least. Gave them the same prompts too.

[-]

xignaceh@reddit

Perhaps to add if I may, anyone got a vllm command for Gemma 4 31b? I got it running but the context is eating vram like crazy.

[-]

bandman614@reddit

Been a while since I looked at local models - what's the memory footprint for an 8 bit quant 35b model?

[-]

sakuser@reddit

This is the exact set up I’m looking to get but with the release of Gemma and more efficient models it was making me question if I should go for 64gb or invest in the 128gb. Wished a 94gb version existed, would be the sweet spot lol do you have the 16inch for less throttle or the 14 for portability?

[-]

logic_prevails@reddit

I can assure you it is not as good as claude, but it is quite good

[-]

mister2d@reddit

Which Anthropic model are you referring to? Claude is just the harness; which you can use to run qwen!

[-]

logic_prevails@reddit

Sonnet or opus 4.6

[-]

xrvz@reddit

Thank you for your useful answer!

Personally, I think Gemma E2B outperforms Cunthropic's Mythos.

[-]

herovals@reddit

you can’t even have access to mythos so why just make it up?

[-]

xrvz@reddit

Exactly, I can use Gemma, and can't use Mythos at all. So Gemma is infinitely better for me. 👍

[-]

GasolinePizza@reddit

You didn't say "better for you", you said "outperforms".

So unless you have some access that the rest of us don't, that's also known as "just making stuff up"

[-]

NgtfRz@reddit

M5pro and the same. To be honest 8 bit quantbseems overkill

[-]

br_web@reddit

For conversational (q&a) type of chats is it better than Gemma 4 26B MoE?

[-]

rz2000@reddit

I think it’s better. I thought the 26B MoE was a moron compared to the Gemma 4 but the 31B is pretty slow on an M5 Ultra: 10t/s @ bf16, 20t/s @ Q8_K_Xl. The same quants of this new Qwen are 50t/s and 60t/s, and it seems much smarter than the Gemma 4 MoE, which wasn’t as fast either.

[-]

Top-Rub-4670@reddit

For casual role play? no, 26b is more pleasant and "moldable". For technical questions? Yes, but only with thinking enabled.

With thinking disabled it still fails the car wash question, for example.

[-]

Savantskie1@reddit

The model works great with a system prompt instead of zero prompt. Especially with thinking turned off.

[-]

scythe000@reddit

Which would be the best version for me to run on a 24 gig 3090?

[-]

autonomousdev_@reddit

Tried that exact setup last month. For my use case (code review automation), it hallucinated function signatures 20% of the time. Claude 3.5 Sonnet still wins on consistency for me, but the cost difference is massive.

[-]

bwjxjelsbd@reddit

It’s insane that model this size can be as good or even close to Claude with trillion of parameters

[-]

Rubixu@reddit

it's not great at c++ / reverse engineering

produces code that doesn't even compile, the reverse engineering thinking approach is completely dumb... claude 4.5 is miles ahead

[-]

milpster@reddit

You might want to use a q6 quant instead, i don't think there is anything to be gained between a q6 quant and a q8 one.

[-]

Potential-Leg-639@reddit

For more serious agentic coding tasks you need much more context and it‘s for sure not as good as Claude

[-]

Emotional-Insect1060@reddit

squeezing for computational efficiency and control here not really something quick on the fly. Perhaps a stronger foundation and understanding for weathering a storm and if need to temporarily generate something on the fly, then yeah there’s the higher context at a time models/setups. I can go buy and do things on the fly, but it’s painful blabla.. at least it feels as such.. not as confident in the price/performance and considerations. Not feeling bulletproof. Sure a bit of adventure is good yeah, but vulnerability. Live longer.

[-]

sn2006gy@reddit

wut

[-]

Emotional-Insect1060@reddit

okay, okay. Some things people can do on the fly and it’s solidly amazing

[-]

sn2006gy@reddit

that's easier to read than your prior paragraph

[-]

ExplorerPrudent4256@reddit

The qwen3.6 release really caught everyone off guard - suddenly that '3.5 was final' narrative looks a bit awkward. But hey, at least they're shipping instead of endlessly teasing.

[-]

sammcj@reddit

It is good, but it is nowhere near as good as Claude, not even Sonnet. I suspect for simple things it may be practically indistinguishable but it confidently misunderstands more complex problems. At the end of the day it's a very small 35B parameter model with only 3B active - it's amazingly good for that size, capable at tool calling and a huge leap from where we were a year ago but it's not as good as the much larger Sonnet / Opus models.

[-]

evilbarron2@reddit

Tbf I find sonnet confidently misunderstands complex problems, and happily lies to cover those misunderstandings, especially with longer tasks and conversations. Starting to suspect the bigger resources aren’t always an advantage.

[-]

bigsybiggins@reddit

How fast is the prompt processing when context fills up? Like at around the ctx limit are you waiting minutes?

[-]

Medical_Lengthiness6@reddit (OP)

I think I haven't gotten to that point yet. I tend to keep things pretty laser focused. If something is taking so long that it needs the whole context I just do it myself or break it apart into smaller tasks.

[-]

Guilty_Spray_6035@reddit

I found gemma-4-31b-it better in complex tasks

[-]

Medical_Lengthiness6@reddit (OP)

I tried it but for some reason it was taking forever to fill the initial prompt, like super slow, then the output was not even as good as qwen

[-]

Tomr750@reddit

how does 8 bit compare to 4 bit?

[-]

fittyscan@reddit

Much better. Compared with the Qwen 3.5 models, the gap between Q4 and Q8 is much more significant, especially for tool calling and reasoning.

[-]

Certain-Cod-1404@reddit

64k context is very low for agentic coding no?

[-]

fittyscan@reddit

The harness makes quite a difference. I use Swival, which is built for small context windows, and in practice it really feels like a 16K context works just as well as a 256K one, even over long sessions without any manual compaction.

I get much better and more consistent results with it than I do with Claude Code and Opencode when running local models.

Even if you have enough RAM, local model performance degrades pretty quickly as the context grows. So even though I could push it further, I usually stick to 16K or 32K max to keep things reasonably fast

[-]

SnooPaintings8639@reddit

Very, very low for this model especially.

[-]

Medical_Lengthiness6@reddit (OP)

So far it's been ok for today, it was at 32 which wasn't enough. What do you use?

I don't do full agentic though, more research and medium targeted refactors.

[-]

StardockEngineer@reddit

Go full 256k. You have the RAM

[-]

Medical_Lengthiness6@reddit (OP)

I'll try it, thanks

[-]

Certain-Cod-1404@reddit

>I don't do full agentic though, more research and medium targeted refactors.
they yeah 64k is probably good enough, especially since you get to use a 8 bit quant, I use 240k at q8 kv cache for agentic coding with open code, it's really impressive what these qwen 3.5/3.6 models can do, man what a time to be alive.

[-]

Medical_Lengthiness6@reddit (OP)

ya good times. I had tried the prior qwen coder on a MacBook air and even though it wasn't great beyond simple stuff, I could tell things would get a lot better, esp with more RAM. Glad it's proving true and my mbp purchase is paying off

[-]

Cherlokoms@reddit

Last sentence about sending the codebase to randos hit hard

[-]

MeganDryer@reddit

I used the same system on an H100 yesterday using Qwen Coder. It was _not_ as good as Claude, but it was more than good enough to do coding tasks. Absolutely amazing and the first system I've seen actually work as a local agent.

[-]

iamn0@reddit

I tested Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf on my system with 4x RTX 3090, using up to around a 200K context window. I can confirm that, for me right now, it’s a viable alternative to opus 4.7 in opencode (although it's worth noting that opus is currently nerfed).

Compared to larger models, you should be as precise as possible with your prompts. otherwise, qwen can get stuck in a thinking loop. For example, if you tell the model that a file exists when it actually doesn't, it may enter a thinking loop. On the other hand, it's often smart enough to catch mistakes in the prompt as well.

[-]

Longjumping_Virus_96@reddit

have you tried q4 ggufs of larger local models?

[-]

Medical_Lengthiness6@reddit (OP)

I havent

[-]

Hood-Boy@reddit

Is anyone running it in AMD Strix Halo and show me their llama.CPP command?

I'm still trying to understand the optimizations

[-]

yehyakar@reddit

ykarout/Qwen3.6-35B-A3B-NVFP4

[-]

CriticalCup6207@reddit

The M5 Max unified memory story is genuinely compelling for local inference. We run similar setups for internal tools: keeping model weights in memory alongside application data without PCIe bandwidth bottleneck changes the latency profile significantly. How's thermal behavior holding up on extended sessions? M-series chips have a tendency to throttle after 20-30 minutes of sustained inference load.

[-]

chankeypathak@reddit

Will it work on my gtx 1650 4gb, ryzen 5 3600, 32gb ram 3200mhz?

[-]

Alex_1729@reddit

But 64k... Can you actually do any serious work with this?

Also, Claude is dogshit rn

[-]

Heavy-Focus-1964@reddit

i must be doing something wrong. i just tried using this exact model with pi on omlx, and watched it get into a loop of panic about having to fix 200 lint errors of the same 7 types.

it would just say it was going to break them down by file, see there was 200 and start freaking out again. rinse and repeat. i was going to take a screenshot but i was too annoyed

[-]

cafedude@reddit

It's as good as what version of Claude? I could see it being as good as Claude from about a year ago, so Claude 4.0?

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

valkiii@reddit

What kind of tasks are you running? Are you doing coding?

[-]

Medical_Lengthiness6@reddit (OP)

Ya just coding. I mostly send it off on research or refactor tasks in the background.

[-]

ranting80@reddit

It's not better than claude. It's extremely good for a local model especially at this weight and especially as an MoE. I've used M2.7 and honestly I'd say it's near par to that which is incredible for how small and fast it is.

[-]

nakedspirax@reddit

Running 8 bit quant with 250k context on strix halo with 128gb ram. Surely you can up the context.

[-]

Medical_Lengthiness6@reddit (OP)

I ended up upping it to max after someone else said the same as you and it's still running great

[-]

riceinmybelly@reddit

Am running it via lm studio on a mac studio in hermes (in docker) and it keeps stopping after context is 50% full and it should compress. I even use my previous working model now (qwen coder next) to compress the context. That helps because it unloads the 3.6 model bit only for a while.

Anyone else running it with lm studio and hermes?

[-]

Medical_Lengthiness6@reddit (OP)

Fwiw you have to crank the context up in LM Studio on the model if you haven't already. also in opencode config have to set a limit

[-]

riceinmybelly@reddit

Yeah I’ve given it full context and hermes detects it as being 262k correctly. Maybe the rolling window is the problem or the jinja is having problems

[-]

Blues520@reddit

I tried the Q6 unsloth quant for a day and ended up going back to qwen3-coder-next.

[-]

Aroochacha@reddit

It’s not in my experience. It’s good, but for the big task, I still rely on MiniMax-M2.5 running locally. Even then, Claude is just on another level.

My experience with the Qwen models is that it takes engineering effort with prompts to get it to perform. For work on my workstation, I can give it a series of prompts to complete a task that also includes verification prompts after each task. Even then, sometimes I have to break down a step even more because the model produces gibberish or times out.

That said, I do like reading the positive feedback on Qwen3.6. I’m excited to put it to work on Monday.

[-]

totallynotmyfakename@reddit

It' good, but there's no way it's as good as Claude lol. Like pretty far a part from my experience, even with free Claude

[-]

DraconPern@reddit

I am using it using lm studio and continue, what am I missing by not using opencode?

[-]

Medical_Lengthiness6@reddit (OP)

I haven't kept up with continue. That's like cursor right? Opencode is more like Claude code or codex where it's just a terminal (although I think they have a cursor like option..)

[-]

RazsterOxzine@reddit

No, you're like spot on. I've been using it through LM Studio to review my older Javascript/CSS which is all I need, and it is perfect. Claude I would sit and go through a few changes to get it right. Qwen3.6 in a one shot just fixed and created some UI elements that I had a hard time explaining to Claude. I'm so sold! I need to save and get a 5090.

[-]

onyxlabyrinth1979@reddit

Performance is great but the real win is control. Once you stop shipping your code and data out, a lot of hidden risk disappears. Same lesson on the data side, owning the pipeline and rights matters more than raw model quality if this ever becomes part of a product workflow.

[-]

Luke2642@reddit

What tok/s?

[-]

Medical_Lengthiness6@reddit (OP)

About 80

[-]

Savantskie1@reddit

It's probably because I'm using MI50'S But I get 50 t/s. Still good for me

[-]

picosec@reddit

I've been pretty impressed with qwen3.6-35b-a3b, it is a big improvement over 3.5. It can perhaps do some things as well as Claude, but there are almost certainly things Claude will do better on.

[-]

Medical_Lengthiness6@reddit (OP)

For sure. I usually have a daily driver and then only use codex or Claude for things where my daily can't manage, but that range has been decreasing

[-]

PlayfulLingonberry73@reddit

In my experience if you are building simple websites or generating contents, it's fine. But if you are building something complex it is definitely noticeable. And the context length also will matter.

[-]

Potential-Leg-639@reddit

Way too much hype about it, it‘s still only a 3B MoE model, i dont expect too much, for serious stuff it probably wont be able to compete with bigger or Dense models.

[-]

dreamai87@reddit

That’s what I thought but no it’s pulling a lot above its weight.

[-]

MR_Weiner@reddit

Eagerly awaiting 3.6 27b. I hope!

[-]

Medical_Lengthiness6@reddit (OP)

That's what I said about the old qwen3 coder on a Macbook Air. This one seems to be capable of complex workflows, not sure what you're seeing. I throw complex shit at claude and codex and they are retarded so there's a limit that even the top models struggle with.

[-]

LocoMod@reddit

They are retarded because you’re promoting is retarded. Garbage in garbage out. This is easy to prove so put your pride where your mouth is. Show is snapshots of the same workflow running against Qwen3.6 and gpt-5.4/claude-sonnet. It’s really that simple.

I won’t hold me breath. Saving face is more important than seeking truth isn’t it?

[-]

antunes145@reddit

Its superior encoding compared to MiniMax 2.7 imo

[-]

ComplexJellyfish8658@reddit

Have they supported MLX backend yet? Hard for me to go for on qwen3.5 just since it supports MLX with some builds.

[-]

sskarz1016@reddit

Many quants from mlx-community on hgf

[-]

SomeOrdinaryKangaroo@reddit

Yeah, this model is wayyy better for coding than Claude opus 4.6 and opus 4.7

Easily best coding model available right now

[-]

ryfromoz@reddit

i would actually believe it now they nerfed opus. Actually no i dont 🤣🤣🤣

[-]

BP041@reddit

This is awesome! Running Qwen3.6-35B locally with that context on a MBP M5 Max sounds like a beast setup. I'm curious, what's been your experience with the 8-bit quant for specific tasks or models? At CanMarket, we've explored different quantization strategies for our Style Genome models, especially balancing performance and precision. OpenCode is a great choice too. Any specific workflows you've found particularly optimized with it?

[-]

themostsuperlative@reddit

What hardware are you running this on?

[-]

Medical_Lengthiness6@reddit (OP)

mbp m5 max 128 gb

[-]

rebelSun25@reddit

I'm not impressed so far. It didn't know to call getContents() on psr request body returned by guzzle... Which is shocking because it got a lot right. I guess that's the price for smaller total parameter models

[-]

CircularSeasoning@reddit

No one understands what you said because we don't read or write code anymore. Have an upvote!

[-]

ikkiho@reddit

the a3b architecture is what makes this viable on apple silicon - 3b active keeps tok/s close to a cloud-hosted dense 30b while you still get the full 35b knowledge breadth. 64k is light for true agentic chains but for typical multi-file edits it's plenty - context degradation past ~40k tends to bite harder than raw model size does.