Field report: coding with Qwen 3.6 35B-A3B on an M2 Macbook Pro with 32GB RAM
Posted by boutell@reddit | LocalLLaMA | View on Reddit | 44 comments
TL;DR: I finally have this working and doing real work within the tight specs of my 32GB RAM Mac.
So for those who would like to fly like Julien Chaumond, here's an updated HOW-TO, an explanation of why I did everything I did, and my personal take on how well it actually works.
This is a snapshot in time. I'll keep posting revised versions as my setup improves.
HOW-TO
* We're going to use llama.cpp to run the model locally. But, these models are really new and bugs are constantly being fixed. So we need to build llama.cpp from source. This is easier than it sounds.
If you have never done it, install the MacOS command line developer tools:
xcode-select --install
Now you can build llama.cpp:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)
export PATH="$HOME/llama.cpp/build/bin:$PATH"
* Add that export line to .bashrc or .zshrc so you have access to it every time.
* Download the model itself. I prefer to just download these directly:
* Create a models subdirectory within your home directory.
* Go to https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
* Click UD_IQ4S
* Click Download
* Move the downloaded file to models
* Go to https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/main/mmproj-BF16.gguf to download the matching vision adapter
* Click Download (it's there, look closer)
* Move that file into models too
* CLOSE ALL YOUR APPS except Chrome and Terminal. Yes including vscode. Close as many browser tabs as you can. For long overnight sessions, close Chrome too. Understand that Chrome uses a lot of RAM and wasted RAM is the enemy. This model just... barely... fits.
* Test it:
llama-cli -m ~/models/unsloth/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf --mmproj ~/models/unsloth/mmproj-BF16.gguf -c 131072 --batch-size 256 -ngl 99 -np 1 --host 0.0.0.0 --port 8899
I'll explain why I used each of these options later.
This will launch a simple chat interface, running entirely on your own machine.
Your first query will take a long time! But as long as you don't leave it idle, later responses will start much faster. llama.cpp is designed to stand down and return resources to the system when you're not using it.
* Now add aliases to your .bashrc or .zshrc so you can run either the chat interface or an OpenAI-compatible API server at any time:
alias qwen-server='llama-server -m ~/models/unsloth/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf --mmproj ~/models/unsloth/mmproj-BF16.gguf -c 131072 --batch-size 256 -ngl 99 -np 1 --host 0.0.0.0 --port 8899'
alias qwen-chat='llama-cli -m ~/models/unsloth/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf --mmproj ~/models/unsloth/mmproj-BF16.gguf -c 131072 --batch-size 256 -ngl 99 -np 1 --host 0.0.0.0 --port 8899'
* Run source ~/.bashrc or open a new terminal so we can start using these aliases now.
* Start qwen-server.
* In a new terminal window, install opencode. The quickest way to get the latest release is:
curl -fsSL https://opencode.ai/install | bash
Again, things are changing fast, so the latest release is a good idea. If you want to install by other means or make sure I'm not giving you weird advice, just check out the opencode site.
* I think I had to manually add opencode to your PATH by adding this line to .bashrc or .zshrc:
export PATH=/Users/boutell/.opencode/bin:$PATH
* Configure opencode to talk to your local model.
Create ~/.config/opencode/opencode.json and populate it:
{
"$schema": "https://opencode.ai/config.json",
"tools": {
"task": false
},
"provider": {
"llama.cpp": {
"npm": "@ai-sdk/openai-compatible",
"name": "llama-server (local)",
"options": {
"baseURL": "http://127.0.0.1:8899/v1"
},
"models": {
"Qwen3.6-35B-A3B-UD-IQ4_XS": {
"name": "Qwen3.6-35B-A3B-UD-IQ4_XS",
"limit": {
"context": 131072,
"output": 49152
},
"attachment": true,
"modalities": {
"input": ["text", "image"],
"output": ["text"]
}
}
}
}
}
}
I'll explain each setting later.
* Now cd into one of your projects and run opencode:
opencode
* As soon as the opencode UI comes up, CHOOSE THE RIGHT MODEL. Do NOT spend half an hour working with the free default cloud model by mistake. Not that I know anyone who did that. Um.
Specifically, choose this model:
Qwen3.6-35B-A3B-UD-IQ4_XS
If you don't see it, you probably didn't configure opencode.json correctly.
* Say "hello" and wait for a response (again, the first may be very slow, later responses are faster).
* You're all set! Work with opencode much as you would with Claude Code.
THINGS THAT GO WRONG
* If you forget and waste a lot of RAM on electron apps or even browser tabs, it'll be very slow, or llama-server will crash with out of memory errors.
* Once in a while it'll print some XML-flavored thinking trace and just... stop. You can prompt it to continue. This is most likely qwen flubbing the tool call and opencode not having code to gracefully recognize that flavor of response and try again.
"WHY DID YOU CHOOSE THAT QUANTIZED MODEL?"
Macs are incredible because they have unified RAM. Both the CPU and the GPU can see 100% of it. But, 32GB RAM is just super, super tight for these models. It's a miracle they fit at all. You simply must choose a quantized model, even though that means trading off some intelligence and accuracy.
The full-size model would never fit. So first I tried Q4_K_M, which is mentioned in most guides. And that technically fit, but I didn't have enough memory left over for an adequate context size.
The IQ4-XS (Extra Small) model gets us back several additional GB of RAM, and we need every one of 'em.
"WHY ARE YOU USING EACH OF THOSE OPTIONS?"
That command again:
llama-server -m ~/models/unsloth/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf --mmproj ~/models/unsloth/mmproj-BF16.gguf -c 131072 --batch-size 256 -ngl 99 -np 1 --host 127.0.0.1 --port 8899
* -m picks the model, of course.
* --mmproj picks the "vision projector" file. You need this if you want to be able to paste screenshots into opencode. With this feature opencode can also potentially take screenshots with playwright and look at them to debug issues.
* -c 131072 sets the context size to 128K. This model goes up to 256K, but memory is just too tight on this machine for that. However, Qwen says you shouldn't go below 128K or the model will get confused. So that is my compromise.
* --batch-size 256 helps limit the system requirements for vision. You can skip it if you leave out --mmproj and the projector file.
* -ngl 99 loads all model layers into VRAM (unified RAM, in the case of a Mac) for best performance.
* -np 1 ensures llama.cpp doesn't try to handle more than one request simultaneously. It will queue them instead. This is important when memory and context are both tight. You might experiment with "-np 2" but I wouldn't go higher.
* --host 127.0.0.1 allows connections only from your own computer.
* --port 8899 selects a port not usually taken by some other service. Just make sure opencode.json matches.
"WHY DO YOU USE THESE OPENCODE SETTINGS?"
Most of that is clearly just pointing to the right place (the right API URL with the right port, the right model name).
These settings are more interesting:
"limit": {
"context": 131072,
"output": 49152
},
"attachment": true,
"modalities": {
"input": ["text", "image"],
"output": ["text"]
}
limit is telling opencode what the context size is and how big a single response from qwen might be, so it can figure out when to compact the session. With a small context window, compaction is obviously mandatory, and if it doesn't happen soon enough, the session fails. I found that without setting a high value for output, the model frequently ran out of context and gave up. Setting output to 49152 solves this.
attachment and modalities are just declaring what this model supports. Without these, plus the mmproj option, opencode won't be able to read your pasted screenshots or look at images created by playwright during testing. If you don't care about image support, you can skip these.
"WHY DON'T YOU JUST..."
* Use Claude Code? I had problems due to a lack of optimization for small context windows. Long-running tasks that complete large projects independently matter for me, so no Claude Code.
* Use pi.dev? Yeah I know: it's even better for limited context windows. And saving context is always the dream. It's next on my list.
* Provide a web search tool to the agent? Also on my list.
* Use mlx? The gap between llama.cpp and mlx is getting pretty small, especially if you only have an M2. Also things tend to get solved for mlx later, and I'm working with qwen 3.6 which is very new. It might be a little faster but it won't solve any fundamental problems for me.
GREAT! BUT... HOW GOOD IS IT?
Well...
I've given it two real world, fair challenges from my actual recent work. These are things that Claude Code was able to complete with Opus 4.6. And from recent experience, I think it would have worked back as far as Opus 4.5. The famous November release. The day a lot of experienced developers like me stopped typing code and started directing Claude Code instead.
One is a pretty simple web app for creating greeting cards. I asked it to find an old bug I'd been too lazy to figure out. The bug had to do with a discrepancy in the positioning of images on the card between the web-based, CSS-driven editor and the pdfkit-based PDF support.
The other is adding SQLite support as an alternative database backend for ApostropheCMS, which defaults to MongoDB.
Now, you would think the first take would be a lot easier. But this model just can't quite wrap its head around the geometry of it. It often names the actual problem (which I know, because Opus already nailed it), but then flails wildly with the implementation. Multiple times now, it has created an implementation that causes the size of the editor to strobe vigorously between two sizes... yes it was painful (but funny). Just once, it kinda fixed it, but added an extra visible space at the bottom of the images and couldn't get rid of it.
So I went on to the second problem. And that, too, was a disappoint at first.
Qwen went through a similar chain of reasoning to Opus: catalog the existing uses of mongodb's Node.js API in ApostropheCMS, create an emulation with the same API.
But the first implementation failed to use real JSONB operations, even though I told it to. It would fetch the entire database, then filter documents in RAM. Um... no.
Qwen also flailed trying to get all of the ApostropheCMS unit tests to pass... or really any of them. It would try to trace where various properties came from, but always get stuck, and it started to modify the CMS code itself. Oh HELL no.
I instructed Qwen to NEVER touch the unit tests or the application code, but only the adapter code itself, because if it passes with mongodb, it can pass with an acceptable emulation. Qwen accepted that direction but still couldn't track down the issues.
Honestly the codebase was probably just too much to fathom in this limited context window, although Claude did fine with just twice as much context (256K).
So I gave Qwen a hint, something Opus figured out on its own: start by writing your own test suite for the mongodb API operations, and make sure both adapters pass it. Obviously, if mongodb doesn't pass, you botched the tests themselves.
And... that worked a lot better. Qwen built a real adapter using real JSONB operations. There is a decent little test suite and those tests do pass with both sqlite and real mongodb.
So now I've asked it to go back to iterating on passing the actual apostrophecms tests. These are mocha tests too, but they are much closer to functional tests than unit tests because they exercise much of the system. My theory is that, now that the simple stuff has been debugged, Qwen will have more luck tracing down issues at this level of integration.
Or it may just be overwhelmed. We'll see.
So... is it useful?
For some tasks, I'd say yes.
My second task is actually a classic win for AI coding agents: the adapter pattern. "Here's a thing that works, and a huge test suite. Build a compatible thing that passes the same test suite. You're not done until the tests all pass."
And I think Qwen did OK on it, eventually. It required more guidance than Claude Code, but I would still choose it over grinding out that much MongoDB-like query logic by hand.
But my first task was a stumper and shows Qwen can still get stuck in thinking loops, at least at this quantization and context size (I need to be fair here).
My next steps
* Try pi.
* Try providing a web search tool, for reading documentation.
* Try using cloud-hosted Qwen 3.6 35B A3B, without quantization, in order to see what I could get from better but still realistic home hardware.
As we watch the AI financing bubble start to shrink, my wife and I are both asking questions like "can we run this at home? If not, are there other sustainably affordable options?"
It's already cool and useful that my Mac can do this. But running on a dedicated box with a little more RAM (OK, twice as much) and a stronger GPU, it might make the leap from "cool and useful" to routinely offloading some of our tasks from expensive cloud AI providers. My task is to find out if it's good enough to justify the cost... especially when cheap cloud API options like DeepSeek 4 also exist.
NoFaithlessness951@reddit
I'm getting 50t/s for the 4 bit mlx quant using lmstudio. MacBook pro m3 36gb ram.
cocacokareddit@reddit
i heard that oMLX even better than LMStudio in prefill.
uti24@reddit
Here is my findings with Qwen3.5 35B and Qwen3.6 27B
So Qwen3.5 35B is really fast, as if should, and Qwen3.6 27B is smart but slow.
Now here comes interesting part:
Qwen3.6 27B doing job faster after all. Yeah. I can just leave it to itself and it will finish the task. It will figure out tricky moments by itself. I agree, it's 5 times slower, but same time, it don't need constant babysitting. Just pleasant to work with.
I mean, there must be task where faster model will do the job, too.
led76@reddit
Did you get 27B running on a MacBook with 32GB ram? I tried initially and it didn’t work. How are you running it
boutell@reddit (OP)
Quant?
led76@reddit
I used 4 bit. I’m running unsloth ud-mlx-4bit in LM studio.
boutell@reddit (OP)
What context size? I kept getting ground down to 5 t/s in 27b with the 4 XS quant even with q8 for KV cache and a little extra memory authorized for the GPU. This was with 128k context, the minimum recommendation for the model.
uti24@reddit
I don't even have MacBook. Idea is, if they can run 35B then they can 27B, and it might be faster in the end.
tmvr@reddit
You have 24GiB default VRAM allocation and the Q4 quants are under 17GiB:
https://huggingface.co/unsloth/Qwen3.6-27B-GGUF
FlyingInTheDark@reddit
How many tokens per second do you get?
boutell@reddit (OP)
Should have mentioned. 19 tokens/sec.
AtomicInteger@reddit
i am able to run 8 instance on a single epyc milan server each 11.8 tk/s with 256k context cpu only, only downside saved slots are not restored, i think you can get better if you play around
boutell@reddit (OP)
What are the specs on that thing
AtomicInteger@reddit
server 7713p 64core cpu 256GB ddr4 2xr8 registered 3200 ram no gpu only 8channel only hardware numa enabled but software numa balancing disabled linux with few more known tuning, no transparent&dev/ hugepages , i run each at 8 core per numa node with script, only 8th llama embedding but i am still studying architectures now i might change few to agent/tool:
CMD="nice numactl --cpunodebind=$i --membind=$i llama-server \
-m ${MODEL_PATH} \
--host 127.0.0.1 \
--port ${PORT} \
--threads 8 --threads-batch 8 \
-b 16384 -ub 16384 \
--ctx-size 262144 \
--no-mmap \
-fa on \
-cmoe \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--top-p 0.95 \
--min-p 0.00 \
--top-k 20 \
--temp 0.6 \
--offline \
--parallel 1 \
--repeat_penalty 1.0 \
--presence_penalty 0.0 \
--jinja \
-lv 2 \
-to 30000 \
--swa-full \
--keep -1 \
-kvu \
-cram -1 \
--slot-save-path ./slots \
--no-webui \
--metrics &>$LOG_FILE &"
Aggravating-Draw9366@reddit
I’m getting more tk/s running 3.6 35b mlx 4 bit on lm studio. Mbp m2 32gb
90hex@reddit
Can confirm, getting 19 as well on MacBook Air M2 24 GB in LMStudio, same model.
spencer_kw@reddit
the 27b for planning and 35b-a3b for execution split is where i landed too. 27b catches things the moe model misses but at 4x the speed cost you can't justify it for every task. been using 27b as a reviewer after a3b does the implementation and the catch rate is surprisingly good.
Chinmay101202@reddit
super interesting!
Chinmay101202@reddit
i aint reading all that.
mantafloppy@reddit
I didn't and you shouldn't, starting with that is telling me the guy have no idea of what he is doing...
Building from source will get you advance release for like a couple hour on Brew or any other popular solution, and will make it so you constantly need to rebuild manually.
This sound more like "The blind leading the blind" than a guide.
Like seriously?
Click UD-IQ4_XS
Click Download
Its like he ask an AI, and he had so much trouble following instruction that it needed to dumb it down real low.
keyboardwarriord1st@reddit
Have you tried running it on omlx? I’m getting around 40tokens/sec on m3pro 36gig with Qwen3.6-35B-A3B-mxfp4
itsyourboiAxl@reddit
Thanks I wanted to try qwen as local ai instead of claude code. How easy would you say it is to work with it compared to claude? Have you kept using it after the tests? Claude works great because you can give it quite vague requests and it will still do the job. How does qwen compare? I feel you need to be way more concise in the prompts for it to actually do the work. I will use your post and try it with pi, thanks for sharing
boutell@reddit (OP)
It's not as smart as Claude Opus, which shouldn't be any surprise. But I would say it is smart enough to be genuinely useful for coding, especially if Claude Code pricing is becoming a problem for you. Which is notable.
In my work on this I've iterated through a lot of the annoyances to arrive at a fairly stable setup, but it needs its work cut into smaller chunks for sure. It doesn't have that "hey assistant, just jump into our big ol' company codebase and figure shit out" vibe.
audioen@reddit
You should probably give the 27b model a spin. It is much slower, for sure, but it is also much better, you can use higher accuracy quant like q5_k or even q6_k, maybe. People suspect that the 35b model needs to be running at the very minimum 6 bits, and preferably 8 bits or even at the full bf16 accuracy to not be damaged, which makes it relatively unfriendly in constrained VRAM.
The -ctk q8_0, ctv q8_0, -fa on options may be something for you in limited setup. People seem to think that the 8-bit KV cache does no harm, especially if running the 27b, but possibly it is the same with the 35b.
The issue with the 27b is of course that it wants much more compute. If it is possible to enable multitoken prediction using the built-in predictor, do so. 1 real + 2 speculated yields something like 2.x tokens for each inference round, roughly doubling the model speed.
boutell@reddit (OP)
Thank you for the advice!
itsyourboiAxl@reddit
I need to experiment with it. Its not monetary issue, if I can I prefer to run something local and promote open source. It will also force me to put more thought into my work instead of dumb requests claude is smart enough to go through.
blackhawk00001@reddit
You can use local qwen as the backend for Claude cli. It’s context heavy at the start but I’ve had success with 200k limits, rarely go over 80-120k. Qwen 3.6 27b q8 is impressive with Claude for planning. I switch to 35B a3b q8 for implementation speed and then verify again with 27B.
I’m hosting on workstations that can handle the large context though. My 24gb Mac air struggles with any recent qwen but Gemma e4b is working great. I wish I had gone with more Mac ram but it’s great for hosting ide while inference is done elsewhere. I’m working my way towards checking out pi.
boutell@reddit (OP)
Hmm, does 27b actually require less RAM? I know I can't fit that much context with 35b a3b. I would assume 27b must be slower than the MoE model...
blackhawk00001@reddit
Yes 27B deploys with a smaller footprint than 35B A3B, but each request uses all experts compared to only a subset with a3b. 27B is the brain with more eyes on each request but 35B A3B is 3-4 times faster and usually good enough with only 3B active per request.
Primary workstation has dual R9700 gpus. With Q8 27B I get 500-1500 pp and 15-20tg and Q8 35B A3B is 2000-3000 pp and 45-70 tg, with speeds dropping off as the context buffer fills. I'm needing to try vllm as it should have much better tensor splitting than llama.cpp and give me a decent speed boost for tg.
boutell@reddit (OP)
Got it. Running 27b on this Mac would be an interesting flex but 4x slower than this would be... slow.
boutell@reddit (OP)
(Results may vary if you have enough RAM for 256K context and/or less quantization)
TheTerrasque@reddit
You should try with q8 for kv cache, and q4 xl as model quant.
boutell@reddit (OP)
RAM gets really really tight. But some have suggested ways to get the OS to caught up more RAM...
TheTerrasque@reddit
Hence q8 for KV cache, should halve the amount of ram the context needs, and allow bigger context / higher quants.
After this PR was merged into llama.cpp lower kv cache quants has become a lot more useful, and you could maybe even go down to q4 without much loss, for another halving of context ram size. But q8 should be very near baseline fp16 and should be indistinguishable in practice.
You should of course check how it affects your workload, but it could be a worthwhile trade to get a bit higher quant on the model.
Velocita84@reddit
Personally i prefer setting an alias for llama-server's router mode instead so you can load and switch different models on the fly without having to use a different command
Then-Topic8766@reddit
I have no mac, just a linux PC, but bookmarked this post. A lot of useful info. Thanks.
minkyuthebuilder@reddit
this is actually a goated write-up. i was struggling with qwen 3.6 on my m2 too and kept getting those weird xml loops. definitely gonna try dropping the context to 128k and switching to IQ4_XS. tbh running this locally feels like trying to fit a v8 engine into a lawnmower but when it actually works and passes a test suite it's pure hitamine. rip to your browser tabs though lol.
boutell@reddit (OP)
Thanks! Yeah the xml loops are not gone gone but they are infrequent, work can be done. My main question now is whether it's smart enough for a decent sunset of my tasks, and whether it would be sufficiently smarter without the quantization which is something I'll test using a cloud hosted provider of the same model.
JLeonsarmiento@reddit
Where do you pass model flags in llama.cpp? {preserve _thinking = true} kind of stuff?
Elusive_Spoon@reddit
You’re looking for chat_template.jinja
boutell@reddit (OP)
Not sure.
Jeidoz@reddit
FYI: Use llama cpp provider plugin instead of manually configuring any model in opencode json. Simplifies a bit life with release of new models, quants, different projects...
thisguynextdoor@reddit
There's no such quant. Do you mean XS?
boutell@reddit (OP)
Yes. Fixed. Thanks.