Field report: Qwen 3.6 27b on an M2 Macbook Pro with 32GB RAM
Posted by boutell@reddit | LocalLLaMA | View on Reddit | 20 comments
This post is a lot shorter than my 35B-A3B field report because almost everything is the same. But if you want to know how to reproduce it, see my earlier post.
Tried this out over my lunch break. To be clear, I realize this machine is totally under-spec'd for 27b in practice. But why not give it a try? It has enough RAM to run it. Sort of!
I'm running qwen 3.6 27b, the 4 bit XS unsloth quant, downloaded from huggingface.
How it started: 80 t/s pp (prompt processing), 7.9 t/s tg (token generation).
How it's going: 4 t/s pp (!!!), 3.1 t/s tg.
4 is not a typo.
Wow that's slow! And I was only up to 52,000 tokens of context at that point.
That's when I hit control-C.
I didn't see any indications that the system was swapping. Memory pressure never went past the yellow range. I think I was simply getting clobbered by low memory bandwidth... pretty much as expected. Memory bandwidth is key when running a dense model like this.
However! The code it generated up to that point in OpenCode looks excellent. Particularly considering I gave it no further input after the initial prompt and it had to analyze a significant codebase to figure out what to do.
It worked much better than 35B A3B, as expected. But it was much slower, as expected... you just can't get something for nothing.
Here was my llama-server command. As you can see I did turn on ngram-mod speculative decoding. Based on the logs, I doubt I gained much from it. But subjectively, based on an earlier run without it that I similarly had to interrupt eventually, I doubt I lost much either. I think the reason is simple: 27b is like your older wiser friend. It speaks when it has something to say, and it rarely repeats itself.
llama-server -m ~/models/unsloth/Qwen3.6-27B-IQ4_XS.gguf --mmproj ~/models/unsloth/Qwen3.6-27B-mmproj-BF16.gguf -c 131072 --batch-size 256 -ngl 99 -np 1 --host 127.0.0.1 --port 8899 -ctk q8_0 -ctv q8_0 --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48
I continue to limit simultaneous processes to 1 (-np 1) because I don't see much of a win in asking it to run two at once. Instead it just queues them up and knocks them down. I have started to allow OpenCode to run agent tasks again, because I see the massive impact on context size for a typical request if I don't. But there's no point in asking the GPU to actually run them simultaneously when it obviously doesn't have the power to spare.
I now understand why people see this model as a slow but effective self-hosted Sonnet. Even Claude Opus 4.7 was impressed with the output and compared it to what could be expected from Sonnet.
Next I plan to evaluate it personally on a cloud-hosted card with specs at least comparable to the R9700, which is not available in the cloud. I do have useful field reports from others (thank you!) but it's important to get a sense of it on my own programming tasks.
Tyme4Trouble@reddit
If I remember correctly, as context length grows calculating attention becomes more and more compute heavy which tanks throughout.
boutell@reddit (OP)
Yes. Using a dense model sacrifices all the hacks that make 35B-A3B surprisingly cheap... and gets you back all the smarts that the hacks give up
This_Maintenance_834@reddit
working MTP will bring a lot of the speed back on dense model.
boutell@reddit (OP)
I would really like that to be true, however I'm seeing prompt processing collapse to four tokens per second over long context, and my understanding is that the use of multi-token prediction is not relevant during prompt processing.
This_Maintenance_834@reddit
DFlash has difficulty to handle long context. MTP does not have this problem. DFlash was only trained for 4096 context length.
boutell@reddit (OP)
What's your MTP config? And are you using llama.cpp or something else? Thanks
This_Maintenance_834@reddit
mtp=3 on vllm v0.20. i don’t use llama.cpp anymore. once vllm works with mtp, llama.cpp is no match.
boutell@reddit (OP)
I'll have to try it with the Mac plugin, just for fun.
Due_Duck_8472@reddit
Use stupidly cheap computers, and in stupid prices
danigoncalves@reddit
I have colleagues with M4, I wonder how much can they squeeze out of this model
BustyMeow@reddit
about 6 to 7 t/s tg for 4bit if you wonder
danigoncalves@reddit
yep I was betting on something like that (~10) but I think its a little bit slow for what they use to.
boutell@reddit (OP)
I'm hearing you need the max to get a big boost on memory bandwidth which is the limiting factor here
2Norn@reddit
could have just calculated it without running tbh unless you wanted the fun of trying it
boutell@reddit (OP)
Basically yes plus the ngram hail Mary
poobear_74@reddit
From recent tests, 35B-A3B is actually a fine model. It seems good enough for most coding tasks. It runs really well on Strix Halo 395. A cost effective local coding solution. Qwen 3.6 27b is a marginal improvement that requires moving from $2k setup to a $13k setup.
Finanzamt_Endgegner@reddit
You need better ngram flags try something like this
boutell@reddit (OP)
You seem to be recommending the older ngram speculator over the newer one. This doesn't mean you're wrong of course.
The rest of your settings aren't that far from what I'm doing.
Finanzamt_Endgegner@reddit
some guy said the newer ones are more for parallel stuff but idk it just works better for me 🤔
FigAltruistic2086@reddit
27B is dense, so all weights have to be touched for every token — unlike the MoE counterpart where only ~3B are active. That's why memory bandwidth becomes the bottleneck, not compute.
I see almost exactly the same behavior on a DGX Spark with 128GB unified memory. Plenty of capacity to load the model, but bandwidth ceiling hits dense 27B the same way — more RAM doesn't help when every token still has to stream the full weight set.