Running Qwen 35BA3B on a 16GB M3 Macbook Air at 8.9TPS!
Posted by Sufficient-Bid3874@reddit | LocalLLaMA | View on Reddit | 15 comments
Preface: I actually write my posts myself, no slop in this post.
I managed to get Qwen 3.5 35BA3B working on my 15" 16GB M3 MBA through mmap, and I must say that given the massive model compared to my ram, 9 TPS is not bad at all.
So, how did I do it?
Step one, download the model itself:
pip3 install huggingface-hub
python3 -c "from huggingface_hub import hf_hub_download; \
hf_hub_download('unsloth/Qwen3.5-35B-A3B-GGUF', \
'Qwen3.5-35B-A3B-UD-IQ3_XXS.gguf', \
local_dir='~/.local/share/llama-models')"
After it has been downloaded, run it through this command:
llama-server \
--model PATH_TO_MODEL
--port 8081 \
--ctx-size 4096 \
--n-gpu-layers 0 \
--parallel 1 \
--mmap \
--flash-attn on \
--threads 6 \
--batch-size 512 \
--ubatch-size 128 \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--no-warmup
Note: You do not need to use the cache type k/v q4, these are here just so if you are doing less serious work, the cache uses less precious vram.
Finally, use the model with either API or the llama.cpp webUI!
API: http://127.0.0.1:8081/v1/
WebUI: http://127.0.0.1:8081
If anyone better versed in Llama.cpp can suggest possible improvements for further TPS, please let me know as these are just some that I tried and found worked pretty well.
po_stulate@reddit
Why
--n-gpu-layers 0though? You should be able to get way better performance with--n-gpu-layers 99?Sufficient-Bid3874@reddit (OP)
That would mean that the whole model stays in memory, which we do not have enough of on a 16gb MBA, therefore, its better to do cpu only inference. mmap also works better on cpu only inference
Evening_Ad6637@reddit
The model is loaded into memory in either case, even if only partially. You have unified RAM, which means it’s physically the same memory. But you’re letting the CPU do more of the processing than the GPU. That doesn’t make sense. It’s slower, and causes the Mac’s temperature to rise faster and higher.
The only problem is that not all of the Unified RAM is allocated to the GPU cores.
However, there is a command you need to run with sudo that frees up more RAM for the GPU cores.
Sufficient-Bid3874@reddit (OP)
This is true, afaik the default max for the GPU is around 66% of total system ram?
Seeing as you are a llama.cpp contributor, you will probably know better:
Does allocating more to the GPU with the terminal command have any downsides?
How would you modify my parameters for better performance?
(Also, unified RAM is the same memory, however my hunch was that the operation is memory bandwidth bound and that the attention layers are what would benefit the most from GPU inference for pre-fill, but correct me if I'm wrong here please)
po_stulate@reddit
Their llama.cpp flair probably just means that it's the software they use, you can apply that flair too if you want.
AFAIK if you leave too little memory for your system then it could freeze and you'd need to force reboot it. But other than that, I don't think there's any downside.
I do remember memory that's allocated to GPU needs to be wired memory, which means can't live in swap, so maybe there is actually a difference between inferencing with CPU and with GPU, since with CPU you could probably swap it.
Crystalagent47@reddit
Hey I have an M3 MBA as well, may I dm you please?
Sufficient-Bid3874@reddit (OP)
You can if you wish, but If it’s non personal I would prefer to help you here so others with the same problem find help!
Crystalagent47@reddit
Ah sure so basically I'm a noob to running local AI, and I only know a bit of LM Studio, can I try this using that or do I gotta use ollama and terminal?
Sufficient-Bid3874@reddit (OP)
You should use llama.cpp, its only two commands:
brew install llama.cpp
llama-server \-hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ3_XXS--port 8081 \--ctx-size 4096 \--n-gpu-layers 0 \--parallel 1 \--mmap \--flash-attn on \--threads 6 \--batch-size 512 \--ubatch-size 128 \--cache-type-k q4_0 \--cache-type-v q4_0 \--no-warmupCrystalagent47@reddit
Any major slowdowns while loading/running the model?
Sufficient-Bid3874@reddit (OP)
No, thanks to mmap
Crystalagent47@reddit
Got it, thanks dude
Sufficient-Bid3874@reddit (OP)
Ofc
Sufficient-Bid3874@reddit (OP)
Sufficient-Bid3874@reddit (OP)
TPS from OpenWebUI