Running Qwen 35BA3B on a 16GB M3 Macbook Air at 8.9TPS!

Posted by Sufficient-Bid3874@reddit | LocalLLaMA | View on Reddit | 15 comments

Preface: I actually write my posts myself, no slop in this post.

I managed to get Qwen 3.5 35BA3B working on my 15" 16GB M3 MBA through mmap, and I must say that given the massive model compared to my ram, 9 TPS is not bad at all.

So, how did I do it?
Step one, download the model itself:
pip3 install huggingface-hub

python3 -c "from huggingface_hub import hf_hub_download; \
hf_hub_download('unsloth/Qwen3.5-35B-A3B-GGUF', \
'Qwen3.5-35B-A3B-UD-IQ3_XXS.gguf', \
local_dir='~/.local/share/llama-models')"

After it has been downloaded, run it through this command:
llama-server \
--model PATH_TO_MODEL
--port 8081 \
--ctx-size 4096 \
--n-gpu-layers 0 \
--parallel 1 \
--mmap \
--flash-attn on \
--threads 6 \
--batch-size 512 \
--ubatch-size 128 \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--no-warmup

Note: You do not need to use the cache type k/v q4, these are here just so if you are doing less serious work, the cache uses less precious vram.

Finally, use the model with either API or the llama.cpp webUI!
API: http://127.0.0.1:8081/v1/
WebUI: http://127.0.0.1:8081

If anyone better versed in Llama.cpp can suggest possible improvements for further TPS, please let me know as these are just some that I tried and found worked pretty well.