I benchmarked 37 LLMs on MacBook Air M5 32GB — full results + open-source tool to benchmark your own Mac

Posted by evoura@reddit | LocalLLaMA | View on Reddit | 41 comments

So I got curious about how fast different models actually run on my M5 Air (32GB, 10 CPU/10 GPU). Instead of just testing one or two, I went through 37 models across 10 different families and recorded everything using llama-bench with Q4_K_M quantization.

The goal: build a community benchmark database covering every Apple Silicon chip (M1 through M5, base/Pro/Max/Ultra) so anyone can look up performance for their exact hardware.

The Results (M5 32GB, Q4_K_M, llama-bench)

Top 15 by Generation Speed

Model Params tg128 (tok/s) pp256 (tok/s) RAM
Qwen 3 0.6B 0.6B 91.9 2013 0.6 GB
Llama 3.2 1B 1B 59.4 1377 0.9 GB
Gemma 3 1B 1B 46.6 1431 0.9 GB
Qwen 3 1.7B 1.7B 37.3 774 1.3 GB
Qwen 3.5 35B-A3B MoE 35B 31.3 573 20.7 GB
Qwen 3.5 4B 4B 29.4 631 2.7 GB
Gemma 4 E2B 2B 29.2 653 3.4 GB
Llama 3.2 3B 3B 24.1 440 2.0 GB
Qwen 3 30B-A3B MoE 30B 23.1 283 17.5 GB
Phi 4 Mini 3.8B 3.8B 19.6 385 2.5 GB
Phi 4 Mini Reasoning 3.8B 3.8B 19.4 393 2.5 GB
Gemma 4 26B-A4B MoE 26B 16.2 269 16.1 GB
Qwen 3.5 9B 9B 13.2 226 5.5 GB
Mistral 7B v0.3 7B 11.5 183 4.2 GB
DeepSeek R1 Distill 7B 7B 11.4 191 4.5 GB

The "Slow but Capable" Tier (batch/offline use)

Model Params tg128 (tok/s) RAM
Mistral Small 3.1 24B 24B 3.6 13.5 GB
Devstral Small 24B 24B 3.5 13.5 GB
Gemma 3 27B 27B 3.0 15.6 GB
DeepSeek R1 Distill 32B 32B 2.6 18.7 GB
QwQ 32B 32B 2.6 18.7 GB
Qwen 3 32B 32B 2.5 18.6 GB
Qwen 2.5 Coder 32B 32B 2.5 18.7 GB
Gemma 4 31B 31B 2.4 18.6 GB

Key Findings

MoE models are game-changers for local inference. The Qwen 3.5 35B-A3B MoE runs at 31 tok/s, that's 12x faster than dense 32B models (2.5 tok/s) at similar memory usage. You get 35B-level intelligence at the speed of a 3B model.

Sweet spots for 32GB MacBook:

The 32GB wall: Every dense 32B model lands at \~2.5 tok/s using \~18.6 GB. Usable for batch work, not for interactive chat. MoE architecture is the escape hatch.

All 37 Models Tested

10 model families: Gemma 4, Gemma 3, Qwen 3.5, Qwen 3, Qwen 2.5 Coder, QwQ, DeepSeek R1 Distill, Phi-4, Mistral, Llama

How It Works

All benchmarks use llama-bench which is standardized, content-agnostic, reproducible. It measures raw token processing (pp) and generation (tg) speed at fixed token counts. No custom prompts, no subjectivity.

It auto detects your hardware, downloads models that fit in your RAM, benchmarks them, and saves results in a standardized format. Submit a PR and your results show up in the database.

Especially looking for: M4 Pro, M4 Max, M3 Max, M2 Ultra, and M1 owners. The more hardware configs we cover, the more useful this becomes for everyone.

GitHub: https://github.com/enescingoz/mac-llm-bench

Happy to answer questions about any of the results or the methodology.