I benchmarked 37 LLMs on MacBook Air M5 32GB — full results + open-source tool to benchmark your own Mac

Posted by evoura@reddit | LocalLLaMA | View on Reddit | 41 comments

So I got curious about how fast different models actually run on my M5 Air (32GB, 10 CPU/10 GPU). Instead of just testing one or two, I went through 37 models across 10 different families and recorded everything using llama-bench with Q4_K_M quantization.

The goal: build a community benchmark database covering every Apple Silicon chip (M1 through M5, base/Pro/Max/Ultra) so anyone can look up performance for their exact hardware.

The Results (M5 32GB, Q4_K_M, llama-bench)

Top 15 by Generation Speed

Model	Params	tg128 (tok/s)	pp256 (tok/s)	RAM
Qwen 3 0.6B	0.6B	91.9	2013	0.6 GB
Llama 3.2 1B	1B	59.4	1377	0.9 GB
Gemma 3 1B	1B	46.6	1431	0.9 GB
Qwen 3 1.7B	1.7B	37.3	774	1.3 GB
Qwen 3.5 35B-A3B MoE	35B	31.3	573	20.7 GB
Qwen 3.5 4B	4B	29.4	631	2.7 GB
Gemma 4 E2B	2B	29.2	653	3.4 GB
Llama 3.2 3B	3B	24.1	440	2.0 GB
Qwen 3 30B-A3B MoE	30B	23.1	283	17.5 GB
Phi 4 Mini 3.8B	3.8B	19.6	385	2.5 GB
Phi 4 Mini Reasoning 3.8B	3.8B	19.4	393	2.5 GB
Gemma 4 26B-A4B MoE	26B	16.2	269	16.1 GB
Qwen 3.5 9B	9B	13.2	226	5.5 GB
Mistral 7B v0.3	7B	11.5	183	4.2 GB
DeepSeek R1 Distill 7B	7B	11.4	191	4.5 GB

The "Slow but Capable" Tier (batch/offline use)

Model	Params	tg128 (tok/s)	RAM
Mistral Small 3.1 24B	24B	3.6	13.5 GB
Devstral Small 24B	24B	3.5	13.5 GB
Gemma 3 27B	27B	3.0	15.6 GB
DeepSeek R1 Distill 32B	32B	2.6	18.7 GB
QwQ 32B	32B	2.6	18.7 GB
Qwen 3 32B	32B	2.5	18.6 GB
Qwen 2.5 Coder 32B	32B	2.5	18.7 GB
Gemma 4 31B	31B	2.4	18.6 GB

Key Findings

MoE models are game-changers for local inference. The Qwen 3.5 35B-A3B MoE runs at 31 tok/s, that's 12x faster than dense 32B models (2.5 tok/s) at similar memory usage. You get 35B-level intelligence at the speed of a 3B model.

Sweet spots for 32GB MacBook:

Best overall: Qwen 3.5 35B-A3B Mo, 35B quality at 31 tok/s. This is the one.
Best coding: Qwen 2.5 Coder 7B at 11 tok/s (comfortable), or Coder 14B at 6 tok/s (slower, better)
Best reasoning: DeepSeek R1 Distill 7B at 11 tok/s, or R1 Distill 32B at 2.5 tok/s if you're patient
Best tiny: Qwen 3.5 4B — 29 tok/s, only 2.7 GB RAM

The 32GB wall: Every dense 32B model lands at \~2.5 tok/s using \~18.6 GB. Usable for batch work, not for interactive chat. MoE architecture is the escape hatch.

All 37 Models Tested

10 model families: Gemma 4, Gemma 3, Qwen 3.5, Qwen 3, Qwen 2.5 Coder, QwQ, DeepSeek R1 Distill, Phi-4, Mistral, Llama

How It Works

All benchmarks use llama-bench which is standardized, content-agnostic, reproducible. It measures raw token processing (pp) and generation (tg) speed at fixed token counts. No custom prompts, no subjectivity.

It auto detects your hardware, downloads models that fit in your RAM, benchmarks them, and saves results in a standardized format. Submit a PR and your results show up in the database.

Especially looking for: M4 Pro, M4 Max, M3 Max, M2 Ultra, and M1 owners. The more hardware configs we cover, the more useful this becomes for everyone.

GitHub: https://github.com/enescingoz/mac-llm-bench

Happy to answer questions about any of the results or the methodology.

[-]

sirfitzwilliamdarcy@reddit

It's much faster with MLX. I'm getting 55 tok/s for Qwen 3.5 35B-A3B at 4 bits.

[-]

UnhingedBench@reddit

Here my own image, illustrating one year of experimentations on a 128GB MacBook M4 Max.

Tested in ideal situation: Empty context, and avoiding any thermal throttle.

You can visualize how MoE are game changers, once you have fast RAM to spare.

[-]

nvidiabookauthor@reddit

Are the fire emoji the models you recommend?

[-]

UnhingedBench@reddit

Yes, that are. I'm benchmark model for Role-playing, including Erotic and NSFL scenarios. 🔥Unhinged ERP Benchmark

[-]

evoura@reddit (OP)

These are very nice visuals 🔥 would you think about running same benchmarks on my repo and upload the results, so we can create a centeralized community benchmark? And in the future we can create visuals like that.

[-]

UnhingedBench@reddit

For comparison, Linear vs Logarithmic scales, same data

[-]

Sweet-Argument-7343@reddit

Happy to contribute with my M2 24GB Mac Mini. However, to me make sense testing only MLX !

[-]

evoura@reddit (OP)

MLX support added now, if you want to benchmark models on your setup :)

[-]

evoura@reddit (OP)

Thank you so much for your interest and im really happy to hear that :) MLX support will be added very shortly. After adding it i will let you know and then i will be very happy if you can check my mlx implementation.

[-]

BeneficialVillage148@reddit

This is super useful 🙌

Love how clean and practical the benchmarks are, especially highlighting how much MoE models outperform dense ones on 32GB Macs. This kind of data is exactly what people need before choosing a model.

[-]

evoura@reddit (OP)

Thank you so much for your feedback. I think these kind of benchmarks are very useful for who is looking to buy new mac, or wants to run sweet spot models for their setups.

[-]

port888@reddit

I concur with the observation that Qwen 3.5 35B A3B is the best at the moment. I have the exact same laptop and configuration, and no matter what local model I try, I always find myself back with the 35B for the balance between speed and output quality.

[-]

evoura@reddit (OP)

Thank you so much for this great information. Real user experiences are important as much as these kind of benchmarks.

[-]

srigi@reddit

I cannot forgive Apple for not giving us the 64GB Air this generation. Even if people mention thermal throttling of Airs, the 64GB would allow the whole new class quantizations being loaded into RAM.

[-]

BeneficialVillage148@reddit

This is super useful 👏

Having real benchmarks across so many models on Apple Silicon is exactly what the community needed. The MoE speed vs dense models difference is honestly wild.

[-]

LoSboccacc@reddit

You get 35B-level intelligence at the speed of a 3B model.

You don't

[-]

SirBardBarston@reddit

Can you elaborate on this a bit? New to the game.

[-]

Pawderr@reddit

it means that the model will never be "as smart" as a dense model where the 35B parameters are all active at once, but it will run very fast and be way more intelligent than a dense 3B model

[-]

LoSboccacc@reddit

You get idk 14b intelligence

[-]

Specter_Origin@reddit

True, It's close but not quite close enough...

[-]

VitSoonYoung@reddit

Newbie here, I wonder at which tg tok/sec is fast enough for coding tasks and how much context to be comparable to Claude's cloud solutions?

[-]

reery7@reddit

For the Devstral Small 24B MLX MoE I‘m getting 9.3 t/s on an M5 MacBook Air 24 GB. Power usage for GPU is about 5.5W, 6.5W package, no throttling at all.

[-]

matt-k-wong@reddit

this is awesome. Would love to see how the pro, max, and (future) ultra models stack up but my mental model is just simply to multiply the t/s by the bus speed increases with a minor coefficient multiplier for the enhanced gpu and neural accelerator speeds.

[-]

sammcj@reddit

I have a M5 Max 128GB, I've benchmarked across a few LLMs here if it helps: https://omlx.ai/my/fadc2127d384283f5df1fcc2c093a9f95700c6a52594bf9db837a81d3418b5ec

Qwen3.5-122B-A10B · 4bit 1k PP 911.1 · TG 64.3 tok/s 4k PP 1,480 · TG 62.2 tok/s

Qwen3.5-27B · 4bit 1k PP 756.3 · TG 30.6 tok/s 4k PP 894.8 · TG 28.4 tok/s 8k PP 825.4 · TG 27.2 tok/s 16 PP 722.1 · TG 26.6 tok/s

Qwen3.5-35B-A3B · 4bit 1k PP 1,698 · TG 131.8 tok/s 4k PP 3,424 · TG 119.6 tok/s 32 PP 3,082 · TG 85.5 tok/s

qwen3.5-9b · 4bit 1k PP 1,983 · TG 96.2 tok/s 4k PP 2,706 · TG 92.2 tok/s

Qwen3.5-4B · 4bit 1k PP 2,819 · TG 165.3 tok/s 4k PP 4,336 · TG 153.0 tok/s 8k PP 4,644 · TG 141.9 tok/s 16 PP 4,535 · TG 123.3 tok/s

Qwen3.5-2B · 4bit 1k PP 3,438 · TG 326.7 tok/s

[-]

evoura@reddit (OP)

Yeah since the memory bandwith is main factor, results for other chips might be roughly estimated. But the goal of this repo is sharing the real life results rather than estimations. Because thermal throttling, shared memory, or other daily life factors can cause different results than theoretical ones. Also, if we can see other chips' results, we will be also benchmarked how acurate the estimations with real life results.

[-]

Wey_Gu@reddit

dense qwen3.5 27b probably is the most slow but capable one i think

[-]

susu3621@reddit

[-]

Moist_Recognition321@reddit

Great benchmark! Really useful to see how different models perform on Apple Silicon. Would love to see memory bandwidth impact analysis too. Thanks for sharing this!

[-]

No_Individual_8178@reddit

nice work, been looking for something like this. i'm on an m2 max 96gb and the 32gb wall you described just doesn't exist at that tier obviously, but the tradeoff is you're paying for bandwidth you only use on the bigger models. i daily drive qwen 2.5 72b q4 through llama.cpp and it's usable for interactive work but definitely on the slower side. happy to run your bench tool and submit a PR when i get a chance, would be cool to see how 96gb compares.

[-]

CSlov23@reddit

Thanks for posting this. Did Mac start slowing down due to the lack of fans? Or get really warm? I’m debating between the air and pro, so I was curious

[-]

evoura@reddit (OP)

Honestly no, I didn't notice any thermal throttling or the machine getting very hot during the benchmarks. And since the Air has no fans, it stays completely silent which is nice. That said I was running the benchmarks with most other apps closed. If you're running VMs, Docker, or heavy background processes at the same time, your experience might be different.

[-]

t4a8945@reddit

I don't understand one-dimensional benchmarks. If a model produces nonsense at 90 tps VS one 5 time slower that actually solves problems, the winner is the slower one.

[-]

evoura@reddit (OP)

Fair point. The idea is that most people already know which models are good from quality benchmarks. What they don't know is whether their Mac can actually run those models at a usable speed, or which Mac they need to buy to get comfortable tok/s on the model they want. That's the gap this fills. The main idea here is finding the sweet spot between knowledge and speed.

[-]

CliveBratton@reddit

Would love to see this done for m5 16gb..

[-]

Shot-Buffalo-2603@reddit

Is this only using gguf? I don’t think llama.cpp supports MLX. I’m on an m5 air 32GB and I’m getting almost double some of those toc/sec using MLX models with vllm-mlx or lmstudio.

[-]

evoura@reddit (OP)

Yes, all GGUF with llama.cpp only. MLX is definitely faster that I've seen people report 30-50% better tok/s compared to llama.cpp for the same model. The repo is set up to support adding other runtimes like MLX in the future. Which models are you running? Would be cool to have a direct GGUF vs MLX comparison on the same M5 Air.

[-]

r15km4tr1x@reddit

Did you include e4b?

[-]

evoura@reddit (OP)

Yep, Gemma 4 E4B is in there with 8 tok/s generation and 5.2 GB RAM usage.

[-]

r15km4tr1x@reddit

Cool was doing some rough testing on speed vs accuracy on photo vlm on aged box and e4b quality wise appeared better and not much slower

[-]

Ill_Barber8709@reddit

That's a little bit surprising. I'm a daily user of Devstral-Small-2 24B 4Bit MLX on an M2 Max MBP (32GB 400GB/s) and I get 18t/s on average. I use it extensively as an agent in VSCode and Xcode, so no small tasks nor small context. I understand that MLX is faster than GGUF and 4Bit is a little bit smaller than Q4_K_M, but I would have expect at least 6~ish t/s on the MBA (150 GB/s).

I'm about to try using vLLM with the same GGUF. I hope it won't be too slow.

[-]

evoura@reddit (OP)

Yeah 3.5 t/s is definitely on the low side . I think the base M5 only has 10 GPU cores which really hurts at 24B. Your M2 Max with 30 GPU cores and 400 GB/s bandwidth is just a completely different beast for these larger models. MLX being faster than llama.cpp probably adds another 20-30% on top of that. Would love your M2 Max numbers in the repo :)