Gemma 4 26b a4b - MacBook Pro M5 MAX. Averaging around 81tok/sec

Posted by Bderken@reddit | LocalLLaMA | View on Reddit | 60 comments

Gemma 4 26b a4b - MacBook Pro M5 MAX. Averaging around 81tok/sec

Pretty fast! Uses around 114watts at its peak, short bursts as the response is usually pretty fast.

Reply to Post

Reply

60 Comments

[-]

fisherwei@reddit

Could you try running Gemma 31B BF16 via omlx, and then benchmark its PP and TG performance with a context window of approximately 32K–64K? As far as I know, omlx is currently the fastest framework available on Apple Silicon. [https://huggingface.co/mlx-community/gemma-4-31b-bf16](https://huggingface.co/mlx-community/gemma-4-31b-bf16) [https://github.com/jundot/omlx](https://github.com/jundot/omlx) BTW: omlx comes with a built-in benchmarking feature.

Reply

[-]

Bderken@reddit (OP)

Setting all that up now, thanks for linking OMLX never seen that before.

Reply

[-]

cryingneko@reddit

oMLX just updated to 0.3.3. If you're going to use Gemma 4, I'd recommend using the updated version. [https://github.com/jundot/omlx/releases/tag/v0.3.3](https://github.com/jundot/omlx/releases/tag/v0.3.3)

Reply

[-]

310dweller@reddit

Curious if you have heard anything about oMLX not correctly reporting context limit? Having issues with Hermes left in auto-context limit detection mode and failing out into an ultra low context window, despite setting both model and server-wide limits at 64k in oMLX. Aside that, holy shit what an incredible thing you have created!!

Reply

[-]

Bderken@reddit (OP)

https://preview.redd.it/eouv9sdj79tg1.png?width=1536&format=png&auto=webp&s=b0b67ea6f266cf3fb47aa12abc93b1b49d6274a5 Here's the results on the oMLX website: \### Gemma 31B bf16 on M5 Max (128GB) — OMLX Benchmark Summary \*\*Short context performance\*\* \- \~7 tok/s decode up to 4k context \- \~550–680 tok/s prefill Solid for a 31B bf16 model on a laptop. \*\*Scaling with context\*\* \- Decode drops as expected: 7 → 4.9 (16k) → 4.0 (32k) → 2.5 (64k) → 1.2 (200k) \- No abnormal bottlenecks; standard attention cost behavior. \*\*Memory is the real constraint\*\* \- \~60 GB base (model) \- \~89 GB at 200k context \- Swap occurs even on 128GB with large context + batching KV cache dominates, not weights. \*\*Batching\*\* \- 1x → 7 tok/s \- 4x → 17.1 tok/s (\~2.4x) \- 8x → 27.9 tok/s (\~4x) Good scaling, but latency increases significantly. 2x result likely impacted by swap. \*\*Long context reality\*\* \- 131k: \~7.8 min TTFT \- 200k: \~17 min TTFT Technically works, but not practical. \*\*Conclusion\*\* Strong performance up to \~16k context. Beyond that, memory pressure (not compute) becomes the limiting factor. This setup is viable for local 30B inference, but not for extreme

Reply

[-]

fisherwei@reddit

Thank you very much for the benchmarking; I hope Apple finds a way to improve MLX performance. Otherwise, Macs will be unable to deploy dense models of this scale.

Reply

[-]

Bderken@reddit (OP)

https://preview.redd.it/5eflo9os79tg1.png?width=1536&format=png&auto=webp&s=01ff3a66cbdc9ab415b9df21c44254b9e34dc81a

Reply

[-]

br_web@reddit

my M1 Max 64G gives me 40t/s, to me is not worth investing $6K+ for double performance, I need at least 4+ times to justify that investment

Reply

[-]

Historical-Curve-235@reddit

Surtout que une fois que tu as un gros contexte les machines sont inutilisable

Reply

[-]

vivekpola@reddit

The full MoE model? Can you give me more details? Have you tried to run the 31b model?

Reply

[-]

New-Ad6482@reddit

What can I run on M4 Pro 16GB? Will Gemma 4 run?

Reply

[-]

ps-73@reddit

E4B runs on my 16gb M1 base not too bad. Prompt processing is pretty shit though

Reply

[-]

Fit-Horse-3100@reddit

sure but I think this one is not worthy to try on 16GB, context will be around 4k-6k. Too small. On 24GB 32k-48k but you can increase by lowering token speed(my opinion)

Reply

[-]

ShelZuuz@reddit

That's pretty good. I average around 61 t/s on an M1 Ultra 128 GB with that model. And around 180 t/s on a 5090.

Reply

[-]

someone_12321@reddit

99 t/s on 3090. 128k context

Reply

[-]

PricePerGig@reddit

Which model did you use please? Quant etc.

Reply

[-]

ShelZuuz@reddit

That is awesome!

Reply

[-]

spaceman_@reddit

What quants are you guys talking about?

Reply

[-]

someone_12321@reddit

Sorry q4 xl with 8fp for kv

Reply

[-]

shveddy@reddit

Can’t speak for everyone else here, but I was using 26b GGUF q8 on a m1 ultra with 128gb and getting 60 per sec at the beginning, and then it dropped to about 52 with large contexts (was asking it to analyze images and write python scripts to visualize its results all in one prompt)

Reply

[-]

Bderken@reddit (OP)

It’s wild how the laptop chip finally passed the M1 Ultra

Reply

[-]

Chilalala@reddit

i get around 50tok/s on my m1 max for gemma 4 26b a4b, and around 8t/s for the 31b model

Reply

[-]

PapaRizkallah@reddit

Assuming this is a GGUF because MLX support for Gemma 4 isn’t in LM Studio yet, right?

Reply

[-]

Bderken@reddit (OP)

Yes exactly! There is this model but I haven't done any research and don't know what it is but will test it. Downloading a butt load of models rn https://preview.redd.it/2bd6n5ysawsg1.png?width=1276&format=png&auto=webp&s=72c142e4a72b040835c189b665575b01f89f4aff

Reply

[-]

alitadrakes@reddit

Test 31b and let us know please

Reply

[-]

Some_Ad_6784@reddit

The 31b is tricky to run in LMstudio. I have a 64GB M4 Max and it will freeze the whole system due to memory exhaustion. I have to keep the context length around 24K and the K and V cache quant size at q4. If I don't do this the system memory is consumed on load. The 26b model does not exhibit this behavior. Getting around 8 tokens/s using opencode.

Reply

[-]

Bderken@reddit (OP)

Tried it, got 8tok/sec but it didn't have thinking mode and I couldn't figure out the profile and all that in LM Studio right now, will need more time or for the LM Studio preset to mature for this model. Not smart enough to get these results fast.

Reply

[-]

alitadrakes@reddit

Wait i noticed that too, so 26b is thinking and 31b isnt? I could be wrong

Reply

[-]

Bderken@reddit (OP)

I think the preset for these isn’t setup? Idk tho. My Gemma preset works for 26 but not 31

Reply

[-]

Bderken@reddit (OP)

Downloading right now! It’s 62GB so I have some time (internet is like 650mbps max rn)

Reply

[-]

Original_Finding2212@reddit

MlX supports nvpf4 which should give best results

Reply

[-]

Bderken@reddit (OP)

It fails to load for some reason. Will try again later

Reply

[-]

Fit-Horse-3100@reddit

LM studio won't work with gemma 4 26B on my macbook M4 pro 24GB, I think this happens cause MacOS 15.7.2 but Im not sure. Can you describe your expirience with this kind problem? "This message contains no content. The AI has nothing to say."

Reply

[-]

Bderken@reddit (OP)

I mean why not just update? Highly doubt they’re testing LM Studio on macOS 15. It loaded and worked fine for me. The MLX version doesn’t work at all for me tho

Reply

[-]

Fit-Horse-3100@reddit

cant adapt to changes, love this version if so, is it problem of LMstudio or llama cpp core itself?

Reply

[-]

Bderken@reddit (OP)

Can’t adopt to changes but you’ll trouble shoot and change the way you use ai all the time? Brother

Reply

[-]

ClydeDroid@reddit

Have you tried Qwen3.5-122B-A10B yet? I’d be interested to see how fast the 4 bit mlx version runs on your hardware: https://huggingface.co/mlx-community/Qwen3.5-122B-A10B-4bit

Reply

[-]

Bderken@reddit (OP)

I haven’t yet but I will!

Reply

[-]

jay-mini@reddit

15tok/s on random latop with 32Go ram.

Reply

[-]

Bderken@reddit (OP)

Nice!

Reply

[-]

equatorbit@reddit

How much RAM does MBP have?

Reply

[-]

Bderken@reddit (OP)

If you see on the bottom right, I have the 128GB option.

Reply

[-]

elie2222@reddit

How much ram does your machine have?

Reply

[-]

Ill_Barber8709@reddit

You can see it in the screenshot. 128GB physical memory.

Reply

[-]

Bderken@reddit (OP)

Let me know if there’s another model you want me to try and what to ask it (ANY MODEL ANY QUESTION)

Reply

[-]

dash_bro@reddit

Does it have qwen's overthinking problem or no? I really like using qwen27B (dense) for synthetic data gen @80k context (i know, it's a large context length) but the overthinking and speed really puts me off. Running q4 with lmstudio on M4 Max (128G RAM)

Reply

[-]

Bderken@reddit (OP)

I think the nvidia 80-90GB model is the best thinker. It thinks like the frontier models. But I haven’t tested it out enough to conclude it doesn’t over think. My initial testing it seems fine for a small model.

Reply

[-]

ShelZuuz@reddit

Qwen3-Coder-Next / Qwen3.5-122B-A10B: >

Reply

[-]

sammcj@reddit

https://omlx.ai/benchmarks?chip=&chip_full=M5%7CMax%7C40&model=122&quantization=&context=&pp_min=&tg_min=

Reply

[-]

Bderken@reddit (OP)

Will try that in the AM

Reply

[-]

Kawaiiwaffledesu@reddit

31b?

Reply

[-]

atmafatte@reddit

Is Gemma trained for tool calling?

Reply

[-]

hoantv1990@reddit

*I tried using* ***Picoclaw*** *with* ***Gemma 4 N2B****. I expected* ***2 turn tool calls****, but* ***Qwen 3.5 4B*** *can handle* ***4 turns*** *for the same question and prompt. This is not sufficient for me.*

Reply

[-]

Bderken@reddit (OP)

Seems like most of the latest models like Kimi 3 are. But I don't trust them after the stuff I have seen with openclaw setups. Not the best test but that's the litmus test for me.

Reply

[-]

illforgetsoonenough@reddit

Yes

Reply

[-]

Citadel_Employee@reddit

How do you like the quality? Is the intelligence a noticeable jump from other models of similar size?

Reply

[-]

SignificanceBest3073@reddit

I've tried it and its really good. In fact I use it a lot more than chatgpt or claude now.

Reply

[-]

Bderken@reddit (OP)

Honestly i don't test for that. I don't have a good system. I always test with this question "How does DLSS work and how does it take so little VRAM" most models are able to spit that out fine. I mainly test these local models for other things like small document summarization, data extraction, and stuff like that for bigger tools. I have systems for testing that for the tools i develop. The model I have been surprised by is Nemotron 3 Super 120B. Obviously a much bigger model.

Reply

[-]

ComfortablePlenty513@reddit

how is it with long contexts?

Reply

[-]

Bderken@reddit (OP)

I don’t have a good test for that myself, as far as accuracy goes. But I asked it how dlsss works and all that and the output was good. But it did use a lot of context for the output. Which is fine for this one chat and one question

Reply