TheaterFire

Gemma 4 26b a4b - MacBook Pro M5 MAX. Averaging around 81tok/sec

Posted by Bderken@reddit | LocalLLaMA | View on Reddit | 60 comments

Gemma 4 26b a4b - MacBook Pro M5 MAX. Averaging around 81tok/sec
Pretty fast! Uses around 114watts at its peak, short bursts as the response is usually pretty fast.

Reply to Post

60 Comments

fisherwei@reddit

Could you try running Gemma 31B BF16 via omlx, and then benchmark its PP and TG performance with a context window of approximately 32K–64K? As far as I know, omlx is currently the fastest framework available on Apple Silicon. [https://huggingface.co/mlx-community/gemma-4-31b-bf16](https://huggingface.co/mlx-community/gemma-4-31b-bf16) [https://github.com/jundot/omlx](https://github.com/jundot/omlx) BTW: omlx comes with a built-in benchmarking feature.
View on Reddit #82454072

Bderken@reddit (OP)

Setting all that up now, thanks for linking OMLX never seen that before.
View on Reddit #82485343

cryingneko@reddit

oMLX just updated to 0.3.3. If you're going to use Gemma 4, I'd recommend using the updated version. [https://github.com/jundot/omlx/releases/tag/v0.3.3](https://github.com/jundot/omlx/releases/tag/v0.3.3)
View on Reddit #82488854

310dweller@reddit

Curious if you have heard anything about oMLX not correctly reporting context limit? Having issues with Hermes left in auto-context limit detection mode and failing out into an ultra low context window, despite setting both model and server-wide limits at 64k in oMLX. Aside that, holy shit what an incredible thing you have created!!
View on Reddit #86023144

Bderken@reddit (OP)

https://preview.redd.it/eouv9sdj79tg1.png?width=1536&format=png&auto=webp&s=b0b67ea6f266cf3fb47aa12abc93b1b49d6274a5 Here's the results on the oMLX website: \### Gemma 31B bf16 on M5 Max (128GB) — OMLX Benchmark Summary \*\*Short context performance\*\* \- \~7 tok/s decode up to 4k context \- \~550–680 tok/s prefill Solid for a 31B bf16 model on a laptop. \*\*Scaling with context\*\* \- Decode drops as expected: 7 → 4.9 (16k) → 4.0 (32k) → 2.5 (64k) → 1.2 (200k) \- No abnormal bottlenecks; standard attention cost behavior. \*\*Memory is the real constraint\*\* \- \~60 GB base (model) \- \~89 GB at 200k context \- Swap occurs even on 128GB with large context + batching KV cache dominates, not weights. \*\*Batching\*\* \- 1x → 7 tok/s \- 4x → 17.1 tok/s (\~2.4x) \- 8x → 27.9 tok/s (\~4x) Good scaling, but latency increases significantly. 2x result likely impacted by swap. \*\*Long context reality\*\* \- 131k: \~7.8 min TTFT \- 200k: \~17 min TTFT Technically works, but not practical. \*\*Conclusion\*\* Strong performance up to \~16k context. Beyond that, memory pressure (not compute) becomes the limiting factor. This setup is viable for local 30B inference, but not for extreme
View on Reddit #82492986

fisherwei@reddit

Thank you very much for the benchmarking; I hope Apple finds a way to improve MLX performance. Otherwise, Macs will be unable to deploy dense models of this scale.
View on Reddit #82523360

Bderken@reddit (OP)

https://preview.redd.it/5eflo9os79tg1.png?width=1536&format=png&auto=webp&s=01ff3a66cbdc9ab415b9df21c44254b9e34dc81a
View on Reddit #82493056

br_web@reddit

my M1 Max 64G gives me 40t/s, to me is not worth investing $6K+ for double performance, I need at least 4+ times to justify that investment
View on Reddit #82389233

Historical-Curve-235@reddit

Surtout que une fois que tu as un gros contexte les machines sont inutilisable 
View on Reddit #83974015

vivekpola@reddit

The full MoE model? Can you give me more details? Have you tried to run the 31b model?
View on Reddit #82413778

New-Ad6482@reddit

What can I run on M4 Pro 16GB? Will Gemma 4 run?
View on Reddit #82342899

ps-73@reddit

E4B runs on my 16gb M1 base not too bad. Prompt processing is pretty shit though
View on Reddit #83850220

Fit-Horse-3100@reddit

sure but I think this one is not worthy to try on 16GB, context will be around 4k-6k. Too small. On 24GB 32k-48k but you can increase by lowering token speed(my opinion)
View on Reddit #82355274

ShelZuuz@reddit

That's pretty good. I average around 61 t/s on an M1 Ultra 128 GB with that model. And around 180 t/s on a 5090.
View on Reddit #82329103

someone_12321@reddit

99 t/s on 3090. 128k context
View on Reddit #82613805

PricePerGig@reddit

Which model did you use please? Quant etc.
View on Reddit #83542129

ShelZuuz@reddit

That is awesome!
View on Reddit #82621531

spaceman_@reddit

What quants are you guys talking about?
View on Reddit #82330805

someone_12321@reddit

Sorry q4 xl with 8fp for kv
View on Reddit #82994112

shveddy@reddit

Can’t speak for everyone else here, but I was using 26b GGUF q8 on a m1 ultra with 128gb and getting 60 per sec at the beginning, and then it dropped to about 52 with large contexts (was asking it to analyze images and write python scripts to visualize its results all in one prompt)
View on Reddit #82332715

Bderken@reddit (OP)

It’s wild how the laptop chip finally passed the M1 Ultra
View on Reddit #82330496

Chilalala@reddit

i get around 50tok/s on my m1 max for gemma 4 26b a4b, and around 8t/s for the 31b model
View on Reddit #82621081

PapaRizkallah@reddit

Assuming this is a GGUF because MLX support for Gemma 4 isn’t in LM Studio yet, right?
View on Reddit #82326808

Bderken@reddit (OP)

Yes exactly! There is this model but I haven't done any research and don't know what it is but will test it. Downloading a butt load of models rn https://preview.redd.it/2bd6n5ysawsg1.png?width=1276&format=png&auto=webp&s=72c142e4a72b040835c189b665575b01f89f4aff
View on Reddit #82327267

alitadrakes@reddit

Test 31b and let us know please
View on Reddit #82327482

Some_Ad_6784@reddit

The 31b is tricky to run in LMstudio. I have a 64GB M4 Max and it will freeze the whole system due to memory exhaustion. I have to keep the context length around 24K and the K and V cache quant size at q4. If I don't do this the system memory is consumed on load. The 26b model does not exhibit this behavior. Getting around 8 tokens/s using opencode.
View on Reddit #82404047

Bderken@reddit (OP)

Tried it, got 8tok/sec but it didn't have thinking mode and I couldn't figure out the profile and all that in LM Studio right now, will need more time or for the LM Studio preset to mature for this model. Not smart enough to get these results fast.
View on Reddit #82328803

alitadrakes@reddit

Wait i noticed that too, so 26b is thinking and 31b isnt? I could be wrong
View on Reddit #82328865

Bderken@reddit (OP)

I think the preset for these isn’t setup? Idk tho. My Gemma preset works for 26 but not 31
View on Reddit #82329100

Bderken@reddit (OP)

Downloading right now! It’s 62GB so I have some time (internet is like 650mbps max rn)
View on Reddit #82327803

Original_Finding2212@reddit

MlX supports nvpf4 which should give best results
View on Reddit #82330906

Bderken@reddit (OP)

It fails to load for some reason. Will try again later
View on Reddit #82335468

Fit-Horse-3100@reddit

LM studio won't work with gemma 4 26B on my macbook M4 pro 24GB, I think this happens cause MacOS 15.7.2 but Im not sure. Can you describe your expirience with this kind problem? "This message contains no content. The AI has nothing to say."
View on Reddit #82364586

Bderken@reddit (OP)

I mean why not just update? Highly doubt they’re testing LM Studio on macOS 15. It loaded and worked fine for me. The MLX version doesn’t work at all for me tho
View on Reddit #82365405

Fit-Horse-3100@reddit

cant adapt to changes, love this version if so, is it problem of LMstudio or llama cpp core itself?
View on Reddit #82371565

Bderken@reddit (OP)

Can’t adopt to changes but you’ll trouble shoot and change the way you use ai all the time? Brother
View on Reddit #82371824

ClydeDroid@reddit

Have you tried Qwen3.5-122B-A10B yet? I’d be interested to see how fast the 4 bit mlx version runs on your hardware: https://huggingface.co/mlx-community/Qwen3.5-122B-A10B-4bit
View on Reddit #82366636

Bderken@reddit (OP)

I haven’t yet but I will!
View on Reddit #82368271

jay-mini@reddit

15tok/s on random latop with 32Go ram.
View on Reddit #82340805

Bderken@reddit (OP)

Nice!
View on Reddit #82359737

equatorbit@reddit

How much RAM does MBP have?
View on Reddit #82357153

Bderken@reddit (OP)

If you see on the bottom right, I have the 128GB option.
View on Reddit #82359705

elie2222@reddit

How much ram does your machine have?
View on Reddit #82342176

Ill_Barber8709@reddit

You can see it in the screenshot. 128GB physical memory.
View on Reddit #82346177

Bderken@reddit (OP)

Let me know if there’s another model you want me to try and what to ask it (ANY MODEL ANY QUESTION)
View on Reddit #82325890

dash_bro@reddit

Does it have qwen's overthinking problem or no? I really like using qwen27B (dense) for synthetic data gen @80k context (i know, it's a large context length) but the overthinking and speed really puts me off. Running q4 with lmstudio on M4 Max (128G RAM)
View on Reddit #82335344

Bderken@reddit (OP)

I think the nvidia 80-90GB model is the best thinker. It thinks like the frontier models. But I haven’t tested it out enough to conclude it doesn’t over think. My initial testing it seems fine for a small model.
View on Reddit #82336158

ShelZuuz@reddit

Qwen3-Coder-Next / Qwen3.5-122B-A10B: >
View on Reddit #82329696

sammcj@reddit

https://omlx.ai/benchmarks?chip=&chip_full=M5%7CMax%7C40&model=122&quantization=&context=&pp_min=&tg_min=
View on Reddit #82330836

Bderken@reddit (OP)

Will try that in the AM
View on Reddit #82330508

Kawaiiwaffledesu@reddit

31b?
View on Reddit #82326368

atmafatte@reddit

Is Gemma trained for tool calling?
View on Reddit #82326382

hoantv1990@reddit

*I tried using* ***Picoclaw*** *with* ***Gemma 4 N2B****. I expected* ***2 turn tool calls****, but* ***Qwen 3.5 4B*** *can handle* ***4 turns*** *for the same question and prompt. This is not sufficient for me.*
View on Reddit #82335649

Bderken@reddit (OP)

Seems like most of the latest models like Kimi 3 are. But I don't trust them after the stuff I have seen with openclaw setups. Not the best test but that's the litmus test for me.
View on Reddit #82326741

illforgetsoonenough@reddit

Yes
View on Reddit #82326582

Citadel_Employee@reddit

How do you like the quality? Is the intelligence a noticeable jump from other models of similar size?
View on Reddit #82326139

SignificanceBest3073@reddit

I've tried it and its really good. In fact I use it a lot more than chatgpt or claude now.
View on Reddit #82335591

Bderken@reddit (OP)

Honestly i don't test for that. I don't have a good system. I always test with this question "How does DLSS work and how does it take so little VRAM" most models are able to spit that out fine. I mainly test these local models for other things like small document summarization, data extraction, and stuff like that for bigger tools. I have systems for testing that for the tools i develop. The model I have been surprised by is Nemotron 3 Super 120B. Obviously a much bigger model.
View on Reddit #82326863

ComfortablePlenty513@reddit

how is it with long contexts?
View on Reddit #82328699

Bderken@reddit (OP)

I don’t have a good test for that myself, as far as accuracy goes. But I asked it how dlsss works and all that and the output was good. But it did use a lot of context for the output. Which is fine for this one chat and one question
View on Reddit #82329174