Gemma4 26B A4B runs easily on 16GB Macs

Posted by FenderMoon@reddit | LocalLLaMA | View on Reddit | 57 comments

Typically, models in the 26B-class range are difficult to run on 16GB macs because any GPU acceleration requires the accelerated layers to sit entirely within wired memory. It's possible with aggressive quants (2 bits, or maybe a very lightweight IQ3_XXS), but quality degrades significantly by doing so.

However, if run entirely on the CPU instead (which is much more feasible with MoE models), it's possible to run really good quants even when the models end up being larger than the entire available system RAM. There is some performance loss from swapping in and out experts, but I find that the performance loss is much less than I would have expected.

I was able to easily achieve 8-10 tps with a context window of 8-16K on my M2 Macbook Pro. This was on good 4 bit quant. Far from fast, but good enough to be perfectly usable for folks used to running on this kind of hardware.

Still a lot slower than using smaller models or more aggressive quants, but if 6-10tps is tolerable, it might be worth trying. It runs quite well on mine.

Thinking fix for LMStudio:

Also, for fellow LMstudio users, none of the currently published ones have thinking enabled by default, even though the model supports it. To enable it, you have to go into the model settings, and add the following line at the very top of the JINGA prompt template (under the inference tab).

{% set enable_thinking=true %}

Also change the reasoning parsing strings:

Start string: <|channel>thought

End string:

(Credit for this @Guilty_Rooster_6708) - I didn't come up with this fix, I've linked to the post I got it from.