Gemma4 26B A4B runs easily on 16GB Macs | TheaterFire

Gemma4 26B A4B runs easily on 16GB Macs

Posted by FenderMoon@reddit | LocalLLaMA | View on Reddit | 57 comments

Typically, models in the 26B-class range are difficult to run on 16GB macs because any GPU acceleration requires the accelerated layers to sit entirely within wired memory. It's possible with aggressive quants (2 bits, or maybe a very lightweight IQ3_XXS), but quality degrades significantly by doing so.

However, if run entirely on the CPU instead (which is much more feasible with MoE models), it's possible to run really good quants even when the models end up being larger than the entire available system RAM. There is some performance loss from swapping in and out experts, but I find that the performance loss is much less than I would have expected.

I was able to easily achieve 8-10 tps with a context window of 8-16K on my M2 Macbook Pro. This was on good 4 bit quant. Far from fast, but good enough to be perfectly usable for folks used to running on this kind of hardware.

Still a lot slower than using smaller models or more aggressive quants, but if 6-10tps is tolerable, it might be worth trying. It runs quite well on mine.

Thinking fix for LMStudio:

Also, for fellow LMstudio users, none of the currently published ones have thinking enabled by default, even though the model supports it. To enable it, you have to go into the model settings, and add the following line at the very top of the JINGA prompt template (under the inference tab).

{% set enable_thinking=true %}

Also change the reasoning parsing strings:

Start string: <|channel>thought

End string:

(Credit for this @Guilty_Rooster_6708) - I didn't come up with this fix, I've linked to the post I got it from.

[-]

mrkouhadi@reddit

any ideas if it can run fast enough on my intel XEON cpu 16 cores with 26GB RAM ?

[-]

chicky-poo-pee-paw@reddit

Think this will work with Qwen 3.5 122B A10B with a 64GB Mac Studio M2? I am not sure I want to waste the bandwidth on the download. Think the performance would be worth the size compared with the 27B or 35B?

[-]

FenderMoon@reddit (OP)

I don't have a 64GB system to test this on, but in theory, yes, it should work absolutely fine so long as you leave "use mmap" turned on and uncheck "keep model in memory" (this allows the system to stream experts straight from disk rather than constantly swapping in and out if the model is larger than the available system RAM).

Frankly I don't see any reason you shouldn't very easily be able to get that one to run on a 4 bit quant. The only difficulty you might have is that you might need to raise the wired memory limit if you want to use more than 48GB of your RAM for VRAM, so if you want GPU acceleration, I'd look into how to do that.

Again I don't have a system to test this on, but I don't see any reason you shouldn't very easily be able to get this to work.

[-]

Olbas_Oil@reddit

Everyone in this thread might be interested in this:

https://thoughts.jock.pl/p/local-llm-35b-mac-mini-gemma-swap-production-2026

[-]

FenderMoon@reddit (OP)

This is a great read! My guess is that mmap is why this works so well on MoE models.

Curious how 17 tokens per second was achieved on A35B. I got nowhere near that on my system when I tried it, although granted I'm almost certainly trying to use higher quants.

[-]

Olbas_Oil@reddit

I considered posting it, but i am not the author of the blog.. I just tried it with simple curl commands via cli, nothing heavy at all, and its giving 15.9 tok/s. It does go through a thinking phase also, so maybe that adds to it, still new to all this myself..

[-]

gnnr25@reddit

I can't believe this fucking worked!

Same system, but tried with llama.cpp

llama-cli  -m ~/models/unsloth-gemma-4-26B-A4B-it-UD-IQ4_NL.gguf -ngl 0 -c 8192 -fa on --no-mmap -b 64 -ub 64 --threads 8 -np 1 --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.0 -p "prompt here"

[ Prompt: 2.9 t/s | Generation: 5.2 t/s ]

[-]

FenderMoon@reddit (OP)

Yea it works surprisingly well! So well that I had to share it on Reddit, I was like, plenty of people need to know that this is an option. It's probably one of the best models folks with 16GB macs can realistically run.

You'll be able to get it to go even faster by enabling flash attention and KV cache quantization. My test system uses unified KV cache too (not exactly sure what that setting does but I left it enabled).

I used to use Gemma3 27B for this kind of thing (at IQ3_XS). Gemma4 26B at IQ4_NL far outperforms it in output quality, and nails many of the prompts that Gemma3 failed. It's a welcome improvement.

[-]

DeepOrangeSky@reddit

Regarding OSS 20b, are you doing it the same way with that one, too (setting GPU down to 0 to make it run on CPU)? Or do you use the memory-limit raising thing of raising it to use nearly the full 16GB of unified memory (instead of like 65% or 70% or whatever the default is set to) to just barely get it to have enough memory to run with extremely small context for 1 or 2 replies? (I saw some youtube vid where the guy was just barely able to get it run, normal-style, by raising the memory limit, but only just barely).

Also, as far as this thing (the CPU running method of Gemma4 with swap) how much swap are we talking, roughly, per amount of usage? And as far as it being bad for the SSD, is it merely bad in the sense of the amount of write being done in the normal sense of burning through the total lifetime write amount (which it is supposed to be able to use around 150TB of total lifetime swap before it might die), or is it bad in other ways, too, beyond just eating into lifetime write amount?

[-]

FenderMoon@reddit (OP)

I don't rememeber needing to do anything special in the past, OSS 20B ran really well. Although I tried it just yesterday and after updating LMStudio, it seems that for whatever reason I do need to peg it completely to the CPU now.

Likely we have a bug in LMStudio somewhere.

As for swap usage, it's using 1-2GB when I try this on 26B. I'm not sure what the actual writes are during one session. It will hurt the SSD over time, but you'd have to use these models constantly to peg the SSD that hard. They're more resilient than people realize.

I used to use an 8GB system and would peg swap constantly on heavy workloads. We're talking constant 5-10GB swap usage throughout the workday. When I went to check endurance ratings on the SSD, it still had 95% life left after a year of doing this. Unless you're using the model for hours daily, it's likely to be fairly negligible over the life of the SSD.

[-]

gnnr25@reddit

Which quant are you using for OSS 20B? Going to try that next.

[-]

FenderMoon@reddit (OP)

Just the official OpenAI one. It's already quantized to 4 bit.

I did end up figuring out how to get it to run on current LMStudio versions again. In the past it would run fine as long as you left a few layers to run on the CPU, but now partial offloads don't work unless the model entirely fits in RAM anymore. If you accelerate with the GPU at all, it has to all be able to fit.

To do it, I ran the following in the terminal: sudo sysctl iogpu.wired_limit_mb=14400

This raises the max GPU memory limit to a high enough level to allow the model to fit. You probably don't want to leave a bunch of extra stuff open while the model is running. It won't crash the system but it'll slow everything else down.

Then I just loaded the model with a context length of 8192, all 24 layers offloaded to the GPU, and with KV cache quantization on at Q8_0. I also unchecked "keep KV cache in memory".

Got it the official OpenAI OSS-20B to run entirely GPU accelerated on my system this way. It used to be easier (you used to just have to tell it to only send 17 or 18 or the 24 layers to the GPU and the model would figure the rest out, but now there are more gotchas to getting it to work on 16GB macs).

[-]

gnnr25@reddit

Nice, got another working model. Tried the unsloth gpt oss 20b UD-Q8_K_XL variant in llama.cpp with following

llama-cli -m \~/models/unsloth-gpt-oss-20b-UD-Q8_K_XL.gguf -ngl 0 -c 8192 -fa on -ctk q8_0 -ctv q8_0 --no-mmap -b 64 -ub 64 --threads 8 -np 1 --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 -p "prompt here"

[ Prompt: 43.8 t/s | Generation: 13.1 t/s ]

[-]

FenderMoon@reddit (OP)

Nice! That's even faster than it runs on my system. I'm gonna give that quant a try.

GPT-OSS-20B is honestly a fantastic model. For reasoning, coding, difficult tasks, etc, it's frankly about as good as o3-mini was. It's unbelievably good for its size.

For world knowledge I find it doesn't punch above its weight so much. It feels like a 20B model would there. Gemma still does better on 27B and 26B, but those are larger models, that's to be expected.

You might be able to get 26B to run with an IQ3_XS quant too.

[-]

gnnr25@reddit

Squeezed some more adding

-ctk q8_0 -ctv q8_0

[ Prompt: 4.6 t/s | Generation: 7.3 t/s ]

[-]

FenderMoon@reddit (OP)

Nice!

[-]

DeepOrangeSky@reddit

u/fendermoon

Regarding this:

Thinking fix for LMStudio: Also, for fellow LMstudio users, none of the currently published ones have thinking enabled by default, even though the model supports it. To enable it, you have to go into the model settings, and add the following line at the very top of the JINGA prompt template (under the inference tab).

{% set enable_thinking=true %}

Also change the reasoning parsing strings:

Start string: <|channel>thought

End string:

Do I only add the "{% set enable_thinking=true %}" line to the very top of the JINJA and then I have to add the "Start string: <|channel>thought" and the "End string: " line somewhere else in the JINJA? Or do I just put all three of those lines stacked one after the other at the top of the JINJA? (even the channel thought lines, too, at the very top, that is)?

[-]

FenderMoon@reddit (OP)

So that first line, yes you add that as the very first line in jinga.

The start string and end string are under a different section in the model settings. You gotta go to the “reasoning parsing” section for that (at least I believe that’s what it’s called).

[-]

DeepOrangeSky@reddit

thanks, I'll give it a try

[-]

FenderMoon@reddit (OP)

It should look like this when you're done (pay attention to the reasoning parsing section, and the first line on the JINGA prompt template. Those are the only things you need to change).

[-]

FenderMoon@reddit (OP)

Also got 31B to work! (Same test system).

Posted a guide on Reddit here.

[-]

FenderMoon@reddit (OP)

Update: I got the 31B version to work also (albeit on an anemic IQ3_XXS quant, used Unsloth’s version) using the same trick. Same 16GB M2 Pro system.

Ironically to get 31B to work, I had to leave “keep in memory” checked, whereas 26B only works if you leave it unchecked. Both required the model to be run entirely in the CPU to avoid generation failures.

Of the two, 26B is still way more usable in terms of speed on this system. And the 31B legitimately hallucinates a lot more on such an anemic quant than the 26B does on a much better 4 bit quant.

[-]

vytcus@reddit

How is it with tool calling?

[-]

FunConversation7257@reddit

I’m confused, this isn’t running on the SSD or something right?

[-]

FenderMoon@reddit (OP)

It still loads into RAM, but the system will swap out experts that aren't being actively used if you uncheck "keep in memory". I find it doesn't swap very much on 4 bit quants. Memory compression takes care of most of it, swap is barely hit.

Performance is surprisingly usable in spite of this as long as the settings are kept sane. We're talking 6-10tps, definitely not "fast" (especially not for an MoE model), but it's legitimately usable for folks who are used to these kinds of speeds.

[-]

lambdawaves@reddit

What system is deciding what to swap in/out of disk?

[-]

FenderMoon@reddit (OP)

Activity monitor. Lets you watch compressed memory stats per process if you enable it in the header as well.

Not sure what the actual swap-to-disk paging rates are. Swap is used, but seems to be less than I expected. I may do a more scientific test later to see how much data is actually getting shuffled back and forth.

[-]

lambdawaves@reddit

Oh so you're just letting the kernel swap in-and-out of memory as needed.

You'll see big gains when moving to model-aware pre-caching into RAM. The kernel doesn't know anything about the structure of the program, so it can't do swapping effectively.

[-]

FenderMoon@reddit (OP)

Interesting! Need to look into this.

[-]

lambdawaves@reddit

They call them "Flash MoE" now
https://github.com/danveloper/flash-moe

https://arxiv.org/abs/2506.04667

[-]

FenderMoon@reddit (OP)

Interesting! I'll look more into this. I can't tell if this can be applied to any model or if it has to sort of be compiled for each specific model, but this looks fantastic in terms of insane potential.

[-]

lambdawaves@reddit

As of today, someone has to surgically do it to a model. But I’m sure the formats will be generalized enough eventually that it can be a sort of “execution mode”. That seems like a natural progression from here

[-]

dobkeratops@reddit

i thought experts were selected at extremely fine grain , per layer.. surprised that works

[-]

FenderMoon@reddit (OP)

I believe they're per token in this model. If I'm not mistaken. It works much better than I expected.

My hypothesis is that some experts must be being used more frequently than others for any given subtext or section of the output, which causes them to not necessarily need to fully swap in and out every single token.

It does seem to behave this way during generation, as it will often generate several tokens very quickly, stutter slightly on one token for a split second, then generate another fast string of them.

[-]

dobkeratops@reddit

i mean it is possible that they could try to bias the training to reduce the amount of experts used across a specific paragraph say.. then that would tend to get them more like actual domain experts. i've not seen any stats on expert utilization

[-]

FenderMoon@reddit (OP)

I think you were right about the per layer thing. This write up is a little bit confusing, but it seems to suggest that it's possible the MoE is in-fact per-layer. With one additional shared expert always-activated through the whole model.

I've been reading up on that, and I'm really surprised they actually pulled this off if that's what they did.

https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-gemma-4

My guess is that there are enough statistical patterns to where a percentage of them can be swapped out without murdering performance too badly. The speed decreases to less than half when I go up to a 5 bit quant (trying to throw a 22GB model into 16GB of VRAM versus around \~19GB at Q4_K_M or \~16GB at IQ4_NL). Which suggests it really does need most of the experts in RAM, but there seem to be some it can get away with swapping without them being needed too often.

On 16GB of Macs, there seems to be enough margin on these to where you can load 20GB models without it being catastrophic. Anything beyond that and you're gonna get like 2-3tps.

[-]

Acceptable_Home_@reddit

What's the size of the quantized model you downloaded on disk, q4 18.21gb gemma4 26B A4B uses 22gb of system ram (out of 23.7gb) and 6gb of vram (out of 8) on my system at 48k ctx window (windows, latest llama cpp)

[-]

FenderMoon@reddit (OP)

I'm using the Unsloth variants, which are a couple GB smaller.

IQ4_NL (the best variant I've tried so far and the one I would recommend) is about 15.5GB. Runs at about 10tps on a 16GB Mac.

Q4_K_M is about 19GB and also runs at usable speeds, although it has no real quality gain over just using IQ4_NL.

Q5_K_S is 22GB and is much slower. About 2tps.

[-]

HealthyCommunicat@reddit

Not sure if you care - but I made a 10gb version that runs just fine https://huggingface.co/JANGQ-AI/Gemma-4-26B-A4B-it-JANG_2L

[-]

FenderMoon@reddit (OP)

Interesting! Giving it a try now.

[-]

Acceptable_Home_@reddit

Pls try llama cpp directly, i know it isn't as easy as LMstudio to switch models and much more but the speed difference would be worth it,

atleast for me with gemma4 26B A4B the speed in LMstudio with latest app ver and latest llama cpp driver, i get about 15tk/s

While 24tk/s on directly llama cpp (both at same context size and same other settings)

But i might be wrong because im a windows user with Nvidia gpu, still the difference is too great to never try directly llama cpp

[-]

FenderMoon@reddit (OP)

I'll give it a try. I have it installed on another machine. Super curious to see.

What's the difference between M4 and L2? Might M4 outperform IQ4_NL?

[-]

FenderMoon@reddit (OP)

I've tested this using various 4 bit and 5 bit quants. Unsloth's IQ4_NL is probably the best one. Q4_K_M works too, but it's larger with virtually no gain in quality. 5 bits runs, but at less than half the speed.

If you have a 16GB mac, Unsloth's IQ4_NL is probably the one you want. It runs great.

I want to do some more testing on this and see if it can be further optimized somehow, as Gemma models have always been very sensitive to quantization. You usually don't want to run these at any less than 4 bits. The 3 bit ones are nowhere near as good in comparison.

I noticed something interesting when I jumped to 5 bits for testing though. The difference was much less pronounced, but I did notice that it would sometimes answer in slightly more detail when asked about obscure or oddly specific topics. Perhaps more interestingly, it seems that the 4 bit models still do a really good job of not hallucinating information to fill in the gaps, and tend to just be more general in their answers rather than being specific and getting it wrong.

That's somewhat surprising behavior, frankly. Usually the quantized models just start hallucinating if they're tested on "fringe" knowledge. Gemma4 seems to be tuned to not do this.

[-]

Confusion_Senior@reddit

What if you keep a few layers on ram instead of zero

[-]

Important_Quote_1180@reddit

On my Linux box I have Mixture of Expert models like this with as many layers of attention on the GPU as possible and all the expert layers on the ram. You need to also keep some room for context in the KV cache on the VRAM. First goal is to get 2-3 experts on GPU + ALL Attention Layers + SWU Cache + KV. As you have bigger VRAM you can put more experts on too and it wont spend as much time swapping out the experts. I Have to Pin experts in manually. I use llama.cpp and claude code helped design my dashboard where I optimize for MoE since I think they are the best for my workflows over dense models. I just use claude for orchestration and openai for research.

[-]

FenderMoon@reddit (OP)

This might be a better solution than the one I described. Is this on Apple Silicon or is this on x86/PC?

I was unable to get anything like this to work on my Mac and was forced to run entirely on the CPU. My guess is it would easily be 2-3x faster if it's possible to offload some of the experts to the GPU another way.

[-]

Important_Quote_1180@reddit

Its Linux, here are the specs:

Ubuntu 24: Radeon 9070 16gb (Soon to be 3090 24gb) 9000+ series amd cpu,, 192gb ddr5 6000MHz. I design games and agentic architecture. I need uncensored models to make graphic content and MoE is amazing for creative and refinement work.

[-]

FenderMoon@reddit (OP)

Ah. How many experts do you find you're able to run on the GPU? What quants to you use?

This might influence my decisions as far as what kind of system to build when I go for building a desktop.

[-]

Important_Quote_1180@reddit

Usually needing to go with Q4 weights and Q8 KV cache. Going to be better perplexity and speed with the 3090. I find for the Gemma 4 26b A4B I can get 15/30 experts on the gpu with a Q6 quant. 20-25 toks totally usable for tasks and the context window is 125k so its my new daily driver over the 122B A10B as that was only 12-15toks gen. Prompt toks are usually 2500 with flash.

[-]

Confusion_Senior@reddit

This my setup as well but with 3090. Have you tried to run giant models streaming experts from ssd like glm5 q4?

[-]

Important_Quote_1180@reddit

No go on the SSD, its why I need the 192GB of RAM. Its the only thing fast enough for me but theoretically very possible.

[-]

FenderMoon@reddit (OP)

There isn't really a way to tell it explicitly "keep this many layers in RAM". What you're doing is you're forcing a certain number of layers to run on the GPU, which in-effect, does keep them in RAM.

However, I find that when the model is too large to load into RAM, even just loading one layer into the GPU causes the entire model to fail to load. I've only ever gotten them to work by running them entirely on the CPU in such cases.

[-]

Confusion_Senior@reddit

Thank you

[-]

lambdawaves@reddit

What’s the pre-fill rate?

[-]

FenderMoon@reddit (OP)

I'm unsure how to measure. Time to first token is only a few seconds on the first prompt. It takes a couple minutes if you saturate a 16K context window.

I'd recommend keeping the context windows around 6-8K to keep it sane.

[-]

lambdawaves@reddit

Put in a first short prompt to get the model warmed up in ram.

Then grab a document of 2000 tokens and paste it in and ask it to summarize. Then 2000/TTFT

[-]

Specter_Origin@reddit

and how much context do you get? 1k ?