Questions about running Gemma 4 on Apple Silicon

Posted by TaylorHu@reddit | LocalLLaMA | View on Reddit | 14 comments

Hello all,

Just picked up a used Mac Studio, M1 Ultra, 64gb. Pretty new to running local models. I wanted to play around with Gemma 4 31B, through Ollama, but running into some trouble. When I load it my memory usage jumps to \~53gb at idle, and if I try and interact with the model at all the memory peaks and Ollama crashes.

According to this, it should only take \~20gb of memory, so I should have plenty of room: https://ollama.com/library/gemma4

Now Google's model card does list it at \~58gb, at the full 16-bit: https://ai.google.dev/gemma/docs/core

So neither of those line up exactly with what I am seeing, though the "official" model card does seem closer. Why the discrepancy, and is there something, in general, I should know about running these kinds of models on Ollama?

[-]

Konamicoder@reddit

It sounds like you’ve accidentally pulled a high-precision version (like Q8 or FP16) instead of the standard 4-bit quantization. The 20GB estimate refers to 4-bit; if you're at 53GB at idle, you're already hitting the ceiling.

The crash happens because when you interact with the model, the system must allocate additional memory for the KV Cache (the context window). Since you're already near the 64GB limit, that extra allocation pushes you over the edge and triggers the crash.

Try explicitly pulling the 4-bit version to leave some headroom for the context:

ollama pull gemma4:31b-instruct-q4_K_M

[-]

TaylorHu@reddit (OP)

I just ran it with `ollama run gemma4:31B` I assumed it would pull the right model that matched https://ollama.com/library/gemma4.

But `ollama ps` shows `gemma4:31b 6316f0629137 81 GB 32%/68% CPU/GPU 262144`

[-]

Konamicoder@reddit

Assume nothing. You need to specify the precise model you want Ollama to pull.

[-]

TaylorHu@reddit (OP)

Where's the best place to find the exact model names?

[-]

Konamicoder@reddit

Ollama.com > Models > Gemma4 > View All

[-]

HealthyCommunicat@reddit

Use osaurus.ai. It has a 20% increase in Gemma4 speeds when compared to literally anything. oMLX, ollama, lm studio etc all cant compete when it comes to gemma4

[-]

Ok_Technology_5962@reddit

Correct... But please do not use Ollama on mac... Even lmstudio is better... But preferably Omlx... Anyways back to answer.

There is a bug where the cache takes a ton of space. Newpatches fix it but Olama might be behind. ...

[-]

Faktafabriken@reddit

Why not? Could you describe or share a link so that I can learn?

Are there other options that are as easy to connect other tools to?

/running ollama on my Mac Studio

[-]

Danfhoto@reddit

I’m with the others:

Llama.cpp is the “engine” that runs inference from GGUF models. Ollama is a program that eases the burden to use Llama.cpp for entry-level folks. The problem is, the main folks behind Ollama have not been great to the community including the main people behind Llama.cpp, whiteout whom they have nothing on top of that, the version of Llama.cpp that Ollama integrates usually lags pretty far behind the releases of Llama.cpp.

LM Studio is closed-source, however they move a bit faster AND they also maintain an MLX engine that uses MLX_LM and MLX_VLM, which is the most native inference available for Mac devices. This means that in LM Studio, you have access to stable releases of Llama.cpp faster, and to get more flexibility on the models you can use.

At an entry level, I highly recommend LM Studio for serving local models above Ollama.

[-]

Ok_Technology_5962@reddit

This is correct.

I just wanted to throw in that there is a push from community for a serving backend on mac currently. oMLX is the github with the massive push to integrate all the mlx into a gui easy installer and caching system as well as quantization and benchmarking and to server claude code or open code or open claw etc. its been updating like crazy but will still be behind llama.cpp if you need sampling or logit access etc. like i am testing best of n with python wrappers and cant use oMLX for that test but llama.cpp would work even on mac. Same as Rpc combining two pc memory pools would work on llama.cpp my next test combining mac and a custom pc for trillion peram model at q8 (since im measureing whats lost edge use case)

So learning curve is LMstudio (lots of guides and very flexible includes great model downloader)->oMLX/JANG/vMLX or some backend for stable use that has turboquant caching faster ->llama.cpp (depending on needs)

[-]

Danfhoto@reddit

This is great to hear! I tried mlx-lm.serve and tools parsing was a big problem. I’ll play with oMLX a bit!

[-]

Faktafabriken@reddit

Thank you for taking your time to describe. I will try lm studio more then.

I’m running ollama on my Mac but using it mostly from my windows laptop via Openwebui in a way Claude told me was smart a few months ago :p

Will ask chat gpt/claude if I can use lm studio the same way as well.

Once again thank you!

[-]

chibop1@reddit

You might be loading the model with more context than your memory can handle. Check type ollama ps.

Check the context setting in client you're using as well as Ollama UI > Settings.

[-]

FoxiPanda@reddit

IMO, after testing on my M3 Ultra Mac Studio... the 31B variant just isn't worth the speed penalty compared to the 26B-A4B variant. You can get like 60-70tok/s on your M1 Ultra with the 26B-A4B on llama.cpp with similar launch settings to me -- Ollama is just a wrapper around llama.cpp that is slower and usually several builds behind (which is actually super important for Gemma4 because there have been several fixes in the last 72 hours in llama.cpp for Gemma4 specifically.)

Here are my launch settings in llama.cpp:

 /opt/homebrew/bin/llama-server
  --model /Users/noodleprincess/models/gemma-4-26B-A4B-it-UD-Q5_K_XL.gguf
  --mmproj /Users/noodleprincess/models/mmproj-gemma-4-26B-A4B-it-F32.gguf
  --port 8091
  --ctx-size 262144
  --n-gpu-layers 999
  --threads 16
  --threads-batch 16
  --flash-attn on
  --cache-type-k bf16
  --cache-type-v bf16
  --parallel 4
  --temperature 1.0
  --top-p 0.95
  --top-k 40
  --min-p 0.01
  --host 0.0.0.0
  --mlock