What is the current solution to running Gemma 4 locally?

Posted by mihirlifehacks@reddit | LocalLLaMA | View on Reddit | 26 comments

Hi everyone,

I'm hearing very good things about Gemma 4 and I appreciate this community making posts on how it's still not perfect with tool call issues and so many other issues, but now that it's been about a week since it's release, I'm curious if anyone has had any success and how?

I'm hearing that ollama had issues up until getting v0.20.0-rc1 but even that had tool call issues. And now I'm seeing ollama has new release candidates like v0.20.6 rc1 and I'm not sure if that fixes everything?

And then there is a whole other side that says, it's better to use llama.cpp, but is that really perfect?

And what CLI / Coding Client are y'all using to help use the model to code with? I think OpenCode is quite popular but are y'all having a better experience with claude code open source https://github.com/anthropics/claude-code or any other CLI/IDE ?

...unless I'm super wrong and Gemma4 is still a disaster to run locally :D

Thank you for your help community!

[-]

ForsookComparison@reddit

I'm hearing

I'm hearing

I'm seeing

I'm not sure

there is a whole other side that says,

Gemma 4 is free

Llama CPP is free

Don't hear/see/read sides go try it man.

[-]

remghoost7@reddit

Also, just a heads up, run llamacpp with these flags if you're running Gemma 4:

--ctx-checkpoints 1 -cram 0

I went from using 50+GB of system RAM in longer contexts / repeated conversations down to almost nothing.
Here's a related reddit comment section on why.

[-]

dampflokfreund@reddit

Isn't the drawback that you will have much more reprocessing this way?

[-]

remghoost7@reddit

See, I figured that was going to be an issue too.
But here's an explanation of what the context checkpoints actually do:

It saves the KV cache of the prompt at that point, and if you later send another prompt with the exact same content up to that point, it can load the checkpoint and skip re-processing that part of the prompt.

The default is every 8192 tokens with up to 32 checkpoints per slot.
But apparently the context checkpoints for Gemma 4 are insanely huge compared to other models (500MB+ compared to around 60MB with Qwen), so it balloons RAM out.

So if you're bouncing back and forth between prior branches, then yes.
But I typically run "linear" conversations, so I haven't noticed any slowdowns.

Someone in that comment section mentioned running it at 4 instead of 1, so you can try that if necessary.
But for my use cases, literally no difference in prompt processing for way less system RAM.

Anonymous_Cyber@reddit

Used antigravity to build my own Tauri app, I suggest to do the same. Comes in handy

Plastic-Parsley3094@reddit

Has anyone run e4b and 26B in somethin like one 5070 ti laptop 12 GB VRAM?
because usin llama.cpp i get only 7tokens/second on bothj e4b and 26b for some reason.
i wil try ollama and see.
I am using linux arch by the way and have tried cuda 13.2 and downgraded to 31.1 as i read somewhere there where bugs.

But if someone has been able to run this at descent speed with ollama or lamma.cpp please write the command here is i am lost.

_Motoma_@reddit

Both Gemma 4 26B MoE and 31B Dense run well on my system with 2x RTX 3060 12 GB (24GB VRAM) in Llama.cpp. I use GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 for Dense, but don’t notice any difference performance wise versus a smaller context window. The bartowski IQ4_XS model is my go-to for most of the models I try on this rig.

Optimal_Guava5390@reddit

Yep only got Gemma 4 26b MoE as only have a 7900xt but it’s extremely impressive. Also run in llama the bloat on LM isn’t worth it, so I send it local host for my GUI for docs ext.

BannedGoNext@reddit

[gemma-4-26B-A4B-it-MXFP4_MOE]

model = /home/user/models/router-models/gemma-4-26B-A4B-it-MXFP4_MOE.gguf

ctx-size = 4096

temp = 1.0

top-p = 0.95

top-k = 64

repeat-penalty = 1.0

cache-type-k = q8_0

cache-type-v = q8_0

flash-attn = on

# Keep MoE expert weights on CPU and trim the layer offload to fit 8 GiB VRAM.

cpu-moe = 1

n-gpu-layers = 8

parallel = 1

threads = 8

batch-size = 512

ubatch-size = 256

chat-template-file = /home/vmlinux/models/chat-templates/google-gemma-4-26B-A4B-it-official-chat_template.jinja

chat-template-kwargs = {"enable_thinking": false}

reasoning = off

reasoning-budget = 0

This is currently running great on an 8gb GPU machine with 64gb of memory, it's processing about 5 prompts per minute at around 12 t/s. Granted, I'm using it as a dialectical check on a brainstorming LLM process, but it's running rock fucking solid. Great little model.

--Spaci--@reddit

Claude code isn't opensource a version just leaked, I would still just use opencode I dont trust people vibe coding local model support and you wont get any future support

TheRealSol4ra@reddit

Claude code has had local model support for months. I hate idiots thinking that this isnt the case.

Lie-Prior@reddit

Seems a bit harsh to call someone an idiot for not knowing this don’t you think? Please just reconsider how you talk to people as you move forward in life.

Spreading misinformation in the age of information is idiotic.

To act like you know something and willfully spread that false information when you can find out if you’re right or wrong in less than 10 seconds either shows blatant ignorance or idiocy.

Im not going to mince words because it hurts someones feelings, boo fucking hoo, what do you do mute or block people you disagree with in real life?

Don't worry you didn't hurt my feelings, I know your life is miserable and you try to make others worse. I don't know why you still try Solara ppl genuinely just don't like how you act

Crafty-Celery-2466@reddit

Makes sense.. based on this argument, go call your coworker an idiot when they’re wrong and confident and let’s see you defend it.

Disposable110@reddit

It works in latest Oobabooga in Chat Mode by setting it to chat-instruct format in the UI.

Which incidentally also removes any sense of censorship.

Haven't managed to get it to work with OpenCode, running into memory exceptions and LlamaCPP crashing completely when serving through the Oobabooga API.

shanehiltonward@reddit

Llama.cpp, Msty,...

CelvestianNesy@reddit

The more money you save, the more GPU's you buy -- Jenson

Jk XD

RedParaglider@reddit

Locally and some would argue for sota models pi is the best harness

gandazgul@reddit

I'm using Pi Agent with the plannotator extension. Configured to use Gemini or Claude for planning then Gemma4 for execution. I might switch to Gemma for planning too.

benevbright@reddit

it's too bad for coding agent. (tested with latest fix)

ttkciar@reddit

Current llama.cpp jfw with Gemma 4. I've been evaluating it for a couple of days now.

You'll still need Google's chat template update if you want to use tool-calling, but I don't, so it's pretty smooth sailing.

Just be sure to use current llama.cpp.

Build llama.cpp b8766+ from source (this was released...today.)
Re-download any GGUF files you have - they were updated (AGAIN) yesterday.
Explicitly use the Google provided chat template (note: this might not be necessary anymore with the latest GGUF / llama.cpp updates, but I still do it)