What is the current solution to running Gemma 4 locally?
Posted by mihirlifehacks@reddit | LocalLLaMA | View on Reddit | 26 comments
Hi everyone,
I'm hearing very good things about Gemma 4 and I appreciate this community making posts on how it's still not perfect with tool call issues and so many other issues, but now that it's been about a week since it's release, I'm curious if anyone has had any success and how?
I'm hearing that ollama had issues up until getting v0.20.0-rc1 but even that had tool call issues. And now I'm seeing ollama has new release candidates like v0.20.6 rc1 and I'm not sure if that fixes everything?
And then there is a whole other side that says, it's better to use llama.cpp, but is that really perfect?
And what CLI / Coding Client are y'all using to help use the model to code with? I think OpenCode is quite popular but are y'all having a better experience with claude code open source https://github.com/anthropics/claude-code or any other CLI/IDE ?
...unless I'm super wrong and Gemma4 is still a disaster to run locally :D
Thank you for your help community!
Anonymous_Cyber@reddit
Used antigravity to build my own Tauri app, I suggest to do the same. Comes in handy
Plastic-Parsley3094@reddit
Has anyone run e4b and 26B in somethin like one 5070 ti laptop 12 GB VRAM?
because usin llama.cpp i get only 7tokens/second on bothj e4b and 26b for some reason.
i wil try ollama and see.
I am using linux arch by the way and have tried cuda 13.2 and downgraded to 31.1 as i read somewhere there where bugs.
But if someone has been able to run this at descent speed with ollama or lamma.cpp please write the command here is i am lost.
_Motoma_@reddit
Both Gemma 4 26B MoE and 31B Dense run well on my system with 2x RTX 3060 12 GB (24GB VRAM) in Llama.cpp. I use GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 for Dense, but don’t notice any difference performance wise versus a smaller context window. The bartowski IQ4_XS model is my go-to for most of the models I try on this rig.
Optimal_Guava5390@reddit
Yep only got Gemma 4 26b MoE as only have a 7900xt but it’s extremely impressive. Also run in llama the bloat on LM isn’t worth it, so I send it local host for my GUI for docs ext.
ForsookComparison@reddit
Gemma 4 is free
Llama CPP is free
Don't hear/see/read sides go try it man.
remghoost7@reddit
Also, just a heads up, run llamacpp with these flags if you're running Gemma 4:
I went from using 50+GB of system RAM in longer contexts / repeated conversations down to almost nothing.
Here's a related reddit comment section on why.
dampflokfreund@reddit
Isn't the drawback that you will have much more reprocessing this way?
remghoost7@reddit
See, I figured that was going to be an issue too.
But here's an explanation of what the context checkpoints actually do:
The default is every 8192 tokens with up to 32 checkpoints per slot.
But apparently the context checkpoints for Gemma 4 are insanely huge compared to other models (500MB+ compared to around 60MB with Qwen), so it balloons RAM out.
So if you're bouncing back and forth between prior branches, then yes.
But I typically run "linear" conversations, so I haven't noticed any slowdowns.
Someone in that comment section mentioned running it at 4 instead of 1, so you can try that if necessary.
But for my use cases, literally no difference in prompt processing for way less system RAM.
BannedGoNext@reddit
[gemma-4-26B-A4B-it-MXFP4_MOE]
model = /home/user/models/router-models/gemma-4-26B-A4B-it-MXFP4_MOE.gguf
ctx-size = 4096
temp = 1.0
top-p = 0.95
top-k = 64
repeat-penalty = 1.0
cache-type-k = q8_0
cache-type-v = q8_0
flash-attn = on
# Keep MoE expert weights on CPU and trim the layer offload to fit 8 GiB VRAM.
cpu-moe = 1
n-gpu-layers = 8
parallel = 1
threads = 8
batch-size = 512
ubatch-size = 256
chat-template-file = /home/vmlinux/models/chat-templates/google-gemma-4-26B-A4B-it-official-chat_template.jinja
chat-template-kwargs = {"enable_thinking": false}
reasoning = off
reasoning-budget = 0
This is currently running great on an 8gb GPU machine with 64gb of memory, it's processing about 5 prompts per minute at around 12 t/s. Granted, I'm using it as a dialectical check on a brainstorming LLM process, but it's running rock fucking solid. Great little model.
--Spaci--@reddit
Claude code isn't opensource a version just leaked, I would still just use opencode I dont trust people vibe coding local model support and you wont get any future support
TheRealSol4ra@reddit
Claude code has had local model support for months. I hate idiots thinking that this isnt the case.
Lie-Prior@reddit
Seems a bit harsh to call someone an idiot for not knowing this don’t you think? Please just reconsider how you talk to people as you move forward in life.
TheRealSol4ra@reddit
Spreading misinformation in the age of information is idiotic.
To act like you know something and willfully spread that false information when you can find out if you’re right or wrong in less than 10 seconds either shows blatant ignorance or idiocy.
Im not going to mince words because it hurts someones feelings, boo fucking hoo, what do you do mute or block people you disagree with in real life?
--Spaci--@reddit
Don't worry you didn't hurt my feelings, I know your life is miserable and you try to make others worse. I don't know why you still try Solara ppl genuinely just don't like how you act
Crafty-Celery-2466@reddit
Makes sense.. based on this argument, go call your coworker an idiot when they’re wrong and confident and let’s see you defend it.
Disposable110@reddit
It works in latest Oobabooga in Chat Mode by setting it to chat-instruct format in the UI.
Which incidentally also removes any sense of censorship.
Haven't managed to get it to work with OpenCode, running into memory exceptions and LlamaCPP crashing completely when serving through the Oobabooga API.
shanehiltonward@reddit
Llama.cpp, Msty,...
CelvestianNesy@reddit
The more money you save, the more GPU's you buy -- Jenson
Jk XD
RedParaglider@reddit
Locally and some would argue for sota models pi is the best harness
gandazgul@reddit
I'm using Pi Agent with the plannotator extension. Configured to use Gemini or Claude for planning then Gemma4 for execution. I might switch to Gemma for planning too.
benevbright@reddit
it's too bad for coding agent. (tested with latest fix)
ttkciar@reddit
Current llama.cpp jfw with Gemma 4. I've been evaluating it for a couple of days now.
You'll still need Google's chat template update if you want to use tool-calling, but I don't, so it's pretty smooth sailing.
Just be sure to use current llama.cpp.
DygusFufs@reddit
I’m maintaining a personal fork of ollama (don’t hate on me, I am just used to it) with updated llama.cpp and a few fun tweaks. The killer for me was setting
swa_full=falsewhich actually allowed it to fit into 24 GB VRAM at Q4.Unlucky-Bunch-7389@reddit
If you’re trying to use agentic coders with open weight models and act like their gonna be opus you’re in for a bad time.
total-context64k@reddit
The latest release of CLIO supports Gemma 4. Launch it with Llama.cpp, connect to it from CLIO, and off you go.
FoxiPanda@reddit
Here's what I've found working the best for me: