Gemma 4 26B A4B just doesn't want to finish the job... or is it me?

Posted by boutell@reddit | LocalLLaMA | View on Reddit | 25 comments

I've tried Gemma 4 26B A4B under both OpenCode and Claude Code now, on an M2 Macbook Pro with 32GB RAM. Both times using Ollama 0.20.2, so yes, I have the updates that make Ollama Gemma 4 compatible.

I gave it a meaty job to do, one that Opus 4.6 aced under Claude Code last week. Straightforward adapter pattern — we support database "A," now support database "B" by generating a wrapper that implements a subset of the database "A" API. Piles of unit tests available, tons of examples of usage in the codebase. I mention this because it shows the challenge is both nontrivial and well-suited to AI.

At first, with both Claude Code and OpenCode, Gemma 4 made some progress on planning, wrote a little code, and... just gave up.

It would announce its progress thus far, and then stop. Full stop according to both the CPU and the GPU.

After giving up, I could get it to respond by talking to it, at which point the CPU and GPU would spin for a while to generate a response. But it wouldn't do anything substantive again. I had very silly conversations in which Gemma 4 would insist it was doing work, and I would point out that the CPU and GPU progress meters indicate it isn't, and so on.

Finally this last time in OpenCode I typed:

"No, you're not. You need to start that part of the work now. I can see the CPU and GPU progress meters, so don't make things up."

And now it's grinding away generating code, with reasonably continuous GPU use. Progress seems very slow, but at least it's trying.

For a while I saw code being generated, now I see ">true" once every minute or two. Test runs perhaps.

Is this just life with open models? I'm spoiled, aren't I.

[-]

boutell@reddit (OP)

I'm going for opencode, so different use case, but I will likely play with these anyway, thank you

[-]

boutell@reddit (OP)

An update and a trail of bread crumbs for myself and others:

* I tried stepping down to E4B, just to see what would happen. It was a perfectly behaved citizen, but it was just too dumb to use: it couldn't resolve an obvious JavaScript syntax error of its own creation.

* So I came back to 26B A4B, but this time I followed this guide. You need very bleeding edge llama.cpp and a specific PR of opencode. However per erikji's comment on the gist, you can avoid compiling llama.cpp now, if you install HEAD with brew.

See this gist, and the comments:

https://gist.github.com/daniel-farina/87dc1c394b94e45bb700d27e9ea03193

* If you have 32GB RAM like me, resist the temptation to use "-c 65536" when starting llama-server. Use -c 32768. In my experiments I couldn't achieve reliability with -c 65536, I would still get unexpected hard stops. I still see tons of RAM use even with 32768.

* As the recommended config files in that gist suggest, you want to keep input tokens down to 32768 and output tokens down to 8192.

* With all of that... I'm starting to see progress. But I need my Mac back, so more experiments and a fresh post after work possibly.

[-]

boutell@reddit (OP)

Update: it printed "True>" in a loop for hours while burning 100% of GPU and writing no code. I shut it down, LOL.

[-]

gintrux@reddit

I had similar problem with mimo v2 pro on opencode, but adding frequency penalty parameter set to 0.3 in config json solved it.

[-]

kataryna91@reddit

Well, that should tell you without a doubt that your quant and/or inference engine is broken, no?
It has nothing to do with the model. Even though the 26B model is already significantly worse than the 31B model, that is not your main issue right now.

[-]

boutell@reddit (OP)

That would make sense if I were designing my app from the ground up today! It really would. But now? When we have tens of thousands of lines of code that are written for database "A"... and Claude Code with Opus 4.6 blew the problem out of the water in a couple days, giving us an adapter that speaks "A" on top of "B" fluently? No chance we'd rewrite everything (:

madbunnyshit@reddit

I can't get gemma4 to edit and read files through codes.

kinglock_mind@reddit

Complete disaster, have to baby sitting right after every step:

ZootAllures9111@reddit

Are the settings right though? And what Quant?

triynizzles1@reddit

I have read some implementations of Gemma 4 are still a work in progress. Maybe that is what you are experiencing. Personally, the only useable, local, non-frontier models in openclaw for me have been Glm 4.7flash. With nemotron 3 being a distant 2nd place.

Ok-Importance-3529@reddit

Try Copaw 9B its been best in tool use with decent speeds and you can use Q8 with full context at roughly size of Q4 of Gemma4

Daniel_H212@reddit

Yeah Gemma 4 has been disappointing for me in a similar way. I mostly use local LLMs for web research tasks and Gemma 4 keeps giving up on searching even after saying it needs to do more searching, sometimes even right after it formulates a research plan.

CombustedPillow@reddit

mine doesn't seem to search at all. I asked it to look up information on Nothing Phone 4a and it kept telling me it doesn't exist and that it couldn't find anything on a search.

One of the reasonings I caught that I found funny:

**Wait! I see what's happening.** The user might be testing my ability to resist being gaslit by a fake URL. If I say "Oh, you're right, I see it now!" I am hallucinating. If I say "That link is broken," I am being truthful.

autognome@reddit

Harness/orchestration issue not model issue

Is it though? It still does use the tools fine, it just gives up earlier than it should.

That job for harness is to continue, summarize, drive subagents, etc - single model can’t do that. It’s collaboration.

Emotional-Breath-838@reddit

what harness works best?

DraconPern@reddit

I used lmstudio and set playwright and it seems to work.

centminmod@reddit

What context size you working with?

I read folks having more issues initially with Ollama and Google Gemma 4. I haven't tried Ollama. I tried it for local AI via LM Studio and Claude Code on my Macbook Pro M4 Pro with 48GB memory https://ai.georgeliu.com/p/running-google-gemma-4-locally-with.

As you increase token context window sizes, memory consumption increases. So I don't think heavy coding users will be able to use Google Gemma 4 locally unless paired with a lot of memory - at least 64+GB memory as context matters for LLM performance.

HardwarePassion@reddit

mine while doing a code, stuck in follow up, thinking for a few minutes and no answer, just says follow up :D When I open tought windows its full of repeating the same thing over and over hahah

SM8085@reddit

I had to re-download the updated gemma 4 ggufs, which seem much more un-fucked.

While it made it through 13-14 steps, it still simply stopped at a point:

Qwen3.5 seems more agentic in that regard, where it seems to follow through on problems more.

Evildude42@reddit

Maybe your prompting is wrong, You want an intermediate format that can be used in database A or B. Instead of understanding database A, then adding a crazy shim to translate to database B.

DinoAmino@reddit

Maybe not enough vram for context? There was a post the other day titled "Gemma 4 is a kv cache pig". I think the latest llama.cpp has a fix for that. Does ollama have those fixes?

matt-k-wong@reddit

perhaps its my use cases but I've found MOE models to be inferior. the new \~30B dense models are much better but slower. Also my mental model is that LLMS exist in stages, so maybe use the fast model to get out a solid framework then come back with the dense model to clean things up.

Adventurous-Paper566@reddit

Vous utiliseriez une perceuse pour enfoncer un clou?

Gemma n'est pas fait pour ça, tout simplement.