Gemma4:26b's reasoning capabilities are crazy.

Posted by Mrinohk@reddit | LocalLLaMA | View on Reddit | 60 comments

Been experimenting with it, first on my buddy's compute he let me borrow, and then with the Gemini SDK so that I don't need to keep stealing his macbook from 600 miles away. Originally my home agent was run through Gemini-3-Flash because no other model I've tried has been able to match it's reasoning ability.

The script(s) I have it running through are a re-implementation of a multi-speaker smart home speaker setup, with several rasperry pi zeroes functioning as speaker satellites for a central LLM hub, right now a raspberry pi 5, soon to be an M4 mac mini prepped for full local operation. It also has a dedicated discord bot I use to interact with it from my phone and PC for more complicated tasks, and those requiring information from an image, like connector pinouts I want help with.

I've been experimenting with all sorts of local models, optimizing my scripts to reduce token input from tools and RAG to allow local models to function and not get confused, but none of them have been able to keep up. My main benchmark, "send me my grocery list when I get to walmart" requires a solid 6 different tool calls to get right, between learning what walmart I mean from the memory database (especially challenging if RAG fails to pull it up), getting GPS coordinates for the relevant walmart by finding it's address and putting it into a dedicated tool that returns coordinates from an address or general location (Walmart, [CITY, STATE]), finding my grocery list within it's lists database, and setting up a phone notification event with that list, nicely formatted, for when I approach those coordinates. The only local model I was able to get to perform that task was GPT-OSS 120b, and I'll never have the hardware to run that locally. Even OSS still got confused, only successfully performing that task with a completely clean chat history. Mind you, I keep my chat history limited to 30 entries shared between user, model, and tool inputs/returns. Most of it's ability to hold a longer conversation is held through aggressive memory database updates and RAG.

Enter Gemma4, 26B MoE specifically. Handles the walmart task beautifully. Started trying other agentic tasks, research on weird stuff for my obscure project car, standalone ECU crank trigger stuff, among other topics. A lot of the work is done through dedicated planning tools to keep it fast with CoT/reasoning turned off but provide a sort of psuedo-reasoning, and my tools+semantic tool injection to try and keep it focused, but even with all that helping it, no other model family has been able to begin to handle what I've been throwing at it.

It's wild. Interacting with it feels almost exactly like interacting with 3 Flash. It's a little bit stupider in some areas, but usually to the point where it just needs a little bit more nudging, rather than full on laid out instructions on what to do to the point where I might as well do it all myself like I have to do with other models.

Just absolutely beyond impressed with it's capabilities for how small and fast it is.

[-]

Naiw80@reddit

Ok I don't know, it succeeds at "traditionall LLM trippers" but so far my tests with using it as an agent been nothing but a disaster.
It's completely useless with claude code/qwen code etc. and just "discussing" with it it gets stuck in loops where it repeats itself over and over, sure maybe it intentionally decided to win by repetition but it certainly made no useful contribution to the discussion at all.

I find Gemma4 worse than Gemma3 in general...

[-]

Mrinohk@reddit (OP)

Which model size are you using? I'd be curious to know what kind of stuff you're feeding it and what version you're talking to to get those results. I've had a couple minor hallucinations and one reasoning issue where it failed to use a tool it should have, but generally it's been fantastic for me. I've not used it for heavy coding though, things that require insane context.

[-]

Naiw80@reddit

I've been using 26b mostly as it's what fits comfortably in my P100+RTX 4070 combo.

I can't even get it to complete the simplest coding task, it just stops mid task, it keeps (just like when chatting with it) keeps repeating the same action over and over, ie perform the same edit to the very same file over and over.

It hallucinates both method calls and arguments, it keeps inserting a lot of comments in the code with thought like "// I hallucinated here earlier, I need to be precise now. etc.

And eventually claude code says "churned for N minutes..." and nothing happens until I reengage where it keeps repeating itself again until it randomly stops.

Same with qwen code. It's great it works for some people, but to me it looks like a model heavily trained to perform well on benchmarks and has all the classic LLM twisters in it's training dataset.

Speed is good though, about 50 t/sec on this setup... but then again it just means it spits out more bullshit faster...

[-]

AdExpress6498@reddit

It's very important what kind of data Quen and Cloud send! I, too, recently suffered from poor answers, then I read the Gemma 4 card again and discovered that in Gemma 4 you don't need to send thoughts to the model! I removed the thought sending and also added these settings:
```
temperature: 1,
top_p: 0.95,
top_k: 64,
```

And thanks to this, the answers became MUCH better.

[-]

Naiw80@reddit

I've experimented a bit and so far it seems much more reliable if I set the context window to 65536, temperature to 1.5 and top_p 0.95, also set the stop token to

It's been going strongly for 25 minutes on a fairly advanced coding task (I have no idea if it will succeed at all) but at least it's not just randomly stopping so far as it been doing before.

[-]

AdExpress6498@reddit

I highly recommend rebuilding your Llamacpp if you haven't done so in a while; they're constantly releasing fixes for Gemma 4! Also, they recently updated the unsloth weights (I haven't tried the new ones).
i use Gemma-4-26B-A4B-it-UD-Q5_K_S.gguf with fresh llamacpp
context 6000

[-]

Naiw80@reddit

Just did and unfortunately now the model is back to being unreliable again and just stops randomly.. I guess I'll wait another few weeks or so...

[-]

AdExpress6498@reddit

What do you run it on?

[-]

Naiw80@reddit

llama.cpp

[-]

Truth-Does-Not-Exist@reddit

are you using openclaw?

[-]

Mrinohk@reddit (OP)

Nope. Didn't realized it existed until after I've built a family of Python scripts where the agent lives and recreates as much of the Jarvis experience as possible.

[-]

Finanzamt_Endgegner@reddit

How does it compare to 35b? 35b in q8 offloaded to ram gives me roughly 42t/s with 32k context while 26b q8 gives me 33 which is quite a bit slower for a smaller model /:

[-]

Specter_Origin@reddit

42t on a dense model ? what kinda of hardware you got? also Dense model running faster than MOE does not add up...

[-]

Finanzamt_Endgegner@reddit

35b meaning the qwen3.5 moe and 26b the Gemma 4 moe

[-]

Specter_Origin@reddit

That makes sense, Active param are more on Gemma than on Qwen. Active param is what is used at inference to answer your queries...

[-]

Finanzamt_Endgegner@reddit

I mean sure just saying that qwen 35b is probably worth it if you have enough ram because it's still faster and might be smarter too (;

[-]

Specter_Origin@reddit

Gemma is much more efficient at reasoning tokens, I have seen consistent reduction of 50-60 less thinking tokens needed to get to an answer and also gemma does not have looping issues.

[-]

Flashy-Split-8602@reddit

am facing looping issue tho
https://github.com/google-deepmind/gemma/issues/610

[-]

Danmoreng@reddit

That is expected tbh since Qwen only has 3B active while Gemma has 4B active.

[-]

Mrinohk@reddit (OP)

On my buddy's macbook, it was averaging around 80 t/s with reasoning off, though I'm not sure what size context he has for it on his end. I intentionally keep my context limited since the agent is meant to be running 24/7. Currently running through Gemini SDK so I don't steal his computer when he's playing a minecraft pack that needs 40 GB of ram, on which t/s doesn't really matter since it's running on an overpowered GPU a million miles away.

I don't know if anyone's posted benchmarks for M4 (non pro) on these models yet. Hoping for in the 40-50 T/S range for usability.

I was using 26B Q4 quantization specifically on his macbook and on the gemini SDK to keep things realistic for which model I'll be using when I get my own, fast enough hardware. Not tried 31B or anything else. More ram than I can afford, and probably too slow for what I'm doing.

[-]

Finanzamt_Endgegner@reddit

Forgot to mention what 35b I mean the qwen3.5 one ehich is a bit bigger but might also be better and faster

[-]

Mrinohk@reddit (OP)

I'll bug my buddy to see if he'll let me steal his compute again this evening, see how qwen3.5 35b does. He's not had any reasoning/agentic issues with GPT OSS 120b so he thinks that my prompting is just better optimized for the Gemina/Gemma family of models, rather than qwen or GPT. I don't know how true that can be; instructions are instructions IMO, but this whole project has been run and tested mostly under gemini models so it could make sense.

[-]

Awkward_Rabbit_9618@reddit

I tested qwen 3.5 35B MoE against gemma 4 on a large set of tasks (from my production workload). gemma 4 was both faster and better in quality (same machine) on most tasks (on some coding tasks results were very similar). so it replaced qwen in production after a few hours evaluation. Since it was Faster, all output were either same quality or better (mostly better) and with a lower memory footprint - I moved to gemma4. Also - gemma4 didn't show and degradation in output even when close to max context window (\~260K) and on my machine the lower RAM footprint allows me to run the llama.cpp with parallel 2 with full context or parallel 3 with 128K context windows. no brainer. my bottleneck is RAM throughput, not CPU and not size of RAM. it is 8845hs ryzen mini pc and I get 14 T/S

[-]

createthiscom@reddit

gemma-4-26B-A4B-it-UD-Q8_K_XL is not as good at reasoning as DeepSeek-V3.2-light-GGUF:671b-q4_k_m, which in turn is nowhere near as good as GPT 5.4-Thinking. It's like a social hierarchy of machines. I am very impressed by gemma-4-26B-A4B-it-UD-Q8_K_XL's OCR capabilities though. Much better than the original DeepSeek-OCR (I think there is a new one but I haven't tried it).

[-]

Mrinohk@reddit (OP)

If I could begin to dream of running a nearly 700B parameter model I don't think I'd be making a post praising a 26B model's performance lmao. I find it extremely impressive across the board though. I'm sure I've not pushed gemini 3 flash as hard as others, but everything I've thrown at gemma4 26b it has handled almost as well as 3 flash has, only requiring very minor correction specific cases where the RAG tool injection doesn't immediately give it the tools it needs and it needs to pull them in itself.

[-]

No-Setting8461@reddit

I think its a bot lol

[-]

createthiscom@reddit

lol

[-]

Mrinohk@reddit (OP)

this man single handedly caused the ram shortage

[-]

Brief_Consequence_71@reddit

This 26b model is not a joke aside way bigger model in my opinion.

[-]

Far-Low-4705@reddit

honestly, i found qwen 3.5 to be stronger especially in agentic use cases than gemma 4.

im surprised qwen 3.5 35b a3b didnt work for you

[-]

Far_Cat9782@reddit

Never had such an excellent tool caller. Blows every one rmocal model I had put the water

[-]

Borkato@reddit

How are you doing tool calls? Llama cpp seems to yield malformed tool calls with Gemma even after the updates. Maybe I’m forgetting a setting or doing it wrong in python?

[-]

Specter_Origin@reddit

llama.cpp tool calls were fixed with yesterday's patch!

[-]

whatever462672@reddit

Oh time to rebuild.

[-]

Borkato@reddit

I should rebuild, I wonder if the models themselves have issues too

[-]

Specter_Origin@reddit

Model them self do not have issues with tool calls, I was also having same issues and after the patch it single shotted multi thousand, multi file codebase with 100s of tool call without any failure for me...

[-]

Borkato@reddit

Can I ask what you’re using it with? Like what frontend? And llama cpp backend?

[-]

Specter_Origin@reddit

Cline,

I had issues with OpenCode specially UI one.

[-]

Borkato@reddit

Omg it seems to be working 👀 👀 👀

[-]

Borkato@reddit

Oh fantastic! I’m gonna update right now!

[-]

RegularRecipe6175@reddit

Same issue using 0-day llama.cpp / Vulkan for this and 31b, Q8. It will shoot off dozens of malformed calls lighting fast, the result being my search API gets shut down due to hitting the rate limit.

[-]

IrisColt@reddit

the result being my search API gets shut down due to hitting the rate limit

Spot on. When the bot fails, it becomes clear that we are still heavily reliant on search APIs that reject our automated queries.

[-]

Borkato@reddit

1000% exactly the same here. Someone said it needs the token but even with that it’s glitched

[-]

Mrinohk@reddit (OP)

I have an executeTool function that takes the function name, and the arguments. For every item in the tool_calls part of the ollama response it feeds the function name and arguments into it as a dictionary. It's pretty simple, but I'm running through the ollama API which abstracts a lot of it for me. I do know that gemma4 had to have specific updates, and there's a very real chance that they might have been rushed and you're running into some weird bug with the way it's being interpreted that ollama is correcting for, but isn't fixed further upstream in llama.cpp.

[-]

Borkato@reddit

This was helpful. Thank you, the people of this sub like you are amazing!

[-]

Mrinohk@reddit (OP)

Truthfully at this point most of the codebase is AI written, claude code, but I make a point to understand what it's doing and keep a strong map of the architecture because I want to be able to share how I'm doing things so other people can build similar systems. The system I've been building has been largely built around gemini 3 as the frontal lobe but gemma4 26b just slotted right in like I never changed models.

It blows my mind that a relatively small model that can be run fully locally, quickly, on not horribly expensive hardware is capable of running within my system who's whole goal is to create as much of the functionality of jarvis as portrayed in the MCU as possible. As the project grew it started to feel like a local model that can run on hardware I'd ever be able to afford wouldn't be able to keep up, but here we are. Browsing the web, finding obscure parts for me, building pinout mappings from one system to another. Insane shit.

[-]

Borkato@reddit

It really is!! I’m right there with you, literally coding stuff rn. I just discovered llama-server’s webui which really helps with sending images instead of curling through the terminal lmao, but other than that I’m rolling my own!

[-]

admajic@reddit

I found Gemma a bit too chatty try qwen 3.5 27b it also rocks

[-]

YouCantMissTheBear@reddit

He let you bring a computer home? This is still local llama \s

[-]

Cold_Tree190@reddit

Huh I keep having reasoning issues with qwen but haven’t tried gemma 4 yet, sounds like I need to switch over and try it out.

[-]

Mrinohk@reddit (OP)

I've only tried the smaller Qwen 3.5 models, 4B, but I wasn't really impressed with it. I found even gemma 3 4b to be more effective, with my biggest issue with it being the lack of native tool calling. Fixed in gemma4 obviously.

[-]

tableball35@reddit

Usually w/ Qwen3.5 most people focus on 9B, 27B, and 35-A3B for locals, 27B generally being seen as the best

[-]

Specter_Origin@reddit

I was also having reasoning issues with qwen. gemma is much better, atm only use it with vllm or llama.cpp though MLX and LM studio is busted

[-]

triynizzles1@reddit

In general QA, i have found 26b to have more logical reasoning traces compared to 31b. 31b feels a bit too short, maybe overfitted and not creative enough. Could be inference engine deployment tho. I haven’t updated since launch.

[-]

Kuarto@reddit

Are you running it on MacBook? LM Studio mlx? What token/sec?

[-]

Mrinohk@reddit (OP)

I first started playing with it last night using my friends macbook as an ollama server that the raspberry pi these scripts live on would call to for it's model. M4 pro macbook pro. He was getting 83t/s on his machine. I've since switched it over to the gemini api, but selected gemma4:26b through that so that it's the same model I tried on his machine and intend to run local. I'm hoping to get in the 40-50t/s range on the m4 mac mini I have in the pipeline to run all of this on in the future.

It was run through ollama, which does use llama.cpp with metal support, but it was notably not MLX, so there is likely performance to be gained running it through not ollama/llama.cpp. When the script is made macOS native and expecting to use vLLM or some other backend that supports MLX I hope to make the agent quite responsive locally.

[-]

DoorStuckSickDuck@reddit

Mine failed the car wash test once and then succeeded on the next rerun.

[-]

FenderMoon@reddit

Do you have reasoning enabled? Mine has always passed when it's allowed to think. (Unsloth 26B A4B at IQ4_NL)

[-]

danigoncalves@reddit

Did you tried the mixture ones like E4B?

[-]

Mrinohk@reddit (OP)

Extremely limited testing. Dumping my full input prompt from that I feed to the larger models and feeding it into a sanitized, non-tool augmented instance, but with the tool definitions included for as close to an apples:apples comparison to see how it outputted on it's own, and compared results. It was surprisingly close with information synthesizing, needle in haystack type requests, and discarding irrelevant information that was on the edge of my RAG embedding threshold, but I've not tried running it in my full, tool enabled environment with any fully agentic task like research or the walmart benchmark I do.