Gemma 4 26b is the perfect all around local model and I'm surprised how well it does.
Posted by pizzaisprettyneato@reddit | LocalLLaMA | View on Reddit | 169 comments
I got a 64gb memory mac about a month ago and I've been trying to find a model that is reasonably quick, decently good at coding, and doesn't overload my system. My test I've been running is having it create a doom style raycaster in html and js
I've been told qwen 3 coder next was the king, and while its good, the 4bit variant always put my system near the edge. Also I don't know if it was because it was the 4bit variant, but it always would miss tool uses and get stuck in a loop guessing the right params. In the doom test it would usually get it and make something decent, but not after getting stuck in a loop of bad tool calls for a while.
Qwen 3.5 (the near 30b moe variant) could never do it in my experience. It always got stuck on a thinking loop and then would become so unsure of itself it would just end up rewriting the same file over and over and never finish.
But gemma 4 just crushed it, making something working after only 3 prompts. It also limited its thinking and didn't get too lost in details, it just did it. It's the first time I've ran a local model and been actually surprised that it worked great, without any weirdness.
It makes me excited about the future of local models, and I wouldn't be surprised if in 2-3 years we'll be able to use very capable local models that can compete with the sonnets of the world.
Tunashavetoes@reddit
Does anybody else Gemma 4 26B and 31B get stuck in a search loop when you ask it to look things up? Like it’ll serve 30 different things and queue them until they are all finished searching to give me a response.
RainierPC@reddit
I had this, but I added instructions to the system prompt limiting the use of the search tool to once per turn, and it worked after that.
sagefields123@reddit
what instruction? I tried it, but it still made 45 web search tool calls. While Qwem3.5 kept it minimal (just asking for weather).
RainierPC@reddit
Literally "Use at most one tool call per turn."
sagefields123@reddit
Literally only had this in the system prompt, worked sometimes but not always/reliable
sagefields123@reddit
Same. Gemma-4 on LM Studio 26 and 31B Model. System Prompt or Temperature settings did not solve it for me. I keep Qwen3.5
port888@reddit
yep facing this issue on LM Studio for me with Gemma 4 26B A4B. I'll just revisit Gemma4 once this issue is ironed out.
littlle@reddit
I use the 31b on my laptop and I have no issues. I run in in console with ollama.
arman-d0e@reddit
Lm studio? Gguf still feels broken to me rn
gpalmorejr@reddit
That's interesting. My Qwen3.5-35B-A3B did great with coding. The only issue I had was a weird context glitch somewhere between Qwen and Roo talking one time. Other than that it has been flawless.
pizzaisprettyneato@reddit (OP)
Yeah I dunno I just ended up having problems and I don’t know why. It’s very possible I didn’t test the setting enough. Gemma just worked for me without any adjustments
johnfkngzoidberg@reddit
Gemma 4 is very slow for me. Qwen3.5 just works out of the box. I also get a lot more context from Qwen in the same size VRAM
gpalmorejr@reddit
Same. But it are you using the Gemma 4 MoE model or the Dense model? It'll make a huge difference.
johnfkngzoidberg@reddit
Not sure, which works better? I’ll check later.
gpalmorejr@reddit
Interesting. I'd be looking at settings and runtime updates. I half expected you to say dense model as I have seen that one a lot. (People seem to confuse that a lot and not realize the difference in speed).
gpalmorejr@reddit
I tried Gemma4 in LM Studio and it failed to even load the model. I tried a bunch of times ans with different versions/quants/sources and never got it to work. So I just stuck with Qwen since it has been the only one I have had give me so little issue and show so much potential that'll fit in my hardware. I am hoping it is just some runtime update or something that fixes it soon so I can try it, though.
Jeidoz@reddit
Check latest update of LM Studio in beta. There was fixes for Gemma
raindownthunda@reddit
This. You need the latest runtime. I got the cpu (slow) version with the latest app update, but when I switched the beta channel the cuda version showed up. Works great!
gpalmorejr@reddit
I just noticed that, too. I'll check it out soon.
AromaticBear777@reddit
There was an update to fix Gemma failing at load time…make sure you are on 0.4.9+1. That fixed it for me.
gpalmorejr@reddit
I did just update. My current daily driver model has been busy refactoring all day, so I'll have to try it later. (Although this is more a hardware speed problem than a bad model problem lol)
El_Hobbito_Grande@reddit
I got it to work well with Ollama
gpalmorejr@reddit
I may have to try that. Ollama uses Llama.cpp too doesn't it? I know LM Studio does and they JUST updated the runtime, so it may be time to try again.
El_Hobbito_Grande@reddit
Yeah, the pace of updates for just about everything AI is nuts right now. I just updated oMLX to run Gemma 4. So far so good, but thus far I only tested the 4b model on oMLX.
gpalmorejr@reddit
How was is compared to Qwen3.5-4B? Or do you know?
jonnyglobal@reddit
When loading with Ollama, Ollama prompted me to update the version and from there it was seamless.
gpalmorejr@reddit
Yeah, LM Studio just got a llama.cpp update too.
DepictWeb@reddit
Just update LM Studio. The first-day release had some issues.
gpalmorejr@reddit
Yeah, I saw that the llama.cpp runtime updated. I need to try it again. I assume it has something to do with the unusual PLE architecture but who knows. I haven't looked into it, yet.
StardockEngineer@reddit
Unsloth team has a guide of what are the optimal settings.
Voxandr@reddit
WIth cline (which is original of Roo) - Gamma 4 is failing hard time to time.
cmenke1983@reddit
Did you adjust frequency, presence and repetition penalty parameters?
rm-rf-rm@reddit
Its most likely that you havent provided a big enough system prompt (and/or used the official recommended params for thinking vs non-thinking, agentic vs chat use cases) - do a search on this sub, there were tons of posts in the past few weeks about this
relmny@reddit
Me neither, and I run 27b, 122b, 397b and 35b, and after the (Unsloth and Bartowski) quants were fixed, never had any issue. And I run them almost daily...
But I use llama.cpp/ik_llama.cpp and follow the default settings by Qwen (or Unsloth)...
gpalmorejr@reddit
I also use the unsloth variants and have been loving them. Also, using llama.cpp through LM Studio. I do not have nearly the hardware (Ryzen7 5700, 33GB DDR4, GTX1060 6GB) for those larger models but similarly, 35B has been loaded up for me for at least a month now with little interruption and has done everything I needed. Only intermittent issues have been related to me tweaking it too much and causes freezes and crashes on my own or running something really big beside it and one or the other gives up lol.
roosterfareye@reddit
Mine as well. I use it to quickly plan and iterate and develop test plans, test, fix, then review with qwen 3.5 27b. Both models are very good.
gpalmorejr@reddit
Unfortunately for me (Ryzen7 5700, 32GB DDR4 RAM, GTX1060), I'll be waiting until the heat death of the universe for any complex agentic coding solution to finish, probably including any realistic length of code and such. I had 27B generate a python script to flatten a bunch of directories and move all the photos and videos to another folder from an old backup of another computer. It took like 15 or 20 minutes lol.
GiomiaGS@reddit
Is the Mac mini M4 16gb ram enough to run?
Felixo22@reddit
True except for the part where we will be able to buy computers with 64gb of ram.
FigZestyclose7787@reddit
Just a few friendly observations: 1) The harness/ serving you're using makes ALL the difference in the type of experience you have with these models. Qwen 3.5 models up to the 35B Moe were getting very confused, into a loop and barely usable only up to 30k tokens of context or so. After investigating more thoroughly, there were thinking tokens being reinserted every new message and it was confusing the model. Something to do with jinja templates/ thinking tags for qwen models. Once I solved it for the pi coding agent i was using, these 3.5 models, even the small ones, are unbeatable in my daily use. I'm talking several hundred tool calls and ralph loops a day. I'm using llamacpp, pi coding agent with extensions/fixes for qwen tool call/ thinking tags. 2) Gemma4 models, in my testing, are very good as well, but consume significantly more memory and are still actively being fixed/baked into llamacpp. Yesterday's llamacpp update provided the first decent run on gemma4 for my system. Overall, comparing qwen 35B vs Gemma4 26B (Moe models) I haven't found a scenario where Gemma4 was better then Qwen 3.5. Just my 2 cents. Check your Agent harness and model Quantization as well. Bartowski has been the MOST stable quants for me. Even up to 200k+ tokens, the model maintains strong coherence (Q5_K_L is my favorite quant).
alphabetasquiggle@reddit
Thanks for sharing. Would you mind explaining a bit more how you fixed Qwen for pi? My experience so far with cline, roo, zed agent aren't great and I'm interested in trying pi and see how well this would work. I've tried Qwen 3.5 122b and 27b.
FigZestyclose7787@reddit
Sure! See my latest reply above and link to the full write up post on these issues, and what to do.
YudhisthiraMaharaaju@reddit
I wanted to ask the same question too. Please shed some light on this OC.
_VirtualCosmos_@reddit
More than unsloth?
FigZestyclose7787@reddit
Yes for me! Noticeably so, especially after context window gets > 50% usage.
FigZestyclose7787@reddit
As promised, here's a little more context. I wrote a longer response/ report on it if you're having issues with Qwen 3.5 models - https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/
Basically I'm running latest llamacpp + pi coding agent. I noticed basic tool calls were not working, or working intermittently and model was getting confused after 4th or 5th message, so I grabbed message logs and traced it with Opus 4.6. After 10's, more likely 100's or back and forths, an extension for pi was produced that reads tool calls correctly for Qwen models (as well as Kimi, Minimax and a few others that I had serious issues with, being served by Nanogpt which had not fixed it server side at the time. I don't know if they've fixed it now).
It has been working without a fault since then. I've built my own cowork+openclaw type of app and I run ralph loops almost 24/7 as well as regular chat and minor coding tasks. I'm using pi sdk agent for that, and pi coding agent for my personal use. Limitations I find with the qwen models now are related to "knowledge/ training" instead of missed tool calls. and I'm on windows, which if I understand correctly, is a huge disadvantage for this models which are mainly trained on *nix.
My system is very limited - 32GB RAM on i7 + 1080TI (11GB VRAM) but I run Qwen 3.5 35B Q5K_L, 131k context window at 27tps which is really enough for my needs.
As far as my comment on the quants, it has just been my experience, but it also seems like a consensus here on Reddit - If you want bleeding edge features, and speed, go with Unsloth. They're fast, first, and fun. But things might break. I had spent about 4 hours trying to make the first few unsloth quants work on my system with no success. awful looping, poor quality overall. Later I learned that the first batch had issues. So i tried Bartowski's and never had the need to try anything else. It just works even when context gets all used up to max window. If you want stability, go with Bartowski imho.
I'm far from being an expert, but I'm insistent and have learned a few things on the way. So fell free to ask and I'll share more, if it is useful to anyone. Good luck
Final comment - What a time to be alive! To have this power on your local machine! What's better than intelligence? (albeit artificial?) I'm very grateful to this whole community.
The_LSD_Soundsystem@reddit
What harness and jinja templates are you using?
FigZestyclose7787@reddit
pi coding agent. vanilla llaamacpp. custom fixes after back and forth with Opus 4.6 and reviewing chat logs, tool use messages, etc. Nothing special.
FigZestyclose7787@reddit
i'll do a more thorough write up later tonight
GrungeWerX@reddit
Is the q5kxl bartowski better than unsloth UD q5kxl? <— that’s my daily driver.
Bamny@reddit
I’ve been rocking Gemma4:26b-a4b under Hermes agent, running on llama.cpp across two 3060 12GB GPUs and MAN - this thing cranks. Very functional, feels Claude ish, tool calls are consistent and right. Really really happy with this one
qnixsynapse@reddit
Yeah it is awesome. I also edited the default chat template to include current date and quantized manually just the experts to MXFP4 while keeping the rest at their original precision(GPT-OSS style). Result size is 16GB and works the best IMO.
florinandrei@reddit
Yeah. Models getting confused about the calendar makes for amusing, but sometimes annoying, hallucinations.
qnixsynapse@reddit
Yeah, include date and they become awesome. Here is an example: it searched reports from the last three years before responding.
hotcornballer@reddit
How much is exa costing you?
qnixsynapse@reddit
It’s free
oxygen_addiction@reddit
You haven't reached the rate limit yet for free accounts.
KldsSeeGhosts@reddit
What web search are you using if you don’t mind me asking?
qnixsynapse@reddit
It's the default(exa) mcp which comes with jan.ai.
KldsSeeGhosts@reddit
I’ll take a look into it, wasn’t sure since Exa looked like it could have been cut off or something. Thank you!
Su1tz@reddit
Literally in the image
llama-impersonator@reddit
mxfp4 is not a particularly great choice
Firepal64@reddit
I've come to understand that (for models not trained for MXFP4) IQ4_NL is the best 4-bit quant if you can fully offload the model, followed by Q4_K_* which CPUs can handle better. Is that right?
llama-impersonator@reddit
well, i've tested MSE on a random weight tensor for various quant formats, but i didn't do the IQ quants because it's always been claimed that you need a calibrated imatrix for those to do well. without the calibration, i would expect it to do worse than q4km on an apples to apples comparison, but IQ quants perform well in real situations with calibration.
Toastti@reddit
Has anyone done an in depth comparison between the Gemma 4 26b and Qwen 3.5 27b? Primarily for coding and agentic work like open code?
Wondering which one works better. I'm sure qwen is slower as it's dense but on a 5090 the speed is quick enough if you have prompt caching on in VLLM
soferet@reddit
I'd read that Qwen3.5-27b was still better at coding than Gemma-4, so this is great news!
How is it conversationally versus Gemma-3?
m3kw@reddit
No man Gemma 4 did what no other came before it on my personal test coding prompt
Radiant-Video7257@reddit
what model are you running specifically ?
m3kw@reddit
26b
FenderMoon@reddit
Gemma4 seems way smarter and way more nuanced.
Gemma3 27B was already really good but Gemma4 leaves you with a much bigger sense of the model having depth and intention behind what it’s saying.
In terms of works knowledge they’re similar. In terms of reasoning, there isn’t really any comparison. It’s like what GPT-4 was to GPT-3.
geringonco@reddit
First test I did on my Android phone (16GB memory). Quantized Google Gemma 4 (own version), running on Google's own app, failed to pass the test. Quan 3.5 passed it on MNN Chat.
IrisColt@reddit
I'm genuinely perplexed by the downvotes here. This has me eager to conduct head-to-head evaluations to gauge whether Gemma 4 can elicit even a flicker of surprise, especially given that Qwen 3.5 left me thoroughly awestruck when I first put it through its paces. Full disclosure: I'm really a fan of Gemma models but I must give Caesar his due.
GrungeWerX@reddit
Gemma fans are gonna Gemma. Everyone knows Qwen 3.5 is better outside of RPG and translation.
bjodah@reddit
So far I've found gemma 4 26b to be substantially better than its e4b counterpart (which is what I'm guessing you tried?)
soferet@reddit
This is very, very hopeful! Thank you!
misha1350@reddit
No, Gemma 4 is better at coding. But only really at coding. Meanwhile Gemma 4 26B didn't even know that Leetcode #412 was FizzBuzz and hallucinated a problem for me, whereas Qwen 3.5 35B knew it well. Apparently Gemma 4 has weak internal knowledge and is bad outside of Google AI Studio, which they forced enable Google searches on for a reason.
Fit_Concept5220@reddit
Literally a real world example of why so many people are so amazed about qwen while it literally is a digshit benchmaxxed model which cannot produce anything coherent outside of what it remembered (and my guess it did leetcode 412).
H_DANILO@reddit
That's not true, I had a small home project with 4 bugs that I wanted to fix.
I tried gemma4(both 20 and 30b models) and qwen.
Qwen fixed all 4. Gemma failed 3.
Fit_Concept5220@reddit
In my experience, giving qwen a task which is very likely to be outside of its training (do smth in unpopular language in unpopular architecture) produces a code in the air style response. Smth that looks like a code and reads like a code but is absolute bullshit. In my experience this is true to almost every open source model including those extremely large like glm5 (I ran my tests on 8bit quants on m3 ultra gguf).
The only models which produce coherent results (not great, but actual code, with signs of architecture and logic) are gpt-oss20/120b, and now gemini4 (I still tend to think that nothing yet beats oss20b in terms of speed/quality but i need more time to test Gemini when backends and front ends are adapted and fixed).
That being said, that does not mean that your results aren’t true. It’s just likely they are based on tests within some very popular ecosystem (python, typescript/js etc). And these in my opinion don’t reveal the true nature of the llm’s ability to think/reason on problem, mearly its ability to remember stuff and fall apart when task / question is outside of its training.
Give gpt-oss a proper context, agentic cli and access to docs and it will beat any open source model. The only (major) downside is that these gpt models were trained without the notion of skills, so the model often gets quite confused and it takes a lot of effort to properly bring that knowledge to their contexts. This is why I think Gemma models have higher potential to become the real horsepower of local-first agentic coding (unless open ai updates their gpt-oss family which I doubt they will).
H_DANILO@reddit
My man, LLM do not think. They infer. It's an inference model. They need prompt and context.
You point is moot. You're saying Gemma is smarter because is has the capability of spitting brainfuck code, but fails are fixing python and Javascript bugs. And in the end you're claiming its smarter because brainfuck is less mainstream.
Do you get the problem now? Being niche is not being smart. Don't be that guy.
__s@reddit
I was trying to get Qwen 122b-a10b to fix my befunge jit in rust, was amusing watching it actually trace through befunge program thinking about failure
It had some trouble: kept thinking p should move pc to write destination, & when I told it to try produce smaller replications it wrote befunge code with d for dup instead of :
I even have a cfg interpreter it can use to compare correctness against, but in the end it decided fix was to disable jit & always use interpreter
misha1350@reddit
I think AI bros aren't going to like swalloping this pill.
misha1350@reddit
Qwen3.5 is benchmaxxed while Gemma 4 isn't, right?
WhiskeyNeat123@reddit
Is a 48gb MacBook Pro m5 pro good enough?
I want to build a local exec assistant
Jemito2A@reddit
Running gemma4:e4b 24/7 in a multi-agent system on a 5070 Ti — some real-world notes:
▎ Gemma4 is genuinely better for introspective/creative tasks. I switched my evening reflection routine from
qwen3.5:9b to gemma4:e4b and the quality difference is night and day — deeper analysis, less formulaic output.
▎ One gotcha nobody mentions: gemma4 requires think: true in the Ollama API, otherwise the response field comes back
empty. And the thinking tokens eat into your num_predict budget — set it to 2048+ or you'll get thinking but no actual
response. Learned this the hard way today.
▎ For coding tasks though, I still prefer qwen2.5-coder:14b. Gemma4 tends to be too "philosophical" when you need
precise code edits. Different tools for different jobs.
▎ VRAM note: if you're running gemma4 (9.6GB) and another model back-to-back, watch your VRAM — Ollama keeps models
cached for 5min by default. On 16GB that can cause TDR crashes. Use keep_alive: "30s" in your API calls.
FusionCow@reddit
31b is still much better, I get the speed is much worse, but imo I always run the smartest model I can
tvmaly@reddit
Do you have an estimate of how many input and output tokens it took to build that working project in 3 prompts?
xrvz@reddit
Care to share the prompt?
Necessary-Summer-348@reddit
Been testing it against Llama 3.1 70b for code generation and honestly surprised how well 26b punches above its weight. The instruction following is solid and inference speed makes it actually usable for iterative work. Curious if anyone's hit edge cases where it falls apart though.
locutus1of1@reddit
I was testing it in AI studio. It did quite well with my (simple) coding prompts, but it failed translating a simple sentence to en. But the dense 31B model translated the same sentence correctly.
Rich_Artist_8327@reddit
I am using gemma4 with vLLM and its amazing
swagonflyyyy@reddit
How??? Whats your setup?
Rich_Artist_8327@reddit
What you mean how? Just like I have used gemma3 with vLLM?
I have used it with 2x 5090, 2x 7900 XTX and even my laptop HX 370 can run it all with vLLM.
there are gemma-4 specific vllm docker containers available, with those all just works
swagonflyyyy@reddit
I see. I was just wondering about running gemma-4 on vllm with turboquant but supposedly turboquant wasn't supported yet on vllm. That's why I held off on it.
jonnyglobal@reddit
Gemma 4 has been a bit of a gamechanger for my OpenClaw. I was using Qwen 3.5 9B at a Q4 for some log analysis and reporting routines. It would succeed on about every other cron and time out on the others. Running these now with Gemma 4 and the output is more consistent while inference seems to be faster as well. Does a better job with strict prompt adherence than Qwen 3.5 (for me anyway). Going to let these go for a few days and see how consistently it performs.
spidLL@reddit
On my 5060ti 16g vram, I’m running 26b-a4b:Q4 with 65k token of context and offloading 22 layers to gpu. I get between 20 and 30 t/s. It’s usable but qwen3.5-9b:Q8 is faster. But I like Gemma 4 “personality”. Very similar vibes than gpt5.4.
Relative_Jackfruit39@reddit
I got 47 t/s with a 5060 8gb. I put the experts on CPU and i was able to get over 100k context. I swapped to an apex it quant that was a little larger at 18gb and im getting 42 too/sec but the output is much better
spidLL@reddit
That’s interesting. Would you mind to share the llama.cpp options?
VoiceApprehensive893@reddit
i get 20t/s on an igpu
3dom@reddit
I have a visual test with the picture of a woman holding a bouquet with 3 types of flowers (dahlias, ranunculus, bunny tail). Ranunculus look like a dense rose. Qwen 31B Q4 correctly identify the flowers, Gemma 26B Q6 call them roses and recall ranucnulus only after being asked if those are really roses?
No-Educator-249@reddit
Yeah I noticed that too. In my case, I had it describe an official illustration of Emilia from Re:Zero (without telling the model her identity) and it did successfully, but when I asked it to describe a different character, it wasn't able to identify her until I gave it hints. Qwen3.5 35B was able to identify the character successfully without hints.
glenrhodes@reddit
26B MoE at 4B active params is a sweet spot I wish more labs were targeting. Running it at Q4_K_M on a 7900 XT and it crushes most of what I was using Mistral 7B for six months ago. The multimodal capability is the real surprise though. Not frontier quality but way better than I expected from an open weight at this size.
Voxandr@reddit
CLINE Agentic coding is pretty bad with it
Ayumu_Kasuga@reddit
Your template might be wrong, at least if you're using LM Studio, there's a fixed template in the community discussions on huggingface.
Potential-Leg-639@reddit
Thanks for confirming, that Qwen3 Coder Next is still the best for local agentic coding, also my experience. Did not test Gemma4 yet intense with agentic coding, but i guess it will still be behind Qwen3 Coder Next in agentic workflows. Means to take the role of coding itself with a detailed plan. Qwen3 Coder Next still does the job - fast and accurate. I‘m not talking about a single prompt coding task for example, no idea how it would compare in sth like that, but that‘s not how to use coding agents properly, there are possibly other models, that can do sth like that better…
Ayumu_Kasuga@reddit
I haven't tested the gemma4 sirectly, but I've run coding livebench on it, and it scored higher than even the full precision version of qwen3 coder next.
Potential-Leg-639@reddit
Oh wow, sounds nice! So probably in a few weeks and some some llama-cpp updates later it can get really interesting!
Voxandr@reddit
What i have tested is
- Qwen 3.5 122b = Better in some frameworks like svelte , which is a bit ninche.
- Qwen 3 Coder Next = Same quality overall , faster , fail at svelte 5.
- Gemma4 a lot of tool call format errors.
Bondyevk@reddit
I’ve encountered a lot of math problems with Gemma 4. For example counting the days between now and 35 years ago. Qwen3.5 is so much better at that.
Teshier-Asspool@reddit
Why would you ever want an LLM to compute that instead of having it code a script that gives you the answer ?
Bondyevk@reddit
If building an in-memory script for this question is better, the llm should have came up with that.
I’m building a memory system and one of the roundtrips tests is asking with different questions for the same information.
For example: - My birthdate is 6th of august 1989. - How old am I? - On what day of the week am I born? - How many days do I live?
And Gemma 4 completely screwed up the calculations.
Odysseyan@reddit
LLMs are notoriously bad at math of all kind. It really is best to give it a tool or something to calculate it programatically
Positive-Power725@reddit
its memory dates to "today's date (May 23, 2024)"did you tell it today's date?
Bondyevk@reddit
Do you really think trained data and fetching current date are the same thing? 😅
petuman@reddit
Have you given it tools to execute code?
Bondyevk@reddit
Yes, just like all other models I’ve tested, it has access to tools that can run Bun.
TastyStatistician@reddit
to test its intelligence
phazei@reddit
I mean, I'd expect the LLM to code that script to give the answer, maybe that's what he was doing.
koloved@reddit
Why not ? Its basic math
Im_Still_Here12@reddit
Hmm.. I just did this with Gemma4 E4B and ChatGPT for comparison. Both came up with with same answer and Gemma did it faster.
Mollan8686@reddit
Using Gemma4 with Hermes but it’s very messy
CATLLM@reddit
What do you mean?
Mollan8686@reddit
It’s complex that I thought (my bad) it would have been easier with many more local services possible, but I’m finding that too many options require paid APIs for interconnecting different services
garg-aayush@reddit
Is it M4 pro/M5? What kind of tok/s generation are you able to get on your setup?
pizzaisprettyneato@reddit (OP)
m5 pro, not sure of the exact token count exactly but its very fast. It can do a whole thinking block in about a second or two.
BringOutYaThrowaway@reddit
I think the release notes for Ollama 0.20.1 added MLX processing for Apple Silicon. Should be quite speedy.
Beginning-Window-115@reddit
just use omlx
trusty20@reddit
Does anybody have actual side by side comparisons to share or just exuberant hype posts declaring gemma CURRENT_VERSION is the best open source model ever?
HekpoMaH@reddit
I am sorry, can you provide what exactly you ran! I have no idea about qwen but gemma 4 is failing miserably at agentic coding and I've went as far as q8 quants.
The dense model is a bit better, in the sense its tool calls don't fail, but the agentic coding experience is also bad -- repetitive, doesn't get to the point, only wastes energy.
ZenaMeTepe@reddit
Even Opus does that for me from time to time.
FinancialBandicoot75@reddit
I’m curious if my m3 max 36 will power it
veramaz1@reddit
Thank you for sharing your experience, it is very useful
misha1350@reddit
For CODING it is. Meanwhile Gemma 4 26B didn't even know that Leetcode #412 was FizzBuzz and hallucinated a problem for me, whereas Qwen 3.5 35B knew it well. Apparently Gemma 4 has weak internal knowledge and is bad outside of Google AI Studio, which they forced enable Google searches on for a reason.
eek04@reddit
Of all the things for my model to spend its few billions of parameters on, why would I want it to prioritize that little bit of trivia?
Not knowing this might just be better training/retention priority.
misha1350@reddit
Because if it's bad at knowledge like this, for coding, how much worse do you think is it going to be for the things outside of coding?
eek04@reddit
That's not knowledge for coding. I'd actively filter out crap like that from the training set if I was going for a model that's good for coding.
Hell, I've been coding for over 40 years, and I had no idea about leetcode even existing; it's just not meaningful.
RainierPC@reddit
Agreed. Now if you gave it the requirements instead of just a name that could or could not be in its training data, I'm fairly sure it will get the code right.
SatoshiNotMe@reddit
The 26B-A4B variant has the best TG and PP speeds of all the recent open weight models. E.g in Claude Code via llama-server I’m able to get 40 tok/s TG nearly double what I got with the comparable Qwen MOE (35B-A3B) on my M1 Max MacBook Pro 64 GB. Full instructions and comparisons here
However my biggest concern is agentic/tool abilities: on tau2 bench Gemma4 is does much worse than Qwen3.5 (68% vs 81%):
https://news.ycombinator.com/item?id=47616761
Designer_Reaction551@reddit
The 128k context is what changes the equation for me. Longer context means you can pass more state into the pipeline without chunking - that's genuinely useful for agent workflows. The multimodal capability is also surprisingly solid for a model this size. What hardware are you running it on?
Limp_Classroom_2645@reddit
Hype slop
daronjay@reddit
Comment slop
Difficult-Drummer407@reddit
Curious to know if you’re using ollama or llama.cpp or LMstudio to load it. I loaded it in ollama on a 64gb M1 Max studio and it took 12 minutes to answer a simple question. Still scratching my head. Any help appreciated.
Emotional-Breath-838@reddit
your 64GB makes all the difference in the world. my 24GB Mini is struggling to get the sweet spot of speed, context and intelligence. youve got room to optomize all three and the models you can run are jaw dropping vs just six months ago.
congrats!
spidLL@reddit
On my 5060ti 16g vram, I’m running Q4 with 65k token of context and offloading 22 layers to gpu. I get between 20 and 30 t/s. It’s usable but qwen3.5-9b:Q8 is faster. But I like Gemma 4 “personality”. Very similar vibes than gpt5.4.
the_renaissance_jack@reddit
I experimented today with running 26B-a4b on my 24 GB M4 through oMLX. I'm getting surprisingly good results and speed talking to the model through Obsidian Copilot
pizzaisprettyneato@reddit (OP)
Yeah I’ve been waiting for a good time to upgrade my old Mac and the improved llm performance on the m5 convinced me
weiyong1024@reddit
26b MoE on 64gb mac is kind of the sweet spot right now. only loads the active expert weights so you get way more usable context than youd expect from the param count. qwen 3.5 27b is still better for pure code imo but gemma handles everything else without choking
kweglinski@reddit
It loads whole model so available space is the same. The gain is in speed not the size. 27a4b will weight the same as 27b.
weiyong1024@reddit
right, i was thinking of inference speed not memory. the active params per token is whats smaller, total weight is still the full 27b in ram
Small-Challenge2062@reddit
It's 64gb ram or vram?
Waste-Intention-2806@reddit
U can turn off thinking in qwen models
m_tao07@reddit
Also tried Gemma 26B at UD-IQ2. It fits in my RTX a2000 at 24k context. Get around 45 tokens/second and it feels good to use for normal tasks. I even tried to ask it about an assignment in my native language. It understood and came up with actually great feedback, except grammar where it correct a world, but it was the same. For vision it runs on the CPU. Like how it is self aware. Send it a screenshot of the conversation and it mentioned that it was our conversation. I asked it about my CPU and all other models in that range told me the wrong specs apart from this. I experienced the same with Qwen3.5 35B model where it would infinitely reasoning with repeating it self in a unsure way.
Parliament5@reddit
How are you running it on your Mac? I have the same 64gb configuration and I've been trying to get it work with llama.cpp but it's not quite working.
pizzaisprettyneato@reddit (OP)
I’m running on ollama with 8 bit variant. I also had problems with llama cpp a couple of days ago. Thought I’d give it a try in ollama and it did amazing
1asutriv@reddit
Llama cpp has a cuda bug if using the latest with gemma 4. You can get around it by using cuda 13 instead of 13.2. Works like a charm
LeonTheTaken@reddit
Nope, 13.0 and 13.1 still has CUDA illegal memory access bug for me. Qwen3.5 runs fine with no problem on 13.2 by the way.
LeonTheTaken@reddit
Does 13.1 not work neither? I’m currently using 13.2.
FluentFreddy@reddit
I use ollama too but it doesn’t invoke tools or shell commands. What am I missing?
pizzaisprettyneato@reddit (OP)
I was using it with github copilot in vscode. Maybe it's editor/terminal related?
4xi0m4@reddit
The MoE architecture really is a game changer for local inference. Gemma 4 26B hits a nice balance between capability and resource usage, making it feel like the first truly practical daily-driver for folks without workstation-grade hardware. Curious how it handles longer debugging sessions though, since that tends to stress memory in ways short prompts dont reveal.
JohnMason6504@reddit
Gemma 4 26b has been surprisingly good for tool-calling and agentic coding on my setup too. Running Q8 on 64GB and the context handling is noticeably cleaner than Qwen 3.5. Less looping, fewer hallucinated file paths. The 48k effective context window also helps when you have large codebases to reason over. Only downside is GGUF quantization support is still rough in some backends.
IsThisStillAIIs2@reddit
yeah gemma 4 26b feels like it hits a really nice balance point right now, especially for “just get it done” tasks where overthinking hurts more than it helps. i’ve seen the same thing with qwen variants where they’re technically strong but can spiral into tool loops or second guessing, especially when quantized. gemma seems more decisive, which ironically makes it more useful day to day even if it’s not topping every benchmark. honestly feels like we’re entering that phase where model “personality” matters as much as raw capability for local use.
usrnamechecksoutx@reddit
I do somewhat simple text-based work (feed LLMs my interview notes and ask them to write an interview report). Used to do this with SOTA models and since ChatGPT5 results were great. However, I needed to redact all PII which was a PITA. Bought a Macbook Air with 32GB, tried Qwen3.5, results were subpar. Two days later Gemma4 was released. 31B-IQ4_XS is incredible, results are 95% of ChatGPT and very much usable - on a Macbook Air! 3-4t/s is slow but I don't mind it in my workflow, as I do something else in the meantime and just come back once it's done after a few minutes. Will get the maxed out M5U MacStudio once it releases; I think in the next few months we'll see local models that reach SOTA levels with manageable hardware setups.
DeepOrangeSky@reddit
Did you also try Qwen3.5 27b (dense) and Gemma4 31b (dense) to see how those compare against the Qwe3.5 MoE model and the Gemma4 MoE model?
I know they are of course a lot slower than the similarly sized MoE counterparts, but, people were saying that they are quite a bit stronger than the MoE ones. Thus, in terms of total time spent on an overall task, they can potentially be "faster" sometimes, if they can do things in less amount of tries (or even be able to do the thing at all vs not able to do it), compared to the MoE ones, even if the MoE ones run at faster tokens/second. I mean, obviously it varies depending on the specific task at hand and types of use-cases (and occasionally just luck, too, from attempt to attempt, I guess).
Anyway, curious if you tried those as well and how they compared in your opinion and for what you tried on them.
robertpro01@reddit
I guess I need to try it again, because for my tests, it was terrible at coding.
I tested the same day it was released at Q6 and 128k context
Dense_Business_6570@reddit
I know right, I just started using gemma4 3 days ago and cannot believe how much better it performs on both reasoning and speed due to its moe. I tried a bunch of others before up to 30b models that would fit on my 24g vram card and cannot believe the day and night difference.
florinandrei@reddit
Which harness do you use? OpenCode? Something else?
Johnwascn@reddit
Gemma4 seems to currently have an issue with excessive memory consumption for its key-value cache; I haven't tried it yet.
However, I found Qwen3 Next Coder (q8) and Qwen3.5-122b (q4) to be very accurate in their tool usage, consistently running dozens of times without errors. I've already integrated them with Claude Code, and the results are quite good.
My experience with configuration is that the key-value cache is best configured with F16 precision; otherwise, performance will be severely impacted.
CryptoUsher@reddit
gemma's efficiency on mac metal is wild, but how does it handle longer debugging sessions? i'm still stuck on smaller models for sustained work.