Gemma 4 26b is the perfect all around local model and I'm surprised how well it does.

[-]

GiomiaGS@reddit

Is the Mac mini M4 16gb ram enough to run?

[-]

_TheWolfOfWalmart_@reddit

It's not ideal, but you can probably squeeze a Q3 quant into that if you keep the context size on the lower side.

Q2 would fit comfortably with plenty of context, but it's going to be significantly degraded of course.

Even Q3 is pushing it quality-wise. You really need more RAM, or just run a different model. One that's a bit smaller but dense in Q4.

[-]

Just a few friendly observations: 1) The harness/ serving you're using makes ALL the difference in the type of experience you have with these models. Qwen 3.5 models up to the 35B Moe were getting very confused, into a loop and barely usable only up to 30k tokens of context or so. After investigating more thoroughly, there were thinking tokens being reinserted every new message and it was confusing the model. Something to do with jinja templates/ thinking tags for qwen models. Once I solved it for the pi coding agent i was using, these 3.5 models, even the small ones, are unbeatable in my daily use. I'm talking several hundred tool calls and ralph loops a day. I'm using llamacpp, pi coding agent with extensions/fixes for qwen tool call/ thinking tags. 2) Gemma4 models, in my testing, are very good as well, but consume significantly more memory and are still actively being fixed/baked into llamacpp. Yesterday's llamacpp update provided the first decent run on gemma4 for my system. Overall, comparing qwen 35B vs Gemma4 26B (Moe models) I haven't found a scenario where Gemma4 was better then Qwen 3.5. Just my 2 cents. Check your Agent harness and model Quantization as well. Bartowski has been the MOST stable quants for me. Even up to 200k+ tokens, the model maintains strong coherence (Q5_K_L is my favorite quant).

[-]

FigZestyclose7787@reddit

As promised, here's a little more context. I wrote a longer response/ report on it if you're having issues with Qwen 3.5 models - https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/

Basically I'm running latest llamacpp + pi coding agent. I noticed basic tool calls were not working, or working intermittently and model was getting confused after 4th or 5th message, so I grabbed message logs and traced it with Opus 4.6. After 10's, more likely 100's or back and forths, an extension for pi was produced that reads tool calls correctly for Qwen models (as well as Kimi, Minimax and a few others that I had serious issues with, being served by Nanogpt which had not fixed it server side at the time. I don't know if they've fixed it now).

It has been working without a fault since then. I've built my own cowork+openclaw type of app and I run ralph loops almost 24/7 as well as regular chat and minor coding tasks. I'm using pi sdk agent for that, and pi coding agent for my personal use. Limitations I find with the qwen models now are related to "knowledge/ training" instead of missed tool calls. and I'm on windows, which if I understand correctly, is a huge disadvantage for this models which are mainly trained on *nix.

My system is very limited - 32GB RAM on i7 + 1080TI (11GB VRAM) but I run Qwen 3.5 35B Q5K_L, 131k context window at 27tps which is really enough for my needs.

As far as my comment on the quants, it has just been my experience, but it also seems like a consensus here on Reddit - If you want bleeding edge features, and speed, go with Unsloth. They're fast, first, and fun. But things might break. I had spent about 4 hours trying to make the first few unsloth quants work on my system with no success. awful looping, poor quality overall. Later I learned that the first batch had issues. So i tried Bartowski's and never had the need to try anything else. It just works even when context gets all used up to max window. If you want stability, go with Bartowski imho.

I'm far from being an expert, but I'm insistent and have learned a few things on the way. So fell free to ask and I'll share more, if it is useful to anyone. Good luck

Final comment - What a time to be alive! To have this power on your local machine! What's better than intelligence? (albeit artificial?) I'm very grateful to this whole community.

[-]

GrehgyHils@reddit

I know it's been a few weeks, but things move fast. I wanted to ask if you moved to Qwen 3.6?

Also, would you use omlx instead of llama cpp if you were on macos?

I'm looking to basically leverage what you've described here to ramp up a stack on my system.

[-]

FigZestyclose7787@reddit

I did move to qwen 3.6 MOE. it's a beast on my system. 99% correct tool calls, keeping context of original goals all throughout several /compacts. It is trully impressive. Can't comment on omlx as I don't know it from adam... but I'd certainly be willing to try if it runs faster/ allows for more context.

[-]

GrehgyHils@reddit

What's the full model identifier, and quant are you having this experience with?

[-]

FigZestyclose7787@reddit

sorry, just saw your reply. I use this https://huggingface.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen_Qwen3.6-35B-A3B-Q4_K_L.gguf?download=true and the q5_kl version from Bartowski as well

[-]

GrehgyHils@reddit

No worries, thank you !!

[-]

The_LSD_Soundsystem@reddit

What harness and jinja templates are you using?

[-]

FigZestyclose7787@reddit

pi coding agent. vanilla llaamacpp. custom fixes after back and forth with Opus 4.6 and reviewing chat logs, tool use messages, etc. Nothing special.

[-]

sch03e@reddit

I can confirm that these fixes are probably not really needed anymore, at least on my computer. Was pretty surprised to come across someone with almost the same setup as me (Local Qwen3.5 35b a3b, Pi coding agent, 27-28 tps,.. etc) here with issues that I haven't actually encountered.

I'm on the latest llamacpp/pi running mradermacher's Q4_K_M quant with 128k context window and it's been rock solid. Tool calls are great/have never failed for me, web search are done properly, it also read my AGENTS.md files well, and the end result is also pretty much what I'm trying to use it for. I think I've peaked for what my setup is capable of until future models with vastly better capabilities are released.

[-]

FigZestyclose7787@reddit

i'll do a more thorough write up later tonight

[-]

alphabetasquiggle@reddit

Thanks for sharing. Would you mind explaining a bit more how you fixed Qwen for pi? My experience so far with cline, roo, zed agent aren't great and I'm interested in trying pi and see how well this would work. I've tried Qwen 3.5 122b and 27b.

[-]

FigZestyclose7787@reddit

Sure! See my latest reply above and link to the full write up post on these issues, and what to do.

[-]

YudhisthiraMaharaaju@reddit

I wanted to ask the same question too. Please shed some light on this OC.

[-]

_VirtualCosmos_@reddit

Bartowski has been the MOST stable quants for me

More than unsloth?

[-]

FigZestyclose7787@reddit

Yes for me! Noticeably so, especially after context window gets > 50% usage.

[-]

GrungeWerX@reddit

Is the q5kxl bartowski better than unsloth UD q5kxl? <— that’s my daily driver.

[-]

qnixsynapse@reddit

Yeah it is awesome. I also edited the default chat template to include current date and quantized manually just the experts to MXFP4 while keeping the rest at their original precision(GPT-OSS style). Result size is 16GB and works the best IMO.

[-]

florinandrei@reddit

include current date

Yeah. Models getting confused about the calendar makes for amusing, but sometimes annoying, hallucinations.

[-]

aamour1@reddit

How to go about adding the current date so far I’ve been using the web gui included in llama cpp

[-]

qnixsynapse@reddit

Yeah, include date and they become awesome. Here is an example: it searched reports from the last three years before responding.

[-]

hotcornballer@reddit

How much is exa costing you?

[-]

qnixsynapse@reddit

It’s free

[-]

oxygen_addiction@reddit

You haven't reached the rate limit yet for free accounts.

[-]

KldsSeeGhosts@reddit

What web search are you using if you don’t mind me asking?

[-]

qnixsynapse@reddit

It's the default(exa) mcp which comes with jan.ai.

[-]

KldsSeeGhosts@reddit

I’ll take a look into it, wasn’t sure since Exa looked like it could have been cut off or something. Thank you!

[-]

Su1tz@reddit

Web Search Exa

Literally in the image

[-]

juandann@reddit

do you edit the chat template from the UI?

[-]

qnixsynapse@reddit

Nope. I edited chat_template.jinja then converted it to gguf

[-]

Cold_Tree190@reddit

Lol the current date idea is brilliant and pretty funny, I’m going to try and do that too

[-]

idkedu@reddit

Gemma 4 can run on mobile devices as well. I have created some skills which I found useful for myself. I have them public for everyone. https://github.com/StrinGhost/gemma-skills

[-]

llama-impersonator@reddit

mxfp4 is not a particularly great choice

[-]

qnixsynapse@reddit

Maybe. However the model works fine here. Cluster quants are expensive for CPU so I chose mxfp4. (I offload experts to cpu)

[-]

llama-impersonator@reddit

it's such a small difference i would not bother avoiding hierarchical _K quants in favor of a worse quant format. you lose a lot of quality for that tiny throughput increase. i run qwen 3.5 122b and 397b with experts in main memory.

[-]

Firepal64@reddit

I've come to understand that (for models not trained for MXFP4) IQ4_NL is the best 4-bit quant if you can fully offload the model, followed by Q4_K_* which CPUs can handle better. Is that right?

[-]

llama-impersonator@reddit

well, i've tested MSE on a random weight tensor for various quant formats, but i didn't do the IQ quants because it's always been claimed that you need a calibrated imatrix for those to do well. without the calibration, i would expect it to do worse than q4km on an apples to apples comparison, but IQ quants perform well in real situations with calibration.

[-]

RJDG14@reddit

I managed to get the 3-bit quantised version of Gemma 4 26B working on my M3 Mac with 16GB RAM. It pushes my machine to its absolute limit, meaning I need to have all other software closed and it is still pretty slow, but it's amazing to see how good it is and that it can run at all on these specs. It seems well above what ChatGPT originally was (Gemma 3 12B roughly rivalled later revisions of GPT 3.5).

[-]

astropheed@reddit

Have you tried e4b 4bit?

[-]

Bondyevk@reddit

I’ve encountered a lot of math problems with Gemma 4. For example counting the days between now and 35 years ago. Qwen3.5 is so much better at that.

[-]

Teshier-Asspool@reddit

Why would you ever want an LLM to compute that instead of having it code a script that gives you the answer ?

[-]

Bondyevk@reddit

If building an in-memory script for this question is better, the llm should have came up with that.

I’m building a memory system and one of the roundtrips tests is asking with different questions for the same information.

For example: - My birthdate is 6th of august 1989. - How old am I? - On what day of the week am I born? - How many days do I live?

And Gemma 4 completely screwed up the calculations.

[-]

Odysseyan@reddit

LLMs are notoriously bad at math of all kind. It really is best to give it a tool or something to calculate it programatically

[-]

_Soledge@reddit

It’s a large ‘language’ model, it’s designed to work with words and phrases; it’s less astute with numbers and calculations, and hallucinations are a feature not a bug.

They are built to generate text, not to crunch numbers. There are different models specifically designed for crunching tables in excel and other numerical data tasks; what most people miss is that these models are built for linguistic operations first and foremost, and struggle with complex code generation in general.

Doing maths in the frame of language generation is like trying to make orange juice with a coffee maker

[-]

Positive-Power725@reddit

its memory dates to "today's date (May 23, 2024)"did you tell it today's date?

[-]

Bondyevk@reddit

Do you really think trained data and fetching current date are the same thing? 😅

[-]

petuman@reddit

If building an in-memory script for this question is better, the llm should have came up with that.

Have you given it tools to execute code?

[-]

Bondyevk@reddit

Yes, just like all other models I’ve tested, it has access to tools that can run Bun.

[-]

TastyStatistician@reddit

to test its intelligence

[-]

phazei@reddit

I mean, I'd expect the LLM to code that script to give the answer, maybe that's what he was doing.

[-]

koloved@reddit

Why not ? Its basic math

[-]

Im_Still_Here12@reddit

Hmm.. I just did this with Gemma4 E4B and ChatGPT for comparison. Both came up with with same answer and Gemma did it faster.

[-]

Itchy_elbow@reddit

Agreed. I've tried a bunch of these thinking models and they get carried away in thought on fringe branches loosely related to the question asked. Gemma4 26b crushes it.

I have an Apple silicon and went for the MLX version optimized for silicon. Gemma loaded 15GB into memory and it crushed, crushed a coding challenge that qwen3.5 27B struggled with. Constantly gave bad syntax - multiple code errors. I spent hours trying to get the code to work and gemma crushed it in 2 minutes - no errors and the fans didn't spin up.

A tip for Mac users - try LMStudio - it is faster and has both GGUF and MLX support. You can even filter models based on apple silicon optimization (MLX). Awesome - my new fave. Forgot to mention all the Gemma model support tool use, vision and thinking. Bonus!!

[-]

styles01@reddit

FULLY agree. Fast as F*** on a M4Max, and damn smart for its speed. Doesn't destroy your memory load. Doesn't reason for hours (and eat all of the token budget on reasoning) like Qwen does.. It's perfect for openclaw, hermes, claude code etc. I LOVE this model for local. It's my Go-to now.

[-]

EuphoricAnimator@reddit

I've been running local models on a Mac Studio M4 Max (128GB) for a while now, so I can chime in on the Gemma vs Qwen thing. You're right to be looking for something that balances quality and system load,it's a constant trade-off. I've been mostly bouncing between Qwen 3.5 and Gemma 4, and also playing with a bunch of the smaller Ollama models.

For coding specifically, I actually still think Qwen 3.5 (the 7B or 14B versions) edges out Gemma 4, even the 26B. I get around 28-35 tokens/sec with Qwen 14B, using around 30-35GB of VRAM. Gemma 26B is definitely good at coding, don't get me wrong, but I’m seeing about 20-25 tokens/sec and it eats up closer to 45-50GB. That difference in speed is noticeable when I'm actively working with it, trying to debug or iterate on generated code. I've been building small Python scripts and web stuff, and Qwen feels a little more responsive.

That said, Gemma 4 is fantastic for more general tasks,writing, brainstorming, summarizing. It feels more… coherent, maybe? It's less prone to rambling than Qwen sometimes is. I'm getting decent results with creative writing prompts, and it handles complex instructions pretty well. It’s also surprisingly good at following character constraints, something I’ve struggled with in other models.

Honestly, I end up using both depending on what I need. If it’s coding, Qwen is still my go-to, but for everything else, Gemma 4 is really impressive and a pleasure to use. Your Doom raycaster test is cool... that's a good challenge to really push a model’s reasoning skills!

[-]

DeepOrangeSky@reddit

Did you also try Qwen3.5 27b (dense) and Gemma4 31b (dense) to see how those compare against the Qwe3.5 MoE model and the Gemma4 MoE model?

I know they are of course a lot slower than the similarly sized MoE counterparts, but, people were saying that they are quite a bit stronger than the MoE ones. Thus, in terms of total time spent on an overall task, they can potentially be "faster" sometimes, if they can do things in less amount of tries (or even be able to do the thing at all vs not able to do it), compared to the MoE ones, even if the MoE ones run at faster tokens/second. I mean, obviously it varies depending on the specific task at hand and types of use-cases (and occasionally just luck, too, from attempt to attempt, I guess).

Anyway, curious if you tried those as well and how they compared in your opinion and for what you tried on them.

[-]

FatheredPuma81@reddit

Gemma4 31B is very slow on an RTX 4090 it probably isn't even usable on his system tbh...

[-]

wasnt_in_the_hot_tub@reddit

It'll be more than fine on the Mac with 64GB

[-]

Bronzewang@reddit

I don’t think it fits in 24gb

[-]

FatheredPuma81@reddit

WHY IS IT CALLING ME OUT TOO? Lol but yea I think a llama.cpp update or Unsloth update fixed it.

[-]

Bronzewang@reddit

lol collateral damage. 20 isn’t terrible how big is the context window?

I can’t get gemma4 to play nicely with openclaw. Still running qwen3.5:27b. Llama.cpp

[-]

FatheredPuma81@reddit

I have a Parallel models section in my model.ini. That's set to 32k but my Parallel one (that isn't parallel) is set to 64k.

cache-type-k = q8_0
cache-type-v = q8_0
ctx-size = 65535
parallel = 1
cont-batching = true
min-p = 0
mlock = false
mmap = false
n-gpu-layers = all
repeat-penalty = 1
temp = 1
threads = 8
top-k = 64
top-p = 0.95
spec-type = ngram-mod
spec-ngram-size-n = 24
draft-min = 48
draft-max = 64

[-]

WhiskeyNeat123@reddit

Is a 48gb MacBook Pro m5 pro good enough?

I want to build a local exec assistant

[-]

FatheredPuma81@reddit

You can _run_ any 4 bit (sweet spot) model under like 80B with that much RAM. The main issue would be speed.

[-]

Bugajpcmr@reddit

I'm running gemma4:26b Q4 with 192k context on Mac Studio M4 Max 36GB with opencode using ollama. Runs well. Better than qwen3.5 for web dev at least.

[-]

centminmod@reddit

Pretty close to acceptable if you close all other running apps on your system to make most use of 48GB memory.

I tried it for local AI via LM Studio and Claude Code on my Macbook Pro M4 Pro with 48GB memory https://ai.georgeliu.com/p/running-google-gemma-4-locally-with.

As you increase token context window sizes, memory consumption increases. So I don't think heavy coding users will be able to use Google Gemma 4 locally unless paired with a lot of memory - at least 64+GB memory as context matters for LLM performance.

48GB memory you're looking at around 48K to 64K usable token context window I guestimate

[-]

FusionCow@reddit

31b is still much better, I get the speed is much worse, but imo I always run the smartest model I can

[-]

FatheredPuma81@reddit

Depends on what you're using it for but you can always use a larger model to plan, faster model to implement, and then larger model to verify and fix. But I wouldn't use Gemma4 31B over Qwen3.5 27B. Differences are too small for such a huge speed _and_ Context Length increase tbh.

[-]

FatheredPuma81@reddit

What program are you using??? I've used Qwen3.5 35B a modest amount and quite a lot as a Subagent to build huge swaths of code really fast and it always limits its thinking to maybe 1 paragraph of thinking tops and rarely fails tool calls in OpenCode?

[-]

jonnyglobal@reddit

Gemma 4 has been a bit of a gamechanger for my OpenClaw. I was using Qwen 3.5 9B at a Q4 for some log analysis and reporting routines. It would succeed on about every other cron and time out on the others. Running these now with Gemma 4 and the output is more consistent while inference seems to be faster as well. Does a better job with strict prompt adherence than Qwen 3.5 (for me anyway). Going to let these go for a few days and see how consistently it performs.

[-]

BradersFPV@reddit

What Gemma 4 model you trying?

[-]

jonnyglobal@reddit

u/BradersFPV , E4B at the default quant.

[-]

BradersFPV@reddit

What size context window?

[-]

jonnyglobal@reddit

Hey u/BradersFPV sorry for the delay on this. This is what Ollama is showing, but I am skeptical that it actually has \~130K window. If I was setting it manually, I'd probably set to 64K, but I haven't had any issues with it as is:

NAME ID SIZE PROCESSOR CONTEXT UNTIL

gemma4:e4b c6eb396dbd59 16 GB 45%/55% CPU/GPU 131072 Forever

[-]

Sizio_plays@reddit

I’m pretty new to running local models. I was wondering.

Would a heavy Quant 26B be better than E4B. Like the unsloth Q2 or Q3 models.

[-]

JangPierre@reddit

guys, what about gemma 8b for my windows laptop.
Is it worthy?

[-]

eme-emes@reddit

Would it work fine in my setup? I'm new to the scene and haven't figured out yet which model goes to each specs. Also, how does it compare to online 3.1 Pro models? Asking cause I have access to a Gemini PRO plan.

32 (16x2) GB 3600mhz CL16 RAM
RTX 3060 12GB
Ryzen 5 5600
1x 500GB SSD NVMe 4.0
2x 500GB SSD SATA

[-]

Turbulent-Walk-8973@reddit

which code-editor are you using? claude code, or cursor or smtg else with gemma? I tried claude code with gemma4:26b it didn't work well.
I've got a DGX Spark with 128gb vram, and I want to test on it

[-]

soferet@reddit

I'd read that Qwen3.5-27b was still better at coding than Gemma-4, so this is great news!

How is it conversationally versus Gemma-3?

[-]

FenderMoon@reddit

Gemma4 seems way smarter and way more nuanced.

Gemma3 27B was already really good but Gemma4 leaves you with a much bigger sense of the model having depth and intention behind what it’s saying.

In terms of works knowledge they’re similar. In terms of reasoning, there isn’t really any comparison. It’s like what GPT-4 was to GPT-3.

[-]

IrisColt@reddit

EQ-Bench3, on par with GLM4.7 and the August version of gpt-5-chat. Well deserved.

[-]

IrisColt@reddit

Alright, I got to test Gemma 4 31B, and... wow. It's a beast. Where to even begin?

It feels naturally less robotic, which means it's not as rigid about following instructions to the letter, and that's actually a good thing. It commands idiomatic English like a native speaker with a fresh, intuitive mind. The way it handles persona-driven responses, especially when the system prompt doesn't call for one, is genuinely impressive.

It actually cracked a few of my trickiest creative benchmarks, something I haven't seen from a model at this scale, or even larger ones. If you just let it run, it develops its own convincing sense of agency, even multiple distinct coexisting ones. Truly unprecedented at this size.

[-]

geringonco@reddit

First test I did on my Android phone (16GB memory). Quantized Google Gemma 4 (own version), running on Google's own app, failed to pass the test. Quan 3.5 passed it on MNN Chat.

[-]

bjodah@reddit

So far I've found gemma 4 26b to be substantially better than its e4b counterpart (which is what I'm guessing you tried?)

[-]

FenderMoon@reddit

Yea, Gemma 26B A4B is amazing. Gemma E4B was quite underwhelming in comparison.

I'm frankly shocked Google didn't release something that was in between. E4B is a 4B model on a couple of light steroids. 26B is a proper mid-sized model that can do countless things the 4B one can't.

[-]

IrisColt@reddit

I'm genuinely perplexed by the downvotes here. This has me eager to conduct head-to-head evaluations to gauge whether Gemma 4 can elicit even a flicker of surprise, especially given that Qwen 3.5 left me thoroughly awestruck when I first put it through its paces. Full disclosure: I'm really a fan of Gemma models but I must give Caesar his due.

[-]

geringonco@reddit

I was replying to a comment about the "walk or drive to wash my car" test. People can down vote what they want, but they can also try for themselves before down voting. Apps I used were Edge Gallery, from Google, and MNN Chat. Gemma 4 e4b x Qwen 3.5 4b.

[-]

IrisColt@reddit

Thanks for the info!

[-]

GrungeWerX@reddit

Gemma fans are gonna Gemma. Everyone knows Qwen 3.5 is better outside of RPG and translation.

[-]

soferet@reddit

This is very, very hopeful! Thank you!

[-]

m3kw@reddit

No man Gemma 4 did what no other came before it on my personal test coding prompt

[-]

Radiant-Video7257@reddit

what model are you running specifically ?

[-]

m3kw@reddit

26b

[-]

misha1350@reddit

No, Gemma 4 is better at coding. But only really at coding. Meanwhile Gemma 4 26B didn't even know that Leetcode #412 was FizzBuzz and hallucinated a problem for me, whereas Qwen 3.5 35B knew it well. Apparently Gemma 4 has weak internal knowledge and is bad outside of Google AI Studio, which they forced enable Google searches on for a reason.

[-]

Fit_Concept5220@reddit

Literally a real world example of why so many people are so amazed about qwen while it literally is a digshit benchmaxxed model which cannot produce anything coherent outside of what it remembered (and my guess it did leetcode 412).

[-]

H_DANILO@reddit

That's not true, I had a small home project with 4 bugs that I wanted to fix.

I tried gemma4(both 20 and 30b models) and qwen.

Qwen fixed all 4. Gemma failed 3.

[-]

Fit_Concept5220@reddit

In my experience, giving qwen a task which is very likely to be outside of its training (do smth in unpopular language in unpopular architecture) produces a code in the air style response. Smth that looks like a code and reads like a code but is absolute bullshit. In my experience this is true to almost every open source model including those extremely large like glm5 (I ran my tests on 8bit quants on m3 ultra gguf).

The only models which produce coherent results (not great, but actual code, with signs of architecture and logic) are gpt-oss20/120b, and now gemini4 (I still tend to think that nothing yet beats oss20b in terms of speed/quality but i need more time to test Gemini when backends and front ends are adapted and fixed).

That being said, that does not mean that your results aren’t true. It’s just likely they are based on tests within some very popular ecosystem (python, typescript/js etc). And these in my opinion don’t reveal the true nature of the llm’s ability to think/reason on problem, mearly its ability to remember stuff and fall apart when task / question is outside of its training.

Give gpt-oss a proper context, agentic cli and access to docs and it will beat any open source model. The only (major) downside is that these gpt models were trained without the notion of skills, so the model often gets quite confused and it takes a lot of effort to properly bring that knowledge to their contexts. This is why I think Gemma models have higher potential to become the real horsepower of local-first agentic coding (unless open ai updates their gpt-oss family which I doubt they will).

[-]

H_DANILO@reddit

My man, LLM do not think. They infer. It's an inference model. They need prompt and context.

You point is moot. You're saying Gemma is smarter because is has the capability of spitting brainfuck code, but fails are fixing python and Javascript bugs. And in the end you're claiming its smarter because brainfuck is less mainstream.

Do you get the problem now? Being niche is not being smart. Don't be that guy.

[-]

__s@reddit

I was trying to get Qwen 122b-a10b to fix my befunge jit in rust, was amusing watching it actually trace through befunge program thinking about failure

It had some trouble: kept thinking p should move pc to write destination, & when I told it to try produce smaller replications it wrote befunge code with d for dup instead of :

I even have a cfg interpreter it can use to compare correctness against, but in the end it decided fix was to disable jit & always use interpreter

[-]

misha1350@reddit

I think AI bros aren't going to like swalloping this pill.

[-]

misha1350@reddit

Qwen3.5 is benchmaxxed while Gemma 4 isn't, right?

[-]

Voxandr@reddit

CLINE Agentic coding is pretty bad with it

[-]

Ayumu_Kasuga@reddit

Your template might be wrong, at least if you're using LM Studio, there's a fixed template in the community discussions on huggingface.

[-]

Voxandr@reddit

Using llamacpp directly. Can you forward me the link?

[-]

Ayumu_Kasuga@reddit

Check out this discussion too: https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/11

You might need to just redownload the model and update llama.cpp.

[-]

Ayumu_Kasuga@reddit

Here, it's for the unsloth quant particularly:

https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/6

Without this template the model didn't produce any thinking for me, so I thought that might be your problem too - seeing how on your screen the model thinks within the code, which I guess could happen if the normal thinking process is disabled.

[-]

Potential-Leg-639@reddit

Thanks for confirming, that Qwen3 Coder Next is still the best for local agentic coding, also my experience. Did not test Gemma4 yet intense with agentic coding, but i guess it will still be behind Qwen3 Coder Next in agentic workflows. Means to take the role of coding itself with a detailed plan. Qwen3 Coder Next still does the job - fast and accurate. I‘m not talking about a single prompt coding task for example, no idea how it would compare in sth like that, but that‘s not how to use coding agents properly, there are possibly other models, that can do sth like that better…

[-]

Ayumu_Kasuga@reddit

I haven't tested the gemma4 sirectly, but I've run coding livebench on it, and it scored higher than even the full precision version of qwen3 coder next.

[-]

Potential-Leg-639@reddit

Oh wow, sounds nice! So probably in a few weeks and some some llama-cpp updates later it can get really interesting!

[-]

Voxandr@reddit

looking forward to that.

[-]

Voxandr@reddit

What i have tested is
- Qwen 3.5 122b = Better in some frameworks like svelte , which is a bit ninche.
- Qwen 3 Coder Next = Same quality overall , faster , fail at svelte 5.
- Gemma4 a lot of tool call format errors.

[-]

Altruistic_Heat_9531@reddit

tool format error mostly because of parser, try upgrading the llamacpp maybe. Since i also encountered Qwen tool format error, switch to pwiklin fork, fixed it. But since it is already being merged, i switched again to main line llamacpp

[-]

Parliament5@reddit

How are you running it on your Mac? I have the same 64gb configuration and I've been trying to get it work with llama.cpp but it's not quite working.

[-]

pizzaisprettyneato@reddit (OP)

I’m running on ollama with 8 bit variant. I also had problems with llama cpp a couple of days ago. Thought I’d give it a try in ollama and it did amazing

[-]

1asutriv@reddit

Llama cpp has a cuda bug if using the latest with gemma 4. You can get around it by using cuda 13 instead of 13.2. Works like a charm

[-]

LeonTheTaken@reddit

Newest PR fixed it for me, de1aa6fa73e135839109e09fec1a0997f4207b2a.

[-]

LeonTheTaken@reddit

Nope, 13.0 and 13.1 still has CUDA illegal memory access bug for me. Qwen3.5 runs fine with no problem on 13.2 by the way.

[-]

LeonTheTaken@reddit

Does 13.1 not work neither? I’m currently using 13.2.

[-]

FluentFreddy@reddit

I use ollama too but it doesn’t invoke tools or shell commands. What am I missing?

[-]

pizzaisprettyneato@reddit (OP)

I was using it with github copilot in vscode. Maybe it's editor/terminal related?

[-]

spidLL@reddit

On my 5060ti 16g vram, I’m running 26b-a4b:Q4 with 65k token of context and offloading 22 layers to gpu. I get between 20 and 30 t/s. It’s usable but qwen3.5-9b:Q8 is faster. But I like Gemma 4 “personality”. Very similar vibes than gpt5.4.

[-]

Relative_Jackfruit39@reddit

I got 47 t/s with a 5060 8gb. I put the experts on CPU and i was able to get over 100k context. I swapped to an apex it quant that was a little larger at 18gb and im getting 42 too/sec but the output is much better

[-]

spidLL@reddit

That’s interesting. Would you mind to share the llama.cpp options?

[-]

Relative_Jackfruit39@reddit

I ended up watching the imatrix during some “tests” in my data and pulled the most called expert layers into vram. I was able to squeeze 4 of the most called expert layers into VRAM and set the server to only p cores.

I ended up getting and extra 5 or token per sec doing that, but just use moe and whatever context you want then start adding layers until you max out vram.

[-]

VoiceApprehensive893@reddit

i get 20t/s on an igpu

[-]

Aggressive-Permit317@reddit

Totally agree! The 26B is hitting that sweet spot where it's actually usable daily without feeling like a compromise. Speed is excellent on consumer hardware and it handles most tasks better than the last Gemma generation. Still seeing some agentic coding weaknesses though compared to Qwen 3.5.

Has anyone found a good quant or fine-tune that fixes the tool-use side yet?

[-]

gpalmorejr@reddit

That's interesting. My Qwen3.5-35B-A3B did great with coding. The only issue I had was a weird context glitch somewhere between Qwen and Roo talking one time. Other than that it has been flawless.

[-]

pizzaisprettyneato@reddit (OP)

Yeah I dunno I just ended up having problems and I don’t know why. It’s very possible I didn’t test the setting enough. Gemma just worked for me without any adjustments

[-]

gpalmorejr@reddit

I tried Gemma4 in LM Studio and it failed to even load the model. I tried a bunch of times ans with different versions/quants/sources and never got it to work. So I just stuck with Qwen since it has been the only one I have had give me so little issue and show so much potential that'll fit in my hardware. I am hoping it is just some runtime update or something that fixes it soon so I can try it, though.

[-]

El_Hobbito_Grande@reddit

I got it to work well with Ollama

[-]

gpalmorejr@reddit

I may have to try that. Ollama uses Llama.cpp too doesn't it? I know LM Studio does and they JUST updated the runtime, so it may be time to try again.

[-]

El_Hobbito_Grande@reddit

Yeah, the pace of updates for just about everything AI is nuts right now. I just updated oMLX to run Gemma 4. So far so good, but thus far I only tested the 4b model on oMLX.

[-]

gpalmorejr@reddit

How was is compared to Qwen3.5-4B? Or do you know?

[-]

El_Hobbito_Grande@reddit

I honestly don't know, yet. (Might test it soon.) BTW, Gemma 4 26B works well on my MacBook Pro M4 Pro with only 24 GB unified memory. Didn't expect it to.

[-]

gpalmorejr@reddit

That's awesome! Yeah the MoE models are awesome and a game changer for getting a huge amount of intelligence on consumer/hobbyist devices.

[-]

Jeidoz@reddit

Check latest update of LM Studio in beta. There was fixes for Gemma

[-]

raindownthunda@reddit

This. You need the latest runtime. I got the cpu (slow) version with the latest app update, but when I switched the beta channel the cuda version showed up. Works great!

[-]

gpalmorejr@reddit

I just noticed that, too. I'll check it out soon.

[-]

AromaticBear777@reddit

There was an update to fix Gemma failing at load time…make sure you are on 0.4.9+1. That fixed it for me.

[-]

gpalmorejr@reddit

I did just update. My current daily driver model has been busy refactoring all day, so I'll have to try it later. (Although this is more a hardware speed problem than a bad model problem lol)

[-]

jonnyglobal@reddit

When loading with Ollama, Ollama prompted me to update the version and from there it was seamless.

[-]

gpalmorejr@reddit

Yeah, LM Studio just got a llama.cpp update too.

[-]

DepictWeb@reddit

Just update LM Studio. The first-day release had some issues.

[-]

gpalmorejr@reddit

Yeah, I saw that the llama.cpp runtime updated. I need to try it again. I assume it has something to do with the unusual PLE architecture but who knows. I haven't looked into it, yet.

[-]

johnfkngzoidberg@reddit

Gemma 4 is very slow for me. Qwen3.5 just works out of the box. I also get a lot more context from Qwen in the same size VRAM

[-]

gpalmorejr@reddit

Same. But it are you using the Gemma 4 MoE model or the Dense model? It'll make a huge difference.

[-]

johnfkngzoidberg@reddit

Not sure, which works better? I’ll check later.

[-]

gpalmorejr@reddit

Interesting. I'd be looking at settings and runtime updates. I half expected you to say dense model as I have seen that one a lot. (People seem to confuse that a lot and not realize the difference in speed).

[-]

StardockEngineer@reddit

Unsloth team has a guide of what are the optimal settings.

[-]

Voxandr@reddit

WIth cline (which is original of Roo) - Gamma 4 is failing hard time to time.

[-]

cmenke1983@reddit

Did you adjust frequency, presence and repetition penalty parameters?

[-]

rm-rf-rm@reddit

Its most likely that you havent provided a big enough system prompt (and/or used the official recommended params for thinking vs non-thinking, agentic vs chat use cases) - do a search on this sub, there were tons of posts in the past few weeks about this

[-]

relmny@reddit

Me neither, and I run 27b, 122b, 397b and 35b, and after the (Unsloth and Bartowski) quants were fixed, never had any issue. And I run them almost daily...

But I use llama.cpp/ik_llama.cpp and follow the default settings by Qwen (or Unsloth)...

[-]

gpalmorejr@reddit

I also use the unsloth variants and have been loving them. Also, using llama.cpp through LM Studio. I do not have nearly the hardware (Ryzen7 5700, 33GB DDR4, GTX1060 6GB) for those larger models but similarly, 35B has been loaded up for me for at least a month now with little interruption and has done everything I needed. Only intermittent issues have been related to me tweaking it too much and causes freezes and crashes on my own or running something really big beside it and one or the other gives up lol.

[-]

BradersFPV@reddit

Hi, what model you been using?

[-]

gpalmorejr@reddit

90% if time unsloth/Qwen3.5-35B-A3B-Q4_K_M

[-]

roosterfareye@reddit

Mine as well. I use it to quickly plan and iterate and develop test plans, test, fix, then review with qwen 3.5 27b. Both models are very good.

[-]

gpalmorejr@reddit

Unfortunately for me (Ryzen7 5700, 32GB DDR4 RAM, GTX1060), I'll be waiting until the heat death of the universe for any complex agentic coding solution to finish, probably including any realistic length of code and such. I had 27B generate a python script to flatten a bunch of directories and move all the photos and videos to another folder from an old backup of another computer. It took like 15 or 20 minutes lol.

[-]

UnclaEnzo@reddit

I mean, I'm gonna have to get fully on board with this, though I will say that the nemotron-2 series is similarly kicking ass, as is glm4.7 and the lfm2 series.

I'm running on a ryzen 7 5700u with 64gb ddr4 and an integrated ROCm graphics system. While the graphics cant be used directly in the usual sense, ollama is optimized for it.

[-]

Tunashavetoes@reddit

Does anybody else Gemma 4 26B and 31B get stuck in a search loop when you ask it to look things up? Like it’ll serve 30 different things and queue them until they are all finished searching to give me a response.

[-]

RainierPC@reddit

I had this, but I added instructions to the system prompt limiting the use of the search tool to once per turn, and it worked after that.

[-]

sagefields123@reddit

what instruction? I tried it, but it still made 45 web search tool calls. While Qwem3.5 kept it minimal (just asking for weather).

[-]

RainierPC@reddit

Literally "Use at most one tool call per turn."

[-]

sagefields123@reddit

Literally only had this in the system prompt, worked sometimes but not always/reliable

[-]

sagefields123@reddit

Same. Gemma-4 on LM Studio 26 and 31B Model. System Prompt or Temperature settings did not solve it for me. I keep Qwen3.5

[-]

port888@reddit

yep facing this issue on LM Studio for me with Gemma 4 26B A4B. I'll just revisit Gemma4 once this issue is ironed out.

[-]

littlle@reddit

I use the 31b on my laptop and I have no issues. I run in in console with ollama.

[-]

arman-d0e@reddit

Lm studio? Gguf still feels broken to me rn

[-]

Felixo22@reddit

True except for the part where we will be able to buy computers with 64gb of ram.

[-]

Bamny@reddit

I’ve been rocking Gemma4:26b-a4b under Hermes agent, running on llama.cpp across two 3060 12GB GPUs and MAN - this thing cranks. Very functional, feels Claude ish, tool calls are consistent and right. Really really happy with this one

[-]

Toastti@reddit

Has anyone done an in depth comparison between the Gemma 4 26b and Qwen 3.5 27b? Primarily for coding and agentic work like open code?

Wondering which one works better. I'm sure qwen is slower as it's dense but on a 5090 the speed is quick enough if you have prompt caching on in VLLM

[-]

Jemito2A@reddit

Running gemma4:e4b 24/7 in a multi-agent system on a 5070 Ti — some real-world notes:

▎ Gemma4 is genuinely better for introspective/creative tasks. I switched my evening reflection routine from

qwen3.5:9b to gemma4:e4b and the quality difference is night and day — deeper analysis, less formulaic output.

▎ One gotcha nobody mentions: gemma4 requires think: true in the Ollama API, otherwise the response field comes back

empty. And the thinking tokens eat into your num_predict budget — set it to 2048+ or you'll get thinking but no actual

response. Learned this the hard way today.

▎ For coding tasks though, I still prefer qwen2.5-coder:14b. Gemma4 tends to be too "philosophical" when you need

precise code edits. Different tools for different jobs.

▎ VRAM note: if you're running gemma4 (9.6GB) and another model back-to-back, watch your VRAM — Ollama keeps models

cached for 5min by default. On 16GB that can cause TDR crashes. Use keep_alive: "30s" in your API calls.

[-]

tvmaly@reddit

Do you have an estimate of how many input and output tokens it took to build that working project in 3 prompts?

[-]

xrvz@reddit

create a doom style raycaster in html and js

Care to share the prompt?

[-]

Necessary-Summer-348@reddit

Been testing it against Llama 3.1 70b for code generation and honestly surprised how well 26b punches above its weight. The instruction following is solid and inference speed makes it actually usable for iterative work. Curious if anyone's hit edge cases where it falls apart though.

[-]

locutus1of1@reddit

I was testing it in AI studio. It did quite well with my (simple) coding prompts, but it failed translating a simple sentence to en. But the dense 31B model translated the same sentence correctly.

[-]

Rich_Artist_8327@reddit

I am using gemma4 with vLLM and its amazing

[-]

swagonflyyyy@reddit

How??? Whats your setup?

[-]

Rich_Artist_8327@reddit

What you mean how? Just like I have used gemma3 with vLLM?
I have used it with 2x 5090, 2x 7900 XTX and even my laptop HX 370 can run it all with vLLM.

there are gemma-4 specific vllm docker containers available, with those all just works

[-]

swagonflyyyy@reddit

I see. I was just wondering about running gemma-4 on vllm with turboquant but supposedly turboquant wasn't supported yet on vllm. That's why I held off on it.

[-]

3dom@reddit

I have a visual test with the picture of a woman holding a bouquet with 3 types of flowers (dahlias, ranunculus, bunny tail). Ranunculus look like a dense rose. Qwen 31B Q4 correctly identify the flowers, Gemma 26B Q6 call them roses and recall ranucnulus only after being asked if those are really roses?

[-]

No-Educator-249@reddit

Yeah I noticed that too. In my case, I had it describe an official illustration of Emilia from Re:Zero (without telling the model her identity) and it did successfully, but when I asked it to describe a different character, it wasn't able to identify her until I gave it hints. Qwen3.5 35B was able to identify the character successfully without hints.

[-]

glenrhodes@reddit

26B MoE at 4B active params is a sweet spot I wish more labs were targeting. Running it at Q4_K_M on a 7900 XT and it crushes most of what I was using Mistral 7B for six months ago. The multimodal capability is the real surprise though. Not frontier quality but way better than I expected from an open weight at this size.

[-]

Mollan8686@reddit

Using Gemma4 with Hermes but it’s very messy

[-]

CATLLM@reddit

What do you mean?

[-]

Mollan8686@reddit

It’s complex that I thought (my bad) it would have been easier with many more local services possible, but I’m finding that too many options require paid APIs for interconnecting different services

[-]

garg-aayush@reddit

Is it M4 pro/M5? What kind of tok/s generation are you able to get on your setup?

[-]

pizzaisprettyneato@reddit (OP)

m5 pro, not sure of the exact token count exactly but its very fast. It can do a whole thinking block in about a second or two.

[-]

BringOutYaThrowaway@reddit

I think the release notes for Ollama 0.20.1 added MLX processing for Apple Silicon. Should be quite speedy.

[-]

Beginning-Window-115@reddit

just use omlx

[-]

trusty20@reddit

Does anybody have actual side by side comparisons to share or just exuberant hype posts declaring gemma CURRENT_VERSION is the best open source model ever?

[-]

HekpoMaH@reddit

I am sorry, can you provide what exactly you ran! I have no idea about qwen but gemma 4 is failing miserably at agentic coding and I've went as far as q8 quants.

The dense model is a bit better, in the sense its tool calls don't fail, but the agentic coding experience is also bad -- repetitive, doesn't get to the point, only wastes energy.

[-]

ZenaMeTepe@reddit

Qwen 3.5 (the near 30b moe variant) could never do it in my experience. It always got stuck on a thinking loop and then would become so unsure of itself it would just end up rewriting the same file over and over and never finish.

Even Opus does that for me from time to time.

[-]

FinancialBandicoot75@reddit

I’m curious if my m3 max 36 will power it

[-]

veramaz1@reddit

Thank you for sharing your experience, it is very useful

[-]

misha1350@reddit

For CODING it is. Meanwhile Gemma 4 26B didn't even know that Leetcode #412 was FizzBuzz and hallucinated a problem for me, whereas Qwen 3.5 35B knew it well. Apparently Gemma 4 has weak internal knowledge and is bad outside of Google AI Studio, which they forced enable Google searches on for a reason.

[-]

eek04@reddit

Gemma 4 26B didn't even know that Leetcode #412 was FizzBuzz

Of all the things for my model to spend its few billions of parameters on, why would I want it to prioritize that little bit of trivia?

Not knowing this might just be better training/retention priority.

[-]

misha1350@reddit

Because if it's bad at knowledge like this, for coding, how much worse do you think is it going to be for the things outside of coding?

[-]

eek04@reddit

That's not knowledge for coding. I'd actively filter out crap like that from the training set if I was going for a model that's good for coding.

Hell, I've been coding for over 40 years, and I had no idea about leetcode even existing; it's just not meaningful.

[-]

RainierPC@reddit

Agreed. Now if you gave it the requirements instead of just a name that could or could not be in its training data, I'm fairly sure it will get the code right.

[-]

SatoshiNotMe@reddit

The 26B-A4B variant has the best TG and PP speeds of all the recent open weight models. E.g in Claude Code via llama-server I’m able to get 40 tok/s TG nearly double what I got with the comparable Qwen MOE (35B-A3B) on my M1 Max MacBook Pro 64 GB. Full instructions and comparisons here

However my biggest concern is agentic/tool abilities: on tau2 bench Gemma4 is does much worse than Qwen3.5 (68% vs 81%):

https://news.ycombinator.com/item?id=47616761

[-]

Designer_Reaction551@reddit

The 128k context is what changes the equation for me. Longer context means you can pass more state into the pipeline without chunking - that's genuinely useful for agent workflows. The multimodal capability is also surprisingly solid for a model this size. What hardware are you running it on?

[-]

Limp_Classroom_2645@reddit

Hype slop

[-]

daronjay@reddit

Comment slop

[-]

Difficult-Drummer407@reddit

Curious to know if you’re using ollama or llama.cpp or LMstudio to load it. I loaded it in ollama on a 64gb M1 Max studio and it took 12 minutes to answer a simple question. Still scratching my head. Any help appreciated.

[-]

Emotional-Breath-838@reddit

your 64GB makes all the difference in the world. my 24GB Mini is struggling to get the sweet spot of speed, context and intelligence. youve got room to optomize all three and the models you can run are jaw dropping vs just six months ago.

congrats!

[-]

spidLL@reddit

On my 5060ti 16g vram, I’m running Q4 with 65k token of context and offloading 22 layers to gpu. I get between 20 and 30 t/s. It’s usable but qwen3.5-9b:Q8 is faster. But I like Gemma 4 “personality”. Very similar vibes than gpt5.4.

[-]

the_renaissance_jack@reddit

I experimented today with running 26B-a4b on my 24 GB M4 through oMLX. I'm getting surprisingly good results and speed talking to the model through Obsidian Copilot

[-]

pizzaisprettyneato@reddit (OP)

Yeah I’ve been waiting for a good time to upgrade my old Mac and the improved llm performance on the m5 convinced me

[-]

weiyong1024@reddit

26b MoE on 64gb mac is kind of the sweet spot right now. only loads the active expert weights so you get way more usable context than youd expect from the param count. qwen 3.5 27b is still better for pure code imo but gemma handles everything else without choking

[-]

kweglinski@reddit

It loads whole model so available space is the same. The gain is in speed not the size. 27a4b will weight the same as 27b.

[-]

weiyong1024@reddit

right, i was thinking of inference speed not memory. the active params per token is whats smaller, total weight is still the full 27b in ram

[-]

Small-Challenge2062@reddit

It's 64gb ram or vram?

[-]

Waste-Intention-2806@reddit

U can turn off thinking in qwen models

[-]

m_tao07@reddit

Also tried Gemma 26B at UD-IQ2. It fits in my RTX a2000 at 24k context. Get around 45 tokens/second and it feels good to use for normal tasks. I even tried to ask it about an assignment in my native language. It understood and came up with actually great feedback, except grammar where it correct a world, but it was the same. For vision it runs on the CPU. Like how it is self aware. Send it a screenshot of the conversation and it mentioned that it was our conversation. I asked it about my CPU and all other models in that range told me the wrong specs apart from this. I experienced the same with Qwen3.5 35B model where it would infinitely reasoning with repeating it self in a unsure way.

[-]

4xi0m4@reddit

The MoE architecture really is a game changer for local inference. Gemma 4 26B hits a nice balance between capability and resource usage, making it feel like the first truly practical daily-driver for folks without workstation-grade hardware. Curious how it handles longer debugging sessions though, since that tends to stress memory in ways short prompts dont reveal.

[-]

JohnMason6504@reddit

Gemma 4 26b has been surprisingly good for tool-calling and agentic coding on my setup too. Running Q8 on 64GB and the context handling is noticeably cleaner than Qwen 3.5. Less looping, fewer hallucinated file paths. The 48k effective context window also helps when you have large codebases to reason over. Only downside is GGUF quantization support is still rough in some backends.

[-]

IsThisStillAIIs2@reddit

yeah gemma 4 26b feels like it hits a really nice balance point right now, especially for “just get it done” tasks where overthinking hurts more than it helps. i’ve seen the same thing with qwen variants where they’re technically strong but can spiral into tool loops or second guessing, especially when quantized. gemma seems more decisive, which ironically makes it more useful day to day even if it’s not topping every benchmark. honestly feels like we’re entering that phase where model “personality” matters as much as raw capability for local use.

[-]

usrnamechecksoutx@reddit

I do somewhat simple text-based work (feed LLMs my interview notes and ask them to write an interview report). Used to do this with SOTA models and since ChatGPT5 results were great. However, I needed to redact all PII which was a PITA. Bought a Macbook Air with 32GB, tried Qwen3.5, results were subpar. Two days later Gemma4 was released. 31B-IQ4_XS is incredible, results are 95% of ChatGPT and very much usable - on a Macbook Air! 3-4t/s is slow but I don't mind it in my workflow, as I do something else in the meantime and just come back once it's done after a few minutes. Will get the maxed out M5U MacStudio once it releases; I think in the next few months we'll see local models that reach SOTA levels with manageable hardware setups.

[-]

robertpro01@reddit

I guess I need to try it again, because for my tests, it was terrible at coding.

I tested the same day it was released at Q6 and 128k context

[-]

Dense_Business_6570@reddit

I know right, I just started using gemma4 3 days ago and cannot believe how much better it performs on both reasoning and speed due to its moe. I tried a bunch of others before up to 30b models that would fit on my 24g vram card and cannot believe the day and night difference.

[-]

florinandrei@reddit

Which harness do you use? OpenCode? Something else?

[-]

Johnwascn@reddit

Gemma4 seems to currently have an issue with excessive memory consumption for its key-value cache; I haven't tried it yet.

However, I found Qwen3 Next Coder (q8) and Qwen3.5-122b (q4) to be very accurate in their tool usage, consistently running dozens of times without errors. I've already integrated them with Claude Code, and the results are quite good.

My experience with configuration is that the key-value cache is best configured with F16 precision; otherwise, performance will be severely impacted.

[-]

CryptoUsher@reddit

gemma's efficiency on mac metal is wild, but how does it handle longer debugging sessions? i'm still stuck on smaller models for sustained work.