If you haven't yet given Gemma 4 a go...do it today

[-]

Significant_Pay_9834@reddit

I run gemma-4-26B-A4B-it on my work m4 pro through zed with turboquant llama.cpp fork and it is incredible. Can search the web, it can use MCP, can run scripts locally, can help with code and dig through codebases, and can handle a pretty large context (128k). It is amazing and lightning quick, and pretty damn intelligent, it hallucinates on things occasionally but it proves we are getting really close to local llms being viable over cloud ones.

[-]

riceinmybelly@reddit

Could you point to the model card? Also, I got a M2 Max 96GB as well as a M4 Pro 24GB so larger models would ne great too, I just can’t find the ones that do tool calls well

[-]

Significant_Pay_9834@reddit

https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF I use 4 bit medium quantization. This is the fork I run. https://github.com/TheTom/llama-cpp-turboquant I run the server locally, then I create a provider in zed with localhost:8080 (or whatever port you use to run it)

[-]

Steus_au@reddit

i found tavily works better

[-]

No-Anchovies@reddit (OP)

I also plugged in searxng, works great. To speed up searches, auto-generating the title for the chat etc etc you can also plug in a small 1b instruct model that jumps in instead of dragging out the "big model" for that mundane stuff. Makes a massive difference in waiting times

[-]

capnspacehook@reddit

I'm very new to all of this, how did you give models access to searxng and route a small model to do things like name chats?

[-]

year2039nuclearwar@reddit

unRAID makes this very easy

[-]

Borkato@reddit

I personally do it with python and bash! If you don’t know python/bash that may be hard, but honestly it’s super worth it to learn. You can even have Claude write up a basic intro

[-]

capnspacehook@reddit

Ah gotcha, I'm familiar with Python and Bash but was wondering if everyone is writing custom scaffolding for this kinda stuff or if there's a standard tool to route/give access to searxng or something. I'm a fairly experienced programmer but am just dipping my toes into the local llm water.

I know routers exist to select the best model for a task, was wondering if that's what was used to select a smaller model for extremely simple tasks. I would also assume an MCP server was used to grant access to searxng, but maybe ppl are doing something else?

[-]

Borkato@reddit

Ah I see! There’s a few options too for things like coding with opencode, aider, and a few others. Not sure if they do web search but they work great for editing files, but I dislike them because although they work with local, they also push the online stuff religiously. There may be some more ideas on this thread a made for coding https://www.reddit.com/r/LocalLLaMA/s/gIuzxZ2YN8

[-]

Significant_Pay_9834@reddit

With zed, models can do web requests, so you just tell it to do web requests / searches against localhost and the port of the searxng container

[-]

riceinmybelly@reddit

Oh they must have updated it since I tried, I’ll have a look, thanks!

[-]

Borkato@reddit

Is it better than qwen 35B A3B? Gemma is hallucinating and looping like crazy even after all the fixes.

[-]

SmartCustard9944@reddit

Try updating the templates

[-]

Borkato@reddit

Already did ☹️

[-]

MerePotato@reddit

Writing this comment as a reminder to give some advice to you when I wake up in the morning (its 3 am rn), ran into another guy having your issue and helped him the other day but I'm too tired to remember what specific steps resolved it

[-]

Borkato@reddit

No worries!! I tried a lot of stuff so maybe it’s just my crazy tool calling setup haha.

[-]

nemomode7@reddit

!remindme 2 days

[-]

RemindMeBot@reddit

I will be messaging you in 2 days on 2026-04-14 08:08:17 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

alexx_kidd@reddit

How much go of memory do you have?

[-]

the__storm@reddit

What arguments are you using with llama.cpp? I can't get it to work with Zed (tool calls) no matter what I try.

[-]

GrehgyHils@reddit

Would you be willing to show an example script?

[-]

No-Anchovies@reddit (OP)

Yeah this was my reaction too. Finally local llm is useable on local gear in the way we use cloud

[-]

mystery_biscotti@reddit

I'm having some trouble with Gemma 4 on my little ROCm setup. Haven't had time to troubleshoot.

[-]

Striking_Wishbone861@reddit

I have been too sometimes it works great and I really love it just not as stable as 3 was in my rig.

[-]

mystery_biscotti@reddit

Are you using the official model? Are you also on a less supported GPU?

I'm still trying to figure out if it's "me" or something weird happening with the way the model and LM Studio are interacting. The old RX 6600 has more support friction than I'm happy with.

[-]

Striking_Wishbone861@reddit

Give it access to the web it can find current topics still unless I am misunderstanding you. Also i went with docker over lm studio due to AMD limitations. I have now tried multiple iterations of Gemma 4 and have settled on the official. No need to keep the 2 other versions anymore

[-]

mystery_biscotti@reddit

I wanted to see how it performed without tools as part of the evaluation workflow. But unfortunately I became busy and well, evaluation and troubleshooting went by the wayside for a bit.

[-]

Emotional-Card-2177@reddit

Qwen3.6-35B is out now. Has anyone compared it with Gemma 4 yet?

[-]

No-Anchovies@reddit (OP)

I only downloaded it yesterday - will test it with this https://github.com/witness-taco/ollama-benchmark-ui

[-]

killerstreak976@reddit

any takeaway on your end? I dont have the hardware to do a fair benchmark but am still curious.

[-]

Tomr750@reddit

what would you use with m4 max 128gb? tried gemma-4-26b-a4b-it-4bit (by mlx community) using mlx_vlm -> about 105 tps. mlx-community/gemma-4-31b-it-4bit gives me 25 tps.

25 seems to low to use with an agent especially when you want to run them in parallel

need to try it with a Claude Code fork

[-]

tmvr@reddit

You have plenty of memory, don't use the 4bit MLX version, use at least 6bit.

[-]

Tomr750@reddit

is there that much performance difference?

[-]

Wild24@reddit

Please suggest me model for my rtx 3060 12 gb & 64 gb ram.

[-]

SomeOrdinaryKangaroo@reddit

u/Wild24 Gemma 4 has actually changed my life, i'm not kidding. This is the first model i can run on my poor hardware that is actually GREAT!!

This thing has become my personal assistant, it helps me schedule my day, reply to emails, search the web, manage my life, it has earned and saved me a lot of money and helped me inprove my relationships. I have more time now thanks to this thing that i can spend on things that actually matter.

Please bro, you need to try Gemma 4, if it changed my life then it can change yours too, i truly believe so

[-]

peanutbuttergoodness@reddit

Can you tell us about your software and hardware you run it with?

[-]

Outside-Positive-578@reddit

gemma4:e4b

[-]

Bubbly_Breakfast6805@reddit

I have the same setup here, 3060 and 64gb RAM. I can run Gemma 4 26b a4b, the speed is around 25 tokens per second. Not great but not really bad

[-]

ectomorphicThor@reddit

I’m getting 21 per second on my 12gb 3080… I have tried optimizing like crazy and nothing works

[-]

Bubbly_Breakfast6805@reddit

What about your CPU and memory? 12gb is not enough to fully load that model so it needs to partly use cpu and ram

[-]

ectomorphicThor@reddit

5800x and 32gb of DDR4

[-]

Bubbly_Breakfast6805@reddit

Yea I think it's the reason. I'm running on i9-14900k with 64gb DDR5. So it's a bit faster even with 3060

[-]

ixdx@reddit

Qwen3.5-35B-A3B with partial offloading to RAM

[-]

-Ellary-@reddit

Gemma 4 26b a4b with partial offloading.

[-]

Adventurous-Paper566@reddit

Qwen3.5 9B Q4_K_XL or Gemma 4 E4B Q6_K_L

[-]

Anonymous_Cyber@reddit

I built a Tauri frontend for it 😅 have nvidia 1650 I was shocked I could get it to work

[-]

LeekInternational857@reddit

excellent on iPhone all considered.

https://youtu.be/08xgpNj1XSA

Installation and usage video link here.

[-]

Waarheid@reddit

No comments on speed since obviously an A4B will be faster than a dense 27B, but with Gemma 4 26A4B I finally have something I can run at great speed locally (32gb M1 Max) that works as a proper agent. It's been able to write scripts (~200 line Python, haven't used it for working on big projects because that's what cloud models are for), accurately chain tool use together to answer complex questions and perform actions based on its research, etc. Super happy with it. I could probably get similar results with the MoE Qwen 3.5 but I only ever tried out the 9B and Qwen3 30A3B, and never really set up a local agent with those. Just so pleased with Gemma 4 though. I am using the pi.dev agent with some custom skills and system prompt etc.

[-]

Borkato@reddit

I feel the same way but about qwen 3.5 35B!

I don’t know why, but Gemma STILL isn’t working well for me. It loops like crazy with tool calls even after I tried different jinja templates, updating llama cpp, adding bos token, etc. Qwen is stable as hell, but Gemma…

[-]

ItilityMSP@reddit

try bartowski gemma 26B q4, it's really quite amazing,

[-]

amaugofast@reddit

Same here, tried couple times Gemma on m4 max 48go but always end up in endless tool calls looping even with the latest lmstudio patch…

Well, sticking to qwen3.5 27b, never betrayed me

[-]

Interestedguy85@reddit

The endless loop is a known issue, I'm assuming you're using ollama. They've come very far in supporting gemma4 but llama.cpp and ollama are still working on making it stable. I say give them a couple more weeks and give it another try.

[-]

amaugofast@reddit

As I said, I am using lmstudio

[-]

the__storm@reddit

Yeah ime Gemma (26B) cannot use the edit tools in Zed or Copilot to save its life. It has a really hard time even attempting to call them - it'll write diffs as regular responses over and over and over again, and then when (if) it does actually call the tool it often sends bad arguments. I feel like it thinks just writing the diff in the response is how it calls the tool? Very strange.

Might be an issue in llama.cpp? I'm on latest; both Vulkan and ROCm, and both the new Google chat template and the llama.cpp workaround template have the same problem. Qwen works fine on the same setup.
Maybe I'll try vllm I guess.

[-]

Waarheid@reddit

Others have mentioned CUDA issues so I can only speak to my experience on M1 Max, but I really like the pi.dev agent for local models since the system prompt is so extremely small.

[-]

Borkato@reddit

What I’m not understanding is why the hell my home grown one that is also extremely small and just a few prompts isn’t working well. I’m going to minify it. If this doesn’t work I’m giving up lol

[-]

Waarheid@reddit

It's probably an issue with the configuration at some point. What's your whole set up? Hardware, runtime, exact GGUF you're using, etc.

[-]

Borkato@reddit

Will get back to you once I try a more minimal version! :) I have hope this one will work

[-]

Pepper_pusher23@reddit

Yeah it's totally broken on pi.dev. Edit never works. Which makes it dead to me. It does brilliantly until that point.

[-]

Waarheid@reddit

I would watch out for updates to ggufs and llama.cpp then as it sounds like some kind of configuration problem. The model itself on my set up performed fine with edits in pi.dev.

[-]

SmartCustard9944@reddit

Prompt issue. With Claude Code harness, it struggles sometimes to do edits and even gets frustrated. Once it got so frustrated that it decided to rewrite the full file from scratch rather than editing it. It went well, but was definitely fascinating to see. Other times it fails until it learns how to actually do edits.

[-]

Borkato@reddit

Does Qwen 3.5 have the same issues? I typically use the 35B and it has almost 0 issues. I’m wondering if Gemma just truly isn’t as smart

[-]

Th3Sim0n@reddit

Gemma does not like CUDA 13.2 according to unsloth, maybe thats why? Try downgrading to cuda 13.1.

[-]

MoffKalast@reddit

Meanwhile I literally only get this out of the 26B on Intel lmfao:

de de de de de de

Strangely the E2 and E4 seem to work fine.

[-]

Borkato@reddit

I just encountered this error and it was a problem with it trying to autocomplete for the user. It cannot do completions of a user turn unless you’re using the base model instead of the it

[-]

Borkato@reddit

I’m on 12.0, so I’m gonna upgrade to 13.1 and see if that helps!

[-]

obey_kush@reddit

Please report back, would be interesting.

[-]

Borkato@reddit

Seemed to not help :(

[-]

Thrustball@reddit

I am running it on CUDA 13.1 and still have massive looping problems when using tool calls

[-]

annodomini@reddit

Yeah, I've just been testing out a range of different models.

Gemma 4 31B dense works great, but is kinda slow and the llama's context checkpoints caused me to hit OOM a bunch on a system with 128 GiB of unified RAM. I can reduce the number of checkpoints, but then I worry about having even more issue with this: https://www.reddit.com/r/LocalLLaMA/comments/1sjsejm/pi_qwen35_with_llamacpp_doing_a_lot_of_prompt/

I then tried the 26B A4B MoE, and it's quick, but not as smart, I get a lot of tool call errors, and it hits looping thinking issues after a while.

I'm on a Strix Halo (Ryzen AI MAX+ 395) so certainly not a CUDA issue.

I really wish Google would release that 124B (I think it was) model that they hinted about in their original announcement.

[-]

Certain-Cod-1404@reddit

Try the newest updated ggufs ? Did you update llama cpp as well?

[-]

relmny@reddit

If you use gguf, have you updated them? (both Bartowski and Unsloth updated them yesterday. Also update the chat template, that it was updated as well.

That said, I still prefer Qwen3.5 (unless for language tasks).

[-]

chickN00dle@reddit

I tried updated unsloth and the interleaved template but it still gets stuck in loops, constantly double checking the same few things

[-]

habachilles@reddit

Same! I can’t wait for 3.6. 3.5 35 is the perfect model Minus over thinking.

[-]

Borkato@reddit

Is there any word on a release date for qwen 3.6? I know they had that poll, I hope they release a qwen 3.6 35B-A3B

[-]

habachilles@reddit

Me too! And not certain yet

[-]

gpalmorejr@reddit

Yeah. Qwen3.5-35B-A3B is definitely the winner for me.

[-]

Waarheid@reddit

What quants and harnesses did/are you using? I think Gemma being 9B smaller is a selling point, but it would still be interesting to try 35b

[-]

gpalmorejr@reddit

Qwen3.5-35B-A3B-Q4_K_M LM Studio (Settings in picture)

20tok/s Ryzen 7 5700 32GB DDR4 3600MT/s RAM GTX 1060 6GB

[-]

ectomorphicThor@reddit

What app is this?

[-]

gpalmorejr@reddit

LM Studio. It's like most of the command line tools except pretty. Runa on the same backend/runtime.

[-]

KubeKidOnTheBlock@reddit

What customizations do you have the system prompt?

[-]

Waarheid@reddit

I took out all of the lines about the pi harness itself, because I'm not having the agent work on or extend itself. It's like the default system prompt's goal was to make a pi.dev developer, rather than an general purpose agent. Probably works fine with huge/cloud models but just adds too much fluff for a small local model.

In https://github.com/badlogic/pi-mono/blob/main/packages/coding-agent/src/core/system-prompt.ts, I took out lines 137-143, and reworded the first line of the prompt.

[-]

WillyTheWoo@reddit

Are you using it via Ollama? I spent most of yesterday figuring in out how to serve and ended up using omlx via Cline. I have a M1Max with 64GB and for some reason it was much faster via omlx than ollama. I’m open to changing to something else that would be faster though (tried out vllm and mlx-vlm and those did not work out).

[-]

Waarheid@reddit

omlx is faster I think, but ollama just sucks and you should ditch it regardless. llama.cpp is super easy to use. I just did brew install llama.cpp and ran the llama-server command from unsloth:

llama-server \
    --model unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
    --mmproj unsloth/gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 64 \
    --alias "unsloth/gemma-4-26B-A4B-it-GGUF" \
    --port 8001 \
    --chat-template-kwargs '{"enable_thinking":true}'

[-]

Thistlemanizzle@reddit

Are you running Q4?

[-]

Waarheid@reddit

unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf

[-]

Mutaclone@reddit

What are you using to run it (ollama/llama.cpp/lmstudio/kobold/etc?)

[-]

Waarheid@reddit

I haven't tried the dense model, it might be too big/slow? But I just use a brew installed llama.cpp, and run llama.cpp's llama-server.

[-]

fatboy93@reddit

Can you sharee your setup with the m1? I have the m1 pro 32gb and I'm wondering between using llamacpp or mlx

[-]

Waarheid@reddit

My setup is very simple and can probably be improved, but I just did brew install llama.cpp and ran the llama-server command from unsloth:

llama-server \
    --model unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
    --mmproj unsloth/gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 64 \
    --alias "unsloth/gemma-4-26B-A4B-it-GGUF" \
    --port 8001 \
    --chat-template-kwargs '{"enable_thinking":true}'

[-]

jemdiggity@reddit

check out oMLX

[-]

Billysm23@reddit

Idk why people are so in love to compare qwen3.5 27b vs gemma4 26b a4b speed. Cmon, that's stupid and uneven. Honestly i haven't seen a proper comparison between gemma4 26b a4b vs qwen3.5 35b a3b, I would like to see it. I'm sceptical if gemma can win

[-]

Waarheid@reddit

That would be the right comparison to do; I think people (like myself) are just excited to have a properly working small MoE.

[-]

GrehgyHils@reddit

Would you be willing to share an example script you've been finding useful with Gemma 4?

[-]

No-Anchovies@reddit (OP)

Real world Adversarial GDPR information requests based on Har audit analysis with the constraints to use generic sounding lingo for copy paste feel while strictly stuffing/referring to the same violations already found in the audit, creating a no escape scenario basically. In the few examples I fed it, not only it nailed the putput but when presented with a dismissive reply from a DPO, it immediately assumed, without instruction, a very real "F you and your non-compliance" Northern European lawyer tone when drafting the reply for right of access. I almost shed a tear reading it. It's been handling all the scumbag big tech/corp DPOs like a charm

[-]

IrisColt@reddit

By the way, heretically ablated versions of Gemma 4, such as llmfan46s, are worth paying attention to. I actually felt a brief, unexpected emotional reaction to something those models said; it was the first time that has ever happened. When a model can genuinely affect you like that, you know it is doing something right.

[-]

IrisColt@reddit

Northern European lawyer tone when drafting the reply for right of access. I almost shed a tear reading it.

Gemma 4 is that good. I am not even surprised.

[-]

GrehgyHils@reddit

Ah gotcha that does sound interesting. I was mostly curious about the actual example source code. Ie what libraries you were using to implement this and language (assuming python) but I get why that isn't possible ha. Thank you

[-]

Weird_Llama_317@reddit

I like qwens 3.5 quality. But it thinks too long. Gemma 4 tps are just slightly faster, but it used much less time to think. Using unsloth models for both. So I switched to gemma 4 for now. Quality was good enough to build a simple mcp from scratch.

[-]

ai_without_borders@reddit

yeah the thinking overhead is the main thing pushing me toward gemma lately too. running both on a 5090 and the raw tps is similar but qwen will spend 15-20 seconds deliberating on stuff that doesnt need it. gemma just answers.

the tradeoff ive noticed is qwen still handles larger context way better. anything involving multi-file refactors or long codebases, gemma tends to lose the thread past a few hundred lines. so ive been switching between them depending on the task

[-]

anthonyg45157@reddit

You might find this graphic interesting. I had Claude come up with 6 tests based on my personal memories and real world things I do. I tested Gemma and qwen thinking and non thinking just to see how they would answer each. Here are the token results

Qwen 3.5 27b thinks sooooo much more than Gemma 4 , interestingly enough turning off thinking doesn't make it that much worse overall.

[-]

Waarheid@reddit

I agree, I've found Gemma 4s thinking time to not be long at all when it doesn't need to be. That was my issue with 3.5 but that could've been an issue with my settings too

[-]

raumgleiter@reddit

same here. the speed is noticeably faster than other models. first thing I noticed. sometimes feels instant after typing a prompt.

[-]

FightOnForUsc@reddit

I have the same machine and tried both the 26b and 31b models. The 26b was just so much faster I decided to stay with it. It’s faster than I read. And if it’s anything it can’t handle it’s fine to use a cloud model. It makes me super excited for what gemma5 will bring because at that point it may be able to do just about the same gemini 3.1 pro outside of world knowledge.

[-]

tobi418@reddit

Tried gemma 4 35b, it was very dumb to code

[-]

lundrog@reddit

How so?

[-]

Danmoreng@reddit

Do yourself a favour and use llama.cpp directly instead of Ollama. It has a great WebUI and while installing isn’t as simple the better speed and latest features you get are absolutely worth it. Best performance if you build it from source for your hardware, but I believe there are bundled downloads as well. If you’re on windows you can build from source with my PowerShell scripts: https://github.com/Danmoreng/llama.cpp-installer

[-]

mr_tolkien@reddit

Ollama has MLX support for Mac OS though

[-]

AdUnlucky9870@reddit

This is an incredibly well-structured benchmark post — the GGUF version trap section alone is going to save people hours of debugging.

One thing I'd add from my experience building local LLM inference pipelines: the content-dependent speedup pattern you documented (code/math ~50%, creative ~10%) maps almost perfectly to what we see with KV cache hit rates in production. Structured outputs have higher token-level predictability, which is essentially what speculative decoding exploits.

The --parallel 1 gotcha is especially important. I've seen multiple people in Discord servers complaining about spec decoding being "broken" when it was just the KV cache allocation multiplying. Worth bolding that even more.

Question: have you tested with different quantization levels for the main model while keeping the draft at Q4? I'm curious whether Q6 or Q8 on the 31B changes the acceptance rate meaningfully, since the draft model's predictions would be compared against a "sharper" probability distribution.

[-]

No-Anchovies@reddit (OP)

Im letting it run as is for now. After the anthropic breach + gemma/turboquant developments I guarantee we will see a ton of REALLY cool shit very soon. Not worth putting in too much effort when the release cycles are this short

[-]

AdUnlucky9870@reddit

Fair point honestly. The release cadence right now is insane — by the time you properly tune something, the next version drops and changes the game. We've basically stopped fine-tuning anything and just keep our eval harness ready to benchmark whatever comes next.

[-]

eddytw@reddit

What's your suggestion for first time local guys. Qwen or Gemma. I have a great big PC that was underutilized, threadripper in there etc. And I've played withbl Gemma 4 after the hype. It's still super slow compared to codex and Claude, obviously, but what's your suggestion?

[-]

Foreign_Yard_8483@reddit

I asked him if it was possible to make a nuke the size of a matchstick (I know it's not). He refused.

I also asked him to complete the sequence: MI BI LI KI (from my head)... he reasoned about music and didn't come up with any sequence as a hypothesis.

We have to accept that we aim for different things: some want optimal commercial models, others want speed, others want novelty per se, others are still bit-by-biting in search of something that resembles automata (in analogy to the first AI discipline).

[-]

BringOutYaThrowaway@reddit

What are the settings that Google recommends, as you mentioned?

[-]

No-Anchovies@reddit (OP)

Its in the HF repo from google, you gotta look it up

[-]

BringOutYaThrowaway@reddit

OK, so everyone, I'll save you a click, and copy the parameters from Huggingface that Google recommends:

Use the following standardized sampling configuration across all use cases:

temperature=1.0
top_p=0.95
top_k=64

[-]

DeepOrangeSky@reddit

I can't get it to work properly on LM Studio as of so far. Its memory usage climbs indefinitely after every single reply, no matter how much ram your computer has, until it uses up all the memory. It adds like 5GB more memory use per each reply, over and over, indefinitely, until it uses all your memory, even if you have 100+ GB of memory.

It is fixable in llama.cpp apparently by using --cache-ram 0 --ctx-checkpoints 1

But, no clue how to make it stop happening on LM Studio. I use the most recently updated runtimes and version, and most recent quants and so on, but that doesn't help. It still just keeps doing it.

Not sure if it will ever get fixed or not, since 9 days have gone by and it still keeps happening on LM Studio. Is it just not fixable on LM Studio or something?

It really sucks, because I really like the model, and it would definitely be my main, go-to model if it wasn't for this issue.

[-]

totallynotmyfakename@reddit

Gemma4 is actually split between CPU & GPU based on my exp, I'm running 26B q8 and it takes about 35GB of VRAM and the speed was very fast. 31B is a little bit of a struggle, like it's usable but definitely won't be enough for coding tasks. I have a strix halo with 128GB RAM so it does help, but again you don't need that much to begin with

26B q8 as as been my daily driver for a week now, it just amazing

[-]

AnOnlineHandle@reddit

I've noticed that with the 26B when trying massive contexts, I think as it was caching the attention cache for each token and wasn't clearing when changing to a different interaction. I got around it by just ejecting and reloading the model which was reasonably fast from my nvme, but yeah seems to be a bug.

[-]

DeepOrangeSky@reddit

Lol, that's actually what I ended up doing, was ejecting the model and just reloading it over and over again once every few replies. On the 31b model + long token counts it is not a realistic solution though as it takes forever and just seems ridiculous, so, at some point it would be nice if there was some actual solution. Not sure if there is something LM Studio would be able to do on their end to fix it, or how it would get fixed, since I don't know much about computers, but yea, really hoping it somehow gets fixed at some point.

[-]

AnOnlineHandle@reddit

Yeah I couldn't even get the 31b to manage medium contexts and just assumed it was due to the dense weights on a 24gb GPU. Could that normally be managed?

In my use case of story writing, the 26b seemed a lot better, though it seems to have degraded after recent updates and I'm not quite sure what to fix, but will likely try a refreshed model at some point.

[-]

Bulky-Priority6824@reddit

--no mmap ?

[-]

Embarrassed-Option-7@reddit

Try starting an LMStudio server and using it through AnythingLLM instead, maybe that’ll fix it

[-]

Adventurous-Paper566@reddit

Yes it's a great model, I wanted to use it as a daily driver but finally 31B took this place...

[-]

totallynotmyfakename@reddit

your hardware must be great, I have a strix halo and 31B is usable but quite slow, definitely can't use for any coding tasks, just "talk"

[-]

Cequejedisestvrai@reddit

Is it noticeable better than the 27b?

[-]

Adventurous-Paper566@reddit

Yes.

[-]

Adventurous-Paper566@reddit

Yes, as a general assistant, because Gemma is much better in french than Qwen.

For coding task and tools I think Qwen still better.

[-]

infinitelylarge@reddit

…what deepseek brought to the table years ago…

DeepSeek R1 came out last year

[-]

No-Anchovies@reddit (OP)

Ha, feels like it had been longer. Every week something new

[-]

infinitelylarge@reddit

Yeah, things are moving so fast now

[-]

Aggressive-Permit317@reddit

Finally fired up Gemma 4 today after seeing this and damn it’s snappy even on modest hardware. The confidence it has on code tasks without going off the rails is a breath of fresh air. Way better than I expected for local. What’s your current daily driver local model right now?

[-]

No-Anchovies@reddit (OP)

Using it as daily driver plus generic base for rag and documentation/prompt tuning for very specific tasks. Before I was using an 9B abliterated qwen opus mix from Crownelius - this one was was a true hidden gem

[-]

Aggressive-Permit317@reddit

Nice, using it as your daily driver plus RAG setup makes total sense. That 9B abliterated Qwen mix you came from sounds solid too. I’ve heard good things about Crownelius tweaks. Gemma 4 just feels way more confident out of the box on code without me having to babysit it. You running it quantized or full precision right now? Curious how it handles longer context for you.

[-]

CriticalCup6207@reddit

Been running it for structured extraction from financial documents. On a 3090 it handles batches of 50 transcripts in about 8 minutes at Q5_K_M. The quality of extraction is genuinely good for a 26B model, it picks up on nuance in language that surprised me.

The catch is consistency. Run the same extraction prompt on the same document twice and you can get meaningfully different results. For creative tasks that probably doesn't matter.

For anything quantitative where you're building downstream analysis on the outputs, you need to account for it or you'll get burned.

[-]

No-Anchovies@reddit (OP)

I wouldn't trust such a small model in that use case, otherwise you risk ending up with a turd polish 'omatic

[-]

hyperschlauer@reddit

The good thing is that Google can't nerf it

[-]

No-Anchovies@reddit (OP)

Give it two more weeks and our local gemma will look like Einstein next to the "soon to be" lobotomised 3.1 Pro

[-]

keradius@reddit

I tried on plane a couple days ago, and can't get anything useful form it. Used with ollama Claude and codex harness too and the thing can't even write to a file properly.

[-]

ElKorTorro@reddit

When I'm in LM Studio and search for "Gemma 4", I see a long list of Gemma-4 models that seem to be different versions/modifications of it? What's the difference in all these permutations? Why are some models like 900MB and others 15GB?

[-]

year2039nuclearwar@reddit

I think CRACK is uncensored, the it means it's been trained to do chatting and following directions better. Some models will have lower quants which might be heavily lobotomized and therefore quality heavily reduced, I usually always go with Q4 while I wait for my GPUs to arrive

[-]

Borkato@reddit

Unfortunately unless they write what they’re doing in the readme, it can be impossible to know. I think part of it is that some of the notebooks they use to train with include an automatic “push to huggingface” function and so they may be uploading it and not even realizing it lmao.

I was also wondering wth that crack one was!

[-]

ElKorTorro@reddit

Yeah it's kinda sus. So basically just stick with the official Google DeepMind version right?

[-]

Borkato@reddit

I was wondering the same actually! Personally I do just try to stick with the official ones, or the ones made by well known people like bartowski, unsloth, mradermacher, the drummer, etc :)

[-]

andres_garrido@reddit

I had a similar impression, but what stood out to me wasn’t just speed. It feels more “usable” because it makes fewer weird jumps in reasoning compared to some other local models. Less backtracking, more direct answers.

Curious if that holds up on longer sessions or larger contexts though.

[-]

dr0yd@reddit

I tried Gemma 4 26b and the responses look really good. Tool calling seemed to have a lot of problems though when I try to integrate it with an agent (I tried continue and vscode)

[-]

Borkato@reddit

Yes! Exactly my issue :(

[-]

Zealousideal-Yard328@reddit

I ran enterprise benchmarks on Gemma 4 E4B across 8 suites — function calling, RAG grounding, classification, code gen, summarization, multilingual, and more. The 4B model scored 83.6% overall, beating the 3x larger Gemma 3 12B (82.3%). Multi-step tool chains failed across every model in the family regardless of size. Full data and methodology: https://aiexplr.com/post/gemma-4-e4b-enterprise-benchmark

[-]

viperbe@reddit

I went back to qwen 3.5 after testing Gemma 4 with Hermes and openclaw and Claude code running on llama.cpp

What I noticed is small things like summaries need qwen. Example if you ask the weather qwen would give me a summary and temps like a real forecast where Gemma would just throw a website for a browser at me and move on.

[-]

nickm_27@reddit

that just sounds like a basic prompting problem. Gemma has no problem giving great weather summaries with the following portion of my prompt:

Precipitation values represent chance, not intensity. Above 34 degrees is rain, at or below is snow.

Order of information (as a connected natural response):
1. Current temperature — today only.
2. Conditions and precipitation — describe how the day unfolds. Note transitions and when precipitation starts or ends. No temperatures in this step. Skip precipitation if none or unlikely.
3. High and low temperatures.

For multi-day forecasts: do not list every day. Summarize the general trend, range of highs and lows, and name any outlier days. Two to three sentences max.

[-]

viperbe@reddit

Probably so but that’s a lot to type on a phone when really all I need to ask is what is the weather forecast for zipcode with does the same thing in qwen with out all that

[-]

nickm_27@reddit

You don't type it every time. You define it in the system prompt.That's the whole point of a system prompt, it knows how you want it to act.

[-]

viperbe@reddit

Ah I guess I’m using it wrong. Should I add more to system prompt? How many different things do you have In Your system prompt

[-]

nickm_27@reddit

depends exactly what you use it for. I have multiple "system prompts" as there are multiple systems that use the LLM. For example my smart home agent has a system prompt but my basic chat (llama.cpp WebUI) has its own system prompt.

What gets defined in a system prompt depends on what it is used for, but what I would say a basic template that fits what I have found to work well is:

basic explanation of tasks and personality.
General desired behavior when calling tools (only answer the question asked, don't include extra filler info, don't call tools more than once unless the first call doesn't answer the question, etc.)
any specific tools, how they should be used, how to answer different types of questions, etc.

[-]

Techngro@reddit

Gemma: "Google is your friend."

[-]

shbong@reddit

Yeah I saw the benchmarks and also that on some benchmarks it's better than sonnet 4.. and it runs on your local machine..

[-]

NigaTroubles@reddit

So if im using qwen 35 and 27 what do recommend for gemma then ?

[-]

Opening-Broccoli9190@reddit

For my 5090 it's as fast as my 35B Qwen. Curiously enough, I find Gemma4's answers to be rather short and laconic. Not sure I like it.

[-]

Dazzling_Equipment_9@reddit

Last time I saw someone say gemma used a temperature of 1.5. The effect will be better.

[-]

swagonflyyyy@reddit

I love how good it is as a general all-rounder model.

It can chat realistically, it can perform image tasks beyond OCR and captioning, it has hybrid reasoning capabilities, and its really good at instruction-following on top of that.

Is it better than qwen3.5? Probably not but it does beat it in chat, roleplay and writing so there's that. I just like this model more even if it might not be as good.

[-]

jackass95@reddit

Which version do you guys recommend having 128GB VRAM to use?

[-]

whysee0@reddit

Noticed this myself. Gemma 4 is consistently faster and seems better than Qwen 3.5 on the same/similar sized models on Home Assistant

[-]

lundrog@reddit

What hardware you running it on

[-]

No-Anchovies@reddit (OP)

Minisforum UM890 Pro, Ryzen 9 32GB DDR5 + RTX 5060 Ti WINDFORCE 16G GDDR7

[-]

starkruzr@reddit

this is heartening. mine isn't far away from this.

[-]

No-Anchovies@reddit (OP)

This is currently the best plug&play bang for buck out there. About 1500$ all in, with the oculink dock & Corsair RM750e included

[-]

mbrodie@reddit

I dunno I just picked up 2 brand new 7900xtx for $2000 48gb ram and it’s quick, currently running week 3,5 35b Q8 on 320k context in parallel streams 2 x 160k concurrent sessions at 70tps

I’d definitely say there is some good cheap inference options

I’m testing out the 80b and 122b today to see what they feel like but I think I’ll probably settle on the 80b with a smaller quant, not sure though have to wait and see how the tests look

[-]

markole@reddit

What is your tg on dual 7900xtx?

[-]

-LaughingMan-0D@reddit

What kinda speeds are you getting? And how far are you pushing context?

[-]

No-Anchovies@reddit (OP)

With that model I mention, it generates text faster than I can read it. With the google settings applied it slows down to "reading speed" but better output

[-]

caetydid@reddit

I really am impressed by its thinking abilities. Unfortunately MoE is kind of legasthenic if I talk to it in German. I guess this might be the quants not being optimized yet. The dense 31B is rock solid, but slow.

[-]

dannydeetran@reddit

Are you comparing 26B-a4b's speed to qwen's 27B?

[-]

bnm777@reddit

In running Gemma 31b on a 3080

[-]

No-Anchovies@reddit (OP)

Not comparing, praising the model for its reliability as a MoE

[-]

MerePotato@reddit

It still exhibits some of that trademark MoE answer variance/instability but I agree its definitely sharper and more consistent than a lot of small-medium MoE efforts

[-]

prescorn@reddit

Gemma is the one

[-]

KittyPigeon@reddit

If you are able to run Qwen3.5 35b (moe), did you get a chance to test gemma4 31b (dense) and see how does that fare for your coding problems?

[-]

Borkato@reddit

It’s just SO slow!

[-]

No-Anchovies@reddit (OP)

Have them both here but havent yet done any tests. Very curious since Qwen has been quite surprising so far. Any way, it's exciting to see how the anthropic breach open up the floodgates (for now) in availability to all the good stuff. Should be a great year

[-]

Upset_Page_494@reddit

Even if you have already given it a go, give it another with a new download/version.

[-]

9kSs@reddit

How practical is it to run on a 32GB M4? Would other stuff, for eg IDE and browsing slow down to a crawl? What about a 24gb m4 pro with swapping?

[-]

thatgreekgod@reddit

am on a 32gb M4 macbook air, it's been a delight. i've been fuckin around with this lil guy:

https://huggingface.co/TrevorJS/gemma-4-26B-A4B-it-uncensored-GGUF

and it's been chugging along very nicely

[-]

droning-on@reddit

I have yet to get hardware to run locally. Here trying to keep tabs on what everyone is having success with. But Gemma 4 on open-router is inferior to minimax 2.7 or Kimi lk 2.5.

It's actually frustrating.

I hope when I get hardware the local models are as good as these cloud models. Going to give Gemma another try. But for tool calling and understanding of big picture tasks it was failing me.

[-]

RonJonBoviAkaRonJovi@reddit

you understand that minimax and kimi are gigantic models compared to gemma 4 locally right?

[-]

droning-on@reddit

From what I was reading the gemma4 31b it model isn't that much smaller than minimax.

Maybe I'm wrong.

I'm using Gemma in open-router

[-]

_-_David@reddit

As it turns out, Gemma4 31b is 31b parameters while MiniMax is 229b parameters.

[-]

No-Anchovies@reddit (OP)

But that's some big boy stuff already. They managed to pack Gemma quite nicely for consumer hardware, the biggest surprise here is being able to run something this quick locally that as a bonus also doesnt suck big time lol

[-]

RonJonBoviAkaRonJovi@reddit

I agree the model is really good for a small local model but it won't compete with minimax and kimi, it's not a realistic expectation. A lot of people just jump on the hype and expect a miracle

[-]

ptear@reddit

Can confirm runs on my mid hardware, best local model I've tried so far.

[-]

No-Anchovies@reddit (OP)

Which model?

[-]

Guilty-Astronaut-696@reddit

Doesn’t work properly in Ollama or LMStudio so I run it via MLXCore https://ddalcu.github.io/mlx-serve/

[-]

fabyao@reddit

What is your use case with Gemma? For coding, i find it really poor.

[-]

No-Anchovies@reddit (OP)

For generic queries & reasoning. I still use qwen code to review my half assed python lol

[-]

habachilles@reddit

Can someone link the model they are using on mlx. I am just finding garbage

[-]

Horror-Veterinarian4@reddit

I saw a YouTube video where someone was using an old hp workstation with the gemma 4 moe model and they were getting north of 20 tokens a sec inference with only the cpu which is mindblowing to me. I am looking to replace claide with all the recent pain its caused for a local model and witb 32gb vram I assume it would be blazingly fast.

[-]

dsartori@reddit

Blazing fast prompt processing on strix halo hardware makes it compelling. A real step down from the 122B Qwen3.5, roughly on par with the 35B, maybe a few more goofs and errors from the Google model, but the speed makes the tradeoff worth considering. I’ve been using 122B to write the docs and plans and 35B to execute them, but Gemma4 is replacing Qwen3.5-35B in this workflow.

[-]

mga02@reddit

I tried the Gemma models and they are quite inferior compared to Qwen 27b. The gap is huge with vision related tasks.

[-]

hawxxer@reddit

I tried on RTX PRO 6000 in ollama the 27b one. Ollama said 100% GPU but cpu was at 100% all the time. No issue with other models… do i need to do a specific config or was there a fix?

[-]

pimpedoutjedi@reddit

I got base 27b running. I'm pleased. Would like some better tok/s (25-30 on a Titan XP) but pretty pleased.

[-]

pepedombo@reddit

I gave it a try for a third time lately and it feels not bad. Suits me far more than qwen. I worked with these models via claude code but due to laggy/hidden behavior of CC I switched to opencode to try these models in more chat-like cooperation with my structured laravel project.

Gemma tends to reach my point easier, though still may fail at some tools, sometimes it's just removing whole routes from the file to put new ones :) For a short time testing gemma feels better than qwen, at least for me.

On the other side: gemma 4 30BQ8 -> a bit slow and completely jumps over plan/build modes as they don't exist. For other cases I'm trying to work with 26BQ8 and qwen 35BQ8. Gemma's faster, 10-15tps more I think.

Anyway - in my environment the most reliable model was glm4.7-Q6, at least at tool usage and general problem solving, but I've started to look for something stronger at keeping code better. Still no good findings :) I thought I would be happy with Qwen next 80b or coder next 80b but nope, gemma feels better. I'll give them a try because I haven't tested them that intense.

[-]

No_Information9314@reddit

I’ve also switched from qwen 27b to Gemma4 24b as a daily driver on dual 3060s. I think there is a slight drop in code quality but I’m not doing anything heavy with it, and its versatility makes it useful for more use cases. Still have to battle test it but going from 20tps to 80tps is hard to resist, and the quality is pretty close imo. Code wise I’m mostly using it for bash scripting, some flask and python. I imagine doing more complex stuff would reveal a large gap.

[-]

No-Anchovies@reddit (OP)

Yeah same but if I want to do more complex tasks you gotta go with cloud anyway

[-]

No_Information9314@reddit

Exactly

[-]

putrasherni@reddit

good model but worse than qwen 3.5