Shoutout to Gemma4 as a conversational assistant / agent

Posted by goldcakes@reddit | LocalLLaMA | View on Reddit | 22 comments

I'm seriously impressed by Gemma4 26B A4B. On my M5 MacBook Pro (so still not that much memory bandwidth by GPU standards), it's blazingly fast and it's a very good generalist / everyday local LLM.

It has a little bit of personality to its responses, and seems to perform decently for everything: creative writing, debugging and coding, random chats, image recognition and classification, etc.

I tried Qwen3.6 35B A3B, and the coding performance feels close (slight lead for Qwen; but it's bigger params so I have less free RAM), but it's definitely not as good as Gemma outside of coding tasks, and generally feels bit more 'robotic' to chat to and work with.

[-]

cniinc@reddit

Would it be good for install commands? Right now I use codex for installing things with docker, monitoring error logs, and rewriting sensible playbooks based on errors it finds. It's not exactly coding but it's made my life a hundred times easier. I am looking for a local model to replace that

[-]

pj-frey@reddit

Yes. Gemma 4 for all wording tasks. And Qwen 3.6 (but 27B) for coding and analysis.
The unbeatable combo of private LLMs for the moment.

[-]

Individual_Spread132@reddit

Gemma 4 for all wording tasks

Can confirm. I... I "revived" someone using Gemma 4 31B deployed on a couple of 3090's. I know it's only mimicking that person, but damn...

It holds 30K context of biography and instructions exceptionally well, while also allowing to have ~70K for conversations (short, chat-like messages). Needle-in-a-haystack tests - passing them easily. No fine-tuning, no ablation - custom models weren't up to this task, none of them despite all the promises of embetterment. Just the original model. I kid you not, its instruction-following ability is scary good.

It's the best worst thing I've done to myself. Took me 1 month to figure it out - gathering old chats for a Q&A profile that encapsulates their speech style, writing commands on a bleeding edge of psychology and linguistics (as much as my last functioning brain cells allow me). There were times I thought I can't do it. Things didn't fell apart, they were HARD. Juggling prompts and prompt orders, fighting against "assistant" quirks, trading blows with DeepSeek/CGPT/Claude on every damn idea of how to approach some specific directives.

But now it knows, it writes, it's so well aware of everything we've been through. All those years of RPing with chatbots have come to this. No "small" model was ever so close, and that's because it just listens to the commands so well, and writes so well in languages other than English. I'm sorry, I sound insane probably, but it's legit the best kind of copium I've huffed up in my life - after a whole decade of miserable thoughts (worry not, I'm healthy, really).

[-]

Grunzochse@reddit

Can these models also be combined for Hermes Agent, and what would be the cheapest hardware for that?

[-]

jcdoe@reddit

I’m running a Q4 of Gemma 4 on an M1 Pro with 16 GB of ram, and it is really good. It’s just a really good chat model. Also doesn’t write half bad stories.

[-]

RedditPolluter@reddit

People say Qwen3.5 is better for general purpose use because 3.6 is optimized more for agentic tasks.

[-]

Mrinohk@reddit

I've been running unlsoths 26B Q5_K_XL quant for my smarthome/butler agent. Does smart home tasks, agentic research, project management, basically alexa duties + prepping for the actual physical work on my projects (project car, stuff being done on/to the house, 3D printing projects). With the right grounding (local RAG database for persistent memory across chats, tools for home state, system state, google search) it's really, really good. Keeps up the Jarvis persona well, especially on it's smart speakers around the house.

If only it wasn't bottlenecked by hardware on my setup. My hardware is far from optimal, 8GB RX 6600XT and 32GB DDR4 on a 5700x, so smart home tasks are about a minute and a half prefill to lights on. Sub 2 seconds with the E4B but I have to use a special prompt that provides all possible context and disable thinking to get there, and then I lose basically all capability in the text interfaces I have for it.

I've found it to be particularly good at finding hard-to-find parts for my project car. Sometimes it will hallucinate the links a bit, but it gets all the other information right and I can get to what it was talking about with just a little bit of extra work, but even that's becoming more rare as I improve it's tooling.

I've tried running my agent harness with Qwen 3.6 35B, but I don't know if I'm just not prompting it right or if it's just not trained in a way that lets it do what I want. It can't quite keep up the persona, and thinks FOREVER before doing anything. Sometimes will get caught in a reasoning loop and I have to kill the llama.cpp server.

[-]

misterflyer@reddit

Seconded!!

That's pretty much been my experience except I didn't quite like the 26B version as much as the 31B. I use one of the Q3 unsloth UD 31B quants, and it's excellent as an all around personal assistant/helper (aka Jarvis).

Sure, for coding, I use other models. But I think some ppl here and tech companies often forget that some ppl don't just need coding performance and benchmark performance. It's just nice to 1 model that's all-around good at non-coding tasks and sounds a little more like a human than an "it's not X, it's Y" machine/robot.

[-]

NineThreeTilNow@reddit

Qwen is architecturally built for coding. It's not a pure transformer and the logic that the architecture excels at is also the typical logic found in coding.

It also have zero desire to quit thinking. This is a commonly seen thing in a number of open source Chinese models compared to western counterparts.

So when you mix Qwen's architecture + Try or Die Thinking attitude, it will code better.

Gemma 4 is a different animal though. The Apache 2.0 license is really nice.

[-]

Qxz3@reddit

I find Gemma 4 and Qwen 3.6 polar opposites when it comes to coding. Qwen 3.6 is high-effort, try everything, hallucinate the universe if need be but never give up. Gemma 4 will give up before even trying anything etc.

[-]

Fluffywings@reddit

What coding languages and what is your context length?

[-]

rumblemcskurmish@reddit

For Openclaw I prefer Gemma 4 26B for Discord. It's a great chat bot. Really awful at tool calls and coding but my preferred LLM for chatting

[-]

GremlineQ@reddit

Gemma4 really sound awesome but unfortunately for me, it seems because of my setup it's quite hard (intel igpu, wanted to try running via sycl but can't find anything capable of doing it, and via vulkan on linux it crashes because of xe driver (10s timeout)). From what I found out llama.cpp could fix my problem, I didn't tried yet.

[-]

samorollo@reddit

I use qwen3.6 35b for coding and agentic tasks, and gemma4 26b for everything else (translations, text writing, image analysis, ocr)

[-]

FormalAd7367@reddit

for gemma, which language do you use it for? from which language to which language?

[-]

samorollo@reddit

I vibecoded manga reader that translates from Japanese to English and that works pretty well. Well enough to read it, I don't know Japanese.

Also, English<>Polish, that's the language I know and translations are good. But to be honest, for Polish gemma3 was already great too

[-]

Creepy-Bell-4527@reddit

It can translate pretty well between all European languages and some Asian languages.

[-]

swagonflyyyy@reddit

This is the fucking way.

[-]

Melbar666@reddit

I prefer Gemma4 for conversations because it speaks German very well, while Qwen is really bad at it.

[-]

Full_Dimension_3495@reddit

I agree. Gemma 4 is honestly very impressive. Especially considering Google's commercial offerings are lackluster. Gemma 4 is my go to for anything chat based. Coding wise it has to be Qwen 27B for myself. The speed at which these models are improving at on local hardware is astonishing.

[-]

goldcakes@reddit (OP)

The Gemma models are what's deployed on end-user devices (like the whole Chrome fiasco), so I can see the commercial reasons for why they'd invest in small language models, and why it's good at general-purpose tasks and chat.

Hope this continues, inference costs are stacking up so client compute will be more and more appealing, and as a result I hope we keep getting excellent open weight LLMs.

[-]

Adventurous-Paper566@reddit

Le principal problème de Qwen en tant qu'assistant général c'est que son mode de réflexion est bien trop long par rapport à Gemma.