Gemma 4 as a replacement to Qwen 27b

Posted by Jordanthecomeback@reddit | LocalLLaMA | View on Reddit | 32 comments

Hey all, I have a long-form context companion.advisor running on qwen 27b through lm studios and openclaw, I really like Gemini for conversations so I'm interested in Gemma 4, but know it's taking some time to get in good shape with updates to lm studios and whatnot.

I'm just wondering if anyone who has similar use cases has given Gemma 4 a try and if so what they think of it as a replacement. Would appreciate any feedback, openclaw makes model swaps kind of a PITA

[-]

JustSayin_thatuknow@reddit

It is working amazingly well with latest lcpp!

[-]

DeepOrangeSky@reddit

Well, so far I have preferred Gemma4 31b's responses to Qwen 27b, so I would like to switch to using it instead of Qwen 27b, in LM Studio, if I could.

The problem is I still keep having this issue:

https://www.reddit.com/r/LocalLLaMA/comments/1sdqvbd/llamacpp_gemma_4_using_up_all_system_ram_on/?utm_source=reddit&utm_medium=usertext&utm_name=LocalLLaMA

As far as I am aware, everyone else using it in LM Studio also still has this issue, right? Like it isn't solved yet?

In llama.cpp you can solve it by using --cache-ram 0 --ctx-checkpoints 1 apparently.

But, I don't have llama.cpp/don't know how to use that. I only use LM Studio so far, so, I have no clue how to implement that fix.

So, is everyone who is using it in LM Studio still having this issue where it just explodes the memory once you get past about 5-10 replies and past about ~10k tokens of interaction length or so, to where it just uses up all your memory?

Is LM Studio ever going to fix the issue, or is it Gemma4 going to remain basically permanently unusable for anything other than really short interactions on LM Studio, forever?

It seems crazy to me that they wouldn't fix it, since, isn't it like the most popular model in the world at this point, and LM Studio presumably the most popular way to use it.

Presumably it would be a very quick and easy fix for them to fix it, and is the biggest main issue with Gemma4 that is still ongoing for LM Studio right now, right?

[-]

Paradigmind@reddit

You could give Kobold.cpp a try. It's built on llama.cpp but you don't have to mess with terminals.

[-]

AppealSame4367@reddit

Run it with llama-server directly instead. It's fixed for days now and other speed and template improvements almost everyday.

I run E4B at 20-40tps on a laptop rtx2060, 6GB VRAM, without vision, so adapt this to your needs. Alsolatest build of llama.cpp with updated chat templates and latest unsloth gguf from today:

llama-server \

-hf unsloth/gemma-4-E4B-it-GGUF:IQ4_XS \

--no-mmproj \

-c 64000 \

-b 2048 \

-ub 512 \

-fit on \

--port 8129 \

--host 0.0.0.0 \

--no-mmap \

-t 6 \

--temp 1.0 \

--top-p 0.95 \

--top-k 65 \

--min-p 0.0 \

--jinja

[-]

overand@reddit

You could read the documentation on llama.cpp - it's not super simple, but I promise it's easier than you think. (It may also be harder than you think too, but, that's the paradox of learning.)

[-]

leonbollerup@reddit

Well.. one big annoyance is that you have to constantly change workflows because does not work as expected - LM studio that lots of features that is half baked or non existing in llama server - and not everyone wants to sitt in a console as soon as you have todo any thing - I personally don’t mind - but I have heard the argument and completely understand it

[-]

Ell2509@reddit

Yeah, or ask claude/chatgpt to help you set it up.

[-]

SingleProgress8224@reddit

I tried a couple of days ago and it seems to be fixed. Make sure to update the runtime libraries (Settings -> System -> Runtime)

[-]

DeepOrangeSky@reddit

Alright, nice. Do I need to re-download Gemma, too? Which exact model and quant are you using (in case it somehow matters)?

And, were you doing fairly long interactions, or just short ones (it seems fine at first, but then it starts ballooning like crazy once it gets past a certain tipping point, so, if your interactions were short enough, then maybe you wouldn't notice the issue)?

[-]

SingleProgress8224@reddit

I use Unsloth Gemma 4 31B Q5_K_XL. It worked before the quant updates, but it can't hurt to update it. They updated their quants 2 days ago.

I use it for code reviews, so it's mostly autonomous reasoning and tool calls until it gives the final summary. So not extremely long, but it often went above 100K tokens in the context. I use a 32gb GPU and can have max 190K context (mmproj disabled since I don't need vision capabilities)

[-]

DeepOrangeSky@reddit

Dang... I checked that the runtime and version stuff was all up to date and just experimented with it a bunch and it is still messed up the same exact way as it was.

Just to be clear btw, the ram explosion thing doesn't really happen if you just try to get a single reply from it (even if in regards to an extremely big toke-count). Rather, it happens if you try to go back and forth with it over multiple replies in an interaction with a medium or large token-count. The memory usage just jumps up by like ~7GB or so with each and every reply.

If you eject the model and reload it (with identical settings and context and everything) it brings its memory usage way back down, albeit just for that one reply and then starts exploding back up again if you try to go multiple replies, so, you have to basically keep ejecting the model and reloading the model after every single reply.

And I know it isn't something unrelated to Gemma causing it, since if I use any other models, none of them do this. Just Gemma. Like, when I use Qwen 27b Q8 I can set it to the max 261k context and use it in a super huge interaction and as many replies as I want and it stays at like -55GB of memory usage, it doesn't just start skyrocketing away with each new reply or anything.

Well, I am going to try the Unsloth quant that you are using, to see if that somehow fixes it, although, I assume it won't. :\

[-]

Frosty_Chest8025@reddit

no problems with vLLM.

[-]

Jordanthecomeback@reddit (OP)

ah yeah that's a dealbreaker for me, I figured by now everything would be up and running but if not I'll steer clear. I did see LM Studios pushed an update in the last 24 hours or so that fixes the Gemma 4 jinja or something like that but no clue if that's capable of fixing the issue

[-]

qwen_next_gguf_when@reddit

Coding ? No.

[-]

Jordanthecomeback@reddit (OP)

nah mine's more like an advisor, I don't bother with coding outside of fixing whatever openclaw breaks in an update, so for me instruct + conversational ability are key

[-]

qwen_next_gguf_when@reddit

Then it's fine to replace.

[-]

Sadman782@reddit

Give 1-2 examples where it struggles vs Qwen. I will give you 100 where Qwen loses badly. Even IQ4_XS Gemma 4 26B beats Qwen 27B in Qwen Chat. For one-shotting, Gemma is ahead, and for solving real world problems Gemma is way ahead, it knows the correct libraries to use. Even in C# Qwen produces old 2020 garbage code. Can't compile after 10+ iterations, Gemma did it in 2.

[-]

Jordanthecomeback@reddit (OP)

just out of general curiosity, do most of you guys using local models for coding work in tech and use it for stuff like that or are some of you hobbyists that use it to code games and other things? I understand you're one person, but I always feel out of the loop when I ask questions on here and the default position is coding relative

[-]

qwen_next_gguf_when@reddit

No. I use local models to generate AI slops to mess with crawlers.

[-]

Sadman782@reddit

Gemma 26B MoE is better in coding, I can give 100+ examples if you want. After the tool calling fix yesterday it is now better in agentic coding as well

[-]

IONaut@reddit

And tool calling? No.

[-]

shansoft@reddit

I find latest gemma4 chat template with higher temp settings start to perform better than qwen 27b. It often produce cleaner code and can be compile 1 shot compare to qwen 27b where it goes back and forth fixing syntax error or others. 27b also sometime falls into loop where I had to force cut off where gemma4 31b doesn't happen that often.

[-]

shing3232@reddit

gemma4 is bigger model so probably not good idea

[-]

supermazdoor@reddit

Your question should be the other way around. Honestly, 27B is leaps ahead, in speed, tool usage etc.

[-]

Prestigious-Use5483@reddit

31B replaced 27B for me because of the overthinking on 3.5

[-]

BuffMcBigHuge@reddit

Tried it with my existing Hermes setup - couldn't really perform my common operations, switched back. I presume it's because all my skills have been iterated by Qwen 27b itself, so there is a "relearning" process of auditing and understanding how to perform the skills in the Gemma 4 way.

[-]

GrungeWerX@reddit

I use mine as a lore master, so we have a similar use case. Long context is essential. I'm literally about to post about my observations, so I'll link when done, you might find it helpful.

[-]

Jordanthecomeback@reddit (OP)

Wonderful, please do

[-]

WishfulAgenda@reddit

I just switched. I’m using the 24b moe at q8 with a 100k context and found it better than qwen 3.5 at q6 with 40k context.

Just seems to get things right more often and the tool calling now seems to be better as well.

A massive change from when it first came out and was crappy.

[-]

Specter_Origin@reddit

I surprisingly found q8_m so be pretty much same as q4_L from bartowski in coding benchmark. So switched to that one with full 256k context. Ran bunch of MMLU sets too and still got same result.

[-]

Jordanthecomeback@reddit (OP)

ok neat, I have 64 gigs of ram and recently swapped from the qwen 3.5 35b MoE to the 27b dense, so I probably won't go back to MoE, I found there was way too much repetition for my use case - talking to the same agent day in and day out, eventually it's like being able to predict a text conversation, but i'm glad it's working for you and it working better bodes well for their dense model

[-]

WishfulAgenda@reddit

Yeah, I’d love to run the higher quant dense models but I use them for coding and conversationally querying databases. And Without the long contexts it just doesn’t work for me.

The higher quant and bigger context seems to give it the edge right now and it seems way better at calling clickhouse via mcp.

Planning a hardware upgrade in a couple of months and will then likely switch to a dense model with a much higher quant.

64gb ram and dual 5070ti here. I also run Gemma 4 4b and a tiny liquid model for faster conversation on a Mac mini as well :-)