Gemma4 26b & E4B are crazy good, and replaced Qwen for me!
Posted by maxwell321@reddit | LocalLLaMA | View on Reddit | 101 comments
My pre-gemma 4 setup was as follows:
Llama-swap, open-webui, and Claude code router on 2 RTX 3090s + 1 P40 (My third 3090 died, RIP) and 128gb of system memory
Qwen 3.5 4B for semantic routing to the following models, with n_cpu_moe where needed:
Qwen 3.5 30b A3B Q8XL - For general chat, basic document tasks, web search, anything huge context that didn't require reasoning. It's also hardcoded to use this model when my latest query contains "quick"
Qwen 3.5 27b Q8XL - used as a "higher precision" model to sit in for A3B, especially when reasoning was needed. All simple math and summarization tasks were used by this. It's also hardcoded to use this model when my latest query contains "think"
Qwen 3 Next Coder 80B A3B Q6_K - For code generation (seemed to have better outputs, but 122b was better at debugging existing code)
Qwen 3.5 122b UD Q4KXL (no reasoning) - Anything that requires more real world knowledge out of the box
Qwen 3.5 122b Q6 (reasoning) - Reserved for the most complex queries that require reasoning skills and more general knowledge than Qwen 3.5 27b. It's also hardcoded to use this model when my latest query contains "ultrathink"
This system was really solid, but the weak point was at the semantic routing layer. Qwen 3.5 4B sometimes would just straight up pick the wrong model for the job sometimes, and it was getting annoying. Even simple greetings like "Hello" and "Who are you?" Qwen 3.5 4B would assign to the reasoning models and usually the 122b non-reasoning. It also would sometimes completely ignore my "ultrathink" or "quick" override keywords, No matter the prompting on the semantic router (each model had several paragraphs on what use cases to assign it too, highlighting it's strengths and weaknesses, etc) I ended up having to hardcode the keywords in the router script.
The second weak point was that the 27b model sometimes had very large token burn for thinking tokens, even on simpler math problems (basic PEMDAS) it would overthink, even with optimal sampling parameters. The 122b model would be much better about thinking time but had slower generation output. For Claude Code Router, the 122b models sometimes would also fail tool calls where the lighter Qwen models were better (maybe unsloth quantization issues?)
Anyway, this setup completely replaced ChatGPT for me, and most Claude code cases which was surprising. I dealt with the semantic router issues just by manually changing models with the keywords when the router didn't get it right.
But when Gemma 4 came out, soooo many issues were solved.
First and foremost, I replaced the Qwen 3.5 4B semantic router with Gemma 4 E4B. This instantly fixed my semantic routing issue and now I have had zero complaints. So far it's perfectly routed each request to the models I would have chosen and have it prompted for (which Qwen 3.5 4B commonly failed). I even disabled thinking and it still works like a charm and is lightning fast at picking a model. The quality for this task specifically matches Qwen 3.5 9B with reasoning on, which I couldn't afford to spend that much memory and time for routing specifically.
Secondly, I replaced both Qwen 3.5 30B A3B and Qwen 3.5 27B with Gemma 4 26b. For the tasks that normally would be routed to either of those models, it absolutely exceeds my expectations. Basic tasks, Image tasks, mathematics and very light scripting tasks are significantly better. It sometimes even beats out the Qwen3 Next Coder and 122b models for very specific coding tasks, like frontend HTML design and modifications. Large context also has been rocking.
The best part about Gemma 4 26b is the fact that it's super efficient with it's thinking tokens. I have yet to have an issue with infinite or super lengthy / repetitive output generation. It seems very confident with its answers and rarely starts over outside of a couple double-checks. Sometimes on super simple tasks it doesn't even think at all!
So now my setup is the following:
Gemma 4 E4B for semantic routing
Gemma 4 26b (reasoning off) - For general chat, extremely basic tasks, simple followup questions with existing data/outputs, etc.
Gemma 4 26b (reasoning on) - Anything that remotely requires reasoning, simple math and summarization tasks. It's also hardcoded to use this model when my latest query contains "think". Also primarily for extremely simple HTML/JavaScript UI stuff and/or python scripts
Qwen 3 Next Coder 80B A3B Q6_K - For all other code generation
Qwen 3.5 122b UD Q4KXL (no reasoning) - Anything that requires more real world knowledge out of the box
Qwen 3.5 122b Q6 (reasoning) - Reserved for the most complex queries that require reasoning skills and more general knowledge than Gemma 4. It's also hardcoded to use this model when my latest query contains "ultrathink"
I'm super happy with the results. Historically Gemma models never really impressed me but this one really did well in my book!
Potential-Leg-639@reddit
No chance for agentic coding, issues with tool calls on my side (latest llama.cpp), but no issues with Qwen3.5-27B and Qwen3 Coder Next
Express_Nebula_6128@reddit
Posted 19h ago, I wonder how this ages when you try qwen 3.6? đ¤
Rich_Artist_8327@reddit
why not gemma-4-31b for any task?
entsnack@reddit
The MoE is faster for almost the same perf
ShadyShroomz@reddit
neither the gemma4 moes or the smaller qwen3.5 moes are as good as the 27b or 31b dense. the larger qwen models (full 397b or 122b) seem to get close but the 27b is honestly better than the 122a10b in many cases due to having almost 3x the active parameters.
entsnack@reddit
hmm interesting
sonicnerd14@reddit
31b is much smarter than the 26b variant. It's slower, but the intelligence might be worth it in some cases where complexity is high. It's been known to match and in some rare instance slightly exceed models much bigger than it. Its worth some messing with for sure.
Rich_Artist_8327@reddit
Its faster of course but the performance is not same, 10% difference is huge is some tasks in quality
tavirabon@reddit
Gemma 26B I didn't test nearly as much as Gemma 31B, Qwen 27B or Qwen 35B because what I did see was an apparent downgrade from either dense model. Really I didn't feel like smaller local models were worth using until Gemma 31B, it genuinely feels like a larger model in adherence, recall etc
I get how it could make sense in a multi-model setup like this, but then again I would just use Gemma 31B over every model here and only load up Qwen 122B when Gemma is failing.
anzzax@reddit
with that many models, do you (re)load them all the time, how do you split vram/compute?
maxwell321@reddit (OP)
I recently found another thread that covered a pretty slept feature that was rarely covered anywhere else. I basically keep all of the models loaded at the same time, but I use the GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 environment variable, so my system (Ubuntu Server) automatically loads the models to VRAM or offloads to regular memory as needed, and it does this as it's processing the request so time to first token is always really low.
I have llama-swap set up so that all of the models are part of the same 'group' and are allowed to stay loaded at the same time. It's pretty neat, I didn't expect it to work very well but it is pretty darn fast at swapping between system memory and VRAM. The best part about it is that it slowly shifts from RAM to VRAM as it's working, instead of the traditional model loading and unloading that requires it to be fully loaded first. Worst case scenario the beginning of a response has a slow token/s but quickly ramps up to speed as the memory manager shifts it to VRAM.
Exact layout:
CUDA 0-1 are the 3090s, CUDA 2 is P40.
Gemma 4 E4B uses CUDA2
Gemma 4 26B uses CUDA0,CUDA1
Qwen3 Next Coder, and Qwen 3.5 122B uses CUDA0,CUDA1,CUDA2.
I have it set up this way so Gemma 4 E4B can be fully loaded at the same time as 26B, however when it comes time to load the larger models it will shift both of them to system ram to make room for the bigger models in VRAM.
\^\^ this is also why I use Gemma 4 E4B for semantic routing instead of 26B, when one of the bigger models is loaded and Gemma 4 E4B is completely in system ram, since its much smaller than 26B it takes less RAM/VRAM swapping to unload of the other currently loaded models to make room for the E4B tensors. Also, on the bigger models it seems to use less of the P40's vram than the 3090's so it's even faster since I have e4b only set to load on that card.
human-rights-4-all@reddit
When you use the same model twice, how do you prevent it from using unneccessary ram/swap?
rm-rf-rm@reddit
but how exactly are you doing the semantic routing? your own python/shell script?
No-Statement-0001@reddit
would you mind sharing your llama-swap config yaml? I tried the unified memory env var before and it didnât really do what you described.
Also check out the new groups v2 (swap matrix) in llama-swap. It allows for way more complex group logic scenarios than groups v1.
rm-rf-rm@reddit
second the request for the config!
OsmanthusBloom@reddit
Hey that's pretty neat, I could really use that on my setup! Please give some pointers if you have them!
epicfilemcnulty@reddit
They mention that in the beginning of the post that they use llama-swap, so I guess that's how. I wonder if llama-swap is somehow better than just using llama-server with
models.ini, though..._hephaestus@reddit
I feel like a large chunk of that is just documentation of the how to use models.ini was not as easy to find.
feckdespez@reddit
I went through this recently. When I set up the models.ini, I had to look at some examples in the PR for it because docs had not caught up yet.
epicfilemcnulty@reddit
Yep, the docs for this feature are sparse. I actually cheated when I was configuring it for my local setup, as I'm a lazy ass -- just described to Qwen what I want and gave it links to the docs and the PR :)
sonicnerd14@reddit
From my experience, this is almost always the best route to take instead of usually wasting a lot of time only for things to not work as they should. Agents are just faster and more precise with these things.
andy2na@reddit
llama-swap benefits (Im not affiliated, just found it to be an amazing tool to llama.cpp):
PaceZealousideal6091@reddit
Umm... only ur second point is unique to llama swap.
andy2na@reddit
Whats the command or config to do different variables per model in llama.cpp without force reloading the model? I will try it out myself
From my research,
llama.cpp's router mode (via--models-presetor.iniconfig) is designed strictly for process management. Each model definition block you would create (e.g.,[Qwen:thinking]vs[Qwen:instruct]) is treated by the router as a completely distinct worker.If you configure two aliases pointing to the exact same
.gguffile but with different default parameters, the router will not share the loaded weights. Instead, it will:llama-serverworker processes, consuming 2x the memory (if you have the space).--models-max 1), it will violently unloadQwen:thinkingfrom your GPU to loadQwen:instruct, causing the exact reload delay you want to avoid.And point #3, sure, you can look through logs manually and get your prompt/generation speeds, but that is just way too much work compared to how llama-swap lays it out nicely
sleepy_roger@reddit
You can also have vLLM models mixed with llama.cpp models which is nice no need to just have it point to llama server for example.
andy2na@reddit
didnt know that, thanks for the tip
No-Statement-0001@reddit
did you know if you clicked on the big âllama-swapâ title you can rename the UI and what shows up in the tab title? :)
andy2na@reddit
thanks for the tip! Very cool
epicfilemcnulty@reddit
Hey, thanks for the info! Yep, seems pretty handy, especially for local-only setups. Although for my particular flow I've realized that LiteLLM might be a better fit.
JamesEvoAI@reddit
llama-swap gives you more niceties than just llama-server, like a UI for inference, model loading, and some basic generation stats.
I personally use it as a router on my Strix Halo machine to run multiple models, each of which are their own llama-server instance running inside a podman toolbox. That way I get the best of both worlds.
GrungeWerX@reddit
Youâll be back. ;)
I also was initially impressed with Gemma 4, but Ive been getting a lot of subpar outputs lately and itâs made me appreciate Qwen even more.
Still planning on using Gemma 4, of course, itâs great to have as a second opinion.
relmny@reddit
Me too. I was also impressed at "first" (well, after all redownloads"...) but kept coming back to qwen3.5 and I barely use it know (except translations and medical questions)
bonobomaster@reddit
Nothing you have locally should be used for medical questions.
In fact, no LLM ever should be used to gain factual knowledge of any kind.
Those fuckers CONSTANTLY hallucinate really, really bad â even something big like Opus 4.6.
jopereira@reddit
You'll be punished for that. Just wait.
bonobomaster@reddit
PiaRedDragon@reddit
I find it depends on the quant you use, for me the unlsoth ones were dog shit, but I changed to RAM ones and they were fine. Also the original non-quant versions are fantastic.
GrungeWerX@reddit
"RAM" ones?
I'm on a 3090 TI, so I can't run the non-quant ones. Which qwen versions were best for you? You're saying the Q5 UD K XL or higher was bad?
PiaRedDragon@reddit
The RAM ones are MLX, so you would have to convert them, I am not sure how hard that is TBH as I use them on MacOS.
I did a direct comparison to the unsloth ones, 500 questions MMLU-Pro, exact same model, cept the RAM ones were SMALLER than the unsloth ones.
Logic would dictate the smaller model would be worse, but the unsloth lost every single comparison, as high as 30% difference in their ability to answer correctly.
Complete dog shit. If you can't get RAM working try for one of the other builds.
CBW1255@reddit
What do you run these with? LMStudio or mlx_lm.generate or something else?
PiaRedDragon@reddit
Claude Code.
The reason is I can just say "Hey claude, run a 200 question x-benchmark against these models, make sure the settings are the same and the questions/seed are the same"
It will then run the test completely unbiased.
I am sure you could do the same with codex or any of the other cli coding tools.
CBW1255@reddit
Well, Claude Code is the client for it.
But what do you use to run the models?
Just to further explain what I mean:
If I run OpenCode or Codex locally, as you do with Claude Code, I would still have to run my models via llama.cpp (for GGUF models) or LMStudio or mlx_lm.server for MLX.
So what are you running as backend? What do you have as http-endpoint for ClaudeCode?
PiaRedDragon@reddit
Claude Code
:-)
I just get claude to load as Python process, it calls MLX-LM natively.
e.g code
from mlx_lm import load. generate
model. tokenizer = load ("modle name")
Everything else I tried just adds a bunch of overhead, then you have to fiddle with there settings to make sure the tests are accurate. It makes it tough to consistency.
By calling mlx_lm directly you can ensure every settings is exactly the same for all models.
CBW1255@reddit
Thanks.
mlx_lm.generate then. Got it.
GrungeWerX@reddit
What about Qwen?
PiaRedDragon@reddit
They were also dog shit, I did not do the full evaluation with them as it ties up my Mac for hours, I did a quick 100 question test against the RAM version (tbf the RAM version was slightly bigger) and the unsloth one got cooked.
So for me it was 0-2 for unsloth which was good enough for me to move away from them.
BTW, I get that people say Benchmarks are not always the best measure to test model intelligence, cause the vendors game the system, I get that, but when you are testing two of the exact same model, just quantized differently, benchmarks are the best way to see which one has been lobotomized.
Because they are both starting from the same place. If you answer a question before you were quantized and you can't afterwards, that is a good indicator how well the quantized has worked.
Any future testing I will compare against the RAM models, which are now my default, if any model can beat them consistently I will upgrade to that.
Currently I use the Qwen 3.5 RAM 48GB version on my 64GB Mac, or the Gemma RAM 30GB, those are my go to. Both are fire.
philnm@reddit
this is eye opening , thanks for sharing
FormalAd7367@reddit
i used the small models on my phone but qwen seems to be smarter ?
Sensitive_Song4219@reddit
Love Gemma 4 26B-A4B (perfect successor to Qwen30b-a3b for me!) but I don't find it that efficient with thinking tokens: it often thinks pretty hard in my testing.
Incredible model though; I've used it for some light coding and debugging - it definitely dethrones the Qwen30b-a3b series. And it has similar-ish speed on my hardware as well. Impressive.
ego100trique@reddit
9b is pretty sufficient for debugging in most cases even in large code bases from what I've tried
Hot-Employ-3399@reddit
I tried gemma4 today again, just several hours ago, and and 31B was still bad.
Vulkan version of llama.cpp doesn't work at all. Just crashes on getting a request, saying something about device lost error. Didn't happen in previous builds from several days ago. Happens in freshly built build.
CUDA works "too well". When Qwen on vulkan works, temperature of my laptop overall is 60-70 degrees and noise level is ok. Gemma on CUDA - 90 degrees. Laptop sounds like a jet.
Might be because of quanting cache or something: memory consumption is still garbage. I had to quant cache, as otherwise context was too limited. I took init string from a thread which promised full context length. I had to decrease it to 98K tokens. That's how much I use in qwen with no quant of the cache.
This weekend I'll try to download MoE quants from bartowski in case something is still shit in Unsloth and dance around with rituals around llama-server argument but I don't hold my breath.
Bijju_skr@reddit
What hardware you are using.., please share
here_n_dere@reddit
I'm having hard time using Gemma 26b for agentic coding via opencode. Editing files is where it goes for a toss. Unusable
808phone@reddit
This is the first model that I am getting decent speeds and results using something like LM studio and Kilo Code. It is actually editing and reading different file correctly. Context window is set to 256k.
SmartCustard9944@reddit
I have been using my proxy observer to compare different harnesses and I found OpenCode to provide a very suboptimal context layout. OpenCode seems to be a bit ineffective with its context engineering, I am even surprised it is so popular. I suggest to try to hook it up with Claude Code instead and see how it goes. I have seen that Gemma 4 follows the system prompt really aggressively, which means that an under specified or weak prompt makes all the difference.
Wise-Hunt7815@reddit
About llama-swap, here's a silly question: When switching models, is it still necessary to load from disk?
Hydroskeletal@reddit
also curious why no 31b.
Former-Tangerine-723@reddit
Because faster
evangelosclaudius@reddit
Could you share links to the Huggingface pages where we can download them from? Just swapped from Ollama to llama-swap and would appreciate some hints on what to download etc.
SmartCustard9944@reddit
They really are. The 26B A4B is extremely attentive (in the transformer sense). It is the only model that seems to recall perfectly the number and variety of tools it supports (via Claude Code). Other models, even the bespoke Qwen 3.5 35B A3B either does not recall all of them or hallucinates the number or even changes number while responding. I tried even higher quants at Q8, does not change a thing.
Tool calling is the main feature of agents, and to me it has to be extremely reliable or would not use a model.
I tested Gemma 4 on LLM Arena compared to other bespoke models, and it is crazy consistently better than many closed models.
I am toying with the idea of buying a Mac Studio just to go from 20 tok/s to 150 tok/s with this model.
PinkySwearNotABot@reddit
what is a claude code router?
Enough-Astronaut9278@reddit
Gemma4 has been surprisingly good for its size. I've been comparing it against Qwen3.5 for vision-language tasks specifically, and the MoE architecture really helps â you get near-26B quality with 4B active params, which is great for memory-constrained setups.
Curious if anyone has tried it for agentic workflows though? In my testing, instruction following for multi-step tasks is where these smaller models still struggle compared to 70B class models.
specji@reddit
tell me a recipe for banana bread
maxwell321@reddit (OP)
Sure! Here's an easy and convenient recipe for banana bread:
Ingredients:
Banana
Bread
Directions:
Make banana bread
Let me know if you need anything else! âşď¸âşď¸
benevbright@reddit
Ship it to âdenziiger str 144âŚâ
pardeike@reddit
âa recipe for banana breadâ
0neTw0Thr3e@reddit
1 banana ripe enough to have white mold. A jolly rancher for sweetness. 1 egg. 1 cup original aunt Jemimas pancake mix.
Mix it up set the oven to 350f.
After 30 minutes the outside will be golden brown and the interior still be mush.
IrisColt@reddit
and replaced (certain Qwens but not others) for me
RevolutionaryGold325@reddit
How do you get rid of thinking with gemma 4?
Additional-Low324@reddit
How do you switch reasoning mode on the fly?
anzzax@reddit
decoy_258@reddit
when using llama-server webui, is there a way to toggle this with a button?
anzzax@reddit
-Cacique@reddit
is this for llama-swap? also, if I change this in models.ini, would I have to restart llama-server?
maxwell321@reddit (OP)
Using llama-swap you can split models into instruct or reasoning variants via the filters and stripParams system to set the thinking_budget_tokens. I originally tried "reasoning: off" since it's a valid flag in llama-server (--reasoning) but I don't think llama-swap updated to support it yet. thinking_budget_tokens: 0 is what makes it work. I use the chat_template_kwargs strategy for my 122b models.
StatisticianFluid747@reddit
this unified memory setup sounds like a lifesaver for the p40.. are you seeing any massive slowdowns when it swaps from the 3090s to the p40? i always find the p40 bottlenecks my whole pipeline if i'm not careful with how i split the layers. also super curious if gemma 4 handles the 'half precision' better on that older card than qwen did
121531@reddit
maybe Qwen's on the spectrum?
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Fresh-Resolution182@reddit
the E4B routing fix alone would've sold me. qwen 3.5 4b misrouting simple greetings to 122b was genuinely infuriating
No-Brush5909@reddit
It is great model and I would love to use it but I just donât understand why on OpenRouter the token speed is so slow, it is unusuable
Present-Rhubarb-9284@reddit
This is the underrated part of local stacks. The router is not just glue. It is the product. A mediocre router can make great models feel dumb, and a good router makes a mixed stack feel coherent. Most people benchmark the models and ignore the orchestration layer, then wonder why real-world quality feels random.
ZhopaRazzi@reddit
Meh, depends on the task. For rule-following and throughput, qwen3.5 has outperformed for me.
andy2na@reddit
What are you using to route?
Also why not just use 26B to route also? It's MoE E4B, so it's very fast and you can save some RAM
maxwell321@reddit (OP)
I elaborated a bit on this in anzzax's comment, but basically the way i have it set up has all of the models loaded at once and the environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 allows my system to dynamically shift model tensors between VRAM and System Ram as needed. Gemma 4 E4B is the smallest of all the models so it is the fastest to migrate from ram to system ram and has a faster time to first token than 26B in a cold run where one of the 122B model's weights has the other models' tensors sitting in system ram instead of VRAM.
andy2na@reddit
Gemma4 E2B is even smaller, have you tried that for routing?
tavirabon@reddit
I've heard E2B is pretty dumb, but I haven't used it by itself. I did use it with speculative decoding and it would spam 8 "*" in a row, which just made the whole thing loop and endlessly spam asterisks. Maybe q8 had something to do with it, idk, but E4B in q8 doesn't have such issues and takes 1gb less VRAM than E2B in bf16 so whatever.
maxwell321@reddit (OP)
I'm serving all model endpoints so they are available on open-webui, then I have a semantic router filter picking the one to actually use
Zag_123@reddit
Gemma 4 has been a surprisingly strong contender. The 26b hitting above its weight class is great to see â especially for those of us running local inference on consumer hardware. Curious how it holds up on longer context tasks though, that's usually where smaller models start to stumble.
qubridInc@reddit
Qwen 3.5 35B is great, but newer setups like Gemma 4 are starting to outperform it in efficiency and routing.
maxwell321@reddit (OP)
Yes it is! So what's a good pumpkin pie recipe
HockeyDadNinja@reddit
How does your semantic routing setup work? Is it something you made or part of one of the other packages?
andy2na@reddit
not OP, but this is how I do semantic routing in openwebui:
https://www.reddit.com/r/LocalLLM/comments/1rnwynh/how_to_use_llamaswap_open_webui_semantic_router/
maxwell321@reddit (OP)
This is a great guide btw! I believe this is the one that introduced me to stripParams. I made a modified version of the OpenWebUI semantic router to have it re-route after every message if needed, instead of just sticking with one for the entire conversation.
HockeyDadNinja@reddit
Thanks! I'm using llama-server's built in routing with a models.ini file so I'm probably almost there.
ScoreUnique@reddit
Kudos to OP for (probably) writing their emotions instead of delegating them to AI. Happy and enjoyable to read that. Gotta appreciate what I missed here.
Otherwise I'm gonna read it again tomorrow and see how my setup can profit from your suggestions. Thanks OP.
MotokoAGI@reddit
There's a special model trained for orchestration - nvidia_orchestrator-8b.
lqvz@reddit
Iâm still having significantly better coding results with Qwen3.5, but Gemma 4 is better for everything else.
RegularRecipe6175@reddit
Very informative. Did you try Gemma 4 31b?
maxwell321@reddit (OP)
Not yet! But I'm excited to try when I have some more free time
popsumbong@reddit
Thx for this, I have a very similar home lab as you.
Queasy_Asparagus69@reddit
what do you use for semantic routing?
besmin@reddit
Just share the setup in a repo alreadyÂ