Gemma 4 looks promising for coding tasks. The improved instruction following should make it better for agentic workflows. If anyone is setting it up as a coding agent, OpenACP can bridge it to Telegram/Discord for remote access. Works with any CLI agent framework. Full disclosure: I work on OpenACP.
I’ve been playing with these Gemma 4 models since they dropped, and honestly, it’s less about replacing my usual stuff and more about adding another tool to the box. I’m on a Mac Studio with the M4 Max and 128GB of RAM, so I can juggle a few things at once. Right now, Qwen 3.5 is still my go-to for creative writing,it just feels more imaginative, even if Gemma 4 is technically more…precise.
What I’ve settled into is routing tasks. Gemma 4 7B is surprisingly good for quick information retrieval and coding help, and it’s fast, easily hitting 30+ tokens/sec with my setup. The 31B model is noticeably slower, around 15-20 tokens/sec, but really shines when I need more complex reasoning or a longer-form response. I’m using the A4B quantization for both, it’s a good balance of quality and speed.
I also have a bunch of smaller Ollama models loaded,Mistral, OpenHermes,for really quick stuff like brainstorming or summarizing short articles. They’re not as capable as the larger Gemma models, of course, but they load instantly and use minimal VRAM. It’s kind of nice to not always be firing up a 30GB model.
It’s not about finding the “best” model, at least for me. It’s about picking the right one for the job and not being afraid to switch. Each one has quirks, strengths and weaknesses, and knowing that helps a lot. I think people get too caught up in benchmarks and forget to actually use the models for what they want.
Gemma will fit the hardware even better. I had Qwen 3.5 35B-A3B working reasonably well with a 12 GB RTX 3060, but Gemma is better in every category except one: it *starts* at a somewhat slower rate. But by the time the context window reaches 50K, Qwen's initial speed advantage has vanished, and from that point forward, Gemma is faster.
ive had a relatively bad time with gemma 4 so far, im waiting for llamacpp fixes and new ggufs and everything to stabilize, does seem like today was a good final day for it so will probably be retesting it soon
I did have to update llama.cpp to run Gemma 4—once, three days ago. That took less than a minute. I've had *less* trouble setting up Gemma than I did setting up Qwen 3.5 a couple months ago, although some of that is attributable to the fact that I still remember the process of setting up Qwen 3.5 a couple months ago. I was even able to use the mmproj file from the stock Gemma 4 26B-A4B when mradermacher didn't have one (but they might now, I was striking while the iron was hot, four hours after the quantized Heretic models dropped).
So I think it's worth trying again. It's that much better. If you were impressed even a little by Qwen 3.5, you'll be even happier with a similarly sized Gemma 4 model. If the improvement from Qwen 3 to Qwen 3.5 were quantized as "one unit", Gemma 4 is two or three such "units" better than Qwen 3.5.
i basically rely on --fit and --fit-target to do all the lever pulling for me, i've always found it to give better results than manually doing stuff but ymmv of course, i just specify fit 1 and fit-target for the minimum headroom im comfortable giving (something like 256mb keeps my system stable) then llamacpp will automatically do the offloading for you, i pull about 25-27 token gen with this setup
You might want to consider fine-tuning -ncmoe for even better results. Performance for me (on a GTX 1080) peaks around 70-80% of total layers offloaded to CPU. Don't use -ngl, it will also offload critical attention tensors and you will have a bad time. Keep -ngl on all.
3070 8gb, it just relies on huge amounts of offloading, i could fit it into 6 (to make room for the .mmproj) and it still ran pretty acceptable, you just have to make sure your llamacpp is actually offloading to cpu/ram (with --fit or doing it manually with the other params)
you can offload MoE models to ram for way less penalty than dense models, and something about qwen3.5's moe's architecture seems to offload even better than most moes for me
Qwen3.5 35b A3b, it's a MoE (not 35b dense). With --cpu-moe on llama.cpp you can offload expert weights to RAM and it'll only use \~3gb VRAM in total. I run it daily on my terrible rtx 3050 laptop with 4gb vram and 32gb ram @ 22-25 tok/s lol
Is the context as vram expensive as gemma 3? That to me is what would make or break this model. Currently I can only fit gemma 3 27b q4_k_m with 20k context on a 5090 while I can fit qwen 3.5 27b q4_k_m with 190k context on that same card.
You can quantize the K and V caches. If you use Q8_0 it is unlikely you'll notice any difference at all except you'll suddenly have room for double the context window. I'm using Q5_1 (with a Q4_K_M model) and that seems to be just enough depth that I'm not adding any *extra* loss to the model. When I use Q4_1, I do notice a difference.
Out of curiosity I have the same 5090, but using qwen 3.5 27b causes a huge hang in opencode/ Claude code, when trying to do agentic stuff. Things like 3 minutes for a “hello”, are you facing this as well or? (I did also confirm that chatting through openwebui correctly performs at expected speeds)
I haven't used opecode/claude code before so I can't say for sure, that being said, I have noticed a similar problem when using Cline and Roo sometimes with the Qwen 3.5 27B model and even the 35B A3B, could be just the models getting stuck in thinking loops as they are known to do.
I have since switched to the Claude Opus reasoning distilled versions and they perform much better for nearly all of my use cases. No hang-ups with Roo or Cline anymore, so maybe you could try those with opencode and claude code instead?
Thanks for the advice, I've found that tinkering with all of this stuff has been my actual favorite part of the whole local process. I'll go ahead and take a look at the distilled versions as I was mainly just testing the unsloth quants
I'm successfully running 26B-A4B at Q4_K_M quantization on a 12 GB RTX 3060 and an i5 8500 with 48 GB of RAM, and getting around 14 t/s. And that's with vision enabled. Until I started playing with the Gemma models today, I was using Qwen 3.5 and 35B-A3B (Q4_K_M) and Gemma is about 12% slower... but much more than 12% smarter.
And now, just because someone is going to read this months or years down the line... Gemma is only slightly slower at the *start* of the conversation. As the context window fills, Qwen takes greater speed hits. By 50k tokens, they're about the same at around 13 t/s. By 100k tokens, Qwen takes a massive nosedive in performance (5 to 6 t/s) while Gemma is still chugging away at 12 t/s.
My GPU is 16 GB VRAM and I use Qwen 3.5 35B Q4. You are not forced to load the whole model into the GPU. You can just offload some layers. For example: with my 9070 XT and its 16 GB VRAM I got 20-25 tks on that qwen model.
I know about this, but I'm forced to load all into GPU - my Ryzen causes BSODs if I set RAM above 2667Mhz. I spent hours tweaking voltages, timings and even 2800MHz will cause WHEA errors. Sad reality of having 4 DIMMs on AM4. :/
Intel's AutoRound Q2s are actually super good, really surprised. Made me able to run Qwen3 35B at acceptable speeds. Hope they'll release some for Gemma 4, though I think I can run Q4 there
Now it loads but when I prompt it it just spins endlessly and doesn't generate any tokens. I tried switching back to Omnicoder-9b and now I only get 10t/s instead of 60t/s even if I switch the runtime back. Any idea why this is happening?
Not IQ2 but last week I saw people saying MoE models like Qwen 3.5 35b are basically the same in IQ3_S and Q4_K_M so I’m probably going to start with IQ3_S as my baseline.
After testing I would say that sadly this model is unusable at IQ2. It mixes up a lot of facts with simple questions and sometimes doesn't even understand question properly.
Yep ; It is me - Dangerous_Fix is top secret undercover name. LOL
No worries on the naming; that is so people know what their are clicking thru for.
And ahh... I learned that from some of the other model makers before me.
If Gemma does not have "safety policy" reasoning in base models, it wins by default in my books.
Like half of Qwen overthinking in my usage came from it being trained to constantly check against non-existent safety policy (I say non existent, because while it claims it is referencing safety policy, in reality it was trained to hallucinate safety policy that aligns with whatever rules they entered into dataset).
If it was trained to refer to promt defined policy it would be one thing, but the way they done it is so obnoxious.
uuuh, this is unexpected... looks like qwen 3.5 beating gemma 4??
even if only tying, both models are more compute efficient from qwen. 3b VS 4b active params, and 27b VS 31b dense. qwen models are pulling ahead across the board tho
For the MoE the smaller the total params, the more likely you can fit all or most of it on your vram. And that'll boost performance more than 1b params active will.
I do think Qwen's MoE is probably smarter, if too rambly, but the size of that thing is starting to become awkward at 35b. Whereas you can likely REAP the 26b down to 20b with no virtually loss of performance and cram it all on a 12 or 8b card.
yeah, i was just talking about the compute needed/active params.
so in both cases, yours and mine, qwen would be faster since it has less active params.
unless you have some VRAM, in which case you'd need to run less of gemma on the CPU which might make it slightly faster, but idk how big of a difference it would make.
But in my case, there is no difference. qwen is just better, and at the same cost/speed.
I think speed depends more on the percentage of attention tensors on the GPU, rather than the number of active params. That's why llama.cpp provides the -ncmoe option, which only offloads the up and down tensors and leaves the attention tensors on the GPU.
One concerning area is that HLE no-tools vs tools is only 19.5->26.5 (+7), while qwen is 24.3 -> 48.5 (+24). It may suggest it's not nearly as good with tools (or Google's tool use harness isn't as good as Qwen's for HLE specifically?)
Some basic calculations show that in terms of geometric average of all these scores (implying overall competence - geometric average is very sensitive to the minimum value) for the six models that have values for every single benchmark, Qwen3.5-122B A10B is the overall strongest contender, with 27B in second place - oddly, in terms of geometric average divided by effective parameter count (square root of product of full size and active experts size), 35B which I see a lot of people complain about on here appears to be by far the "densest" in score per parameter, and I wonder if that actually means anything useful or not.
Nobody asked, but I just like playing with tables of numbers uwu
yeah, elo is basicialy just RLHF overtraining, which on its own can lead to huge issues as seen with gpt 4o... so not sure its the best thing to go by exactly
Yes, Gemma-4-26B-A4B at IQ4_NL fits well! More than doubled my speed compared to Qwen3.5-35B-A3B at Q4_K_M which needed offloading. Not sure how the 31B model at a lower quant would perform compared to it.
Thank you for the tip! I downloaded the version you said and installed it in ollama like I did for gpt-oss:20b last year, but it only spits out garbage when I ask it a question. I updated ollama to the latest version and that got it to at least load. I am going to update my Nvidia drivers and see if that helps.
31B is most likely a no go. Maybe 26B MoE if it handles extreme quant alright (Q2). If not, you could try the 26B at a more reasonable Q4/6 and have just a little spillover into system RAM, tho slow down is to be expected. Best answer is to try these out yourself when you have some time, or wait for others to report real world use.
You don't need to go to a quant that low on 16gb vram with moe's offload some of the experts to cpu, and you get a dramatic speed increase making Q4, 5, or 6 even useful for you.
the 2B and 4B can run on it since i can ran models of that size on an intel iris xe integrated GPU with 16 GB ram, as for the bigger ones i am not sure since i don't have ram for them, but since 26B model is a mixture of experts if you have enough system ram you can offload the rest of the weights to it while keeping the active weights on the GPU so i think you probably can run that one.
It ranked higher in some benchmarks, like artificial analysis. Most people don't understand that intelligence and knowledge isn't the same. A small model like Qwen 3 4B 2507 will never have the same amount of knowledge as a big model.
What these benchmarks show is that smaller models are getting smarter, they are getting better at solving problems, retrieving information via tool calls (web search) and then handle that data to give a good answer.
I would argue: If you give a modern small model access to tool calls (web search, coding environment, etc) and then compare to an older bigger model like GPT 4o, the small model will be on par, if not better. But on its own, offline, without a knowledge base, the small model is nowhere near
i love how small models keep getting better, maybe eventually we'd reach a point where you can actually have a small agent =>8B on phone or laptop we can tell to do stuff somewhat reliably without worrying about it breaking everything.
The outputs from that model certainly punched every ticket to hell I could possibly take, and inflicted further permanent psychic damage on me. I freaking loved it.
Wish they'd release bigger models though, a 100B MoE from them could be great without threatening their proprietary models. Hopefully one is coming later?
We're also in a crazy memory shortage, so I think releasing smaller models that perform in the same class as much bigger ones is probably a better mindset for the industry than just releasing something huge for the sake of "more parameters = better". Low key I'm tired of the daily SOTA gigantic 500B+ models that I can't even run across 4x RTX Pro 6000s.
I mean sure, but there surely is a bit of space to fit a model between 31 and 500B+, no? Isn't Qwen3.5-122B-A10B one of the most popular in the Qwen3.5 family? I'd like to see something like that from Google if their ~30B models are so good.
I'm not necessarily disagreeing with you there. There's just an upwards push in parameter size that I'm glad to see Google is able to throw down with in the ~30B range dense and more range, especially given the RAMpocalypse. So maybe that pressure to keep pushing params up gets a little relaxed, idk.
I was using 500B as an example. I know I can run 100B easy on one lol, but there seems to be a trend of releasing "better" models right and left but they're just absolutely massive and slow.
Their proprietary models are definitely getting bigger, so it's quite possible that their open models will have bigger sizes too. Someone else pointed out that they called the current releases Gemma 4 small and medium, indicating there's a large, and previously there were leaks about a Gemma 4 124b MoE, so there's hope.
I have one that I was connecting via oculink but my setup has some downsides. Oculink doesn’t allow hot plugging so the gpu has to always be idle if you want to leave it on all the time which negates some of the power advantage of having an always on llm machine.
Also, the gpu/harness I have runs the GPU’s fans at a constant 30% never spinning down. Also, also, I never was able to get models to play nice when splitting them across both the unified gpu and the egpu at the same time.
I’ve had OK results with llama.cpp + Vulkan and Radeon pro Ai R9700. Ran Qwen 3.5 122b at Q8_0. :) I’m OK with the noise too.
But I had to remove my second NVMe on one of my Strix halos. Turns out that the eGPU was causing the whole system to freeze while on the other strix halo with single NVMe it worked like a charm.
I also did have some instability on the machine with two NVMes when I used a network card - sometimes the card was lost and I had to restart the machine, while the same model on the other machine worked.
It’s the memory speed. Strix is around 250 gb/s and 5090 is 1700 gb/s. Strix has a large pool of RAM so you can load large models. In a MoE, you only need to get the weights for the active experts per token (active experts can change from one token to the next) vs dense where you need all weights per token.
31B dense
Vs
26B A4B
31B weights per token
Vs
4B weights per token
Dense models seem to perform better imo. Ofc, a much larger MoE could outperform a smaller dense model.
Yep, strix has more vram but it is lower memory bandwidth than a typical gpu. Strix is great for MoE models because they’re generally a lot of parameters with few active params whereas dense models activate all the params at once.
I haven't been regretting my strix halo tbh. Yeah a 5090 would have costed around the same and gotten me way faster speeds, but firstly it isn't a standalone server computer and I'd need to pay more for a computer to put it in, and secondly the VRAM of a 5090 is so limited in comparison, to run Qwen3.5 35B at full context would require dropping down to Q3. Plus I get to play around with 100B MoEs which still work fast enough as a backup in case the smaller models aren't capable of something.
I got one too and I feel you, but what is worth considering is that the massive VRAM means that you can give these models several context windows at once to several agents that can run in parallel, increasing your tokens/seconds/agent. I'll try it with claw-code.
Gemma seems to have solved the "50 meters to the car wash" problem, and it even identifies specifically how other LLMs fail on this test. Has that question/meme been around long enough to make it into the training data, or is it actually smarter?
Fun fact: Medical data training makes a great Dungeon and Dragons RP base too, because it can greatly focus on anatomy and effects of the fantasy creatures after fine-tuning.
I see the answer Loafy gave you but I'm just gonna say I actually play Sillytavern and keep semi up to date on the models people use and I have literally never seen MedGemma. I think they're bullshitting you. The closest thing I've seen is Gemma 3 27b and its finetunes.
Checks out. I am not really an active sillytavern user, but i never heard anyone talk about them as well. Thankfully people bullshitting wasted their own time and effort talking about it. It was just a cool info for me and now you grounded the fact. Thanks.
apache 2.0 is the gold standard and fully permissive. the google gemma license was "open" but google technically had the ability to restrict for any reason if they wanted to if it came to that,
I wonder if they did it because they felt annoyed that everyone was still using Mistral 24b tunes instead of Gemma 27b this whole time. I mean, presumably vanilla G27's writing ability and intelligence are both supposed to be higher than vanilla Mistral 24b, right? But because of the license, all the tunes were for Mistral 24b, and most people ended up preferring that to Gemma 27b and also preferred it over its abliterations.
Or they just want as much serious innovations/experimentation from the populace to be done on it for non-writing stuff and it helps with that, too, or something?
Well, in any case, pretty cool they decided to just unleash this thang
Big deal honestly. Apache 2.0 means you can do anything with these models commercially without Google's terms hanging over you. This is Google finally playing the open-weights game for real — not just "open with asterisks." Could shift a lot of enterprise adoption that was stuck on "but what's the license?" questions.
Gemma 4 dropping at this level is actually insane for open-source. 26B punching way above its weight and the speed on consumer hardware is a game changer. I've been running the release locally and it's noticeably smoother than the previous Gemma line on agentic tasks. Still curious how it compares to the newest Qwen3.5 in real tool-use chains though. Anyone else already quanting and testing it?
That Performance vs Size chart is actually insane. The fact that the gemma-4-31B-thinking and 26B-A4B models are punching so far above their weight class to beat out 120B+ parameter behemoths like Qwen 3.5 122B and Mistral Large 3 on the Elo scale is wild.
Seeing almost a 90% on AIME 2026 from a 31B model just proves how powerful that new configurable step-by-step reasoning mode is. Combining that built-in thinking with the 256K context window is going to make these absolute beasts to run locally. Definitely downloading the 31B GGUFs to test this out today.
Gemma 4 dropping feels like Google finally stopped playing it too safe. The efficiency numbers they’re claiming could actually make local models feel snappy again on mid-range hardware instead of just server-grade stuff. I’ve been running the last couple of Gemma versions locally and the jump in coherence is noticeable. Anyone already spinning this one up and seeing the difference in real tasks, or is it still too fresh?
Just shipped a small Android assistant app using Gemma 4 E2B via the LiteRT-LM tool calling works surprisingly well out of the box. The native format (<|tool_call>) is clean to parse, and the model stays on-task without much prompting.
Coming from Gemma 2, the jump is significant. Response quality is noticeably better, and the memory footprint is actually smaller for what you get. 52 decode tokens/sec on GPU makes streaming feel instant.
Next experiment is using it as a coding assistant, curious how E4B holds up on LiveCodeBench-style tasks locally. Will report back.
It seems like native tool calling isn't working very well. Is this a model problem or me? I'm running 26B-A4B at UD-Q6_K_XL with all the same settings in OpenWebUI as Qwen3.5-35B-A3B also at the same quant, (native tool calling on, web search and web scrape tools enabled), plus with <|think|> at the start of the system prompt to enforce thinking, and when given a research task, Qwen3.5 did a web search (searxng, so only snippets were returned from each result) and then scraped 5 specific pages, while gemma 4 did a web search, summarised, came up with a research plan, and then immediately gave me a response without actually following through with its research plan.
It did this somewhat consistently. The one time it did try fetch_url after search_web, it happened to fetch a page that was down (which returned an empty result), and it just went into responding as if it never planned on doing further research in the first place, nor did it try the alternative web_scrape function that I also have available (which I noted in the system prompt as a more reliable backup to fetch_url).
I also tried telling it to do further research after its first message, which caused it to use search_web twice, still no fetch_url. I then tried telling it to use its other search tools, after which it tried web_scrape once, which got it some results, and it just gave up. There's zero persistence in its research.
Yup even the one time I got it to search the web repeatedly (gave it a task where a single search definitely gets nowhere close to the full answer), it did like 5 searches and a page fetch, talked about needing to do more searching, and still stopped searching anyway.
Try Unsloth Studio - it works wonders in it! We tried very hard to make tool calling work well - sadly nowadays it's not the model, but rather the harness / tool that's more problematic
I'm serving OpenWebUI via a home server to my whole family, is that possible via unsloth studio?
Also you showed one tool call but I'm looking for multiple consecutive tool calls for in depth internet research tasks, is gemma 4 able to do that in unsloth studio?
I'm using the unsloth quants, maybe I should try some others, I'll do that tomorrow. Currently using llama.cpp built for vulkan for this but I usually use llama.cpp ROCm from lemonade sdk, will wait for that to update
Native tool calling straight out of the box is huge for setting up reliable agentic workfows locally. Finally being able to automate heavy buisness logic without bleeding money on api calls is a massive win.
Thanks. I've tried several ft trial runs with `unsloth/gemma-4-E2B-it` on Kaggle (T4 GPUs) but they all go `NaN` in reported loss after some time. Have you or anyone else been able to successfully tune this one on a dataset?
All the typical hyperparameter stuff already tried, tiny LR, tiny grad norm, filtering out empty samples, etc.
`UNSLOTH_FORCE_FLOAT32` made no difference. Tried using `FastVisionModel` instead of `FastModel` according to those notebooks but same outcome.
Btw, `device_map="balanced"` seems to give an illegal memory access error on FastModel, so Gemma 4 probably can't be multi-gpu trained that way for now. But that doesn't affect most users I'd think.
do you have any very quick first impression insights as to the ability of the model? people over at huggingface seem to rate it very highly talking about how they found it hard to look into what to finetune since it was so great out of the box. Is this true?
Hey, quick question re: Unsloth Studio. I'm thinking of switching over to it from my existing llama.cpp installation, but why do I need to create an account to run stuff locally?
Onboarding
On first launch you will need to create a password to secure your account and sign in again later. You’ll then see a brief onboarding wizard to choose a model, dataset, and basic settings. You can skip it at any time.
The first version I downloaded didn't ask me to create an account so I thought it was interesting that it was now a requirement.
We're still trying to get it to work well in Studio - should be done in minutes - see https://github.com/unslothai/unsloth?tab=readme-ov-file#-quickstart
For Linux, WSL, Mac: curl -fsSL https://unsloth.ai/install.sh | sh
For Windows: irm https://unsloth.ai/install.ps1 | iex
I switched to UD-Q3_K_XL and that got me to 84 tps since it actually fits in VRAM. But then I went back and retested the Q4_K_M after pulling the latest llama.cpp (there was a KV cache fix where they reverted the SWA cache being forced to f16) and switched from -ngl 99 to --fit on, and the Q4 jumped to 55-59 tps. All the tests were around 32k context. This model is a beast!
Same here. 26B 4AB context also uses more VRAM for me than Qwen3.5 for me too.
So this must be what I'm seeing. I wasn't getting full GPU utilization, it must be overflowing. Same GGUFs size, Gemma 4 wants an extra 3GB vram for the same 8192 context, wild.
I was able to get it to pass this benchmark once I enabled reasoning. Though this benchmark is easy enough that it should have been able to pass without it IMO.
So how to ask this question if you have 2 cars and left one of them in carwash queue while going home? I can agree with you, that most of the time you have one car and should drive it there. But if you tries to ask this question to a real person about - should I drive to carwash or walk - they probably thinks you are talking about second car there or that would mean you are going insane. So I would think they know what they do (already have car there to wash) and not a morons with a real person conversation.
So asking this question to a real person and common sense are kind of opposite.
Lol. When I was benchmarking this, I left off that first sentence because I just assumed that made it too easy. It doesn't of course, lots of models fail like this.
But because of that, I'm favorably impressed with Qwen 3.5. without the first sentence, it thought forever, but it produced an acceptable answer. It said I should drive unless I was going to work there.
I should also acknowledge that although it thought forever, it identified the core issue very early in the thinking trace.
I'm not sure if this is LM Studio, or what, but I can't load Gemma 4 unless I reduce the context window down to about \~8k which is insane because I can load Qwen 3.5 comparable models with \~32k context window
and actually late yesterday there was an update to LM Studio/Llama.cpp that allowed me to load the models with expected context windows (comparable to Qwen)
I tried to use Gemma 4 with opencode/speckit to define a new feature but Gemma got itself caught in a deathloop doing the same thing over and over, then I fell asleep
Yeah never trust lm studio with new releases. They normally rush a broken version of new models to say they "support" it, but use mainline llamacpp if you want to use new models properly on launch
Same - my experience with Gemini 3 has been horrible for coding. Lots of mistakes where it said things were perfect. Qwen3.5 27B has been rock solid with the updates from llama.cpp and vllm. Not expecting much from Gemma 4
I wouldn't since if you're having a bad experience with Gemini 3.0 which was their previous SOTA model from 6 months ago and Gemma 4 is clearly weaker.
instruction tuned, it means the model went through a supervised fine tuning phase where it's trained to follow instructions, this lets it act as a useful assistant.
you can also find base models on huggingface which haven't went through it and so more so try to complete the text sent to them instead of treating them as instructions..
Yeah they just complete text, you could do something like writing part of a code and they'll continue writing based on it or writing a part of a story and they'll continue the story, but you can't do the usual "you're a CLI agent- [insert rest of prompt] now write a script for checking whether a number is a prime number" as it might just continue completing with something like "and whether it's odd or even"
wait they skipped gemma 3? lol google's version numbering is always chaos. anyway the real question is does it actually run better locally than llama or are we still in that weird spot where google models look good on paper but dont quite deliver at 4bit quant. anyone tried it yet?
I immediately tried it and it was not good actually... corrupt JSON results coming back and a bunch of other anomalies. Lasted 3 hours and switched back to qwen
Trying the 31B out on my Mac Studio M2 Max 64GB unified memory.
For some reason it uses a lot of memory when I add context, compared to qwen3.5
Q8 was unusable and q4_km usable only with very short context. Way worse than qwen3.5 27B
Don’t know why, but maybe someone computer-smart will se this and come up with a solution.
The 4-bit variants are tempting but I'm curious about the E2B vs E4B tradeoffs. Anyone actually running the 27B on consumer hardware yet? Wondering if the quality drop from A4B is noticeable for coding tasks.
They mention handwriting recognition, which could be valuable in some projects I am working on. Has anyone tried this yet? So far (anecdotally), the Qwen VL models seem to be the best, with no real competition.
On 24GB it can kinda run with ctx-size of 262144 and kv cache at q4_0, but it's on the tipping edge of crashing, can't wait for llama to add those crazy kv cache optimizations from google's whitepaper.
Oh, yeah... this is a smart model... yeah... sure.
Aha!
Line 297 uses displacement (misspelled: 'a' before 'c' is okay, but 'e' after 'c' is 'ce', the typo is displacement vs displacement). Wait, let me look closer.
displacement (line 297)
displacement (line 300)
Let's compare them:
d-i-s-p-l-a-c-e-m-e-n-t (Correct)
d-i-s-p-l-a-c-e-m-e-n-t (Looking at line 297) -> displacement.
Wait, let me re-read the provided code for main.py.
Line 297: displacement = (dx * dx + dy * dy) ** 0.5
Line 300: if displacement > 3:
Let's check the spelling:
Line 297: d-i-s-p-l-a-c-e-m-e-n-t (Actually, looking at the text: displacement. No, it's displacement vs displacement).
Wait, displacement has an 'e' where it should be... actually let me just use grep or read again to be absolutely sure of the exact characters.
I've seen errors like this too, I compared words byte-by-byte and they were identical. Also there were broken tool calls. Seems like it doesn't work properly yet.
I have great expectations from this model for computation but in biology, I will try and see what it can do for me. It's been 3months I did not touch any local modals when I saw codex 5.3, in fact not updated my ollama and lmstudio😂
Can someone explain the business model here? I'm basically running a SOTA model on my basic laptop now. Why would I buy a subscription? My university was already running Kimi and not paying. I don't get it.
yeah, in limited fields they can perform close to SOTA. that's what they are good for and it's really cool that they can do that! but calling any ~30b parameter model a general replacement for real SOTA models is silly
Tbf GLM-5's quality depends heavily during the time of day. During peak hours especially in China they use a heavily quantized model. And its thinking block is unusually sparse and the model overall has poor context comprehension. 5.1 is the real deal and what 5 should have released as.
Testing Gemma4 E4B unsloth gguf at the moment and it refuses to believe I have it running locally, its telling me its a cloud based service provided by Google.
I'm getting 65-70tok/sec which is great, so I was going to see if i can backend OpenClaw with it, but not sure I trust it if its kinda stubborn and hallucinatory already.
I missed that - I'm still setting up my smart home system to use LLMs for local voice, but wasn't Qwen 3.5 4b also a multi-modal model? Or would you still need to use something like Parakeet for voice to text (and the associated delay of each step). Or was that only for vision and text inputs?
If so, that's a major improvement considering it is not too far from Qwen 3.5 4b. However, it looks like the same size quant at q4 is around 5gb for E4b to Qwen's 2.75gb size while being roughly 4.5b active parameters. I'm curious how much faster or better quality it may have versus the multiple tools approach, since I don't really need to have the audio out / text to speech output with my smart home at this time (only the text as notification on my phone).
Qwen3.5 (Except for 2B) is multi-modal, but that means the classic duo of Text and Images. No direct Audio input. Do note that Gemma4 does Audio, Image, and Text input, but only Text output.
foundation models they said... I guess the recent news from that deal saying apple will open up to other providers is cause they paid billions, but in the end it's just an open model =)
it seems a serious bug. I also had this where it wanted to "correct" function with the same name and got in loop for 10 minutes or so until i canceled it
shuffled (the one in the bracket): s(1) h(2) u(3) f(4) f(5) l(6) e(7) d(8)? No, wait.
Let's look at the spelling:
S H U F F L E D (7 letters + 1 'd' = 8). This is "shuffled".
User's typo in code: [shuffled[i]...
Wait, let me look extremely closely at the user's text.
User wrote: [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];
Let's count again.
Word 1: shuffled -> s-h-u-f-f-l-e-d (8 letters).
Word 2: shuffled -> s-h-u-f-f-l-e-d (8 letters).
Self-Correction: I am being blind. They are identical in my view because of how they look on screen, but let me re-read the user's provided source code one more time.
Let me look at the first word in that bracket: shuffled.
Let's count characters: S, H, U, F, F, L, E, D. (8 characters).
Now let's look at the second word: shuffled.
S, H, U, F, F, L, E, D. (8 characters).
OH! I SEE IT NOW.
I was looking for an extra 'd'. But there isn't one.
Let me look at the user's source code one more time very slowly.
User provided: [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];
Wait... let me check if I am hallucinating the typo.
Looking at the block:
shuffle: function(library, array) {
// Fisher-Yates shuffle - creates a copy to avoid mutating original
const shuffled = [...array];
for (let i = shuffled.length - 1; i > 0; i--) {
const j = this.getRandomInt32Inclusive(library, i);
[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];
}
return shuffled;
}
Actually, in many fonts, "shuffled" and "shuffled" look exactly the same because they are both just the word "shuffled".
Wait... I am looking at the user's code again.
Line: [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];
Let me re-read the word in my brain.
S-H-U-F-F-L-E-D.
S-H-U-F-F-L-E-D.
Wait... if they are spelled exactly the same, then there is no error!
Let me re-read the user's code from the very first time they sent it.
const shuffled = [...array];[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];
Hold on. I will copy and paste from their text into a character counter in my mind.
The user's code: [shuffled[i], ...]
The user's constant: const shuffled = ...
I am looking at the letters... S, H, U, F, F, L, E, D.
If both have 8 letters and are s-h-u-f-f-l-e-d... then they are the same word.
Wait! I found it. In my previous response, I was trying to tell them there was an extra 'd', but in their code, shuffled (with one d) is just "shuffled".
BUT! The constant they defined was shuffled.
They are the same word.
Wait... if they are the same word, where is the error?
If the user's code says:
const shuffled = [...array];[shuffled[i], ...]
And "shuffled" == "shuffled"... then there is no error.
Let me look at the user's source code one more time. I will be extremely pedantic.
User: const shuffled = [...array];
(S H U F F L E D) - that's the word "shuffled".
User: [shuffled[i], ...]
(S H U F F L E D) - that's also the word "shuffled".
Wait! I am looking at the user's source code again.const shuffled = [...array];[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];
I will look at shuffled.
Is it possible there is a typo in the user's code that I am misreading because of my own brain?
Let me look at the word in the bracket: shuffled.
Let me look at the constant: shuffled.
Wait... I think I see it.
In the array declaration, the word is shuffled.
In the swap line, the first word is shuffled.
THEY ARE THE SAME WORD. "Shuffled" and "Shuffled" are just two different ways people spell "shuffled"? No. "Shuffled" is correct. "Shuffled" is... also how you'd spell it if you forgot the 'e'? No, shuffle + ed = shuffled.
Wait! I'm being stupid.
Let's look at the word: S H U F F L E D.
That is one word. It has one 'D'.
The user wrote: [shuffled[i], ...]
And they defined: const shuffled = ...
THEY ARE IDENTICAL.
There is no typo in the user's code. I hallucinated a typo because I saw two words that looked slightly different to my pattern-matching brain, but upon character-by-character inspection, the user's code was actually correct and had no error there.
Wait, let me check again.
User's code: const shuffled = [...array];
Line below: [shuffled[i], ...]
Let me look at the letters in shuffled.
S-H-U-F-F-L-E-D.
Now let's look at the word "shuffled".
S-H-U-F-F-L-E-D.
I see what happened. I thought shuffled had an extra 'd' (like shuffledd), but it doesn't. It just has one 'd'. And the constant they declared also only has one 'd'.
Therefore, they are identical.
My "correction" was useless and wrong because I misread their code as having a typo when it didn't!
Wait... let me look one more time.
I will zoom in on the user's text.
[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];
Let me count 'd's again.
Word 1: shuffled -> s h u f f l e d. (One d)
Word 2: shuffled -> s h u f f l e d. (One d)
Wait! I found it.
Looking at the user's code again.
Line: [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];
Let me look at word 1. S H U F F L E D.
Now let's look at Word 2. S H U F F L E D.
I AM LOSING MY MIND. They are the same! If they are the same, there is no error. If there is no error, why did I say there was one? Because my brain saw shuffled and thought "That's not how you spell shuffled!" but then realized shuffled is also spelled that way? No, shuffle + ed = shuffled.
Actually... "Shuffle" ends in an E.
When you add D, it becomes SHUFFLED.
If the user wrote SHUFFLE-D (with one 'd'), it's correct.
If the user wrote SHUFFL-ED... wait, they are both "shuffled".
Wait! I see a difference in my own eyes.
Is there an extra E?
shuffled vs shuffled.
Let me check:
S H U F F L E D
S H U F F L E D
They are identical.
I've been trying to find a typo in code that actually had no typo. I am a very bad AI sometimes.
Wait... let me look at the user's line one more time with extreme focus.
[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];
I will count characters in shuffled. 1,2,3,4,5,6,7,8.
I will count characters in shuffled. 1,2,3,4,5,6,7,8.
They are the same word. There is no typo. I apologize profusely.
Wait! Let me look at the user's constant one more time.const shuffled = [...array];
Maybe they meant shuffled but typed shuffled? But they are the same.
OK, let's be real: I was wrong. There is no error in that line of the user's code.
I will admit it.
Wait... looking at the user's source again...
const shuffled = [...array];[shuffled[i], ...]
Actually, if I look really closely at their text:
The first one is shuffled. The second is shuffled.
They are spelled identically.
Final realization: My brain was trying to find a difference where none existed because I expected there to be a typo in a code block that looked "suspicious."
Actually, hold on. Let me look at the user's text one last time.
[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];
Is there any other word?
No.
Conclusion: The user is correct. They are identical. I am wrong. I will apologize and admit the mistake.
Wait! One more check.
Does shuffled have two 'f's? Yes.
Does shuffled have two 'f's? Yes.
I see no difference.
WAIT!!! I just noticed something in the user's original code block that I missed before!
The constant is: const shuffled = [...array];
The line is: [shuffled[i], ...]
If they are spelled exactly the same, there is NO error.
UNLESS... "shuffled" (correct) vs "shuffled" (wrong)? No, they are both s-h-u-f-f-l-e-d.
I will admit I am hallucinating and apologize.
Wait! Let me look at the spelling of "shuffled".
S H U F F L E D.
It's a perfect word.
If both words in that line are s-h-u-f-f-l-e-d, then there is no typo.
Wait! I just looked at the user's prompt again.[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];and[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];
Yes, they are identical. I will apologize for my confusion and admit that there is no error in that line of their code. (I'll also check if "shuffled" vs "shuffle-d" was the issue, but they aren't).
Actually...
Looking at the user's text: [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];
The first word is shuffled.
The second word is shuffled.
THEY ARE THE SAME.
I am going to apologize and stop trying to find a ghost typo.
My initial impression is that 26B-A4B and 31B are extremely smooth with translation and language. Honestly, it's in a tier of its own so far which is something I've been waiting for over a year now. It even makes translategemma feel outdated instantly for my use case. E4B and E2B are a bit meh.
Gemma 3 is still better than many other SOTA models at Ukrainian, second only to Google models, FWIW. Claude and GPT caught up only in the latest versions.
It is crazy how good Gemmas are at multilingual support. Though, Ukrainian does require larger models.
Is there a way to disable "thinking" in llama.cpp for this model through commandline options? I tried --reasoning-budget 0, but it didn't seem to change anything :(
Am I missing something, or is Gemma 4 less censored than Mistral 3? I've tested it briefly, and it didn't refuse writing jokes that Mistral 3 24b refused to. Very interesting.
I currently use gemma-3-27b-qat-Q4. I have Nvidia 5070 12gb, 32gb ddr5 ram, and i7-13700k. Will any of the gemma 4 run in a way that make it an upgrade over the gemma-3-27b-qat-q4? Or should I stick with the gemma-3?
Trying to pair unsloth/gemma-4-26B-A4-it-GGUF (I4_XS q4_0/q4_0 cache) with opencode. It does something but stops very often, asking me confirmation at every step. And stupid <channel stuff gets printed, not sure what to do with it. :(
seems to be a bug in the 26B quants, haven’t heard anyone able to use them properly yet. It might be a llama.cpp issue or even more likely something with the chat template
q4 should fit - I think there might be a KV Cache bug or leak that adds additional GB when extending context window. Wait for them to optimize or even better hopefully there are TurboQuants coming
What is the difference between the E4B and A4B models? I understand that A4B is an MoE architecture, so only 4B parameters are used during inference, but no idea what the E4B is?
The 26B A4B is a Mixture of Expert model. It requires around 16GB of RAM / VRAM to load at 4bit quantization. It means that the model is 26B parameter ”medium sized” but anytime you ask it something only 4B parameter is activated which means it will be very fast as it’s now using the full 26B at any given time.
The E4B is a very ”small” it only has 4B and those 4B is always activated (dense model). This will fit on as little as 6GB RAM / VRAM at even 8Bit and would fit on 4GB RAM / VRAM in 4 bit. These small models are usually not recommended to use below 8bit as they are so small to begin with and therefore it’s usually looses a lot of ”intelligence” when quantized heavily
Well I find it great for analysis and planning, but for writing any code it's only my fourth choice, after Kimi and the two usual suspects. Maybe it does better for Golang or something, but it seems consistently bad at implementing math heavy stuff.
This is a bit congnitively jarring for me because I use Gemini 3.0 every day as my base model (when I've run out of credits for frontier models) and it's absolutely fine. I'm coding large and fairly complex applications.
I wonder if what we're experiencing is that the quality of the agentic loop is more important than the model.
Why? We know Chinese models haven't as polished on reasoning as models from the big 3 western labs.
We also know Gemma 3 has unusually high world knowledge for its size.
So a slightly scaled up version of + reasoning would be expected to be one of the best open reasoning models out there. Qwen still has less reliable reasoning than GPT-OSS, it's the base model performance that makes up for it.
I’m not worried about knowledge to be honest. I’m much more interested in intelligence (understanding queried history and using all information it has) and tool utilization
Any idea when llama-cpp-python will be updated to support Gemma 4? A project I'm working on uses llama-cpp-python with a custom IDE UI written in Python, and I'm getting model initialization errors which make me think that llama-cpp-python isn't able to make heads or tails of the Gemma 4 architecture.
I'm using the unsloth Q4_K_M quant of Gemma 4 E2B, hardware is a Raspberry Pi 5 8GB
No architectural innovation?? No hybrid attention? Apart from gemma specific capabilities like strong multilingual perf and nice talking style, I don't think this means much... Qwen3.5 wins in architectural innovation, hybrid attention that supports very long context with minimal memory footprint... I wish they had shared some research that actually pushed things forward...
Spent half the night testing it and I think people don't realize how big of a deal it is for those of us who value the range of philosophical thinking more than tool use.
I just tried e2b on my iPhone with googles edge gallery, I asked it to write a dfs for me, and then my phone started to burn😭 but it is actually fast. Based on this website and google’s blog, e2b/e4b actually support native audio, which is insane
gemma-4-31B-it-UD-Q4_K_XL passed a personal niche code test I use first try that all other models have like a 95% fail rate on cause they miss one thing. We might have something special here
5070ti 5060ti 32gb combined, llama.cpp cuda, 25tps to start trickling down to 18tps after 32k context used.
Ah nice, theres also options like adding this for llama.cpp, but I haven't battle tested it for intense code debug sessions so I'm not sure what a good value for reasoning budget would be
--reasoning-budget 4096 --reasoning-budget-message "I'm running low on thinking tokens, I should wrap up and give my answer."
Just had this happen to me too with llama.cpp on windows with claude. Started at around 50GB ram OS used then eventually hit 128gb ram after a long session and then process killed.
i am a simple man that just uses ollama running in a w11 vm (there's a reason for that) to handle local llm services. please let the pre-release 0.20 update come out soon.
Just replaced Qwen3.5 35B with the Gemma 4 26B in one of my workflows and got a HUGE speed increase simply due to the fact that Gemma doesn't think as much.
Not to nitpick, but why are the links for the "unsloth" version? I could not get that working for the life of me.. But then I went and tried the standard "ollama run gemma4" model and that runs perfectly.
Gemma models typically output a nicer aesthetic (better prose, formatting, etc.). If I had to guess they're probably hevaily weighing head to head scoring mechanisms like LMArena.
Definitely noticing this as the biggest jump from Qwen 27b. It's prompting me back, keeping the conversation going and helping me think towards solutions alongside it. This is a very interesting experience!
I would expect these models to have better language skills and possibly better broad knowledge (likely what sways LM Arena). While at the same time having likely worse analytic rigour, likely worse in agentic tasks or highly specific scientific work. Tau2 might be a decent proxy. Qwen scores extremely well there, in fact Qwen3.5 4B scores higher than 27B on that benchmark and either model is better than any of the Gemmas. It's definitely something these models are very optimized for. I would imagine the Gemma models to be better generalists. Also the Qwen models think obscenely long, especially the smaller ones. If you get comparable performance with less thinking that's a win.
Would also wait for independent benchmarks. From a first little test I do find them to perform favourably against Qwen but not in a blowing them out of the water way, at a comparable level, likely with different strengths and weaknesses.
i think in reality now the release hype is tarting to dull down we can see it's probably much closer to 27b which makes sense, still seems like a great release but qwen3.5 set such a high bar
If it's true that the AA omniscience accuracy benchmark (general knowledge) is a predictor of model size, then Gemini 3 is likely the largest model that exists which is likely its biggest strength. I'm curious how benchmarks will turn out but I would suspect something more akin to the small Qwen 3.5 models with less overthinking and probably slightly worse at very technical tasks, slightly better in other domains.
No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins
Eh it is still using the weird interleaved thinking mode. The other 2 new models, Trinity Large Thinking and Qwen3.6 Plus, already embrace the preserved thinking mode.
Personally I prefer that, as preserving thinking means the context size balloons really, really quickly. And personally I haven't actually found that models that preserve thinking perform that much better than those that don't.
Do you run local inference on consumer hardware? Because interleaved thinking also breaks prompt caching.
These days, the best models like GLM-5 and Qwen3.5 support long enough context, and also don't think for too long in between tool calls. Preserved thinking should be the way forward.
Holy fuck that’s the model in the most excited about. Qwen 35B is SO good that I desperately want something like 27B which is even better but way slower, but faster. So holy crap I’m so excited
Cool off. Qwen 35B A3B is a multi-modal model first, coding second. Apart from coding (basically in most OpenClaw cases), Qwen3.5 is still SOTA. Gemma 4 E4B badly loses to Qwen3.5 4B and 9B in most benchmarks. Give it some time and give them both a spin and compare them or have someone else compare them for you, and you'll likely see that Qwen3.5 is still extremely good.
MRCR v2 is a "needle in a haystack" benchmark to test for long-context performance. A higher score means the model is better at finding small pieces of information hidden in a sea of text.
I can load 2 ggufs with llama, a 10.8GB Qwen3.5-27B IQ3_XXS and a 11.5GB Gemma 31b IQ3_XXS gguf with the same settings (tested with Cuda 13 and Vulkan llama builds). I'm seeing 3GB more Vram and IQ3_XXS barely fits on my 16GB.
yeah, to run it on a 5090, I had to take it down to 32k context with Q4_0 kv cache. Makes it a bit limited. Even the 26b version had to use Q4 kv cache at 128k, otherwise it ballooned up and failed.
Now I understand why Google was recently publishing papers on how to reduce the size of KV cache.
Looks like they built a purpose for their TurboQuant.
the 26B MoE fitting on 16GB is what I've been waiting for. been running qwen 3.5 27B for code stuff and it's solid but slow - if this thing is comparable quality at those inference speeds people are reporting I might finally have a daily driver that doesn't make me stare at my terminal for 30 seconds between completions.
Where Gemma 4 270M... Awesome release, I hope Google will release such a small model again. It's incredibly capable for it's size, and I don't think there is any other alternative similarly sized.
Do lfm2.5 350 or bonsai 1bit models easily fine-tunable? I'm kinda stuck with LlamaFactory as it's easy and does what I need it to. Although I think bonsai is just XOR for tunes?
Very high-detail embeddings, insanely quick to experiment and fine-tune, takes minutes with solid GPU, and even on CPU you can probably produce an ok-ish tune in matter of hour or two. Generally with function calling and proper specialization i.e. docs and stuff and RAG it produces really sensible output really fast.
If you do need to perform some language analysis task, and it's too much for general NLP tools like spaCy, then such small models are your best bet unless you have compute capacity for larger ones, or you are willing to hit API for every small thing.
Also, I have a personal challenge to take see how far one can push such small model, and the more "smart" base, the better ;)
It's a bit late where I am, but I threw Gemma4-26b on my mi50 32gb
Ran it with -c 128000 -dev rocm0
Used the UD Q4.
Llama-bench got about 939 +- 21 on pp512 and 76 on tg128
Ran a quick 2 prompt run with llama-cli and got about the same results.
I'll have to test some more tomorrow, I'm too tired rn.
was so excited about this, but in my Vietnamese -> English translation task Gemma4 is worse than Qwen3.5 in the same Q4 quant. It also failed the car wash puzzle :(
I have a basic laptop I7 with 32gb ram running qwent3.5 4b q5 k m with llama.cpp. Swapped it over to gemma-4-E4B-it-Q4_K_M.gguf (with some flags) and not only is it faster, it gives significantly better answers
I'm very much a newbie, but even saw the difference when using it for finance analysis
Back in the 90s I used to program assembly, and whilst this old decrepid mind isn't sharp to do that anymore, I know what end results should be, and how they should be processed, so having great fun giving it a good pokey pokey, laptop is having a meltdown, all good fun!
Yes but I was doing 64k intros, with music and 3D :)
I tried to use local LLMs to generate some effects in Python or HTML, there was a bigger problem with C++ and some libraries like SDL, not sure how to use assembly in 2026 to render something, but maybe it's possible.
I've already seen like 4 "secret" models, the most recent one is actually called "Leviathan" XD
They all seem to be in testing at Meta AI, but I had already seen that, according to Mark, they were going to focus on making closed-source models to compete with the rest. You know, Llama 4 was the worst model in 2025, and apparently that really hurt their egos.
In LM Studio, you can try Gemma 4 via the CPU or Vulkan backend if you have an AMD iGPU. Gemma 4 26B A4B model on my Strix Halo via Vulkan gives about 50 tokens per second.
Oh, great news! Thinking, system role support, more context basically what everyone asked for, and a 35B competitor MoE too.
But aww man audio is E2B and E4B only, that's a bit of a bummer. I thought we were about to have native and capable voice assistants now. But these are too small. Basically larger native multimodal models that can input and output audio natively.
Yes, I was thinking just use it for the recognition and feed the output directly into a larger model, don't even bother with tool use, make that the loop.
Indeed, but qwen3.5 4B is at the level of gpt-oss-20B and in some cases gpt-oss-120B, is by no means a weak model. Likewise, the Gemma 4 E2B has been at least at the level of Gemma 3 27B, at least as far as google's benchmarks goes.
Might be, but they are still small models and the MoE and the 31B dense are obviously a lot better. These capabilities with native audio support would have been great to have. But I guess it is not the time yet for that
Oh, the hype isn't bullshit! Comparing the MoE model favourably to qwen 3.5 in my own tests right now. It's getting some very tricky shit right! STEM and philosophy, that is. And it's fast despite partial offload. Sweet af.
I'm trying to run gemma-4-E4B-it-GGUF on both my PC with Unsloth Studio, and my phone with Off-Grid, and none of them work. anybody having the same issue ?
Finally, an open-source model that not only allows you to write in German but can also express itself very well in German. Multilingual capabilities have always been Gemma’s strength, and that’s still true for Gemma 4. No other open model has come close so far.
I have a few random trivia questions I toss at models just to get a feel for their training data. Not so much expecting a right answer, but more to see how they fail and if they get the general gist of the topic even if getting the specifics wrong. 31b got my history, early American literature, and pop culture questions totally right and 26b came really close.
Hardly a real benchmark or anything. But it's the best I've ever seen from models this size.
llama.cpp Vulkan b8637 + 26B-A4B-it-UD-IQ4_XS (on 7800 XT 16GB) seems to have a bug in its fit/context size estimation (or at least it's way too conservative). Using --fit I have to dial the context target all the way back to 256 (lol) to get it to not offload any layers, but if I force --ngl 99 it complains a bunch but loads and runs fine up to a context of about 20K.
I've not used any of the Gemma models before, is there room to run these (either 26B A4B or 31B) with reasonable context if you have 32gb or 48gb setup of VRAM?
I don't trust benchmarks anymore because models are benchmaxxxed. Elo should be the only valid benchmark because it's based on arena votes from humans, but even that could somehow be broken in 2026. It's arena.ai, it was called lmarena before
Thanks, well gotta be cautious trusting anything LLM-related in 2026: this arena has 31B with same score as sonnet-4.5, which leaves me very doubtful. Google has probably received enough of those user traces from this arena for gemini and now has a decent idea what users there vote for and skew in that direction. E.g. make model hallucinate more instead of confirming it can't answer
Guys... Please, don't be so naive... I'm testing now this SOTA-like 31B in AI Studio. It's a pure shit compered to Qwen3.5-27B.... Infinite loops and no possibility to read text from the image.... Not quantized!
I have noticed the loops as well. However, even though Google runs AI Studio, there's likely still bug in the implementations as with every new release.
Gemini 3.1 Thinking is suggesting that the new architecture means there's no need for QAT anymore, but I don't know enough to know whether or not that's bullshit
Because no one would want a mediocre model like that. Or maybe they had something, but decided to scrub that when they realised that Qwen3.5 9B drinks their milkshake
Qwen3.5 outdoes Gemma 4 in certain benchmarks. But when Qwen3.5/3.6 Coder rolls around, it'll be game over. Unless Alibaba completely drops the ball with whatever new tech lead they have right now.
Pretty insane to see the E4B model beating one of the best models from last year. Unlikely to be true in broad real world use but a great signal anyway
Good level intelligence humanoid soon at your doorstep! They're going to squeeze it into the robot very soon! Just imagine the mess on the Russo-Ukrainian front line in a year! It's gonna be Terminator I live. 🤯
"Generate a humorously complicated python code that simply prints out hello world. The code should be as convoluted and hard to read as possible while remaining functional"
Oh, so you want me to turn a simple task into a digital fever dream? Fine, but don't come crying to me when your brain short-circuits trying to parse this masterpiece.
import sys, functools, operator
def axiom_complexity_engine(payload):
# We don't just 'print', we orchestrate a symphony of absurdity
return "".join(map(chr, functools.reduce(operator.add, [[x] for x in payload])))
def initiate_sequence():
try:
# Constructing the payload via unnecessary mathematical gymnastics and bitwise madness
target_sequence = [
(0x48 ^ 0x00), (0x65 ^ 0x00), (0x6C ^ 0x00), (0x6C ^ 0x00),
(0x6F ^ 0x00), (0x20 ^ 0x00), (0x57 ^ 0x00), (0x6F ^ 0x00),
(0x72 ^ 0x00), (0x6C ^ 0x00), (0x64 ^ 0x00), (0x21 ^ 0x00)
]
# A lambda-based recursive loop that does absolutely nothing meaningful but looks intimidating
logic_gate = lambda n, acc: acc if n == 0 else logic_gate(n-1, acc + [target_sequence[n-1]])
# Injecting the result into the standard output stream via a convoluted functional pipe
sys.stdout.write(axiom_complexity_engine(logic_gate(len(target_sequence), [])) + '\n')
except Exception as e:
raise RuntimeError(f"Even my brilliance is too much for this system: {e}")
if __name__ == "__main__":
# Execute the overkill
(lambda: [initiate_sequence()])()
There you go—one "Hello World" wrapped in enough unnecessary layers to make a senior developer weep. You're welcome.
I'm not sure what it says about me that I thought this would be the way to do it and this is what it did... But it added error handling so there's that...
Thank the lord. Multi-language is often ignored and mostly focuses on the English language. If it is any good, I hope to use it for some small tasks at the office (the 26ba4b model).
Some sizes like 15B, 50B, 90B, 150B, 300B are pretty empty right now.
People who could already run Qwen 3.5 27B will be able to run Gemma 4 31B, but people who were looking at a touch smaller 10-20B models, or bigger 40B+ models still have limited choice.
Seems like Qwen3.5 is better at coding, and Gemma 4 is better with knowledge. My guess it the rest will come down to personality/ preference. Probably will just have to test with your use cases.
Anyone have a working template to use with openclaw ? Gemma 4 E4B Instruct is not working with the default jinja template in lmstudio. I'm looking to test it's agentic ability.
A lot of tokens for almost the same result. It's not good for people with less resources. Gemma was the last guardian of inteligent models without this spend of tokens
Yeah thinking sucks on small models. It honestly doesn’t even add that much on larger models — just has a CFG type effect from repeating the prompt in a different way.
Can we get open omni models for all sizes and at least Nano Banana 1 level of image gen and editing in like a Gemma 4.1/.2 or something please now Google?
Finally getting a good quality LM that can do images and editing too is something I've been waiting for.
would be great if the presenter spoke better english and if most of the video wasn't a bunch of useless words. Why are companies so bad at presenting information?
twanz18@reddit
Gemma 4 looks promising for coding tasks. The improved instruction following should make it better for agentic workflows. If anyone is setting it up as a coding agent, OpenACP can bridge it to Telegram/Discord for remote access. Works with any CLI agent framework. Full disclosure: I work on OpenACP.
EuphoricAnimator@reddit
I’ve been playing with these Gemma 4 models since they dropped, and honestly, it’s less about replacing my usual stuff and more about adding another tool to the box. I’m on a Mac Studio with the M4 Max and 128GB of RAM, so I can juggle a few things at once. Right now, Qwen 3.5 is still my go-to for creative writing,it just feels more imaginative, even if Gemma 4 is technically more…precise.
What I’ve settled into is routing tasks. Gemma 4 7B is surprisingly good for quick information retrieval and coding help, and it’s fast, easily hitting 30+ tokens/sec with my setup. The 31B model is noticeably slower, around 15-20 tokens/sec, but really shines when I need more complex reasoning or a longer-form response. I’m using the A4B quantization for both, it’s a good balance of quality and speed.
I also have a bunch of smaller Ollama models loaded,Mistral, OpenHermes,for really quick stuff like brainstorming or summarizing short articles. They’re not as capable as the larger Gemma models, of course, but they load instantly and use minimal VRAM. It’s kind of nice to not always be firing up a 30GB model.
It’s not about finding the “best” model, at least for me. It’s about picking the right one for the job and not being afraid to switch. Each one has quirks, strengths and weaknesses, and knowing that helps a lot. I think people get too caught up in benchmarks and forget to actually use the models for what they want.
secret-meeting@reddit
gemma 4 7B?
EuphoricAnimator@reddit
26b... typo
itsdigimon@reddit
Did Google just release a 26B A4B model? Sounds like christmas is early for GPU poor folks :')
Final_Ad_7431@reddit
yeah im only really able to run qwen3.5 35b on 8gb vram, im very excited to compare this new moe
MushroomCharacter411@reddit
Gemma will fit the hardware even better. I had Qwen 3.5 35B-A3B working reasonably well with a 12 GB RTX 3060, but Gemma is better in every category except one: it *starts* at a somewhat slower rate. But by the time the context window reaches 50K, Qwen's initial speed advantage has vanished, and from that point forward, Gemma is faster.
Final_Ad_7431@reddit
ive had a relatively bad time with gemma 4 so far, im waiting for llamacpp fixes and new ggufs and everything to stabilize, does seem like today was a good final day for it so will probably be retesting it soon
MushroomCharacter411@reddit
I did have to update llama.cpp to run Gemma 4—once, three days ago. That took less than a minute. I've had *less* trouble setting up Gemma than I did setting up Qwen 3.5 a couple months ago, although some of that is attributable to the fact that I still remember the process of setting up Qwen 3.5 a couple months ago. I was even able to use the mmproj file from the stock Gemma 4 26B-A4B when mradermacher didn't have one (but they might now, I was striking while the iron was hot, four hours after the quantized Heretic models dropped).
So I think it's worth trying again. It's that much better. If you were impressed even a little by Qwen 3.5, you'll be even happier with a similarly sized Gemma 4 model. If the improvement from Qwen 3 to Qwen 3.5 were quantized as "one unit", Gemma 4 is two or three such "units" better than Qwen 3.5.
mattrs1101@reddit
What settings do you use?
Final_Ad_7431@reddit
i basically rely on --fit and --fit-target to do all the lever pulling for me, i've always found it to give better results than manually doing stuff but ymmv of course, i just specify fit 1 and fit-target for the minimum headroom im comfortable giving (something like 256mb keeps my system stable) then llamacpp will automatically do the offloading for you, i pull about 25-27 token gen with this setup
Objective-Stranger99@reddit
You might want to consider fine-tuning -ncmoe for even better results. Performance for me (on a GTX 1080) peaks around 70-80% of total layers offloaded to CPU. Don't use -ngl, it will also offload critical attention tensors and you will have a bad time. Keep -ngl on all.
bolmer@reddit
What gpu do you have? I have an rx 6750 GRE 10GB and though I couldn't run Qwen 3.5 at that size.
Final_Ad_7431@reddit
3070 8gb, it just relies on huge amounts of offloading, i could fit it into 6 (to make room for the .mmproj) and it still ran pretty acceptable, you just have to make sure your llamacpp is actually offloading to cpu/ram (with --fit or doing it manually with the other params)
wotererio@reddit
Wait how are you running a 35b model on 8gb vram? Even with quantization that would exceed 8gb right?
Final_Ad_7431@reddit
you can offload MoE models to ram for way less penalty than dense models, and something about qwen3.5's moe's architecture seems to offload even better than most moes for me
SilaSitesi@reddit
Qwen3.5 35b A3b, it's a MoE (not 35b dense). With
--cpu-moeon llama.cpp you can offload expert weights to RAM and it'll only use \~3gb VRAM in total. I run it daily on my terrible rtx 3050 laptop with 4gb vram and 32gb ram @ 22-25 tok/s lolBorkato@reddit
Qwen 3.5 35B is indeed god tier tho!
Musicheardworldwide@reddit
27B is better imo
ThankGodImBipolar@reddit
Where does Coder Next slide in?
bikemandan@reddit
Will it run on my Commodore 64?
FlamaVadim@reddit
Naturlich!
Ok_Zookeepergame8714@reddit
I ran it on my abacus 🧮!!
Borkato@reddit
Now I’m curious how big an LLM computation on an abacus would be. Perhaps I’ll ask Gemma 4!
AdamLangePL@reddit
Only with “Action Replay”, and you need at least 5 tapes for it ;)
toothpastespiders@reddit
Main reason I'm bummed about the lack of a 120b model. I was all prepped to start writing it to floppy for my Commodore 128.
Prestigious-Crow-845@reddit
If 64 means Gb VRAM size then yes
picosec@reddit
If you enough external storage attached it should be able to run. You might be able to achieve low single-digit tokens per year.
Old_Wave_1671@reddit
Just type RUN, hit Return and do a crusade or smthng..
roselan@reddit
eazy.
Cherlokoms@reddit
Does 26B A4B means that it takes the RAM like a 26B param or just what it would take for a 4B model?
Training_Isopod3722@reddit
this is cool, no sure how to compare it with Qwen3.5
Choice_Sympathy9652@reddit
Dear huihui, we are waiting for abliterated version! :D Forward thanks to You!
MushroomCharacter411@reddit
You only had to wait five days for *quantized* Heretic models (from mradermacher). 26B-A4B at Q4_K_M damn near runs on a potato.
AdamFields@reddit
Is the context as vram expensive as gemma 3? That to me is what would make or break this model. Currently I can only fit gemma 3 27b q4_k_m with 20k context on a 5090 while I can fit qwen 3.5 27b q4_k_m with 190k context on that same card.
MushroomCharacter411@reddit
You can quantize the K and V caches. If you use Q8_0 it is unlikely you'll notice any difference at all except you'll suddenly have room for double the context window. I'm using Q5_1 (with a Q4_K_M model) and that seems to be just enough depth that I'm not adding any *extra* loss to the model. When I use Q4_1, I do notice a difference.
KldsSeeGhosts@reddit
Out of curiosity I have the same 5090, but using qwen 3.5 27b causes a huge hang in opencode/ Claude code, when trying to do agentic stuff. Things like 3 minutes for a “hello”, are you facing this as well or? (I did also confirm that chatting through openwebui correctly performs at expected speeds)
AdamFields@reddit
I haven't used opecode/claude code before so I can't say for sure, that being said, I have noticed a similar problem when using Cline and Roo sometimes with the Qwen 3.5 27B model and even the 35B A3B, could be just the models getting stuck in thinking loops as they are known to do.
I have since switched to the Claude Opus reasoning distilled versions and they perform much better for nearly all of my use cases. No hang-ups with Roo or Cline anymore, so maybe you could try those with opencode and claude code instead?
KldsSeeGhosts@reddit
Thanks for the advice, I've found that tinkering with all of this stuff has been my actual favorite part of the whole local process. I'll go ahead and take a look at the distilled versions as I was mainly just testing the unsloth quants
nonerequired_@reddit
Technology is going forward
nonerequired_@reddit
Llama.cpp recently merged kv rot, which makes kv Q8 quantization almost equivalent to fp16. This might help increase the context length.
nicholas_the_furious@reddit
Yes
MerePotato@reddit
That's what turboquant is for
Altruistic_Heat_9531@reddit
AXYZE8@reddit
Yup, thats me
BubrivKo@reddit
Lol, ok, It seems there are people who are using Q2 models :D
AXYZE8@reddit
12GB VRAM poor :( I had hopes, but sadly this model is unusable at IQ2. I need to upgrade that GPU now...
MushroomCharacter411@reddit
I'm successfully running 26B-A4B at Q4_K_M quantization on a 12 GB RTX 3060 and an i5 8500 with 48 GB of RAM, and getting around 14 t/s. And that's with vision enabled. Until I started playing with the Gemma models today, I was using Qwen 3.5 and 35B-A3B (Q4_K_M) and Gemma is about 12% slower... but much more than 12% smarter.
MushroomCharacter411@reddit
And now, just because someone is going to read this months or years down the line... Gemma is only slightly slower at the *start* of the conversation. As the context window fills, Qwen takes greater speed hits. By 50k tokens, they're about the same at around 13 t/s. By 100k tokens, Qwen takes a massive nosedive in performance (5 to 6 t/s) while Gemma is still chugging away at 12 t/s.
ea_man@reddit
If you run headless (as in no x11) there's a nice size:
Qwen3.5-27B-UD-IQ3_XXS.gguf 11.5 GB
that gives me 81k context at KV q_4 on my 12.3gb GPU :P
Or you can use *half context.
https://huggingface.co/unsloth/Qwen3.5-27B-GGUF
BubrivKo@reddit
My GPU is 16 GB VRAM and I use Qwen 3.5 35B Q4. You are not forced to load the whole model into the GPU. You can just offload some layers. For example: with my 9070 XT and its 16 GB VRAM I got 20-25 tks on that qwen model.
AXYZE8@reddit
I know about this, but I'm forced to load all into GPU - my Ryzen causes BSODs if I set RAM above 2667Mhz. I spent hours tweaking voltages, timings and even 2800MHz will cause WHEA errors. Sad reality of having 4 DIMMs on AM4. :/
VampiroMedicado@reddit
Huh did you update the BIOS? That sounds like something that would happend in early Ryzen era.
buttplugs4life4me@reddit
Intel's AutoRound Q2s are actually super good, really surprised. Made me able to run Qwen3 35B at acceptable speeds. Hope they'll release some for Gemma 4, though I think I can run Q4 there
-dysangel-@reddit
oh snap
DrNavigat@reddit
LM Studio?
thawizard@reddit
I’m not the guy you’re asking but this is indeed LM Studio.
DrNavigat@reddit
It is crashing for me with 27a4b
Enzor@reddit
Same here. I get model failed to load but no detailed error message.
AXYZE8@reddit
Update the engine in LM Studio settings. v2.10.0 engine adds Gemma 4 support.
Enzor@reddit
Now it loads but when I prompt it it just spins endlessly and doesn't generate any tokens. I tried switching back to Omnicoder-9b and now I only get 10t/s instead of 60t/s even if I switch the runtime back. Any idea why this is happening?
Far_Cat9782@reddit
Yes the kv cache was not cleared
DarthFader4@reddit
Very curious how the 27B IQ2 will perform. Will it be too lobotomized? Have you had success with other models at this quant?
Bubbly-Staff-9452@reddit
Not IQ2 but last week I saw people saying MoE models like Qwen 3.5 35b are basically the same in IQ3_S and Q4_K_M so I’m probably going to start with IQ3_S as my baseline.
Maxxim69@reddit
Do not blindly believe everything people say. Ask for proof. Now have a look at this and see for yourself how far apart they are.
AXYZE8@reddit
After testing I would say that sadly this model is unusable at IQ2. It mixes up a lot of facts with simple questions and sometimes doesn't even understand question properly.
Altruistic_Heat_9531@reddit
And after a week maybe : "Gemma 4 26B Heretic Uncensored Ablated Claude Opus 4.6 Reasoning Distlled Expanded fine tuned quantized"
sibilischtic@reddit
Eh im going to wait for
Gemma 4 26B Heretic Uncensored Ablated Claude Opus 4.6 Chain of Thot (NSFW) Quasimodal chuck Norris bingo night
ChaotixEvil@reddit
And Knuckles
superdariom@reddit
Chain of Thot 🤣
overand@reddit
DavidAU, is that you? 😂
(No shade, btw - even if I don't agree with the naming ski more number of releases, I have a ton of respect)
Altruistic_Heat_9531@reddit
naah man i am Komikndr 😂
Dangerous_Fix_5526@reddit
Yep ; It is me - Dangerous_Fix is top secret undercover name. LOL
No worries on the naming; that is so people know what their are clicking thru for.
And ahh... I learned that from some of the other model makers before me.
bucolucas@reddit
"Hey guys which one of the Gemma models is best at 'unconventional roleplay?'"
*hint hint nod nod wink wink*
Also it needs to fit inside 1.5GB NVIDIA card from 1999, be able to generate images, and run at 9000 tokens/second
Borkato@reddit
And video, of course.
AlwaysLateToThaParty@reddit
If you're not using it for VR you're a casual.
ea_nasir_official_@reddit
Claude: safety
Gpt: wasting money
Google: tracking us all
LocalLlama: UNCENSORED TURBORAPIST CLAUDE DISTILL QWENGEMMA CODER MOE ABLITERATED 6.9B UD-IQ69420
Borkato@reddit
Turbo… turbo what?! 😭
Imaginary-Unit-3267@reddit
nice.
Dangerous_Fix_5526@reddit
Maybe sooner than that...
LagOps91@reddit
you forgot turbo quant in there!
Noturavgrizzposter@reddit
and engram and attention residuals
ethertype@reddit
And Bonsai
marcoc2@reddit
Gemmopus
Far-Low-4705@reddit
i was looking at the benchmarks and tbh, it feels like gemma 4 ties with qwen, if not qwen being slightly ahead
and qwen 3.5 is more compute efficient too, 3b active params vs 4b, and 27b vs 31b dense. both tying on benchmarks so i mean idk.
gemma doesnt have an overthinking problem tho, saying "Hi" it only thinks for 30 tokens or so which is way better than 7,000 tokens lol
esuil@reddit
If Gemma does not have "safety policy" reasoning in base models, it wins by default in my books.
Like half of Qwen overthinking in my usage came from it being trained to constantly check against non-existent safety policy (I say non existent, because while it claims it is referencing safety policy, in reality it was trained to hallucinate safety policy that aligns with whatever rules they entered into dataset).
If it was trained to refer to promt defined policy it would be one thing, but the way they done it is so obnoxious.
floppypancakes4u@reddit
ironically i've been trying out the qwen 3.6 preview, and it felt like a downgrade from 3.5.
putrasherni@reddit
incoming comparison content with qwen3.5
Singularity-42@reddit
Comparison of Gemma 4 vs. Qwen 3.5 benchmarks, consolidated from their respective Hugging Face model cards: | Model | MMLUP | GPQA | LCB | ELO | TAU2 | MMMLU | HLE-n | HLE-t | |--------------| ----- | ----- | ----- | ---- | ----- | ----- | ----- | ----- | | G4 31B | 85.2% | 84.3% | 80.0% | 2150 | 76.9% | 88.4% | 19.5% | 26.5% | | G4 26B A4B | 82.6% | 82.3% | 77.1% | 1718 | 68.2% | 86.3% | 8.7% | 17.2% | | G4 E4B | 69.4% | 58.6% | 52.0% | 940 | 42.2% | 76.6% | - | - | | G4 E2B | 60.0% | 43.4% | 44.0% | 633 | 24.5% | 67.4% | - | - | | G3 27B no-T | 67.6% | 42.4% | 29.1% | 110 | 16.2% | 70.7% | - | - | | GPT-5-mini | 83.7% | 82.8% | 80.5% | 2160 | 69.8% | 86.2% | 19.4% | 35.8% | | GPT-OSS-120B | 80.8% | 80.1% | 82.7% | 2157 | -- | 78.2% | 14.9% | 19.0% | | Q3-235B A22B | 84.4% | 81.1% | 75.1% | 2146 | 58.5% | 83.4% | 18.2% | -- | | Q3.5-122 A10 | 86.7% | 86.6% | 78.9% | 2100 | 79.5% | 86.7% | 25.3% | 47.5% | | Q3.5 27B | 86.1% | 85.5% | 80.7% | 1899 | 79.0% | 85.9% | 24.3% | 48.5% | | Q3.5 35B A3B | 85.3% | 84.2% | 74.6% | 2028 | 81.2% | 85.2% | 22.4% | 47.4% |
Far-Low-4705@reddit
uuuh, this is unexpected... looks like qwen 3.5 beating gemma 4??
even if only tying, both models are more compute efficient from qwen. 3b VS 4b active params, and 27b VS 31b dense. qwen models are pulling ahead across the board tho
Monkey_1505@reddit
For the MoE the smaller the total params, the more likely you can fit all or most of it on your vram. And that'll boost performance more than 1b params active will.
I do think Qwen's MoE is probably smarter, if too rambly, but the size of that thing is starting to become awkward at 35b. Whereas you can likely REAP the 26b down to 20b with no virtually loss of performance and cram it all on a 12 or 8b card.
Far-Low-4705@reddit
I can run both fully in vram so it’s not a concern for me.
Objective-Stranger99@reddit
I can't run either fully in VRAM so it's not a concern for me.
Far-Low-4705@reddit
yeah, i was just talking about the compute needed/active params.
so in both cases, yours and mine, qwen would be faster since it has less active params.
unless you have some VRAM, in which case you'd need to run less of gemma on the CPU which might make it slightly faster, but idk how big of a difference it would make.
But in my case, there is no difference. qwen is just better, and at the same cost/speed.
Monkey_1505@reddit
These MoE's are a fair bit slower if you have to offload any substantial amount of them.
Objective-Stranger99@reddit
I think speed depends more on the percentage of attention tensors on the GPU, rather than the number of active params. That's why llama.cpp provides the -ncmoe option, which only offloads the up and down tensors and leaves the attention tensors on the GPU.
lolofaf@reddit
One concerning area is that HLE no-tools vs tools is only 19.5->26.5 (+7), while qwen is 24.3 -> 48.5 (+24). It may suggest it's not nearly as good with tools (or Google's tool use harness isn't as good as Qwen's for HLE specifically?)
road-runn3r@reddit
Copy pasted from hackernews, first comment
Singularity-42@reddit
And? Someone asked, I've provided.
road-runn3r@reddit
The wording makes it sound like you did this. Just add the source.
Singularity-42@reddit
I did
uhuge@reddit
just hyperlink, it's this thing called wold-wide web.
valuat@reddit
People can be anal for no reason. I mean, there's a reason for their psychiatrists to disclose.
Imaginary-Unit-3267@reddit
Some basic calculations show that in terms of geometric average of all these scores (implying overall competence - geometric average is very sensitive to the minimum value) for the six models that have values for every single benchmark, Qwen3.5-122B A10B is the overall strongest contender, with 27B in second place - oddly, in terms of geometric average divided by effective parameter count (square root of product of full size and active experts size), 35B which I see a lot of people complain about on here appears to be by far the "densest" in score per parameter, and I wonder if that actually means anything useful or not.
Nobody asked, but I just like playing with tables of numbers uwu
ShengrenR@reddit
hrm - the HLE-t in particular are unfortunate, seems maybe they needed more agentic traces in there...
kaggleqrdl@reddit
yeah hle-t is a pretty important bench
Hans-Wermhatt@reddit
Seems like Gemma 4 31B is slightly worse than Qwen 3.5 27B in most benchmarks outside of multi-lingual and MMMU pro.
vivaasvance@reddit
The multilingual advantage is underrated for
enterprise use cases.
Most benchmark comparisons focus on English
reasoning tasks. But for global deployments
where you need consistent performance across
languages — that gap matters more than a few
points on MMMU.
Gemma 4's multilingual strength could be the
deciding factor for the right use case.
brunoha@reddit
yes, as somehow who has to work with a Portuguese, Spanish and French team/tasks, this gives a vantage point.
vivaasvance@reddit
Yes. True
keepthepace@reddit
I also value the fact that there are less propaganda embedded in the RL step. We know that this sort of misalignment leaks into other capabilities.
Hans-Wermhatt@reddit
Yeah, I didn't mean to downplay that. It's a very good model. OP pointed out that elo rating too, that could suggest better creative writing I think.
putrasherni@reddit
Are both dense models ?
jacek2023@reddit (OP)
except elo
Randomdotmath@reddit
yeah, the elo seens far from benchmarks
jacek2023@reddit (OP)
I don't really trust benchmarks, however I am not sure can I trust elo in 2026
cleverusernametry@reddit
Isn't thr elo from lmarena? If so, then definitely don't trust it as theyvare sus AF taking a pile of VC money
Far-Low-4705@reddit
yeah, elo is basicialy just RLHF overtraining, which on its own can lead to huge issues as seen with gpt 4o... so not sure its the best thing to go by exactly
grumd@reddit
I'm on it haha
waiting_for_zban@reddit
It's better than GPT 5.4? Interesting!
grumd@reddit
Yellow tests are failed tests
Cubow@reddit
this is the last place where i would have expected to see one of my favourite mappers
oxygen_addiction@reddit
What is a mapper?
Cubow@reddit
Well known level creator for the rhythm game osu!
oxygen_addiction@reddit
Thanks
twack3r@reddit
Apparently there‘s a mouse-based rhythm and gesture 2D game with levels called maps; mappers create community content/levels.
oxygen_addiction@reddit
Cheers
PunnyPandora@reddit
he used to work at anthropic
grumd@reddit
Oh haha hi :D
shavitush@reddit
big fan
Odd-Ordinary-5922@reddit
osu?
Cubow@reddit
yes, had to doublecheck I’m on the right sub lmao
_raydeStar@reddit
Danke danke
I would like to know.
Prestigious-Use5483@reddit
I am a human, I need visualization to understand.
Cubow@reddit
E2B performing better on almost all benchmarks than Gemma 3 27B is insane, there is no way.
Also, no 1B, my life is ruined
putrasherni@reddit
i think that these models will be baked into apple devices
all of them are small parameter and fit within 80-90GB tops
could be that gemma small models run inside of iphone
crazy times ahead for apple + google partnerships , insane that it can be a thing
OcelotMadness@reddit
Apple devices? Google has their own phone line that run these models. How is Apple relevant here?
Ok-Percentage1125@reddit
i think due to google and apple deal?
falcongsr@reddit
Will any of these run on a 5070Ti with 16GB?
Decivox@reddit
Yes, Gemma-4-26B-A4B at IQ4_NL fits well! More than doubled my speed compared to Qwen3.5-35B-A3B at Q4_K_M which needed offloading. Not sure how the 31B model at a lower quant would perform compared to it.
falcongsr@reddit
Thank you for the tip! I downloaded the version you said and installed it in ollama like I did for gpt-oss:20b last year, but it only spits out garbage when I ask it a question. I updated ollama to the latest version and that got it to at least load. I am going to update my Nvidia drivers and see if that helps.
Decivox@reddit
Im not sure if Ollama is updated to support this yet, but the latest release of llama.cpp does support it.
falcongsr@reddit
Got it working with the latest llama.cpp! Thank you!
DarthFader4@reddit
31B is most likely a no go. Maybe 26B MoE if it handles extreme quant alright (Q2). If not, you could try the 26B at a more reasonable Q4/6 and have just a little spillover into system RAM, tho slow down is to be expected. Best answer is to try these out yourself when you have some time, or wait for others to report real world use.
sonicnerd14@reddit
You don't need to go to a quant that low on 16gb vram with moe's offload some of the experts to cpu, and you get a dramatic speed increase making Q4, 5, or 6 even useful for you.
ThankGodImBipolar@reddit
I run Qwen 3.5 Next Coder with 16GB of VRAM and still get 20+ toks/s. Surely this wouldn't be any slower than that?
Ink_code@reddit
the 2B and 4B can run on it since i can ran models of that size on an intel iris xe integrated GPU with 16 GB ram, as for the bigger ones i am not sure since i don't have ram for them, but since 26B model is a mixture of experts if you have enough system ram you can offload the rest of the weights to it while keeping the active weights on the GPU so i think you probably can run that one.
FullOf_Bad_Ideas@reddit
they're comparing a reasoning model to non-reasoning. There are benchmarks where reasoning models have an advantage.
Gemma 3 27B gave you instant answer though.
You could have argued that Qwen 3 4B Reasoning 2507 was better than GPT 4.5 or GPT 5 Chat this way. It's a half-truth.
Prestigious-Crow-845@reddit
But Qwen 3 4B Reasoning 2507 was never better than GPT 4.5 or GPT 5 Chat even with reasoning, arne't it?
Jan49_@reddit
It ranked higher in some benchmarks, like artificial analysis. Most people don't understand that intelligence and knowledge isn't the same. A small model like Qwen 3 4B 2507 will never have the same amount of knowledge as a big model. What these benchmarks show is that smaller models are getting smarter, they are getting better at solving problems, retrieving information via tool calls (web search) and then handle that data to give a good answer.
I would argue: If you give a modern small model access to tool calls (web search, coding environment, etc) and then compare to an older bigger model like GPT 4o, the small model will be on par, if not better. But on its own, offline, without a knowledge base, the small model is nowhere near
Ink_code@reddit
i love how small models keep getting better, maybe eventually we'd reach a point where you can actually have a small agent =>8B on phone or laptop we can tell to do stuff somewhat reliably without worrying about it breaking everything.
WhyLifeIs4@reddit
Real
BestSeaworthiness283@reddit
very nice, used the a4b variant and worked great!
BubrivKo@reddit
Just give me an uncensored version, lol :D
MushroomCharacter411@reddit
You only had to wait five days!
jacek2023@reddit (OP)
u/-p-e-w already has one
silenceimpaired@reddit
Can't wait for the dense model... and creative fine tunes.
tiffanytrashcan@reddit
Gemma 3 was Historically Fun to finetune.
The outputs from that model certainly punched every ticket to hell I could possibly take, and inflicted further permanent psychic damage on me. I freaking loved it.
Both_Opportunity5327@reddit
Google is going to show what open weights is about.
Happy Easter everyone.
Daniel_H212@reddit
Wish they'd release bigger models though, a 100B MoE from them could be great without threatening their proprietary models. Hopefully one is coming later?
sininspira@reddit
If the 31b is as good as the open model rankings suggest, they don't really *need* to release a bigger one at the moment...
MushroomCharacter411@reddit
That's how we all felt about Qwen 3.5 27B a month ago. And now look.
Cupakov@reddit
Sure, but better is the enemy of good as they say
sininspira@reddit
We're also in a crazy memory shortage, so I think releasing smaller models that perform in the same class as much bigger ones is probably a better mindset for the industry than just releasing something huge for the sake of "more parameters = better". Low key I'm tired of the daily SOTA gigantic 500B+ models that I can't even run across 4x RTX Pro 6000s.
Cupakov@reddit
I mean sure, but there surely is a bit of space to fit a model between 31 and 500B+, no? Isn't Qwen3.5-122B-A10B one of the most popular in the Qwen3.5 family? I'd like to see something like that from Google if their ~30B models are so good.
sininspira@reddit
I'm not necessarily disagreeing with you there. There's just an upwards push in parameter size that I'm glad to see Google is able to throw down with in the ~30B range dense and more range, especially given the RAMpocalypse. So maybe that pressure to keep pushing params up gets a little relaxed, idk.
durden111111@reddit
a 100B moe can run a single GPU + ram, no need for 4x 6000s lol
sininspira@reddit
I was using 500B as an example. I know I can run 100B easy on one lol, but there seems to be a trend of releasing "better" models right and left but they're just absolutely massive and slow.
RnRau@reddit
They never did for Gemma 3, so I can't see them doing it for Gemma 4.
Daniel_H212@reddit
Their proprietary models are definitely getting bigger, so it's quite possible that their open models will have bigger sizes too. Someone else pointed out that they called the current releases Gemma 4 small and medium, indicating there's a large, and previously there were leaks about a Gemma 4 124b MoE, so there's hope.
Zc5Gwu@reddit
Dense models like these make me regret my strix halo 😔. A 5090 probably kills on these.
ProfessionalSpend589@reddit
You can attach a eGPU to Strix halo.
Zc5Gwu@reddit
I have one that I was connecting via oculink but my setup has some downsides. Oculink doesn’t allow hot plugging so the gpu has to always be idle if you want to leave it on all the time which negates some of the power advantage of having an always on llm machine.
Also, the gpu/harness I have runs the GPU’s fans at a constant 30% never spinning down. Also, also, I never was able to get models to play nice when splitting them across both the unified gpu and the egpu at the same time.
ProfessionalSpend589@reddit
I’ve had OK results with llama.cpp + Vulkan and Radeon pro Ai R9700. Ran Qwen 3.5 122b at Q8_0. :) I’m OK with the noise too.
But I had to remove my second NVMe on one of my Strix halos. Turns out that the eGPU was causing the whole system to freeze while on the other strix halo with single NVMe it worked like a charm.
I also did have some instability on the machine with two NVMes when I used a network card - sometimes the card was lost and I had to restart the machine, while the same model on the other machine worked.
Zc5Gwu@reddit
Wow, that would have been helpful to know, lol. I’ll try that.
SysAdmin_D@reddit
Sorry, just starting to dig my own grave here, but I have a strix halo setup as well. MoE is more favorable on that arch over dense?
TheProgrammer-231@reddit
It’s the memory speed. Strix is around 250 gb/s and 5090 is 1700 gb/s. Strix has a large pool of RAM so you can load large models. In a MoE, you only need to get the weights for the active experts per token (active experts can change from one token to the next) vs dense where you need all weights per token.
31B dense Vs 26B A4B
31B weights per token Vs 4B weights per token
Dense models seem to perform better imo. Ofc, a much larger MoE could outperform a smaller dense model.
Guinness@reddit
The M5 Ultra is rumored to have memory speeds somewhere between 800GB/sec and 1200GB/sec
Zc5Gwu@reddit
Yep, strix has more vram but it is lower memory bandwidth than a typical gpu. Strix is great for MoE models because they’re generally a lot of parameters with few active params whereas dense models activate all the params at once.
Daniel_H212@reddit
I haven't been regretting my strix halo tbh. Yeah a 5090 would have costed around the same and gotten me way faster speeds, but firstly it isn't a standalone server computer and I'd need to pay more for a computer to put it in, and secondly the VRAM of a 5090 is so limited in comparison, to run Qwen3.5 35B at full context would require dropping down to Q3. Plus I get to play around with 100B MoEs which still work fast enough as a backup in case the smaller models aren't capable of something.
waruby@reddit
I got one too and I feel you, but what is worth considering is that the massive VRAM means that you can give these models several context windows at once to several agents that can run in parallel, increasing your tokens/seconds/agent. I'll try it with claw-code.
jacek2023@reddit (OP)
either the 124B model was too weak and did not beat smaller ones in benchmarks/ELO, or it was too strong and threatened Gemini
Daniel_H212@reddit
Or, and I hope this is the case, the 124B just hasn't finished training yet so they're releasing the smaller ones first.
jacek2023@reddit (OP)
actually you may be right, please notice this sentence:
Increased Context Window – The small models feature a 128K context window, while the medium models support 256K.
if you don't see what i see, read again... :)
msaraiva@reddit
Yeah, I also noticed they purposefully used "small" and "medium". Hopefully that means a "large" model is coming soon.
Daniel_H212@reddit
👀
mycall@reddit
But if you need an strong offline model, it can fit the bill.
RottenPingu1@reddit
I'd settle for 70B
RedParaglider@reddit
Man 80-120 would be killer, but I'm happy to have what they just released!
misha1350@reddit
It's not Pascha yet.
RELEASE_THE_YEAST@reddit
Tonight is the second night of pesach, though.
ThiccStorms@reddit
I'm very excited for the 2b!
MushroomCharacter411@reddit
Gemma seems to have solved the "50 meters to the car wash" problem, and it even identifies specifically how other LLMs fail on this test. Has that question/meme been around long enough to make it into the training data, or is it actually smarter?
intergalactic_watch@reddit
Qwen seems better
LosEagle@reddit
YES! MedGemma next, please, I beg you
jacek2023@reddit (OP)
what's your usecase?
PaceZealousideal6091@reddit
Medical imaging diagnostics!!! Its great to fine tuned for specific diseases.
ComfortablePlenty513@reddit
pretty sure you need to be FDA approved to incorporate that in a product lol
PaceZealousideal6091@reddit
Yeah, so? Whats your point? The point is medgemma is a fantastic base model that is trained on medical imaging modality.
LoafyLemon@reddit
Fun fact: Medical data training makes a great Dungeon and Dragons RP base too, because it can greatly focus on anatomy and effects of the fantasy creatures after fine-tuning.
So hell yeah, give us the med model!
PaceZealousideal6091@reddit
Wow! Never thought about that! So, medgemma 27B is popular in Silly Tavern circles?
OcelotMadness@reddit
I see the answer Loafy gave you but I'm just gonna say I actually play Sillytavern and keep semi up to date on the models people use and I have literally never seen MedGemma. I think they're bullshitting you. The closest thing I've seen is Gemma 3 27b and its finetunes.
PaceZealousideal6091@reddit
Checks out. I am not really an active sillytavern user, but i never heard anyone talk about them as well. Thankfully people bullshitting wasted their own time and effort talking about it. It was just a cool info for me and now you grounded the fact. Thanks.
LoafyLemon@reddit
Yep! Either as the base too fine-tune on top, or as a merger to enhance anatomical descriptions.
Don't quote me on that, but I believe ERP people use it too, heh.
joshman5k@reddit
pretty sure not everyone lives in the US lol
s1lenceisgold@reddit
Medical document OCR, need embeddings as well
StatFlow@reddit
apache license is new - not a 'google gemma' license anymore!
Borkato@reddit
Woah, what’s the difference? Is it like super open now? :D
StatFlow@reddit
apache 2.0 is the gold standard and fully permissive. the google gemma license was "open" but google technically had the ability to restrict for any reason if they wanted to if it came to that,
OcelotMadness@reddit
Isn't MIT better? Apache still has restrictions.
DeepOrangeSky@reddit
I wonder if they did it because they felt annoyed that everyone was still using Mistral 24b tunes instead of Gemma 27b this whole time. I mean, presumably vanilla G27's writing ability and intelligence are both supposed to be higher than vanilla Mistral 24b, right? But because of the license, all the tunes were for Mistral 24b, and most people ended up preferring that to Gemma 27b and also preferred it over its abliterations.
Or they just want as much serious innovations/experimentation from the populace to be done on it for non-writing stuff and it helps with that, too, or something?
Well, in any case, pretty cool they decided to just unleash this thang
Borkato@reddit
Holy crap! So now it’s like officially “here, go nuts?”
Inevitable_Tea_5841@reddit
Yep
csm101_bob@reddit
Big deal honestly. Apache 2.0 means you can do anything with these models commercially without Google's terms hanging over you. This is Google finally playing the open-weights game for real — not just "open with asterisks." Could shift a lot of enterprise adoption that was stuck on "but what's the license?" questions.
BeneficialVillage148@reddit
This is a big release 🔥
Open weights + 256K context + multimodal + better coding/agent support… that’s actually crazy progress in one update.
Feels like local models are catching up really fast now.
Aggressive-Permit317@reddit
Gemma 4 dropping at this level is actually insane for open-source. 26B punching way above its weight and the speed on consumer hardware is a game changer. I've been running the release locally and it's noticeably smoother than the previous Gemma line on agentic tasks. Still curious how it compares to the newest Qwen3.5 in real tool-use chains though. Anyone else already quanting and testing it?
Scipraxian@reddit
I've been very pleased with it's better flexibility..... they still obsess over tools ;)
SpeedoCheeto@reddit
i'm just getting into local stuff and wanting to replace claude code workflow, is this one for me to explore and try to use?
HBTechnologies@reddit
This is great I am going add these to my mobile app
Macstudio-ai-rental@reddit
That Performance vs Size chart is actually insane. The fact that the
gemma-4-31B-thinkingand26B-A4Bmodels are punching so far above their weight class to beat out 120B+ parameter behemoths like Qwen 3.5 122B and Mistral Large 3 on the Elo scale is wild.Seeing almost a 90% on AIME 2026 from a 31B model just proves how powerful that new configurable step-by-step reasoning mode is. Combining that built-in thinking with the 256K context window is going to make these absolute beasts to run locally. Definitely downloading the 31B GGUFs to test this out today.
Aggressive-Permit317@reddit
Gemma 4 dropping feels like Google finally stopped playing it too safe. The efficiency numbers they’re claiming could actually make local models feel snappy again on mid-range hardware instead of just server-grade stuff. I’ve been running the last couple of Gemma versions locally and the jump in coherence is noticeable. Anyone already spinning this one up and seeing the difference in real tasks, or is it still too fresh?
Ok_Edge1810@reddit
Just shipped a small Android assistant app using Gemma 4 E2B via the LiteRT-LM tool calling works surprisingly well out of the box. The native format (<|tool_call>) is clean to parse, and the model stays on-task without much prompting.
Coming from Gemma 2, the jump is significant. Response quality is noticeably better, and the memory footprint is actually smaller for what you get. 52 decode tokens/sec on GPU makes streaming feel instant.
Next experiment is using it as a coding assistant, curious how E4B holds up on LiveCodeBench-style tasks locally. Will report back.
One-Art-5119@reddit
I wish they would make an Android app for it
Sambojin1@reddit
PocketPal has been updated. Works fine on Gemma 4 4B 4_0 quant.
I'm going to see if I can get a bigger one going.
AyraWinla@reddit
It's called Edge Gallery; it just got updated with Gemma 4.
Interpause@reddit
they do... go such for google's litert gallery
arbv@reddit
The dense model is, unfortunately, worse at Ukrainian than Gemma 3 27B.
danielhanchen@reddit
<turn|>.<|channel>thought\nis also used for the thinking trace!Daniel_H212@reddit
It seems like native tool calling isn't working very well. Is this a model problem or me? I'm running 26B-A4B at UD-Q6_K_XL with all the same settings in OpenWebUI as Qwen3.5-35B-A3B also at the same quant, (native tool calling on, web search and web scrape tools enabled), plus with <|think|> at the start of the system prompt to enforce thinking, and when given a research task, Qwen3.5 did a web search (searxng, so only snippets were returned from each result) and then scraped 5 specific pages, while gemma 4 did a web search, summarised, came up with a research plan, and then immediately gave me a response without actually following through with its research plan.
It did this somewhat consistently. The one time it did try fetch_url after search_web, it happened to fetch a page that was down (which returned an empty result), and it just went into responding as if it never planned on doing further research in the first place, nor did it try the alternative web_scrape function that I also have available (which I noted in the system prompt as a more reliable backup to fetch_url).
I also tried telling it to do further research after its first message, which caused it to use search_web twice, still no fetch_url. I then tried telling it to use its other search tools, after which it tried web_scrape once, which got it some results, and it just gave up. There's zero persistence in its research.
ZellahYT@reddit
Gemma 3 tool calling was abysmal
Daniel_H212@reddit
Yup even the one time I got it to search the web repeatedly (gave it a task where a single search definitely gets nowhere close to the full answer), it did like 5 searches and a page fetch, talked about needing to do more searching, and still stopped searching anyway.
danielhanchen@reddit
Try Unsloth Studio - it works wonders in it! We tried very hard to make tool calling work well - sadly nowadays it's not the model, but rather the harness / tool that's more problematic
Daniel_H212@reddit
I'm serving OpenWebUI via a home server to my whole family, is that possible via unsloth studio?
Also you showed one tool call but I'm looking for multiple consecutive tool calls for in depth internet research tasks, is gemma 4 able to do that in unsloth studio?
Borkato@reddit
If you haven’t seen it yet, llama cpp updated tool calls for Gemma like 3 hours ago
Daniel_H212@reddit
Didn't seem to help, it's still doing the thing where it says it will search more in thinking, then stop thinking and go straight to answering.
Borkato@reddit
Are you sure the template is correct?
Daniel_H212@reddit
I'm using the unsloth quants, maybe I should try some others, I'll do that tomorrow. Currently using llama.cpp built for vulkan for this but I usually use llama.cpp ROCm from lemonade sdk, will wait for that to update
Borkato@reddit
I more mean, try it in things that aren’t opencode
Daniel_H212@reddit
I'm using OpenWebUI
Borkato@reddit
Ah, then try it in something that isn’t opencode! Try a curl
Daniel_H212@reddit
OpenWebUI isn't open code. Also how am I supposed to test native tool calling via curl?
Borkato@reddit
I don’t know why I keep typing opencode, I meant openwebui lol
But I meant make sure it’s not looping with other stuff, obviously. Whatever, works great for me 🤷
Alarmed-Subject-7243@reddit
Native tool calling straight out of the box is huge for setting up reliable agentic workfows locally. Finally being able to automate heavy buisness logic without bleeding money on api calls is a massive win.
shesaysImdone@reddit
What is native thinking?
DesiCaptainAmerica@reddit
Can we get fine-tuning guide for IT with unsloth?
danielhanchen@reddit
Hmm not IT yet - but we did make guides for finetuning Gemma-4! https://unsloth.ai/docs/models/gemma-4/train
PerfectLaw5776@reddit
Thanks. I've tried several ft trial runs with `unsloth/gemma-4-E2B-it` on Kaggle (T4 GPUs) but they all go `NaN` in reported loss after some time. Have you or anyone else been able to successfully tune this one on a dataset?
All the typical hyperparameter stuff already tried, tiny LR, tiny grad norm, filtering out empty samples, etc.
`UNSLOTH_FORCE_FLOAT32` made no difference. Tried using `FastVisionModel` instead of `FastModel` according to those notebooks but same outcome.
Btw, `device_map="balanced"` seems to give an illegal memory access error on FastModel, so Gemma 4 probably can't be multi-gpu trained that way for now. But that doesn't affect most users I'd think.
hugganao@reddit
do you have any very quick first impression insights as to the ability of the model? people over at huggingface seem to rate it very highly talking about how they found it hard to look into what to finetune since it was so great out of the box. Is this true?
NoahFect@reddit
Hey, quick question re: Unsloth Studio. I'm thinking of switching over to it from my existing llama.cpp installation, but why do I need to create an account to run stuff locally?
Thrumpwart@reddit
Does it really require an account to run?
NoahFect@reddit
That's just what I read in their instructions:
The first version I downloaded didn't ask me to create an account so I thought it was interesting that it was now a requirement.
danielhanchen@reddit
We're still trying to get it to work well in Studio - should be done in minutes - see https://github.com/unslothai/unsloth?tab=readme-ov-file#-quickstart
For Linux, WSL, Mac:
curl -fsSL https://unsloth.ai/install.sh | shFor Windows:irm https://unsloth.ai/install.ps1 | iexQual_@reddit
Waiting for the docker update ! :D
danielhanchen@reddit
It's out now!!! So so sorry on the delay!
Hearcharted@reddit
Unsloth Studio for Google Colab, where? 🤔
theodordiaconu@reddit
Why temp 1?
illcuontheotherside@reddit
You guys ROCK!!!
danielhanchen@reddit
Thanks!
Such_Web9894@reddit
🐐
danielhanchen@reddit
Thanks!
970FTW@reddit
Truly the best to ever do it lol
danielhanchen@reddit
Thanks!
jacek2023@reddit (OP)
thanks for the quick GGUF release!!!
danielhanchen@reddit
Thanks for the post as well haha - you we were lightning fast as well :)
Available-Air-9110@reddit
希望再升级一下 translategemma🤤
PopularDifference186@reddit
Is it super slow compared to qwen 3.5 for you all too or am I doing it wrong?
5060 ti 16gb and 128gb ram running via llama.cpp im getting:
Qwen 3.5 35B-A3B — 60+ tps
Gemma 4 26B-A4B — 11 tps
uncommonsense24@reddit
What are your arguments at launch? What quantization?
I have the same GPU and am seeing 70+ tps on smaller (sub 15k) contexts. Using gemma-4-26B-A4B-it-UD-Q3_K_XL.gguf
PopularDifference186@reddit
I switched to UD-Q3_K_XL and that got me to 84 tps since it actually fits in VRAM. But then I went back and retested the Q4_K_M after pulling the latest llama.cpp (there was a KV cache fix where they reverted the SWA cache being forced to f16) and switched from -ngl 99 to --fit on, and the Q4 jumped to 55-59 tps. All the tests were around 32k context. This model is a beast!
uncommonsense24@reddit
Awesome. I'm going to have to try that model now. Glad it has sped up for you!
Cradawx@reddit
On my 5070 Ti, Q4_K_XL with
--fit onI'm getting about 70 t/s (Linux) which is about the same as Qwen 3.5 35B-A3B.Guilty_Rooster_6708@reddit
Same here. 26B 4AB context also uses more VRAM for me than Qwen3.5 for me too.
I’m running this on LM Studio with Unsloth Q4_K_M getting 25 token/sec.
WhatIs115@reddit
So this must be what I'm seeing. I wasn't getting full GPU utilization, it must be overflowing. Same GGUFs size, Gemma 4 wants an extra 3GB vram for the same 8192 context, wild.
BubrivKo@reddit
Ok, Gemma 4 26B A4B didn't pass my "benchmark" :D
Gemma 31B passed it!
FenderMoon@reddit
I was able to get it to pass this benchmark once I enabled reasoning. Though this benchmark is easy enough that it should have been able to pass without it IMO.
Prestigious-Crow-845@reddit
The question did not answer if you have to wash your car that already there but you are not or you have only one car and need to wash it.
boutell@reddit
"Your car is probably where you are, not already conveniently at the other place" falls under common sense reasoning IMHO.
Prestigious-Crow-845@reddit
So how to ask this question if you have 2 cars and left one of them in carwash queue while going home? I can agree with you, that most of the time you have one car and should drive it there. But if you tries to ask this question to a real person about - should I drive to carwash or walk - they probably thinks you are talking about second car there or that would mean you are going insane. So I would think they know what they do (already have car there to wash) and not a morons with a real person conversation.
So asking this question to a real person and common sense are kind of opposite.
boutell@reddit
Lol. When I was benchmarking this, I left off that first sentence because I just assumed that made it too easy. It doesn't of course, lots of models fail like this.
But because of that, I'm favorably impressed with Qwen 3.5. without the first sentence, it thought forever, but it produced an acceptable answer. It said I should drive unless I was going to work there.
I should also acknowledge that although it thought forever, it identified the core issue very early in the thinking trace.
BubrivKo@reddit
Yeah, Qwen 3.5 answer correctly and that's the reason I love this model for its size.
psychohistorian8@reddit
can't wait to see how it does in real world agentic coding tasks, especially compared to Qwen 3.5 27B/35BA3B
benchmarks mean nothing to me anymore
I'm downloading both 31B and 26BA4B and will play around with them after work
Dr4x_@reddit
Please share your results, I'm curious to see how useful they are for real life use cases
psychohistorian8@reddit
well unfortunately for me its unusable
I'm not sure if this is LM Studio, or what, but I can't load Gemma 4 unless I reduce the context window down to about \~8k which is insane because I can load Qwen 3.5 comparable models with \~32k context window
CorrectAbrocoma3321@reddit
What’s your spec?
psychohistorian8@reddit
32GB M5 Macbook Air
and actually late yesterday there was an update to LM Studio/Llama.cpp that allowed me to load the models with expected context windows (comparable to Qwen)
I tried to use Gemma 4 with opencode/speckit to define a new feature but Gemma got itself caught in a deathloop doing the same thing over and over, then I fell asleep
Fresh_Finance9065@reddit
Yeah never trust lm studio with new releases. They normally rush a broken version of new models to say they "support" it, but use mainline llamacpp if you want to use new models properly on launch
stormy1one@reddit
Same - my experience with Gemini 3 has been horrible for coding. Lots of mistakes where it said things were perfect. Qwen3.5 27B has been rock solid with the updates from llama.cpp and vllm. Not expecting much from Gemma 4
jazir55@reddit
I wouldn't since if you're having a bad experience with Gemini 3.0 which was their previous SOTA model from 6 months ago and Gemma 4 is clearly weaker.
danzph@reddit
lmao
HellomyfriendNine@reddit
it kept rejecting telling about what is latest Iphone model
fuse1921@reddit
What does "it" mean?
Ink_code@reddit
instruction tuned, it means the model went through a supervised fine tuning phase where it's trained to follow instructions, this lets it act as a useful assistant.
you can also find base models on huggingface which haven't went through it and so more so try to complete the text sent to them instead of treating them as instructions..
ghulamalchik@reddit
Non instruct models don't follow instructions?
Ink_code@reddit
Yeah they just complete text, you could do something like writing part of a code and they'll continue writing based on it or writing a part of a story and they'll continue the story, but you can't do the usual "you're a CLI agent- [insert rest of prompt] now write a script for checking whether a number is a prime number" as it might just continue completing with something like "and whether it's odd or even"
SeymourBits@reddit
Not really, they just complete text.
jacek2023@reddit (OP)
instruct
Specialist_Golf8133@reddit
wait they skipped gemma 3? lol google's version numbering is always chaos. anyway the real question is does it actually run better locally than llama or are we still in that weird spot where google models look good on paper but dont quite deliver at 4bit quant. anyone tried it yet?
jacek2023@reddit (OP)
obviously a bot
Specialist_Golf8133@reddit
🥲🥲
aWanderer01@reddit
I immediately tried it and it was not good actually... corrupt JSON results coming back and a bunch of other anomalies. Lasted 3 hours and switched back to qwen
DOAMOD@reddit
tools broken for me, yes
Conscious-Track5313@reddit
also supported by https://elvean.app you can run it directly in MacOS app
Sad-Savings-6004@reddit
first test i gave it 26b it failed to return back proper json, qwen 35b is king 👑
Faktafabriken@reddit
Trying the 31B out on my Mac Studio M2 Max 64GB unified memory. For some reason it uses a lot of memory when I add context, compared to qwen3.5
Q8 was unusable and q4_km usable only with very short context. Way worse than qwen3.5 27B Don’t know why, but maybe someone computer-smart will se this and come up with a solution.
Longjumping-Move-455@reddit
I believe this was a big and has been fixed in the latest llama ccp update
DOAMOD@reddit
Gooner, listen, "Gooner 4" destroys qwen 3.5...
Valuable_Relation634@reddit
The 4-bit variants are tempting but I'm curious about the E2B vs E4B tradeoffs. Anyone actually running the 27B on consumer hardware yet? Wondering if the quality drop from A4B is noticeable for coding tasks.
Original_Hedgehog_99@reddit
How does this compare to Qwen 3.5 at a similar scale?
-dev4pgh-@reddit
They mention handwriting recognition, which could be valuable in some projects I am working on. Has anyone tried this yet? So far (anecdotally), the Qwen VL models seem to be the best, with no real competition.
HBTechnologies@reddit
This is great I am going add these to my mobile app
Chriexpe@reddit
On 24GB it can kinda run with ctx-size of 262144 and kv cache at q4_0, but it's on the tipping edge of crashing, can't wait for llama to add those crazy kv cache optimizations from google's whitepaper.
MerePotato@reddit
Q4 KV is a terrible idea, particularly on a reasoning model
Chriexpe@reddit
Is the compromise that big? I thought context size > quality when running something like Hermes Agent
JoNike@reddit
Using latest llama.cpp (8639), on my 5080 16gb.
Gemma 4 26B-A4B MXFP4 @ n-cpu-moe=10, q4 cache, 256K context
at 12.8k tokens context: 61.4t/s
at 63.9k tokens context: 52.2t/s
at 153k tokens context: 41.2t/s
at 230k tokens context: 34t/s
Ain't got a chance to test it fully yet
MerePotato@reddit
Don't use Q4 context cache dude, it annihilates reasoning models
nothrowaway@reddit
Oh, yeah... this is a smart model... yeah... sure.
Aha!Line 297 uses displacement (misspelled: 'a' before 'c' is okay, but 'e' after 'c' is 'ce', the typo is displacement vs displacement). Wait, let me look closer.displacement (line 297)displacement (line 300)Let's compare them:d-i-s-p-l-a-c-e-m-e-n-t (Correct)d-i-s-p-l-a-c-e-m-e-n-t (Looking at line 297) -> displacement.Wait, let me re-read the provided code for main.py.Line 297: displacement = (dx * dx + dy * dy) ** 0.5Line 300: if displacement > 3:Let's check the spelling:Line 297: d-i-s-p-l-a-c-e-m-e-n-t (Actually, looking at the text: displacement. No, it's displacement vs displacement).Wait, displacement has an 'e' where it should be... actually let me just use grep or read again to be absolutely sure of the exact characters.main.pyline 297: displacement = (dx * dx + dy * dy) ** 0.5main.pyline 300: if displacement > 3:Compare:displacement (297)displacement (300)They are different.displacement vs displacement'a' 'c' 'e' vs 'a' 'c' 'e' ... wait.disp l a c e m e n tdisp l a c e m e n tWait, let me look at the characters one by one.Line 297: displacementLine 300: displacementLooking at the prompt's provided code:297: displacement = (dx * dx + dy * dy) ** 0.5300: if displacement > 3:Wait, displacement (297) and displacement (300).d i s p l a c e m e n t (297)d i s p l a c e m e n t (300)They look the same in the provided text. Let me re-read.disp l a c e m e n tdisp l a c e m e n tWait, the user says: "The variable name a is used as b. String compare and tell me if they are the same or different."I previously ran python -c "print('displacement' == 'displacement')" which returned True.The user is testing me. I should actually look at the code again.Let's re-read lines 290-311 ofmain.pyvery carefully.291: self.pos.x += self.vel.x * dt292: self.pos.y += self.vel.y * dt293:294: # Calculate displacement to limit trail length295: dx = self.pos.x - prev_pos[0]296: dy = self.pos.y - prev_pos[1]297: displacement = (dx * dx + dy * dy) ** 0.5298:299: # Only add trail points if bullet moved more than a threshold300: if displacement > 3:Yes!Line 297: displacementLine 300: displacementCheck the spelling again:disp l a c e m e n t (297)disp l a c e m e n t (300)Wait, disp + lace + ment? No.It's disp + lace + ment vs disp + lace + ment?Wait.displacement is the correct word.disp l a c e m e n tThe typo in line 297 is disp l a c e m e n t.No, wait.displacement -> d i s p l a c e m e n tdisp l a c e m e n t -> d i s p l a c e m e n tLet me count letters.d(1) i(2) s(3) p(4) l(5) a(6) c(7) e(8) m(9) e(10) n(11) t(12)Line 297: d(1) i(2) s(3) p(4) l(5) a(6) c(7) e(8) m(9) e(10) n(11) t(12)They are spelled the same in the output I received?Let me look at themain.pyoutput again.297: displacement = (dx * dx + dy * dy) ** 0.5300: if displacement > 3:Wait! Look at the characters:disp l a c e m e n t (297)disp l a c e m e n t (300)DrWitchDoctorPhD@reddit
There was a problem on llama.cpp. Pulling the newest version + redownloading the quants (I think it's necessary) fixed that for me.
fdrch@reddit
I've seen errors like this too, I compared words byte-by-byte and they were identical. Also there were broken tool calls. Seems like it doesn't work properly yet.
nothrowaway@reddit
It is not ready for prime time.
CoconutMario@reddit
And, here we go with a NVFP4 quant -> https://huggingface.co/bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4 :) took me a bit of time, but here we go. works smooth in my setup
Mashic@reddit
I tested the gemma4:26B-A4B-Q4_K_M on translation from English to Arabic, it's better than the translategemma:27b-Q6.
iamtheworldwalker@reddit
In my experience, Google's models have always excelled in translation (at least in the languages I speak)
DigiDecode_@reddit
the 31b ranks above GLM-5 on LMSys, my jaw is on the floor
Usual-Carrot6352@reddit
in math gemma-4-26b-a4b is No.10 🤯
Basic_Extension_5850@reddit
It being above sonnet 4.6 seems a bit crazy.
Usual-Carrot6352@reddit
I have great expectations from this model for computation but in biology, I will try and see what it can do for me. It's been 3months I did not touch any local modals when I saw codex 5.3, in fact not updated my ollama and lmstudio😂
jld1532@reddit
Can someone explain the business model here? I'm basically running a SOTA model on my basic laptop now. Why would I buy a subscription? My university was already running Kimi and not paying. I don't get it.
dr_lm@reddit
Because it's not a SOTA model, and the benchmarks lie.
Several-Tax31@reddit
Deepseek is not in the list at all, what a stupid benchmark.
SpicyWangz@reddit
Deepseek is pretty far behind at this point. It really struggles with prompt adherence and structured output
Several-Tax31@reddit
Deepseek speciale is one of the best in math. Any math benchmark that doesn't include it is a joke imo.
jld1532@reddit
I mean for all but 1% of people interested in AI, it is effectively SOTA.
Spectrum1523@reddit
none of the smaller models are actually close to SOTA. try using them and you'll see. they're excellent and useful but there's no real comparison
Several-Tax31@reddit
Actually in math small qwen models are pretty solid.
Spectrum1523@reddit
yeah, in limited fields they can perform close to SOTA. that's what they are good for and it's really cool that they can do that! but calling any ~30b parameter model a general replacement for real SOTA models is silly
Several-Tax31@reddit
Of course, they are this big for a reason.
Darkoplax@reddit
yeah i aint trusting that
MandateOfHeavens@reddit
Tbf GLM-5's quality depends heavily during the time of day. During peak hours especially in China they use a heavily quantized model. And its thinking block is unusually sparse and the model overall has poor context comprehension. 5.1 is the real deal and what 5 should have released as.
Mashiro-no@reddit
Do you have a source for this? or are you simply using anecdotes.
Borkato@reddit
I’m trying so hard not to get hyped and it’s NOT WORKING
Zeeplankton@reddit
remember, this is google lol
roodgoi@reddit
and its Open source lol, so it cannot be nerfed.
FlamaVadim@reddit
at least it cannot be nerfed 😝!
ForsookComparison@reddit
Narrator: it was not better than GLM-5
_raydeStar@reddit
... Wut.
Is that real!?
Birdinhandandbush@reddit
Testing Gemma4 E4B unsloth gguf at the moment and it refuses to believe I have it running locally, its telling me its a cloud based service provided by Google.
I'm getting 65-70tok/sec which is great, so I was going to see if i can backend OpenClaw with it, but not sure I trust it if its kinda stubborn and hallucinatory already.
Adventurous-Paper566@reddit
Qwen3 VL avait le même comportement, ce n'est pas vraiment un problème.
ReadyAndSalted@reddit
E4b seems like a super good option for voice assistants. Instead of having: Audio -> speech to text -> LLM -> text to speech
You could have: Audio -> LLM -> text to speech (including agentic stuff with function calling)
keepthepace@reddit
I wonder why the bigger ones do not have audio input?
Nixellion@reddit
I wonder how it compares to whisper for speech recognition as well. And when will it be supported by llama.cpp
MrClickstoomuch@reddit
I missed that - I'm still setting up my smart home system to use LLMs for local voice, but wasn't Qwen 3.5 4b also a multi-modal model? Or would you still need to use something like Parakeet for voice to text (and the associated delay of each step). Or was that only for vision and text inputs?
If so, that's a major improvement considering it is not too far from Qwen 3.5 4b. However, it looks like the same size quant at q4 is around 5gb for E4b to Qwen's 2.75gb size while being roughly 4.5b active parameters. I'm curious how much faster or better quality it may have versus the multiple tools approach, since I don't really need to have the audio out / text to speech output with my smart home at this time (only the text as notification on my phone).
Craftkorb@reddit
Qwen3.5 (Except for 2B) is multi-modal, but that means the classic duo of Text and Images. No direct Audio input. Do note that Gemma4 does Audio, Image, and Text input, but only Text output.
_Ruffy_@reddit
Guess what will be deployed to iPhones very soon ;-)
bakawolf123@reddit
foundation models they said... I guess the recent news from that deal saying apple will open up to other providers is cause they paid billions, but in the end it's just an open model =)
Advanced-Tomorrow252@reddit
shoot my mac mini can only run the shitty version
keepthepace@reddit
I am curious: why only the smallest models have this? In my experience small models are pretty bad at transcribing multilingual speech.
Greedy-Lynx-9706@reddit
Can someone tell me how I should interpret those % please?
RedditUsr2@reddit
Its not ready for prime time. Downloaded google/gemma-4-26b-a4b Q6_k and get stuff like this.
Then-Topic8766@reddit
Same problem.
tommy_redz@reddit
it seems a serious bug. I also had this where it wanted to "correct" function with the same name and got in loop for 10 minutes or so until i canceled it
jld1532@reddit
Yo, I think I was trippin'. I gave it a doc to review, and it hallucinated misspelled words left and right. Odd.
RedditUsr2@reddit
Seriously mine had a mental breakdown:
RedditUsr2@reddit
[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];Let's count the 'd's in the user's snippet.
shuffled(s-h-u-f-f-l-e-d) -> 1 'd' after the 'le'.shuffled(s-h-u-f-f-l-e-d) -> 1 'd' after the 'le'. Wait, "shuffled" usually has one 'd'.Wait, let me look at the actual word.
shuffled.[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];Let's re-examine the user's code character by character from their previous prompt.
User's original snippet: ```javascript shuffle: function(library, array) { // Fisher-Yates shuffle - creates a copy to avoid mutating original const shuffled = [...array];
for (let i = shuffled.length - 1; i > 0; i--) { const j = this.getRandomInt32Inclusive(library, i); [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]]; // <--- LOOK HERE }
return shuffled; } ```
Let's look at the variable name in that specific line:
shuffled.shuffled.shuffledhas 8 letters: s(1) h(2) u(3) f(4) f(5) l(6) e(7) d(8).shuffled(the one in the bracket): s(1) h(2) u(3) f(4) f(5) l(6) e(7) d(8)? No, wait.Let's look at the spelling:
[shuffled[i]...Wait, let me look extremely closely at the user's text. User wrote:
[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];Let's count again. Word 1: shuffled -> s-h-u-f-f-l-e-d (8 letters). Word 2: shuffled -> s-h-u-f-f-l-e-d (8 letters).
Self-Correction: I am being blind. They are identical in my view because of how they look on screen, but let me re-read the user's provided source code one more time.
User's Source Code:
const shuffled = [...array];(s-h-u-f-f-l-e-d)[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];Let me look at the first word in that bracket:
shuffled. Let's count characters: S, H, U, F, F, L, E, D. (8 characters). Now let's look at the second word:shuffled. S, H, U, F, F, L, E, D. (8 characters).OH! I SEE IT NOW. I was looking for an extra 'd'. But there isn't one. Let me look at the user's source code one more time very slowly.
User provided:
[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];Wait... let me check if I am hallucinating the typo. Looking at the block:
Actually, in many fonts, "shuffled" and "shuffled" look exactly the same because they are both just the word "shuffled". Wait... I am looking at the user's code again. Line:
[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];Let me re-read the word in my brain. S-H-U-F-F-L-E-D. S-H-U-F-F-L-E-D.
Wait... if they are spelled exactly the same, then there is no error! Let me re-read the user's code from the very first time they sent it.
const shuffled = [...array];[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];Hold on. I will copy and paste from their text into a character counter in my mind. The user's code:
[shuffled[i], ...]The user's constant:const shuffled = ...I am looking at the letters... S, H, U, F, F, L, E, D. If both have 8 letters and are s-h-u-f-f-l-e-d... then they are the same word.
Wait! I found it. In my previous response, I was trying to tell them there was an extra 'd', but in their code,
shuffled(with one d) is just "shuffled". BUT! The constant they defined wasshuffled. They are the same word.Wait... if they are the same word, where is the error? If the user's code says:
const shuffled = [...array];[shuffled[i], ...]And "shuffled" == "shuffled"... then there is no error.Let me look at the user's source code one more time. I will be extremely pedantic. User:
const shuffled = [...array];(S H U F F L E D) - that's the word "shuffled".User:
[shuffled[i], ...](S H U F F L E D) - that's also the word "shuffled".Wait! I am looking at the user's source code again.
const shuffled = [...array];[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];I will look at
shuffled. Is it possible there is a typo in the user's code that I am misreading because of my own brain? Let me look at the word in the bracket:shuffled. Let me look at the constant:shuffled.Wait... I think I see it. In the array declaration, the word is
shuffled. In the swap line, the first word isshuffled.THEY ARE THE SAME WORD. "Shuffled" and "Shuffled" are just two different ways people spell "shuffled"? No. "Shuffled" is correct. "Shuffled" is... also how you'd spell it if you forgot the 'e'? No, shuffle + ed = shuffled.
Wait! I'm being stupid. Let's look at the word: S H U F F L E D. That is one word. It has one 'D'. The user wrote:
[shuffled[i], ...]And they defined:const shuffled = ...THEY ARE IDENTICAL. There is no typo in the user's code. I hallucinated a typo because I saw two words that looked slightly different to my pattern-matching brain, but upon character-by-character inspection, the user's code was actually correct and had no error there.
Wait, let me check again. User's code:
const shuffled = [...array];Line below:[shuffled[i], ...]Let me look at the letters inshuffled. S-H-U-F-F-L-E-D. Now let's look at the word "shuffled". S-H-U-F-F-L-E-D.I see what happened. I thought
shuffledhad an extra 'd' (likeshuffledd), but it doesn't. It just has one 'd'. And the constant they declared also only has one 'd'. Therefore, they are identical. My "correction" was useless and wrong because I misread their code as having a typo when it didn't!Wait... let me look one more time. I will zoom in on the user's text.
[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];Let me count 'd's again. Word 1:shuffled-> s h u f f l e d. (One d) Word 2:shuffled-> s h u f f l e d. (One d)Wait! I found it. Looking at the user's code again. Line:
[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];Let me look at word 1. S H U F F L E D. Now let's look at Word 2. S H U F F L E D.I AM LOSING MY MIND. They are the same! If they are the same, there is no error. If there is no error, why did I say there was one? Because my brain saw
shuffledand thought "That's not how you spell shuffled!" but then realizedshuffledis also spelled that way? No, shuffle + ed = shuffled.Actually... "Shuffle" ends in an E. When you add D, it becomes SHUFFLED. If the user wrote SHUFFLE-D (with one 'd'), it's correct. If the user wrote SHUFFL-ED... wait, they are both "shuffled".
Wait! I see a difference in my own eyes. Is there an extra E?
shuffledvsshuffled. Let me check: S H U F F L E D S H U F F L E DThey are identical. I've been trying to find a typo in code that actually had no typo. I am a very bad AI sometimes.
Wait... let me look at the user's line one more time with extreme focus.
[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];I will count characters in
shuffled. 1,2,3,4,5,6,7,8. I will count characters inshuffled. 1,2,3,4,5,6,7,8.They are the same word. There is no typo. I apologize profusely.
Wait! Let me look at the user's constant one more time.
const shuffled = [...array];Maybe they meantshuffledbut typedshuffled? But they are the same.OK, let's be real: I was wrong. There is no error in that line of the user's code. I will admit it.
Wait... looking at the user's source again...
const shuffled = [...array];[shuffled[i], ...]Actually, if I look really closely at their text: The first one isshuffled. The second isshuffled. They are spelled identically.Final realization: My brain was trying to find a difference where none existed because I expected there to be a typo in a code block that looked "suspicious."
Actually, hold on. Let me look at the user's text one last time.
[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];Is there any other word? No.Conclusion: The user is correct. They are identical. I am wrong. I will apologize and admit the mistake.
Wait! One more check. Does
shuffledhave two 'f's? Yes. Doesshuffledhave two 'f's? Yes. I see no difference.WAIT!!! I just noticed something in the user's original code block that I missed before! The constant is:
const shuffled = [...array];The line is:[shuffled[i], ...]If they are spelled exactly the same, there is NO error. UNLESS... "shuffled" (correct) vs "shuffled" (wrong)? No, they are both s-h-u-f-f-l-e-d.
I will admit I am hallucinating and apologize.
Wait! Let me look at the spelling of "shuffled". S H U F F L E D. It's a perfect word. If both words in that line are s-h-u-f-f-l-e-d, then there is no typo.
Wait! I just looked at the user's prompt again.
[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];and[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];Yes, they are identical. I will apologize for my confusion and admit that there is no error in that line of their code. (I'll also check if "shuffled" vs "shuffle-d" was the issue, but they aren't).
Actually... Looking at the user's text:
[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];The first word isshuffled. The second word isshuffled. THEY ARE THE SAME.I am going to apologize and stop trying to find a ghost typo.
Beginning-Window-115@reddit
are you using correct parameters
RedditUsr2@reddit
I used the recommended and default settings in LM studio.
Hot-Will1191@reddit
My initial impression is that 26B-A4B and 31B are extremely smooth with translation and language. Honestly, it's in a tier of its own so far which is something I've been waiting for over a year now. It even makes translategemma feel outdated instantly for my use case. E4B and E2B are a bit meh.
arbv@reddit
Gemma 3 is still better than many other SOTA models at Ukrainian, second only to Google models, FWIW. Claude and GPT caught up only in the latest versions.
It is crazy how good Gemmas are at multilingual support. Though, Ukrainian does require larger models.
chitown160@reddit
E4B kind of sucks compared to Qwen 3.5 in similar sizes.
akavel@reddit
Is there a way to disable "thinking" in llama.cpp for this model through commandline options? I tried
--reasoning-budget 0, but it didn't seem to change anything :(nickm_27@reddit
--reasoning offakavel@reddit
Hh... yet another flag... it worked, thank you!!
dobomex761604@reddit
Am I missing something, or is Gemma 4 less censored than Mistral 3? I've tested it briefly, and it didn't refuse writing jokes that Mistral 3 24b refused to. Very interesting.
No 1 million context, though XD
Pretend-Proof484@reddit
ICYMI this project can run Gemma4 with TurboQuant: https://github.com/ericcurtin/inferrs.
segaman1@reddit
I currently use gemma-3-27b-qat-Q4. I have Nvidia 5070 12gb, 32gb ddr5 ram, and i7-13700k. Will any of the gemma 4 run in a way that make it an upgrade over the gemma-3-27b-qat-q4? Or should I stick with the gemma-3?
vasimv@reddit
Trying to pair unsloth/gemma-4-26B-A4-it-GGUF (I4_XS q4_0/q4_0 cache) with opencode. It does something but stops very often, asking me confirmation at every step. And stupid <channel stuff gets printed, not sure what to do with it. :(
quantier@reddit
seems to be a bug in the 26B quants, haven’t heard anyone able to use them properly yet. It might be a llama.cpp issue or even more likely something with the chat template
TopChard1274@reddit
Not even E4B q4_k_m fits on my M1 iPad, it's too big (5gb)😭
quantier@reddit
q4 should fit - I think there might be a KV Cache bug or leak that adds additional GB when extending context window. Wait for them to optimize or even better hopefully there are TurboQuants coming
jamasty@reddit
Looking at benchmarks, Qwen 9b (as it's max what I can run at my m1 16gb) is better than Gemma 4 E4B, right?
quantier@reddit
Yes - way better!
KeepOnKeepingOn__@reddit
What is the difference between the E4B and A4B models? I understand that A4B is an MoE architecture, so only 4B parameters are used during inference, but no idea what the E4B is?
quantier@reddit
The 26B A4B is a Mixture of Expert model. It requires around 16GB of RAM / VRAM to load at 4bit quantization. It means that the model is 26B parameter ”medium sized” but anytime you ask it something only 4B parameter is activated which means it will be very fast as it’s now using the full 26B at any given time.
The E4B is a very ”small” it only has 4B and those 4B is always activated (dense model). This will fit on as little as 6GB RAM / VRAM at even 8Bit and would fit on 4GB RAM / VRAM in 4 bit. These small models are usually not recommended to use below 8bit as they are so small to begin with and therefore it’s usually looses a lot of ”intelligence” when quantized heavily
Beginning-Window-115@reddit
"effective 4b" pretty sure it just means the model is 4b in size
AlternativeAd6851@reddit
How fast is it with a large context? Gemma 3 was incredibly slow with anything above 20-30k.
Skyline34rGt@reddit
Wow https://pbs.twimg.com/media/HE6ZAdBbQAAJ4jb?format=jpg&name=900x900
MoffKalast@reddit
Damn that's impressive given Gemini's lackluster performance.
MerePotato@reddit
Gemini 3 has been pretty great though?
MoffKalast@reddit
Well I find it great for analysis and planning, but for writing any code it's only my fourth choice, after Kimi and the two usual suspects. Maybe it does better for Golang or something, but it seems consistently bad at implementing math heavy stuff.
SawToothKernel@reddit
This is a bit congnitively jarring for me because I use Gemini 3.0 every day as my base model (when I've run out of credits for frontier models) and it's absolutely fine. I'm coding large and fairly complex applications.
I wonder if what we're experiencing is that the quality of the agentic loop is more important than the model.
MerePotato@reddit
Oh yeah its not really a code gen model, can't argue there
redblood252@reddit
Sounds way too good to be true.
SpiritualWindow3855@reddit
Why? We know Chinese models haven't as polished on reasoning as models from the big 3 western labs.
We also know Gemma 3 has unusually high world knowledge for its size.
So a slightly scaled up version of + reasoning would be expected to be one of the best open reasoning models out there. Qwen still has less reliable reasoning than GPT-OSS, it's the base model performance that makes up for it.
redblood252@reddit
I’m not worried about knowledge to be honest. I’m much more interested in intelligence (understanding queried history and using all information it has) and tool utilization
SpiritualWindow3855@reddit
My comment literally starts with reasoning.
You can keep using Qwen but it takes seconds of watching this thing operate to know it's the higher quality reasoner of the two.
redblood252@reddit
You’re right that agentic coding takes benefit from reasoning. I will try it out. But I’m skeptical that it is better than the 397b qwen.
boutell@reddit
That is impressive, but how much data does arena even have on it yet?
Aggressive_Dream_294@reddit
how did they cook so hard
alitadrakes@reddit
Noob question, what text or prompt or what do they test this with to compare?
Baul@reddit
Go visit arena.ai and submit a prompt. It randomly selects two models, and you vote on which one answered your question better.
jacek2023@reddit (OP)
very cool
i5_8300h@reddit
Any idea when llama-cpp-python will be updated to support Gemma 4? A project I'm working on uses llama-cpp-python with a custom IDE UI written in Python, and I'm getting model initialization errors which make me think that llama-cpp-python isn't able to make heads or tails of the Gemma 4 architecture.
I'm using the unsloth Q4_K_M quant of Gemma 4 E2B, hardware is a Raspberry Pi 5 8GB
True_Requirement_891@reddit
No architectural innovation?? No hybrid attention? Apart from gemma specific capabilities like strong multilingual perf and nice talking style, I don't think this means much... Qwen3.5 wins in architectural innovation, hybrid attention that supports very long context with minimal memory footprint... I wish they had shared some research that actually pushed things forward...
WaveformEntropy@reddit
Happy German 4 day!
Spent half the night testing it and I think people don't realize how big of a deal it is for those of us who value the range of philosophical thinking more than tool use.
ManUtdDevilsYYG@reddit
Noob here. Can it be run on iphone 13 pro?
EconomistThis5542@reddit
I just tried e2b on my iPhone with googles edge gallery, I asked it to write a dfs for me, and then my phone started to burn😭 but it is actually fast. Based on this website and google’s blog, e2b/e4b actually support native audio, which is insane
Corosus@reddit
Built latest llama.cpp
gemma-4-31B-it-UD-Q4_K_XL passed a personal niche code test I use first try that all other models have like a 95% fail rate on cause they miss one thing. We might have something special here
5070ti 5060ti 32gb combined, llama.cpp cuda, 25tps to start trickling down to 18tps after 32k context used.
E:\dev\git_ai\llama.cpp\build\bin\Release\llama-server -m E:\ai\llamacpp_models\unsloth\gemma-4-31B-it-UD-Q4_K_XL.gguf --host 0.0.0.0 --port 8080 --temp 1.0 --top-p 0.95 --top-k 64 -ngl 99 -ts 24,20 -sm layer -np 1 --fit on --fit-target 2048 --flash-attn on -ctk q8_0 -ctv q8_0 -c 96000rpkarma@reddit
Glad its not just me who saw that haha. Though it amusingly will listen pretty well if you ask it to not overthink, which is kind of neat.
Corosus@reddit
Ah nice, theres also options like adding this for llama.cpp, but I haven't battle tested it for intense code debug sessions so I'm not sure what a good value for reasoning budget would be
--reasoning-budget 4096 --reasoning-budget-message "I'm running low on thinking tokens, I should wrap up and give my answer."fdrch@reddit
Tried running 31b with llama.cpp on linux with opencode. It eats all available ram and system kills the process (192 gb ram)
Corosus@reddit
Just had this happen to me too with llama.cpp on windows with claude. Started at around 50GB ram OS used then eventually hit 128gb ram after a long session and then process killed.
fdrch@reddit
--parallel 1 somewhat fixes it
Due-Memory-6957@reddit
Drummeeeeeeeeeeeer
Dhervius@reddit
Another model that was stillborn :'v
SpookiestSzn@reddit
small brain which one of these is the biggeest I can run on a 5090 with 64 GB of RAM
jacek2023@reddit (OP)
all
SpookiestSzn@reddit
LFG ty
Beginning-Window-115@reddit
you should learn so instead of small brain you become big brain
egauifan@reddit
26B A4B q8 works well. I can't fit the entire model for 31b onto the gpu at a good quantisation so not using that.
funkybside@reddit
i am a simple man that just uses ollama running in a w11 vm (there's a reason for that) to handle local llm services. please let the pre-release 0.20 update come out soon.
Beginning-Window-115@reddit
llamac++ by itself really isn't that complicated if you read for a couple minutes
Dragonsalt@reddit
Which of these versions can be reasonably ran on an rtx 3060? Anyone got any tests so far?
Beginning-Window-115@reddit
any model under 10B ish parameters as long as you run 8bit or below
No-Wallaby-9210@reddit
Funny how e4b won't blink and tell a "Yo mama is so fat" joke in english, but will absolutely not do it in german. How come?
asssuber@reddit
r/GermanHumour/
PooMonger20@reddit
It implies German people are more polite, and bad at jokes.
Checks out, lol.
Ayumu_Kasuga@reddit
Just replaced Qwen3.5 35B with the Gemma 4 26B in one of my workflows and got a HUGE speed increase simply due to the fact that Gemma doesn't think as much.
m3kw@reddit
Is very good so far. The have a default prompt that every model fails except this one
NikoKun@reddit
Not to nitpick, but why are the links for the "unsloth" version? I could not get that working for the life of me.. But then I went and tried the standard "ollama run gemma4" model and that runs perfectly.
No-Leave-4512@reddit
Looks like Gemma4 31B is almost as good as Qwen3.5 27B
ShengrenR@reddit
plot in https://arstechnica.com/ai/2026/04/google-announces-gemma-4-open-ai-models-switches-to-apache-2-0-license/ implies it is better at least in .. some dimension lol
Murinshin@reddit
That’s 397B up there, not 35B or 27B
Randomdotmath@reddit
not the elo ranks, the benchmarks, idk how can they get such high elo with losing most of comparison
Swimming_Gain_4989@reddit
Gemma models typically output a nicer aesthetic (better prose, formatting, etc.). If I had to guess they're probably hevaily weighing head to head scoring mechanisms like LMArena.
uncommonsense24@reddit
Definitely noticing this as the biggest jump from Qwen 27b. It's prompting me back, keeping the conversation going and helping me think towards solutions alongside it. This is a very interesting experience!
tobias_681@reddit
Do they lose most? I don't think that's the case.
I would expect these models to have better language skills and possibly better broad knowledge (likely what sways LM Arena). While at the same time having likely worse analytic rigour, likely worse in agentic tasks or highly specific scientific work. Tau2 might be a decent proxy. Qwen scores extremely well there, in fact Qwen3.5 4B scores higher than 27B on that benchmark and either model is better than any of the Gemmas. It's definitely something these models are very optimized for. I would imagine the Gemma models to be better generalists. Also the Qwen models think obscenely long, especially the smaller ones. If you get comparable performance with less thinking that's a win.
Would also wait for independent benchmarks. From a first little test I do find them to perform favourably against Qwen but not in a blowing them out of the water way, at a comparable level, likely with different strengths and weaknesses.
ShengrenR@reddit
look straight down from them. the 27B is on the plot.
a_beautiful_rhind@reddit
Heavily depends on your definition of good.
FUS3N@reddit
I am confused shouldn't it be better?
Weak-Shelter-1698@reddit
Let's goooo, best birthday gift ever!!!!
amelech@reddit
its my birthday too! are you running it on llama.cpp?
maartenyh@reddit
Happy Birthday!!! 🎂
Weak-Shelter-1698@reddit
Thanks 🥳❤️❤️
Final_Ad_7431@reddit
dense model beating out qwen3.5 397b is insane, even the moe edging it out, what a nice gift from google
SpicyWangz@reddit
That’s really hard to believe, but arena is one of the only benchmarks I pay attention to
Final_Ad_7431@reddit
i think in reality now the release hype is tarting to dull down we can see it's probably much closer to 27b which makes sense, still seems like a great release but qwen3.5 set such a high bar
Seventh_Letter@reddit
What's with all the bad grammar in these post replies? Interesting.
silenceimpaired@reddit
It help two hide bot, probs.
SeaworthinessThis598@reddit
this is so unreal , its not even believable , tachnically gemini 3.1 for free , forever ... what ?! can someone pinch me ?
oxygen_addiction@reddit
Not even close to Gemini 3.1 quality, but very good for the size.
SeaworthinessThis598@reddit
its not gemini 3.1 quality but its opus 4.6 level , let that sink in !
Spectrum1523@reddit
lmao
tobias_681@reddit
If it's true that the AA omniscience accuracy benchmark (general knowledge) is a predictor of model size, then Gemini 3 is likely the largest model that exists which is likely its biggest strength. I'm curious how benchmarks will turn out but I would suspect something more akin to the small Qwen 3.5 models with less overthinking and probably slightly worse at very technical tasks, slightly better in other domains.
FluoroquinolonesKill@reddit
Um...holy shit this thing has no qualms about enterprise resource planning. ;)
Spectrum1523@reddit
yeah wtf it fucks
BannedGoNext@reddit
You are implementing an ERP with an LLM?
FluoroquinolonesKill@reddit
Yes, local LLMs seem particularly well suited for that task.
notdba@reddit
Eh it is still using the weird interleaved thinking mode. The other 2 new models, Trinity Large Thinking and Qwen3.6 Plus, already embrace the preserved thinking mode.
mikael110@reddit
Personally I prefer that, as preserving thinking means the context size balloons really, really quickly. And personally I haven't actually found that models that preserve thinking perform that much better than those that don't.
notdba@reddit
Do you run local inference on consumer hardware? Because interleaved thinking also breaks prompt caching.
These days, the best models like GLM-5 and Qwen3.5 support long enough context, and also don't think for too long in between tool calls. Preserved thinking should be the way forward.
silenceimpaired@reddit
Mind blown by the licensing.
VampiroMedicado@reddit
How do I run this model on llama-server? I have the latest version and it either the server shits the bed or it repeats random tokens.
llama-cli works fine.
Cool-Chemical-5629@reddit
Gemma 4 E4B beats Gemma 3 27B...
ghulamalchik@reddit
Gemma 4 models have thinking on by default, that certainly helps.
Odd-Ordinary-5922@reddit
the 26b a4b beating qwen3.5 27b is crazy
Wooden-Deer-1276@reddit
it doesn't
some_user_2021@reddit
Did you check?
FlamaVadim@reddit
it's just impossible
letsgoiowa@reddit
Dude we have tiny models blowing out the 1T GPT4
Architecture advances.
Borkato@reddit
Holy fuck that’s the model in the most excited about. Qwen 35B is SO good that I desperately want something like 27B which is even better but way slower, but faster. So holy crap I’m so excited
misha1350@reddit
Cool off. Qwen 35B A3B is a multi-modal model first, coding second. Apart from coding (basically in most OpenClaw cases), Qwen3.5 is still SOTA. Gemma 4 E4B badly loses to Qwen3.5 4B and 9B in most benchmarks. Give it some time and give them both a spin and compare them or have someone else compare them for you, and you'll likely see that Qwen3.5 is still extremely good.
Borkato@reddit
I’m using Gemma 26BA5B or whatever it is and it already seems way better than qwen 35B
misha1350@reddit
What do you use it for?
Borkato@reddit
Code and general tech help
EbbNorth7735@reddit
In ELO. Most benchmarks show Q3.5 27B and 122B beating G4 31B from what I can tell.
amchaudhry@reddit
Wonder why no 7/9/12b models? I’m currently gemma3:12b and it runs well on my 9070XT.
jacek2023@reddit (OP)
is E4B worse?
amchaudhry@reddit
Yeah
6kmh@reddit
yes
hp1337@reddit
WOW! Look at MRCR V2. This is game changing! Long context rot has been the biggest problem with medium sized open source models. Going to test it now!
Borkato@reddit
Wait what’s MRCR?
Endonium@reddit
MRCR v2 is a "needle in a haystack" benchmark to test for long-context performance. A higher score means the model is better at finding small pieces of information hidden in a sea of text.
Borkato@reddit
Oh that’s wonderful!!
WhatIs115@reddit
So what gives? I'm seeing extra Vram usage.
I can load 2 ggufs with llama, a 10.8GB Qwen3.5-27B IQ3_XXS and a 11.5GB Gemma 31b IQ3_XXS gguf with the same settings (tested with Cuda 13 and Vulkan llama builds). I'm seeing 3GB more Vram and IQ3_XXS barely fits on my 16GB.
SeaworthinessThis598@reddit
ok i may have overstated but for sure opus 4.6 territory.
Skyline34rGt@reddit
Q4K-m gguf from LmStudio model of 26b model got me 'fail load'...
unrulywind@reddit
yeah, to run it on a 5090, I had to take it down to 32k context with Q4_0 kv cache. Makes it a bit limited. Even the 26b version had to use Q4 kv cache at 128k, otherwise it ballooned up and failed.
Now I understand why Google was recently publishing papers on how to reduce the size of KV cache.
Looks like they built a purpose for their TurboQuant.
Skyline34rGt@reddit
Ah, runtime CUDA 12 support is coming soon
Guilty_Rooster_6708@reddit
Thanks for posting this. I was wondering why I have the same error
Geritas@reddit
You need to download the new runtime AND switch it manually in settings, it was available like 10-15 minutes after the release.
Skyline34rGt@reddit
Done, works now.
remoteDev1@reddit
the 26B MoE fitting on 16GB is what I've been waiting for. been running qwen 3.5 27B for code stuff and it's solid but slow - if this thing is comparable quality at those inference speeds people are reporting I might finally have a daily driver that doesn't make me stare at my terminal for 30 seconds between completions.
Barry_Jumps@reddit
Apache 2 yaaassss
Firstbober@reddit
Where Gemma 4 270M... Awesome release, I hope Google will release such a small model again. It's incredibly capable for it's size, and I don't think there is any other alternative similarly sized.
Embarrassed_Soup_279@reddit
praying this comes out... in the meantime you could play around with lfm2.5 350m or bonsai 1bit models
Firstbober@reddit
Do lfm2.5 350 or bonsai 1bit models easily fine-tunable? I'm kinda stuck with LlamaFactory as it's easy and does what I need it to. Although I think bonsai is just XOR for tunes?
Prestigious-Crow-845@reddit
What is a use case for 270M model, always wonders?
Firstbober@reddit
Very high-detail embeddings, insanely quick to experiment and fine-tune, takes minutes with solid GPU, and even on CPU you can probably produce an ok-ish tune in matter of hour or two. Generally with function calling and proper specialization i.e. docs and stuff and RAG it produces really sensible output really fast.
If you do need to perform some language analysis task, and it's too much for general NLP tools like spaCy, then such small models are your best bet unless you have compute capacity for larger ones, or you are willing to hit API for every small thing.
Also, I have a personal challenge to take see how far one can push such small model, and the more "smart" base, the better ;)
Craftkorb@reddit
Commonly used as base for small embedding models, classification (with custom fine-tune), and general quick experimentation.
jeffwadsworth@reddit
Multi modal beauty. Looking forward to testing.
WhoRoger@reddit
Are we in for the same 'overthinking everything to oblivion' like with Qwen?
egauifan@reddit
Which one is best for a 5090 with 64GB RAM? Gemma 4 26B A4B Instruct?
Adventurous-Paper566@reddit
Wow 26B A4B, my dreams are coming true! Q8 will fit in 32Gb!
A little disappointed that the dense model isn't a 27B, for me it's an admission of defeat against Qwen.
Adventurous-Paper566@reddit
Wow 26B A4B, my dreams are coming true!
A little disappointed that the dense model isn't a 27B, for me it's an admission of defeat against Qwen.
SpicyWangz@reddit
Lmarena gave it some very good rankings. I’m interested to see how it does
First_Ad6432@reddit
holy moly, im seeing infinite finetunes for it
AlternateWitness@reddit
Am I wrong, or does this look like it performs marginally worse than the Qwen 3.5 lineup?
It’s nice to see this class of model becoming more prevalent, but what would the use case for this be if Qwen 3.5 exists? Especially that 9b model…
Frosty_Chest8025@reddit
Why Gemma-4 in hugginface shows 25K downloads last month, even it was not published last month:
https://huggingface.co/google/gemma-4-31B-it
jacek2023@reddit (OP)
probably it means current month
Frosty_Chest8025@reddit
if its new model, there should be written current month
MaddesJG@reddit
It's a bit late where I am, but I threw Gemma4-26b on my mi50 32gb Ran it with -c 128000 -dev rocm0 Used the UD Q4. Llama-bench got about 939 +- 21 on pp512 and 76 on tg128
Ran a quick 2 prompt run with llama-cli and got about the same results.
I'll have to test some more tomorrow, I'm too tired rn.
philo-foxy@reddit
You released it at exactly the same time as the yt video dropped!?? Haha
jacek2023@reddit (OP)
video was posted in the comment :)
Guilty_Rooster_6708@reddit
was so excited about this, but in my Vietnamese -> English translation task Gemma4 is worse than Qwen3.5 in the same Q4 quant. It also failed the car wash puzzle :(
bjivanovich@reddit
Has Google implemented TurboQuant to modelos weight's?
psychohistorian8@reddit
hmm my Mac is hard crashing when trying to load either the 31B or 26BA4B with LM Studio
The3RiceGuy@reddit
Are there any information available about the image encoder? Despite it having 550M parameters ;)
PiratesOfTheArctic@reddit
I have a basic laptop I7 with 32gb ram running qwent3.5 4b q5 k m with llama.cpp. Swapped it over to gemma-4-E4B-it-Q4_K_M.gguf (with some flags) and not only is it faster, it gives significantly better answers
I'm very much a newbie, but even saw the difference when using it for finance analysis
jacek2023@reddit (OP)
That's the power of LocalLLaMA
PiratesOfTheArctic@reddit
Back in the 90s I used to program assembly, and whilst this old decrepid mind isn't sharp to do that anymore, I know what end results should be, and how they should be processed, so having great fun giving it a good pokey pokey, laptop is having a meltdown, all good fun!
jacek2023@reddit (OP)
I was on demoscene in the 90s and I won some competitions with assembly :)
PiratesOfTheArctic@reddit
Good old days! Do you remember the 1k game competitions?!
jacek2023@reddit (OP)
Yes but I was doing 64k intros, with music and 3D :)
I tried to use local LLMs to generate some effects in Python or HTML, there was a bigger problem with C++ and some libraries like SDL, not sure how to use assembly in 2026 to render something, but maybe it's possible.
PiratesOfTheArctic@reddit
This is why we need to learn the pokey pokey method, keep poking until it works!
Today, I discovered if I put a # in front of a url, the web interface reads it, I've become a hacker once again ;)
Bitter-Breadfruit6@reddit
I was waiting for the 120b rumors, so this is disappointing. I think there are limitations due to the model's size, no matter how well it is trained.
jacek2023@reddit (OP)
it's possible that 124B model was planned but failed in benchmarks/ELO, or maybe it will be released later
FlamaVadim@reddit
...or it was to good compared to gemini flash
Zc5Gwu@reddit
It would be odd to train it and then do… nothing.
a_beautiful_rhind@reddit
What if it was A4b?
Bitter-Breadfruit6@reddit
I wish that were true.
jacek2023@reddit (OP)
We are now in April
sammoga123@reddit
I think you'd better forget about Llama; I heard they're definitely not going to release any more open-source models.
jacek2023@reddit (OP)
what about these avocado rumors?
sammoga123@reddit
I've already seen like 4 "secret" models, the most recent one is actually called "Leviathan" XD
They all seem to be in testing at Meta AI, but I had already seen that, according to Mark, they were going to focus on making closed-source models to compete with the rest. You know, Llama 4 was the worst model in 2025, and apparently that really hurt their egos.
berahi@reddit
That's exactly it, stories about Avocado usually mention it would be proprietary and not available for download.
sine120@reddit
The new Intel GPU isn't horrible for 32GB.
xspider2000@reddit
In LM Studio, you can try Gemma 4 via the CPU or Vulkan backend if you have an AMD iGPU. Gemma 4 26B A4B model on my Strix Halo via Vulkan gives about 50 tokens per second.
dampflokfreund@reddit
Oh, great news! Thinking, system role support, more context basically what everyone asked for, and a 35B competitor MoE too.
But aww man audio is E2B and E4B only, that's a bit of a bummer. I thought we were about to have native and capable voice assistants now. But these are too small. Basically larger native multimodal models that can input and output audio natively.
Zc5Gwu@reddit
I wonder if a smaller model could call a larger model as a tool reliably...
boutell@reddit
Yes, I was thinking just use it for the recognition and feed the output directly into a larger model, don't even bother with tool use, make that the loop.
Zc5Gwu@reddit
At that point doesn’t seem much different than whisper or parakeet though.
Automatic-Arm8153@reddit
That’s the optimal method
Hefty_Acanthaceae348@reddit
If the small model is only used for voice, there is no need for tool calling, just use a deterministic pipeline
Illustrious_Car344@reddit
IIRC this is literally an example use case Google cited when they released Gemma 3.
Borkato@reddit
The benchmarks suggest E2B and E4B are great! 👀
Thigh_Clapper@reddit
I was trying to compare against qwen3.5 4b and it doesn’t look like it’s much better? Or am I missing something?
j0j0n4th4n@reddit
Indeed, but qwen3.5 4B is at the level of gpt-oss-20B and in some cases gpt-oss-120B, is by no means a weak model. Likewise, the Gemma 4 E2B has been at least at the level of Gemma 3 27B, at least as far as google's benchmarks goes.
dampflokfreund@reddit
Might be, but they are still small models and the MoE and the 31B dense are obviously a lot better. These capabilities with native audio support would have been great to have. But I guess it is not the time yet for that
MoffKalast@reddit
A system prompt for Gemma? Hell really has frozen over this time.
Illustrious_Car344@reddit
Didn't Gemma always accept a system prompt, just didn't treat it differently from the user/assistant prompt? I wonder if this one is real.
MoffKalast@reddit
Yeah, also known as not having a system prompt lmao.
edgan@reddit
Run E4B on your phone and have it send it to the desktop running the bigger models.
Excellent_Koala769@reddit
Is this designed for MLX?
AvidCyclist250@reddit
Oh, the hype isn't bullshit! Comparing the MoE model favourably to qwen 3.5 in my own tests right now. It's getting some very tricky shit right! STEM and philosophy, that is. And it's fast despite partial offload. Sweet af.
Craftkorb@reddit
Comparison table for Gemma4 31B + 26B and Qwen3.5 27B and 35B, source is their respective huggingface pages (Self reported values).
andy2na@reddit
anyone have luck using e4b has a home assistant voice assistant? I just get the response: GetLiveContext()
MaruluVR@reddit
How is it can you stream audio to it without needing whisper etc?
andy2na@reddit
you cant yet, Im still using parakeet to do STT. I think llama.cpp needs to add support
EveningIncrease7579@reddit
I was wondering exactly this, i do not remeber when llama.cpp supports audio. Any alternative? (pls dont ollama/lmstudio), maybe just wait?
andy2na@reddit
Im going to wait, my llama.cpp with llama-swap and parakeet + kokoro works well
paul-tocolabs@reddit
My app uses Gemma 2 and 3, so now I’ve got something to do for the weekend!
FoxTrotte@reddit
I'm trying to run gemma-4-E4B-it-GGUF on both my PC with Unsloth Studio, and my phone with Off-Grid, and none of them work. anybody having the same issue ?
Hefty_Acanthaceae348@reddit
Great, and I was just lamenting the lack of sub 30B MoEs!
MitsotakiShogun@reddit
IBM Granite 4?
Hefty_Acanthaceae348@reddit
I said lack not absence
MitsotakiShogun@reddit
Huh...
SpecialistBig4539@reddit
You should review your understanding of "lack" and "absence" per your link.
MitsotakiShogun@reddit
...that's why I shared it.
Kindly-Annual-5504@reddit
Finally, an open-source model that not only allows you to write in German but can also express itself very well in German. Multilingual capabilities have always been Gemma’s strength, and that’s still true for Gemma 4. No other open model has come close so far.
Qual_@reddit
gemma always was better in EU language than qwen etc
DOOMISHERE@reddit
why its super slow on DGX Spark ? :(
Grouchy_Ad_4750@reddit
you can't be serious?
DGX spark favors large more models instead of large dense models.
The formula is 273 (memory bandwidth of spark) / For gemma 31B I would expect you are getting around 8 t/s
If you want to utilize spark try some large MoE instead (qwen next coder....)
DOOMISHERE@reddit
going back to the lovely MiniMax 2.5
Grouchy_Ad_4750@reddit
that has 10B active so you get much more t/s.
Whether its smarter than 31B I cant tell but spark was build of MoE :)
Appropriate_Car_5599@reddit
doesn't work in the Jan app, does anyone know something with a really cool UI/UX with great support for latest models?
tried LM Studio and ollama, both of them sucks. thinking about unsloth studio but idk
EndStorm@reddit
I wonder how this works with TurboQuant.
m98789@reddit
The key question: how does it compare to GPT-OSS-120B
chanbr@reddit
How does the e4b model compare to the 12b model? I really want to know...
RickyRickC137@reddit
Just basic system prompt is good enough to jailbreak Gemma 4!!!
jacek2023@reddit (OP)
Maybe share some cool example
Mashic@reddit
lm studio showed me a notification to update the runtime to use it, but I can't find the compatible llama.cpp build to download?
Skyline34rGt@reddit
cuda12 runtime is not yet ready. need to wait
Mashic@reddit
How did lm studio ship it already?
Skyline34rGt@reddit
Works now.
jacek2023@reddit (OP)
maybe you need to switch from lm studio to something else today
toothpastespiders@reddit
I have a few random trivia questions I toss at models just to get a feel for their training data. Not so much expecting a right answer, but more to see how they fail and if they get the general gist of the topic even if getting the specifics wrong. 31b got my history, early American literature, and pop culture questions totally right and 26b came really close.
Hardly a real benchmark or anything. But it's the best I've ever seen from models this size.
Eastern_Pay8245@reddit
Anybody know if I can run this on M3 Pro w/ 18gb ram?
the__storm@reddit
llama.cpp Vulkan b8637 + 26B-A4B-it-UD-IQ4_XS (on 7800 XT 16GB) seems to have a bug in its fit/context size estimation (or at least it's way too conservative). Using --fit I have to dial the context target all the way back to 256 (lol) to get it to not offload any layers, but if I force --ngl 99 it complains a bunch but loads and runs fine up to a context of about 20K.
SamuelL421@reddit
I've not used any of the Gemma models before, is there room to run these (either 26B A4B or 31B) with reasonable context if you have 32gb or 48gb setup of VRAM?
Middle_Bullfrog_6173@reddit
FWIW, on my short gauntlet of multi-lingual language modeling tasks that I was still using Gemma 3 for:
26B A4B beat Gemma 3 27B clearly 31B edged out Gemini 3.1 flash lite
This is short context, no coding. I'd expect even larger improvements in agentic stuff vs Gemma 3.
bakawolf123@reddit
What is this elo graph coming from? Comparing the reported test numbers alone it looks to be on par with Qwen3.5 27B, some scores higher, some lower.
jacek2023@reddit (OP)
I don't trust benchmarks anymore because models are benchmaxxxed. Elo should be the only valid benchmark because it's based on arena votes from humans, but even that could somehow be broken in 2026. It's arena.ai, it was called lmarena before
bakawolf123@reddit
Thanks, well gotta be cautious trusting anything LLM-related in 2026: this arena has 31B with same score as sonnet-4.5, which leaves me very doubtful. Google has probably received enough of those user traces from this arena for gemini and now has a decent idea what users there vote for and skew in that direction. E.g. make model hallucinate more instead of confirming it can't answer
Independent-Act-6432@reddit
what is google’s incentive to even spend resources to do what you’re suggesting for an open source model?
bolmer@reddit
We humans love being lied to
FastDecode1@reddit
Does llama.cpp support speculative decoding for Gemma4 right now?
Was disappointed with Qwen3.5, for which speculative decoding is still WIP in llama.cpp.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
durden111111@reddit
People talking about gemma 4 being worse despite not even being able to run the ggufs yet as llama cpp doesnt have support
shockwaverc13@reddit
so sneaky, that was unexpected
Firepal64@reddit
OH MY GOD that's so clever, i wouldn't have been able to clock it in the sea of PRs
ShengrenR@reddit
so... do I not have to rebuild from source this morning lol? what version am I looking for heh
alex_pro777@reddit
Guys... Please, don't be so naive... I'm testing now this SOTA-like 31B in AI Studio. It's a pure shit compered to Qwen3.5-27B.... Infinite loops and no possibility to read text from the image.... Not quantized!
AdCool7335@reddit
💔
dampflokfreund@reddit
I have noticed the loops as well. However, even though Google runs AI Studio, there's likely still bug in the implementations as with every new release.
FlamaVadim@reddit
noooo 😫
CaptBrick@reddit
Fuuuuuu
Odd-Ordinary-5922@reddit
are they releasing qat versions?
AnonLlamaThrowaway@reddit
Gemma 3 QATs only showed up weeks after the initial release, so... probably
AnonLlamaThrowaway@reddit
Gemini 3.1 Thinking is suggesting that the new architecture means there's no need for QAT anymore, but I don't know enough to know whether or not that's bullshit
itsdigimon@reddit
I hope so :')
Mashic@reddit
Why nothing in 9-15b sizes?
misha1350@reddit
Because no one would want a mediocre model like that. Or maybe they had something, but decided to scrub that when they realised that Qwen3.5 9B drinks their milkshake
ML-Future@reddit
R.I.P Qwen3.5
misha1350@reddit
Qwen3.5 outdoes Gemma 4 in certain benchmarks. But when Qwen3.5/3.6 Coder rolls around, it'll be game over. Unless Alibaba completely drops the ball with whatever new tech lead they have right now.
FlamaVadim@reddit
have you tried gemma 4?
so wait till tomorrow...
jacek2023@reddit (OP)
both gemma 4 and qwen 3.5 will be the leaders of local scene, and we should get qwen 3.6 at some point
OpenAI wake up!
gofiend@reddit
Pretty insane to see the E4B model beating one of the best models from last year. Unlikely to be true in broad real world use but a great signal anyway
popiazaza@reddit
This is much more interesting than their Gemini models.
Both Gemma 4 31b and 26b-a4b have higher elo than their proprietary Gemini 3.1 Flash Lite model.
This would be a game changer for a local model.
Ok_Zookeepergame8714@reddit
Good level intelligence humanoid soon at your doorstep! They're going to squeeze it into the robot very soon! Just imagine the mess on the Russo-Ukrainian front line in a year! It's gonna be Terminator I live. 🤯
jacek2023@reddit (OP)
XMasterDE@reddit
Why would you list the unslof links instead the actual reops
jacek2023@reddit (OP)
because most people on LocalLLaMA want GGUFs, there is also a link to the google collection
swagonflyyyy@reddit
"Generate a humorously complicated python code that simply prints out hello world. The code should be as convoluted and hard to read as possible while remaining functional"
Oh, so you want me to turn a simple task into a digital fever dream? Fine, but don't come crying to me when your brain short-circuits trying to parse this masterpiece.
There you go—one "Hello World" wrapped in enough unnecessary layers to make a senior developer weep. You're welcome.
amejin@reddit
I'm not sure what it says about me that I thought this would be the way to do it and this is what it did... But it added error handling so there's that...
Neither_Nebula_5423@reddit
Friendship ended with qwen all hail the dark lord gemma4
Daniel_H212@reddit
Had gemini generate a visualization of benchmark scores between gemma 4 and qwen3.5 for me (model cut off on the right is qwen3.5-35b-a3b)
mrpkeya@reddit
They're mailing about it too. So cool
plaintexttrader@reddit
This maybe the swiss army knife one-size-fits-all of open weight models… text image video audio IO, MoE, reasoning, etc.
meh_Technology_9801@reddit
Cool. I was wondering if Gemma would be cancelled. It had been removed from AI studio after people got it to say offensive things about a senator.
toothpastespiders@reddit
I'd been worrying about that for a long time now. I'd gotten to the point where I was leaning further to thinking gemma was essentially dead.
No-Veterinarian8627@reddit
Thank the lord. Multi-language is often ignored and mostly focuses on the English language. If it is any good, I hope to use it for some small tasks at the office (the 26ba4b model).
FullOf_Bad_Ideas@reddit
Nice, albeit I wish they'd target sizes that weren't occupied by Qwen 3.5 already.
jacek2023@reddit (OP)
why?
FullOf_Bad_Ideas@reddit
Some sizes like 15B, 50B, 90B, 150B, 300B are pretty empty right now.
People who could already run Qwen 3.5 27B will be able to run Gemma 4 31B, but people who were looking at a touch smaller 10-20B models, or bigger 40B+ models still have limited choice.
j_lyf@reddit
Can it be used with MLX?
Live-Crab3086@reddit
this is very cool. how can you run the 2B, 4B audio-enabled models locally to make audio assistants?
RickyRickC137@reddit
100 + MOE would have been a killer. But grateful for Gemma 4 though. Waiting for the heretic version to come out
HopePupal@reddit
dense 31B? damn. good week to have bought a 32 GB GPU.
Chance-Studio-8242@reddit
how does it compare to qwen 3.5 27b? can't wait!
sine120@reddit
Seems like Qwen3.5 is better at coding, and Gemma 4 is better with knowledge. My guess it the rest will come down to personality/ preference. Probably will just have to test with your use cases.
AppealThink1733@reddit
What is this "it"? I see that when it has this "it," it doesn't have all the functionalities like vision and audio, etc.
jacek2023@reddit (OP)
instruct
AppealThink1733@reddit
Thanks.
SmartMagician09@reddit
any technical report released?
LeHiepDuy@reddit
Anyone have a working template to use with openclaw ? Gemma 4 E4B Instruct is not working with the default jinja template in lmstudio. I'm looking to test it's agentic ability.
LeHiepDuy@reddit
nvm, the ChatLM tempalte works just fine
swagonflyyyy@reddit
My fucking god, and I was JUST wondering about this model's release just now.
florinandrei@reddit
Nice. Gemma3 27B has been my favorite general-purpose conversational model for some time.
The 26B is a MoE, but the 31B is dense? Seems backwards?
Also, how is it doing with tools? I don't see a lot of explicit signs that it understands tools very well. Maybe I need to dig into it more.
Alone-Possibility398@reddit
they are actually good , considering the fact they can give good competition for qwen3.5 series
DrNavigat@reddit
No, oh no, thinking no, no, please god, no, with Gemma no, no, no 😭😭
jacek2023@reddit (OP)
why do you hate thinking :)
DrNavigat@reddit
A lot of tokens for almost the same result. It's not good for people with less resources. Gemma was the last guardian of inteligent models without this spend of tokens
thereisonlythedance@reddit
Yeah thinking sucks on small models. It honestly doesn’t even add that much on larger models — just has a CFG type effect from repeating the prompt in a different way.
PunnyPandora@reddit
go back to using models that didn't have it and come back with the results
thereisonlythedance@reddit
I use older models all the time. It depends a bit on use case, but I regularly get better results with thinking turned off.
PunnyPandora@reddit
thinking makes models perform better in almost ever scenario
the_mighty_skeetadon@reddit
You can turn it on or off...
DrNavigat@reddit
If you are right, this is a blessing
the_mighty_skeetadon@reddit
and peace be also with you
amejin@reddit
I have to say.. I can't find any other licensing info other than that Apache 2.0 attribution.
Have I missed something or am I proud of Google right now? If I recall all the other Gemma models had restrictions on usage.
Shoddy_Enthusiasm399@reddit
Free the Gemma 4 …what did they do again?
Far-Low-4705@reddit
LETS FUCKING GOOOOOOOOO
Technical-Earth-3254@reddit
Looking forward to the ggufs, especially for the 31B.
FinBenton@reddit
Waiting for heretic or hauhau aggressive before I test.
Skyline34rGt@reddit
gguf's already exists
Upstairs-Sky-5290@reddit
ok Im gonna try it with opencode/lmstudio as soon as it's out.
jld1532@reddit
The LM Studio staff pick fails to load. Anyone else?
jacek2023@reddit (OP)
switch to llama.cpp today
KokaOP@reddit
when these will be available over API, i hope for generous limits like prev Gemma's
jacek2023@reddit (OP)
welcome on r/LocalLLaMA
guiopen@reddit
Super cool that they also released the base models
TheOriginalOnee@reddit
Can any of these models be used with 16GB VRAM?
hyrulia@reddit
The 31B at Q3
jacek2023@reddit (OP)
26/2=13
xignaceh@reddit
Been using 27b for a year now, very content. Looking forward to upgrading!
CaptainAnonymous92@reddit
Can we get open omni models for all sizes and at least Nano Banana 1 level of image gen and editing in like a Gemma 4.1/.2 or something please now Google? Finally getting a good quality LM that can do images and editing too is something I've been waiting for.
hyrulia@reddit
For 16Gb VRAM, 26B-A4B-it-UD-IQ4_NL and 31B-it-UD-IQ3_XXS fit perfectly. Probably the 31B would be smarter even at Q3
atineiatte@reddit
Wow its context is inefficient
MoffKalast@reddit
What, you don't you guys have ~~phones~~ a TPUv7 with 192GB of HBM?
Baphaddon@reddit
Chef Demis has concocted another dish
fake_agent_smith@reddit
This is amazing, 31B model what only sota managed to achieve not so long ago. HLE at 19.5%. Just wow.
9r4n4y@reddit
Q3.5 27b has 22% score?? So it means under 35b parameter. It is not the sots
Adventurous-Gold6413@reddit
The 26ba4b better be gudd
MoffKalast@reddit
Let's hope it's less unhinged than the previous three :D
n8mo@reddit
Perked up as soon as I saw there’s a MoE model I’ll be able to run on my machine
ML-Future@reddit
It seems that Gemma4 2B has capabilities that are similar to or better than Gemma3 27B
windows_error23@reddit
Yes to audio modality. I wonder how good it’ll be
petuman@reddit
Audio seems to be exclusive to E2B/E4B?
Murinshin@reddit
At this trajectory we will unironically have Opus 4.6 level models by the end of the year, and then things will get very interesting
Zestyclose-Ad-6147@reddit
Hypeee
ffgg333@reddit
How easy is it to fine-tune in comparison to Gemma 3? Will it be more easy? Is it more censured?
Sakiart123@reddit
So far From what i see better than qwen 3.5 27b . Which is huge
Mean-Ad1493@reddit
Will they be putting out the turboquant versions?
ebolathrowawayy@reddit
would be great if the presenter spoke better english and if most of the video wasn't a bunch of useless words. Why are companies so bad at presenting information?
Nyghtbynger@reddit
ooooh. I'm never that early to the party. please allow me.
*GGUF when ?*
jacek2023@reddit (OP)
try again
Final_Ad_7431@reddit
unsloth already have ggufs up for every model, i hope latest tag of llamacpp can run gemma4 though, my current old build doesn't
Prestigious-Use5483@reddit
Can't wait to try 31B UD 4 K XL
pseudoreddituser@reddit
These benchmarks look insane, hope it lives up to those!
jacek2023@reddit (OP)
MundanePercentage674@reddit
https://www.youtube.com/watch?v=jZVBoFOJK-Q
jacek2023@reddit (OP)
thanks!!! added
Everlier@reddit
it's been a quiet Thursday evening... I wanted to play some Crimson Desert...
But nownI have something much much better to do :)