Just how powerful is Google’s Gemma 4?

[-]

Zealousideal-Yard328@reddit

I benchmarked Gemma 4 E4B specifically on enterprise tasks — structured JSON output, compliance, and reasoning. Thinking mode makes a noticeable difference. Results and methodology here: https://aiexplorer-blog.vercel.app/post/gemma-4-e4b-enterprise-benchmark

[-]

Stepfunction@reddit

IT'S OVER 9000!!!

[-]

jugalator@reddit

Very good for the size for creative writing and above all language support! I've never seen this good language support from something 31B, even lesser languages.

[-]

Equal-Ad9264@reddit

Can it do a good image analysis?? More specifically if I need to understand the attributes of furniture shown in the image and tell me type of furniture, style, material and color?

[-]

AvocadoArray@reddit

I've been running it through some personal benchmarks and comparing it to Qwen 3.5 27b / 122b.

Coding (general): Seems to be about on par with Q3.5, but haven't tested with long multi-turn conversations yet.
Coding (visual): Produces cleaner designs and doesn't go overkill with purple gradient aesthetics on every little thing. Much better IMO.
Visual understanding: Roughly on par, but seemed to capture more detail from the image that provided a better overall response (or at least reasoned through the details better).
Tool Calling: Not sure if it's just a VLLM thing right now, but it seems to only want to call one tool at a time / per response. For example, if I give it a prompt to take a screenshot using a Node.js script, read the screenshot, and then give an analysis on that screenshot. It takes the screenshot and saves it to a file, then asks the user to provide the image instead of doing so on its own.
Vibe / Sloppiness: Very different from most other LLMs. Less emojis, unsolicited praising, and other "LLM-isms". I'd definitely prefer this model for general proofreading, technical analysis, or writing content that needs to sound more "human".

[-]

GrungeWerX@reddit

What about "not just this, it's that" slop?

[-]

AvocadoArray@reddit

It’s not just an improvement, it’s an evolution in how AI talk to humans.

Jk. It does still have some of that at times, but it’s definitely toned down compared to everything else I’ve run.

[-]

po_stulate@reddit

Which gemma4 model do you use? I've tried 31b (as I thought it would be the most capable one), but it feels like a huge step down from qwen3.5-122b-a10b for me.

[-]

AvocadoArray@reddit

Sorry I left that out, I've been running 31b dense.

I started testing with UD-Q8-K-XL, but started noticing some weird token accuracy issues. Sure enough, GH issues started popping up in the llama.cpp repo with a slew of confirmed bugs, Not sure if it's fixed yet, but I'd say hold off on judging it if you've only tested with llama.cpp so far.

The rest of my testing has been in VLLM using the full official BF16 weights since no FP8 weights were available yet. Will download an FP8 quant tonight and test with that as well.

[-]

po_stulate@reddit

Thanks. I was using UD-Q8_K_XL too and yes only llama.cpp for me. If it is really that good on VLLM I think I'll wait and test it again.

[-]

AvocadoArray@reddit

Keep an eye here: https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/discussions/3

I'll update my post there once the fixes are in place and confirmed working.

[-]

po_stulate@reddit

Haha, I also had similar issues with it. The model claimed that there're some typos in my script and that it fixed it but there's no typo and it didn't fix anything:

Fixed Typos:
Changed /dev/urandom →→ /dev/urandom.
Changed magick →→ magick (assuming ImageMagick 7).

I asked it to parallelize the script, it also didn't realize that it needs to make the cache file path different for each thread/iteration or they're going to overwrite each other. Qwen3.5-122b didn't have this issue too, wonder if this can also be a llama.cpp issue.

[-]

AvocadoArray@reddit

Yes, those are the exact problems I was having, I suspect it was also leading to other brain-damaged responses, but this one was the most obvious in my testing.

That specific issues isn't present in VLLM, but it seems they're also fighting some tool-calling bugs in the tool parser.

Either way, take all results right now with a grain of salt. I'm sure these bugs will get ironed out by the end of next week.

[-]

NotumRobotics@reddit

Asked her to build a complete inventory management system with QR scanning/generating. \~15 minutes with sub-agents, 100% local. So far so good, far less iterations than other models we've tested.

[-]

Signal_Ad657@reddit

It’s so hot right now.