Gemma 4 31B passed 7/8 real-world production tests — including ones I designed to make it fail. Full prompts + outputs.

Posted by grassxyz@reddit | LocalLLaMA | View on Reddit | 21 comments

I've been waiting for a capable free local LLM for a while. I think we're close — the quality is getting there fast, and Gemma 4 is the first open-weight model where I genuinely considered using it in production for simple-to-medium tasks.

To test that instinct, I ran both models (31B Dense and 26B A4B MoE) through 8 real-world tasks — not benchmarks, actual prompts I'd use at work. Shared everything so you can run the same tests yourself:

- All 8 prompts, copy-paste ready

- Full model outputs for the longer tests

- Demo app source (single HTML file, just needs a free AI Studio key)

Results verified by Gemini 3.1 Pro and Claude Opus 4.6 independently.

https://github.com/useaitechdad/explore-gemma4

[-]

danigoncalves@reddit

On the Test 6 couldn't did you tried with live docs search like Context7?

[-]

grassxyz@reddit (OP)

No. I know it will work but will need to build the tool call. The main focus of the test is to see what knowledge it has and whether it hallucinate and any other non forgivable errors.

[-]

ttkciar@reddit

I've been evaluating Gemma-4-31B-it for codegen, and it is very good. Not as good as GLM-4.5-Air, but it comes close with a sufficiently well-worded project specification, it generates fewer bugs, and its context limit is twice that of Air's.

It's still leaving some features unimplemented in my trials, but I'm trying to figure out how to remedy that.

[-]

danigoncalves@reddit

GLM-4.5-Air is better than Gemma-4-31B? I am a little bit suprised by that tbh.

[-]

grassxyz@reddit (OP)

Agreed. Interesting point on GLM-4.5-Air — haven't tested that one.

For the unimplemented features, have you tried breaking the spec into smaller chunks and feeding them sequentially? Context + instruction granularity seems to matter a lot with Gemma (the limited context window). but honestly, I think the current context window 256k is the bare minimum.

I'm confident the next version will be sufficient for most commercial use — with the right wrappers around context management and tool calling. The gap is closing faster than I expected.

[-]

ttkciar@reddit

For the unimplemented features, have you tried breaking the spec into smaller chunks and feeding them sequentially?

I'm not sure what you mean. This is an example of the kinds of specifications I have been using:

http://ciar.org/h/wiki.03.txt

GLM-4.5-Air did very well with that, leaving nothing unimplemented, but Gemma-4 made some bad assumptions which I had to rectify with additional instructions (like "Write an html template file for every kind of web page the wiki uses" else it would generate html from functions within the wiki main program).

yes it is slow... we are spoiled to swift reply :) I hope we have some consumer grade LPU at affordable price in future ...BTW, how's your experience with OpenCode.

[-]