Gemma 4 31B passed 7/8 real-world production tests — including ones I designed to make it fail. Full prompts + outputs.
Posted by grassxyz@reddit | LocalLLaMA | View on Reddit | 21 comments
I've been waiting for a capable free local LLM for a while. I think we're close — the quality is getting there fast, and Gemma 4 is the first open-weight model where I genuinely considered using it in production for simple-to-medium tasks.
To test that instinct, I ran both models (31B Dense and 26B A4B MoE) through 8 real-world tasks — not benchmarks, actual prompts I'd use at work. Shared everything so you can run the same tests yourself:
- All 8 prompts, copy-paste ready
- Full model outputs for the longer tests
- Demo app source (single HTML file, just needs a free AI Studio key)
Results verified by Gemini 3.1 Pro and Claude Opus 4.6 independently.
danigoncalves@reddit
On the Test 6 couldn't did you tried with live docs search like Context7?
grassxyz@reddit (OP)
No. I know it will work but will need to build the tool call. The main focus of the test is to see what knowledge it has and whether it hallucinate and any other non forgivable errors.
ttkciar@reddit
I've been evaluating Gemma-4-31B-it for codegen, and it is very good. Not as good as GLM-4.5-Air, but it comes close with a sufficiently well-worded project specification, it generates fewer bugs, and its context limit is twice that of Air's.
It's still leaving some features unimplemented in my trials, but I'm trying to figure out how to remedy that.
danigoncalves@reddit
GLM-4.5-Air is better than Gemma-4-31B? I am a little bit suprised by that tbh.
grassxyz@reddit (OP)
Agreed. Interesting point on GLM-4.5-Air — haven't tested that one.
For the unimplemented features, have you tried breaking the spec into smaller chunks and feeding them sequentially? Context + instruction granularity seems to matter a lot with Gemma (the limited context window). but honestly, I think the current context window 256k is the bare minimum.
I'm confident the next version will be sufficient for most commercial use — with the right wrappers around context management and tool calling. The gap is closing faster than I expected.
ttkciar@reddit
I'm not sure what you mean. This is an example of the kinds of specifications I have been using:
http://ciar.org/h/wiki.03.txt
GLM-4.5-Air did very well with that, leaving nothing unimplemented, but Gemma-4 made some bad assumptions which I had to rectify with additional instructions (like "Write an html template file for every kind of web page the wiki uses" else it would generate html from functions within the wiki main program).
If I can close the codegen competence gap between Gemma4 and Air with better instructions, though, that will be a huge win, since Gemma-4-31B-it fits in my 32GB VRAM and GLM-4.5-Air does not (at Q4_K_M).
grassxyz@reddit (OP)
You probably want to put the file list at the end of the prompt. Anything buried at the top (in your spec txt sample) in a long spec tends to get underweighted by the time the model starts generating. An explicit file manifest at the end also gives Gemma a checklist to work against rather than needing to infer the full structure mid-output.
ttkciar@reddit
Thanks! I'll give that a shot.
mtomas7@reddit
Could you also compare it vs Qwen-3.5-27B?
dzedaj@reddit
u/grassxyz please run those same benchmarks using Qwen3.5-27B - it's supposed to be better than Gemma4-31B in every aspect - is it?
grassxyz@reddit (OP)
If I have time I will compare it. Based on the spec, I think they are pretty close but need to test it before drawing conclusions
mrtrly@reddit
The 7/8 pass rate on actual work prompts matters less than the one failure. Production viability usually comes down to whether that failure is the kind you can live with for simple-to-medium tasks or the kind that kills the whole approach.
grassxyz@reddit (OP)
For me it is sufficient. Am I looking for better models yes. But this is good enough for most day to day use.
dzedaj@reddit
u/grassxyz
can you share your config? llama.cpp / vllm ?
what settings, what context size do you use and what hardware
grassxyz@reddit (OP)
It is cloud hosted as noted in my post. My old gpu can’t run.
RIRATheTrue@reddit
What settings did you use? Also what quant?
chuvadenovembro@reddit
Eu gostaria de testar mais o gemma4 31b, mas baixei varias versões e todas elas (quando funcionam), são relativamente lentas, testei com 8bits e 4bits e continuo achando lento... Tem um llm chamado zen4 coder 80b que é muito mais rapido... Estou usando um mac studio m2 ultra com 128... Testei com oMLX, llama.cpp, lmstudio (só não testei no ollama)... Uso o opencode para realizar meus testes e com meus codigos reais para ter essa base...
grassxyz@reddit (OP)
yes it is slow... we are spoiled to swift reply :) I hope we have some consumer grade LPU at affordable price in future ...BTW, how's your experience with OpenCode.
chuvadenovembro@reddit
Acredito que parte da perda de velocidade ocorre por utilizar o opencode (o claude code é muito mais lento), apesar disso, a experiencia é bem agradavel, se o modelo é inteligente, ele quase não fica pedindo permissão para fazer as coisas.
verdooft@reddit
Interesting, thank you for sharing the results. I use the smaller Qwen 3.5 MOE model at the moment, but will try Gemma 4 soon.
chickN00dle@reddit
update us!