Gemma 4 just casually destroyed every model on our leaderboard except Opus 4.6 and GPT-5.2. 31B params, $0.20/run
Posted by Disastrous_Theme5906@reddit | LocalLLaMA | View on Reddit | 303 comments
Tested Gemma 4 (31B) on our benchmark. Genuinely did not expect this.
100% survival, 5 out of 5 runs profitable, +1,144% median ROI. At $0.20 per run.
It outperforms GPT-5.2 ($4.43/run), Gemini 3 Pro ($2.95/run), Sonnet 4.6 ($7.90/run), and absolutely destroys every Chinese open-source model we've tested — Qwen 3.5 397B, Qwen 3.5 9B, DeepSeek V3.2, GLM-5. None of them even survive consistently.
The only model that beats Gemma 4 is Opus 4.6 at $36 per run. That's 180× more expensive.
31 billion parameters. Twenty cents. We double-checked the config, the prompt, the model ID — everything is identical to every other model on the leaderboard. Same seed, same tools, same simulation. It's just genuinely this good.
Strongly recommend trying it for your agentic workflows. We've tested 22 models so far and this is by far the best cost-to-performance ratio we've ever seen.
Full breakdown with charts and day-by-day analysis: foodtruckbench.com/blog/gemma-4-31b
FoodTruck Bench is an AI business simulation benchmark — the agent runs a food truck for 30 days, making decisions about location, menu, pricing, staff, and inventory. Leaderboard at foodtruckbench.com
MrCoolest@reddit
Can you run Gemma 4 31b in 24gb ram on a 3090?
AceHighness@reddit
its a 19gb file
MrCoolest@reddit
So is that a yes or a no?
AceHighness@reddit
It think so .. I think 19gb is less than 24gb. But I'm not 100% sure that's how it works.
MrCoolest@reddit
You still need space for your KV cache. I'm getting a 3090, I'll see if it works
cynido@reddit
please let us know the speed
nvmmoneky@reddit
can i test that with custom model?
EuphoricAnimator@reddit
Wow, those results are seriously impressive for a 31B model. I've been running a bunch of stuff locally on my M4 Max Mac Studio (128GB) and seeing similar trends - the newer models are just efficient. I’m mostly playing with Qwen 3.5, Gemma 4, and a rotating cast of things from Ollama.
Quantization is where it gets tricky though, especially with these MoE models. I've found Q4 is usually fine for basic chat with Qwen and even Gemma, getting around 20-25 tokens/sec on my setup. But if I’m trying to do anything with tool calling or complex reasoning, Q8 is almost essential. The difference in accuracy is noticeable, and those little mistakes can totally derail a tool use. I’m regularly using around 40-50GB of VRAM when running a Q8 30-40B model.
It’s a trade-off, obviously. Q4 saves a ton of VRAM - letting me run more models at once - but Q8 feels significantly more…reliable. I've noticed Q4 can sometimes "forget" earlier parts of the conversation, leading to weird outputs when it tries to use a tool. That said, for just quick back-and-forth, Q4 is perfectly usable.
Honestly, I’m a little skeptical of the super-low cost per run quoted here. My electric bill is definitely going up! But even if it's a bit higher, being able to get this level of performance locally is amazing, and beating GPT-5.2 at that price? That's huge.
BasaltLabs@reddit
that's an interesting system!
jkflying@reddit
How does the MoE model do?
Disastrous_Theme5906@reddit (OP)
MoE models didn't do well on our bench. Qwen 3.5 397B (17B active) only has 29% survival and negative ROI. DeepSeek V3.2 survives 62% of the time but still ends up in the red. Gemma 4 being dense and still beating all of them at 31B is honestly the most surprising part.
dampflokfreund@reddit
They are talking about Gemma 4 26B A4B.
Disastrous_Theme5906@reddit (OP)
Oh sorry, misread that. Haven't tested the 26B A4B yet, only the 31B dense. Running it now, will update the post and article with results in the next 12 hours.
Disastrous_Theme5906@reddit (OP)
My_excellency@reddit
Thank you!!
SummarizedAnu@reddit
Let's goo
EstarriolOfTheEast@reddit
Why not use a grammar? I believe the issue of failing to account for token boundaries should be addressed in most major libraries?
(I write my own samplers so I didn't closely track that).
Aside: 5 runs is still enough to be hit by variance. And, it's possible google trained for (not necessarily on) this benchmark. But then again, looking at positioning, it looks like what this measures is not intelligence or raw capabilities but instead pure agentic grounding and decision making in that context.
anzzax@reddit
Thank You! Next one qwen3.5 27b, please! I see bigger 397b didn't do well, but just to confirm how is dense 27b
Strange-Base8809@reddit
!RemindMe 60h
nonumlog@reddit
!RemindMe 6h
monolith2303@reddit
!RemindMe 12h
RemindMeBot@reddit
I will be messaging you in 12 hours on 2026-04-06 21:16:20 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
Ifihadanameofme@reddit
So any updates?
Turbulent-Walk-8973@reddit
It's 12hrs now :)
SailIntelligent2633@reddit
!RemindMe 13.565333h
hesperaux@reddit
!remindme 16.63774h
semibaron@reddit
!RemindMe 72h
kuhaku03@reddit
!RemindMe 24
nomnom2001@reddit
!RemindMe 12h
Shad0wca7@reddit
!RemindMe 12h
condrove10@reddit
!RemindMe 12h
JorgitoEstrella@reddit
!RemindMe 24h
SailIntelligent2633@reddit
!RemindMe 72h
Constandinoskalifo@reddit
!RemindMe 24h
RemindMeBot@reddit
I will be messaging you in 12 hours on 2026-04-06 07:59:45 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
reddotr@reddit
!RemindMe 12h
roaringpup31@reddit
!REmind me 8 hours
Koalateka@reddit
!RemindMe 12h
Candiana@reddit
!Remindme 10h
BigFuckingStonk@reddit
!RemindMe 14h
sephiroth_pradah@reddit
!RemindMe 16h
Prudence-0@reddit
!ReMindMe 12h
SummarizedAnu@reddit
!RemindMe 15h
Iwaku_Real@reddit
!remindme 9h
Its-all-redditive@reddit
!RemindMe 12h
Witty_Mycologist_995@reddit
!remindme 12h
Photochromism@reddit
I’d also love to know this
lostRiddler@reddit
!RemindMe 24h
Foreign-Beginning-49@reddit
!RemindMe 24h
deathlymonkey@reddit
!RemindMe 24h
KPaleiro@reddit
!RemindMe 24h
engydev@reddit
!RemindMe
Azrox_@reddit
!remindme 13 hours
Lorian0x7@reddit
!RemindMe 24h
FenderMoon@reddit
!RemindMe 12h
infernys20@reddit
!RemindMe 12h
FenderMoon@reddit
I'm curious as well. There are a lot of us that can run the 26B one on 16GB systems but can't really run the 31B very easily.
(Technically you CAN run the 31B on a 16GB system if you use some wacky quant like IQ3_XXS, but that's a pretty trash quant, so for all intents and purposes I'm limited to the 26B on my system.)
Objective-Good310@reddit
How are you going to fit 26GB into 16GB of RAM if the system is also taking up memory? My GPU OS barely fits, and it's not exactly fast.
FenderMoon@reddit
It's not 26GB. It's smaller much smaller than that on 4 bit quants. Because it's MoE, it performs fine if you run it entirely on the CPU and uncheck "keep model in memory".
I don't know what sorcery MacOS is doing under the hood but it seems to work better than expected.
-Ellary-@reddit
Running 31B IQ4XS at 16k KV Q8 at 10 tps, 5060 ti 16gb.
Even without thinking it is performs really good, passed all my tests.
bonesoftheancients@reddit
where can i find the IQ4XS weights? huggingface search shows nothing
-Ellary-@reddit
https://huggingface.co/bartowski/google_gemma-4-31B-it-GGUF/tree/main
bonesoftheancients@reddit
thanks
FenderMoon@reddit
I wish I could run it at IQ4. Unfortunately I can't allocate all 16GB on the Mac to VRAM (it's possible to get it to allocate about 14.5GB with terminal commands, but beyond that, it just crashes out).
Plasmx@reddit
Now you have a reason to sell your kidney and get that bigger Mac. /s
jazir55@reddit
Genuinely, how can anyone use such a small context limit? I'm over here with 1M tokens being far too small and that context limit is effectively just double of what GPT 3.5 had. I can't even conceive of the use cases for that small a context window.
-Ellary-@reddit
It all deepens on the usage, skills and tools for context compression, idk why people need 1m context, when effective range of local models is around 32-64k. If you from 8k max context era, like gemma 2 was, this is not a problem, rag, dynamic context injections similar to lorebook etc.
jazir55@reddit
I have an extremely large codebase of over 60k lines and another well over 1M lines of code. 16k context usage for those projects is impossible. The codebase is so large that it legitimately needs a large context window to actually work on the project and understand how it all meshes together.
-Ellary-@reddit
And I'm auto filing specs forms using rag for database file, even 8k is enough.
jazir55@reddit
Could you clarify?
Material_Hour_115@reddit
They are using a different workflow from yourself.
"Hey agent, ingest all one million lines of my codebase and then start making changes across multiple subsystems" is frankly not something people who are experts at LLM-driven development do very often.
The person you're responding to, instead, uses a pipeline where they automatically generate small specifications (specs) - highly detailed instructions that define exactly how to solve one particular task. Their project then becomes the sum of many lesser agents reading many small specifications and executing many small tasks and building many small functionality modules that fit together through data exchange contracts explicitly described in the specifications, rather than one giant monolith.
Their agents can use a relatively small context because no single agent ever needs to understand the whole project at once, it just has to tactically build one module at a time from one spec at a time. 16k context is 2000-ish lines of instructions or code, which is typically more than enough to add one piece of functionality, fix one bug, etc. in a well-structured project.
-Ellary-@reddit
It can even do a web search using 2048-4096 of context, just trying to get not a whole page but snippets from the page that match the search keywords.
FatheredPuma81@reddit
If MoE doesn't perform well then test Qwen3.5 27B? Other benchmarks show it on par with 122B.
RobotechRicky@reddit
Which model is best for coding? Is Gemma 4 it?
inphaser@reddit
Do you know if it has already been converted for apple MLX?
Deep90@reddit
Just a suggestion, I think it would be interesting if you started running a hidden seed for all models to see if any of them are being trained and overfitted to your benchmark.
Disastrous_Theme5906@reddit (OP)
Good idea. The benchmark is closed source specifically for this reason, so models can't train on the simulation. A few Chinese labs showed interest but we only share run data, never the simulation itself. That said, running a few random seeds to double-check is a solid idea, we'll do that. Looking at the logs though, the decisions feel organic, no signs of overfitting.
FranzJoseph93@reddit
Something that seems like a flaw to me (based on reading the Gemma 4 test results) is that as I understand it, making every possible investment is good. I think without throwing in some investments the agent should avoid, you're just rewarding aggressive behavior. The food waste issue might be another sign of this.
Super interesting benchmark though, congrats!
nuclearbananana@reddit
mind you with anthropics vending machine going viral and all, I'm sure most are training on very similar tasks.
Which is, I mean, how training works, but depending on how similar is probably benchmaxxing a bit
Disastrous_Theme5906@reddit (OP)
Yeah Google literally says they trained Gemma 4 on agentic tasks, so in that sense sure. But training on agentic data and overfitting to a specific benchmark are different things. Same way code models are good at coding — nobody calls that benchmaxing.
nuclearbananana@reddit
yeah, but code is very broad category though. If you trained extra heavily on python/flask and did better on swe-bench (which is like 90% python+flask iirc), that would be benchmaxxing, without ever training directly on the tasks themselves
florinandrei@reddit
Models got so good so quickly at coding specifically because it's such a narrow task. You may think differently about that, because biases and whatnot, but it's hyper-specified and hyper-codified and has extremely few exceptions.
Cleaning toilets is far more complex, and far harder to do.
Nervous_Variety5669@reddit
You can easily have it verified by a trusted third party. It's a flawed experiment and that is quite obvious from the comments you've posted here, the very obviously vibe statements you've made on the site, and the results you are seeing.
You guys tinkered with LLMs, have very little technical experience, are obviously not data scientists or experts in this domain, or using LLMs either.
It's neat though. But it's a novelty.
I would have a different reaction if it was peddled as entertainment rather than try to appear as if you've built a legitimate measure of model capability here.
You will fool a lot of folks here though, unfortunately for them.
Deep90@reddit
Very cool! Thank you :)
nnxnnx@reddit
Would be great to test Qwen 3.5 27B (dense) - which would be a much better "equivalent" to compare against compared to Qwen 3.5 9B that is currently the "closest model" in the leaderboard.
Phaelon74@reddit
Dense always has an advantage on Moe's. So that should bot be all that surprising.
BidWestern1056@reddit
commented this separately but same on npcsh benchmarks, it performed worse than gemma3:4b
BankruptingBanks@reddit
I think he meant gemma moe model
exact_constraint@reddit
Be interesting to see Qwen3.5 27B added to the test matrix - 31b dense vs Qwen MOE isn’t a super fair comparison, imo.
Disastrous_Theme5906@reddit (OP)
We tested Qwen 3.5 at both 9B and 397B — the 397B actually went bankrupt. More parameters didn't help. Qwen 3.5 is a great model overall, but this kind of sustained multi-day agentic task seems to hit different. Not sure the 27B would change the picture much — probably needs a generation-level jump.
StirlingG@reddit
don't underestimate the 27B!
SSOMGDSJD@reddit
Earlier in this thread you said MoEs specifically don't perform well on this bench. Why would you run Gemma 4 31b dense and Moe and then balk at running qwen.5 27B, dense gemmas obvious nearest neighbor? Qwen3.5 27b has been getting a lot of praise for its performance at its param count, would be nice to see direct comparisons between it and Gemma 4 31b.
Awwtifishal@reddit
The 397B has 17B active parameters. Maybe the active parameter counts matters much more than the total. The 27B dense has 27B active parameters.
Plasmx@reddit
That can be true, but technically you would expect the gigantic size of 397B to trump a 27B dense model with overall knowledge, but who knows, has to be tested in the end.
Awwtifishal@reddit
This specific test was not about knowledge, but about skills related to business decision making.
exact_constraint@reddit
💯. I would expect 397B to outperform on a crystallized knowledge test, considering it has.. well, more lol. And besting 9B should be expected - I haven’t found a use for it outside tasks where you can define a very narrow scope.
No shade on the testing itself, nice points to have for comparison. Just, yeah, 27B is probably the most relevant model for direct comparison. I’m biased, considering I run 27B daily. Gemma 4 31B is pretty close to a drop in, 1:1 replacement, ignoring the current issues w/ context size.
Yousef5ory@reddit
How did DeepSeekV4 do at the logical reasoning compared to it and does it stand to the Llama 4 70B or its better for creativity or no?
asevans48@reddit
Runs on a macbook so its cheaper than $.20 a run.
EuphoricAnimator@reddit
Wow, those results are seriously impressive for a 31B model. I’ve been running stuff locally on a Mac Studio M4 Max (128GB) for a few months now and have been really digging the progress. I mostly play with Qwen 3.5, Gemma 4, and a bunch of things through Ollama,Mixtral is a daily driver, naturally.
What I've found is that tool calling is so hit or miss, even with models this capable. I’ve been trying to get consistent results with a simple function to look up current weather, and Gemma 4 does noticeably better than most of the 7B/13B models I've tested. The key is really forcing a structured output,like, JSON all the way. Anything less and it gets confused pretty quickly, hallucinating parameters or just ignoring the instructions. Qwen 3.5 actually surprised me here, it's pretty good at following JSON schema even with minimal prompting.
Inference speeds are great on my setup. I can get Gemma 4 running around 25-30 tokens/second with quantization, using about 60GB of VRAM. It's not instantaneous, but totally usable for most tasks. Trying to push it too far with less quantization definitely impacts quality, especially with more complex prompts.
Honestly, benchmarks are great, but I'm more interested in how these models actually behave when you ask them to do something specific. I’m still tweaking prompts and experimenting with different techniques to get reliable outputs. It’s a fun puzzle, and seeing models like Gemma 4 perform this well locally makes it even more exciting.
petruskax@reddit
tool calling with gemma models is abbysmal, the worst I ever used even after fucking around in VLLM a LOT.
EuphoricAnimator@reddit
Yeah the native tool calling is rough. I ended up building a harness that normalizes tool calling across models — works with Gemma, Qwen, Deepseek, whatever. Handles the structured output/JSON wrangling so I don't have to fight each model's quirks individually.
https://use-ash.github.io/apex/
aristotle-agent@reddit
yikes. Question: does it feel better than those paid models?
( like is performance better feeling than sonn4.6 and gem3pro from your image?)
Disastrous_Theme5906@reddit (OP)
Genuinely yes. In terms of agentic reasoning this model is way above Sonnet 4.6 and Gemini 3 Pro. The decision quality is closer to GPT-5.2 xhigh honestly. How they achieved this in 31B params we don't fully understand yet, but Google says they specifically trained Gemma 4 for agentic tasks so that probably explains a lot.
Nervous_Variety5669@reddit
Genuinely, I do not appreciate you insulting our intelligence. Your benchmark is vibes and so are your comments. You've lost all credibility with claiming, and I will quote what you said (not that pivot you made to another commenter narrowing the scope to your vibe benchmark):
"In terms of agentic reasoning this model is way above Sonnet 4.6 and Gemini 3 Pro. The decision quality is closer to GPT-5.2/5.3/5.4 xhigh honestly."
Second, you have not provided the parameters you used for any of the models.
- What reasoning effort did you use for each model?
- Did you configure compaction? How?
- What tools were configured? I know you wrote this gem here:
"Text-based tool calling, zero friction: Gemma 4 has no native function-calling API. All 34 tools were invoked via text-based parsing. The model followed the schema perfectly — 462–488 tool calls per run with zero parsing errors."
What does that even mean? What do you mean it has no native function calling API? How did you run it?
Also, you claim this:
"All models receive the exact same system prompt— rules, tools descriptions, and simulation mechanics. No model-specific tuning or hints."
Well no wonder. OpenAI, Anthropic and Google publish prompt guides for a reason. Prompting matters when leveraging the full capability of a model. You've crippled all of them. If Gemma came out on top then all you may have proven is its less sensitive to propr prompt engineering.
Has your benchmark been verified by a third party? Where can we find it?
Once you give us the runtime configuration for each model, and how we can replicate the experiment, this ... thing, is nothing more than vibes and a blog post. Anyone can call anything a benchmark.
It doesnt make it true.
Icy_Distribution_361@reddit
From what I've heard prompt engineering isn't really much of a thing anymore. What models improve their output on is the quality of the input, i.e. enough context, clear intent. Not anything very specific to the model.
Old_Cantaloupe_6558@reddit
Since when did prompt engineering stopped meaning managing the context?
Icy_Distribution_361@reddit
My point is that it isn’t specific to a model
joost00719@reddit
My experience is way worse than even qwen 3.5 35b. It fails to even edit a json file. I mean, it does edit the file, it just fucks over the syntax.
I don't like it. I wish I could, but for programming it's kinda bad.
Prestigious-Crow-845@reddit
Fail to edit json sounds not like an issue with model itself but something else
s101c@reddit
If it fails at something so simple as syntax, something is wrong with your setup. Either unpatched llama.cpp or bad sampler settings.
joost00719@reddit
Latest Llama-server version and recommended settings from the site. The Llama-server version was so new that it broke my auto update script cuz when I checked, the newest release was 4 minutes ago.
vinigrae@reddit
It’s not the model it’s your setup or whatever host you’re using
jakegh@reddit
Qwen is definitely better at coding. That isn’t what this benchmark measures.
nyrixx@reddit
Ah yes json file syntax deep coding black arts level task right there.
gnnr25@reddit
Ahh yes, because we all know the Expert Software Developer who is also a *checks notes* successful food truck entrepreneur. See it all soon, in Chef 2: AI Boogaloo
DOAMOD@reddit
It's very funny how people believe the benchmarks hype, this is absolutely horrible. In my tests, it's much inferior and currently has very serious problems in the state Gemma4 is in. It's totally useless with tools and doesn't do a good job at all, it's completely broken. 3.5 is way ahead in anything related to coding, development, agentic tasks, tools, etc.
Where Gemma4 is clearly better is in everything related to multilingual capabilities and writing; in that aspect, it is much better. So if you want it for topics related to writing, yes it is the best, but for serious development topics, no, at least not currently. And even if all the problems are fixed, I don't think it will improve much; this is more of a generalist model.
randylush@reddit
You say the word “genuinely” a lot
Disastrous_Theme5906@reddit (OP)
lol fair, I need to expand my vocabulary
randylush@reddit
It got very popular as a meaningless meme/filler word lately like “lowkey”. Well, at least it means something. “Lowkey” essentially has no meaning at this point
nuclearbananana@reddit
AI loves the word 'genuinely' as well as similar ideas of being 'real', 'not performing' etc. This didn't come out of nowhere
bephire@reddit
Could you expand more on that?
nuclearbananana@reddit
I don't know what to expand on. I've just noticed it using these phrases/ideas a lot
bephire@reddit
Oh just, what models, the context in which they appear, what made it stand out to you. When I read your comment I was very surprised that this happened with somebody else and was wondering if this was a well known phenomenon. The model I've been using is Claude Sonnet 4.6
justgetoffmylawn@reddit
You genuinely do. :)
I'm not super familiar with your benchmark. If I'm reading correctly, right now the best humans double Opus's performance. Opus is almost double GPT 5.2 (are you adding GPT 5.4?). And Gemma 4 is surprisingly close behind.
One thing that's interesting to me - Opus and most other AI models have extremely minimal waste, no matter the ROI. Gemma 4 seems to have much higher waste, but good ROI - which seems more similar to human responses?
Anyways, I only skimmed so I may not be understanding - just curious your thoughts.
Disastrous_Theme5906@reddit (OP)
haha genuinely sorry about that :)
5.4 isn't on the leaderboard yet. We tested it when it launched but the API was extremely slow in xhigh mode and costs several times more than 5.2. Generates a massive amount of tokens. From what we saw it's better than 5.2 but not by much, not enough to justify the cost increase. Postponed full testing for now.
On humans — best players can beat AI models after 2-3 tries. Getting to the overall #1 on the leaderboard took the top player about 10 runs though.
The waste thing is a great observation. Humans have the same problem — they get lazy with math and don't calculate exact portions. Gemma just physically can't do that math as well as bigger models like Opus. It knows it's wasting food, writes about it every day, but can't fix it. Bigger models with better arithmetic just don't make that mistake.
justgetoffmylawn@reddit
So that was 5.2 on xhigh, correct?
That's interesting on human performance. And if you trained a model on it, I'm sure it would improve - so doesn't really mean that humans are better.
It knows it's wasting food, writes about it every day, but can't fix it." One of us, one of us.
But seriously - just tried Gemma 4 31b for the first time on a complex medical question where I've used Opus and Gemini 3.1 Pro and GPT 5.4 - and Gemma 4 was shockingly good. Like I keep forgetting that Gemma 4 31b would've been a frontier model not that long ago. Maybe it still is?
Gotta use it more, but didn't expect it to be this good even when I heard the hype - thought it would be narrow knowledge.
jakegh@reddit
On reasoning capability it’s quite good, and it handles tool calls well. But frontier models have vastly larger world models, which makes them more intuitive and handle ambiguous prompts better.
Gemini 3 is particularly strong at that. It’s a huge model. BUT, it sucks at coding.
jakegh@reddit
I thought this is an agentic benchmark— why wouldn’t it use tool calls for that sort of thing?
I don’t really see why the ability to do arithmetic is valuable.
Venium@reddit
lol, lmao even.
johnnyXcrane@reddit
I really really doubt that. Perhaps in some specific use cases. But I have not yet tested it so I am not saying you lying. I just read that so often here and in tons of benchmarks and all always were way worse than SOTA.
Disastrous_Theme5906@reddit (OP)
Fair skepticism. We're not claiming it beats SOTA at everything, just on our specific agentic benchmark. The results are public with full day-by-day logs and you can verify the runs. It's definitely not matching Opus 4.6 or GPT-5.2 in overall capability, but for structured multi-step decision making at this price point the gap is way smaller than expected.
Ardalok@reddit
It feels better than Gemini 3 Flash, or at least on par.
DarkArtsMastery@reddit
Vibes are fine
ZucchiniEfficient978@reddit
i tried e4b for openclaw and it was really bad, any suggestions?
jimmytoan@reddit
Have you published the benchmark methodology anywhere, and are you planning to test the MoE variant of Gemma 4 to see if the dense model's performance advantage holds up?
ICanSeeYou7867@reddit
Dude, this is a great benchmark. Everything seems to be benchmaxxed these days. Great idea.
andber6@reddit
Wow i need to try this
bithatchling@reddit
The 31B vs 26B A4B tradeoff is one of the more interesting parts of this release for me. The MoE variant only activates ~3.8B parameters per token, so the fact that it can land this close to 31B on evals is kind of wild. What stands out here is not just the score, but how badly it breaks the old habit of using parameter count as a proxy for quality.
tmyx0m0p@reddit
what about GPT-5.4/mini?
idkedu@reddit
Gemma 4 can run on mobile devices as well. I have created some skills which I found useful for myself. I have them public for everyone. https://github.com/StrinGhost/gemma-skills
joeyhipolito@reddit
tried it on my orchestrator for planning tasks, held up surprisingly well for 31B. starts getting weird with long tool call chains though, hallucinates tool names around step 6 or 7. still testing. what's the tool call depth looking like in your food truck sim?
hesperaux@reddit
Thanks for the info! Very helpful. Appreciate you going back and testing 26b a4b.
TheRiddler79@reddit
100% agree. I started using it yesterday as my 2nd level and it punches almost up to my Qwen 3.5 that's 12x larger
dev_l1x_be@reddit
Claude is committing suicide with their current approach to fuck over the user base while at the same time somehow managed to make Opus worse at least for the project I am using it for.
Keinsaas@reddit
Connect it to our keinsaas navigator🙌
_derpiii_@reddit
Question: is there a framework/harnesses to build these kind of benchmarks? Or are people vibe coding custom harnesses?
I see a lot of these benchmarks and don't know where to even begin.
BidWestern1056@reddit
i gotta try the 31b cause the e4b did p dogshit in my npcsh benchmarking, doing even worse than the gemma3:4b strangely, might try re-running it but was surprised.
Disastrous_Theme5906@reddit (OP)
Same experience here. The A4B variant leaks native special tokens into tool call JSON, couldn't even complete our simulation. The 31B dense is a completely different model in terms of quality. Definitely try it.
BidWestern1056@reddit
ya the gemma4 31b was solid
z_latent@reddit
Just pointing out, E4B and [26B] A4B are different models. Which one did you mean here?
BidWestern1056@reddit
good to know, working on training a native-complex model that's eating up most of my gpus atm but am planning to run it through my npcsh benchmarks after that's done.
Adventurous-Paper566@reddit
Gemma 4 is the first local model I can run on 32Gb of VRAM without having to correct it.
I'm talking this it, with an average stt time of 2 minutes per input, and he never disgress or misunderstood the subject of the conversation. In French.
I'm waiting the 124B MoE with impatience!
redditorialy_retard@reddit
me with 24 GB :(
Adventurous-Paper566@reddit
Q4_K_XL still good 👍
redditorialy_retard@reddit
shi thanks
Plasmx@reddit
You can just remove the mmproj file? I thought it was a bigger effort when they said to remove vision for less VRAM usage!
Adventurous-Paper566@reddit
Yes, or just rename it with the .gguf.bckp extension to disable it.
It's only easy with gguf quants.
bacocololo@reddit
Which quantization model do you use please ? Merci
Adventurous-Paper566@reddit
Q6_K_XL bartowski, without the mmproj I can reach 20k context.
Q4_K_XL with the same parameters loads with 65536 tokens but I haven't tested its limits yet.
6
Maleficent-Ad5999@reddit
Do you find Gemma 4 performing better than Qwen 3.5 27b?
Adventurous-Paper566@reddit
Yes, for my usage 31B performs better.
Rude_Ambassador_6270@reddit
well, okay, can I put it to fx trading then?
DetouristCollective@reddit
Do you have any plans to compare it to another comparable dense model like Qwen3.5 27B?
Disastrous_Theme5906@reddit (OP)
Already tested Qwen 3.5 9B and 397B — the 397B went bankrupt, bigger didn't help here. The 27B would likely land somewhere in between. Great model family overall, just not at Gemma 4's level for this type of task yet.
my_name_isnt_clever@reddit
Not wanting to test the most modern and directly comparable model to Gemma 4 31B is a strange choice OP.
pile-of-V100s@reddit
27B has far more active parameters than both 9B and 397B-A17B
kavakravata@reddit
Can i run it with a single 3090? 😁😁
LanceThunder@reddit
on my 3090 the 30b is slow or crashes my system. the 26b goes a little slow but not as slow as i would expect for 26b. 4b is pretty good.
misha1350@reddit
4B on a 3090 is such a waste, 31B would run well on 24GB VRAM. You can use a 3090 or the Intel ARC Pro B60 24GB easily.
raindownthunda@reddit
Yes, 31B at Q5_K_M runs on my 3090 but it’s on the slower side. The quality is pretty fantastic though! I am going to experiment with 26B A4B next. E4B is insanely fast and surprisingly good, but 31B of course blows it away. Hoping 26B A4B is a good balance of quality and speed for the 3090.
misha1350@reddit
Try UD-Q4_K_XL as well, or the UD-IQ quants (since you have an NVIDIA card).
raindownthunda@reddit
Thanks for the tip! Will try those quants. Do you know if imatrix gguf is preferred or standard gguf (thinking about some of the Q3.5 models)? I’ve admittedly gotten a little lost in the different methods.
misha1350@reddit
Imatrix seems to work well for CUDA, whereas regular gguf and other models are universal and work equally well across all hardware.
raindownthunda@reddit
Thank you - I appreciate it
LanceThunder@reddit
i did get the 31B to work for a little while but it was still too slow for my liking.
z_latent@reddit
The 26B one is an A4B MoE, so it's supposed to be near 4B speeds, assuming it fit your VRAM without CPU offloading. What quant were you running it with?
Spectrum1523@reddit
Sure
YetiTrix@reddit
Gemma 4 didn't really work for my use case. Which is diagnosing PLC Code. Qwen-Coder-Next still does best job for that.
Embarrassed_Adagio28@reddit
Yeah qwen 3 coder next is still my best local coding model. It is great at tool calling and with the right project structure, can run for hours at a time without stopping to ask questions. Even with the iq3_xss quant.
Disastrous_Theme5906@reddit (OP)
Makes sense, 31B is still a small model and can't be great at everything. Our benchmark tests agentic decision-making, not coding. For PLC diagnostics and dev tasks there are definitely better options at this size. Qwen-Coder is solid for that.
ceo_of_banana@reddit
How does it test that? I've heard the world many times but I'm still not sure what it means.
vinigrae@reddit
Logic
Ryukish@reddit
Have you tried a mix of skills + more detailed prompts. I usually find that open source models need me to be more explicit to get opus level performance
YetiTrix@reddit
PLC code is especially hard because it's not really trained on especially text representation of ladder logic. It doesn't exist out on the internet. The models have to infer a lot more of the meaning. Yes, I do A/B testing with my prompts.
Ryukish@reddit
I agree with that, it feels like the harness we use matters a lot more than the models sometimes. Especially IF it isn't a huge field ai is familiar with. I did find processing books(related to my field) into markdown files and referencing them did help performance though.
Ok-Secret5233@reddit
OP, how do you run Gemma 4? I haven't managed it. Hugging face has only safetensors (no gguf) and the llama.cpp convert script errors with "failed to detect model architecture"...
danhoel999@reddit
As someone who only starts with local models: is there a guide on how to use for example this 31B model of Gemma?
DeepOrangeSky@reddit
What does the mean (instead of median) result look like for it compared to these other models?
Also, how extreme is the variance between the runs (for the same model vs itself over the 5 runs)? Like is there some way of expressing the severity of the volatility experienced over the course of the 30 days like standard deviation of volatility per 1-day segment or per 5-day segment across the runs to give a sense of the volatility severity? Like how big are the jumps and dips on the graph as it goes along? Are they severe enough that it would need a lot more than just 5 runs/30 days for it to mean much, or are the size of the wobbles small enough relative to the overall run that it ends up being ultra-meaningful even when taking variance/volatility into account? (I assume it's at least somewhat volatile given that some of the overall profitable models are going broke on some runs, meaning its enough volatility that they are dying in the early phase some significant percentage of the time).
Also, since the models all start with a starting amount of just $2,000 (which is a proportionally fairly small starting amount relative to the ending amounts of money over the course of the month), if the volatility/variance going on during these runs is fairly high, and even models that on avg are doing fairly well overall sometimes just go broke in the early portion of their runs if the volatility dips below 0 during the dangerous early part of the run when they start with such a small amount of money relative to the volatility size, I was wondering if you have considered doing an additional version of this test where you basically "allow the models to go broke" (and basically receive a bailout if they do, like, let's say if one of them goes broke on day 10 or something, you just reboost it back to $2k and let it continue its run but still note the run as a "it went broke" run with an asterisk next to it, but this way you can get more data on the runs if like half the time these models go broke in their early portion of their runs or something, you could get like ~1.1x-1.5x as much data if you did it that way, for example. (The idea of why this is different than just having it do a few extra runs from scratch and noting how many extra runs they did being that you also get to keep the data of what it was doing before it went broke on the runs where it went broke in the analyses of the runs afterwards. Although maybe you are still including data from all aspects of the runs where they go broke anyway, in which case maybe it wouldn't make much difference to any of this?). Also, if you did this, it should go without saying that the models would still need to think that going broke meant going broke, though (obviously don't tell the models that they get to receive a bailout and continue their run if they go broke, otherwise it would affect their strategy since they would try much riskier strategies if they knew they'd get a bailout if they went broke).
Disastrous_Theme5906@reddit (OP)
We use median because some models have wild variance — one good run out of five. Gemma 4 is actually pretty consistent, ROI across 5 runs ranges from +457% to +1,354%, all profitable. Models that go bankrupt usually blow up in the first week because of bad inventory management, not bad luck.
On the bailout thing — there's a loan system in the sim for this. Models can borrow to recover from a rough start. Doesn't help though. Weak models keep making the same mistakes and go bankrupt anyway, loan or not.
SummarizedAnu@reddit
Will you make a YouTube animated 3d/2d video about this?
MoodDelicious3920@reddit
Why AI generated answer? Ur answer contains 2 em dashes.
Sea-Spot-1113@reddit
Did you know -- humans also use em dashes?
MoodDelicious3920@reddit
But the dashes u used are -- which are there in keyboard (both phone and PC) but the long em dash — isn't generally available unless u copy paste that symbol , like i did just now
Party-Special-5177@reddit
I’m on iPad right now I can make em dashes on command with double hyphens, it just autocorrects them: —
And I use them too as I like them, just not on Reddit as you guys have a psychosis about them lol
CallmeAK__@reddit
That 31B dense model really hits the sweet spot for unit economics. I’ve found that for agentic workflows, the "perception" of tool outputs is usually what kills the smaller variants, so it makes sense that the 31B handles the reasoning loops without the JSON formatting issues.
Happy-Register3367@reddit
This feels almost too good to be true 31B + $0.20/run beating stuff 10-100x more expensive is kinda insane if it actually holds up outside this benchmark.
GroundbreakingMall54@reddit
gemma 4 really is a different feel
Loose_Object_8311@reddit
We need to give them real food trucks to run and benchmark against that.
FenderMoon@reddit
I've been using the 26B A4B one, and I've been blown away. First local model I've ever used that genuinely feels smart enough to replace ChatGPT for daily stuff.
I did have to get reasoning enabled by modifying the templates. For some reason none of them have the reasoning working out of the box, the model is way worse without it in LMStudio.
Strange-Base8809@reddit
what sort of use case that you are validating against?
Hot_Ferret7431@reddit
I don't understand. Is this model really better than Sonnet 4.6? I don't understand how a 32GB model I can run on my machine is better than a multi billion dollar model
Sky-Asher27@reddit
the 4b param is great too
Force88@reddit
I tried gemma4 26 and 31b, while fast, it seems to not handle unknown knowledge or web search.
I tried to ask it to find me latest news, that its knowledge base doesn't have, like details of nvidia 5000 gpus, but it said they are not out yet and only leaks shown that 5090 will be very powerful.
The same question will be answered correctly by qwen3.5 though.
I don't know if I do anything wrong, I just pull it from ollama app in windows and chat though.
ashlord666@reddit
did you connect it to any mcp servers like mcp-server-fetch? If not, how do you expect it to be able to go online?
Force88@reddit
Nope, ollama client by default support web search, at least every qwen model I used does.
Also, what's a mcp servers? I that other softwares I need to run AI?
Far_Cat9782@reddit
Mcp servers are the tools. Pretty much python scripts that run http servers in the bg that. It's a set standard that allows your AI to use tools like web search web fetch image generation etc; it lets it interface with outside programs the "bridge." So u can make one to control "blender" or connect to comfyUI to generate images of audio etc; def. Look it up easy to mplement
year2039nuclearwar@reddit
I’m also interested to find out the answer to this
Odd_Mortgage_9108@reddit
Wait, if you have a food truck simulation, is it solving an optimisation problem? Maybe a traveling salesman problem? I'm wary of "model X does well in benchmark" if the benchmark is very specific.
silentus8378@reddit
When you do comprehensive benchmarking, qwen3.5 27b is still better than gemma 4 31b.
year2039nuclearwar@reddit
Where can we find details on this?
VoiceApprehensive893@reddit
sometimes benchmarking results are just funny
yes its the moe thats beating sonnet not dense
EugeneSpaceman@reddit
Huge margin of error on all those scores. They all overlap with each other if you take that into account
OmarBessa@reddit
how
GrungeWerX@reddit
Why isn't Qwen 3.5 27B in this testing? That's the only fair comparison to the 31B as they're both dense models...
Disastrous_Theme5906@reddit (OP)
Getting a lot of Qwen 3.5 27B requests in this thread. We tested the 9B and 397B — both well below Gemma 4 on this task, the 397B went bankrupt. I run this project on my own time and money, so I can't cover every model from every lab. If you want to see Qwen 27B tested — ping u/Alibaba_Qwen on Twitter or tag them in r/LocalLLaMA. If a lab shows interest, I'll run their models and publish everything.
EugeneSpaceman@reddit
It’s the most direct comparison to Gemma 4 31b, and considered by many to be ahead of it in several domains.
Would be a big omission not to include it.
Negative-Web8619@reddit
The first one to benchmaxx on ftb
Iwaku_Real@reddit
How can you benchmaxx if you don't have the actual test data
Negative-Web8619@reddit
It's a joke
Digitalzuzel@reddit
google has access to conversation logs of gemini models..
Beckendy@reddit
Seriously, you have gpt-5.2 on a second place? Where gpt-5.4?
Ayuzh@reddit
what's your setup for running these?
RevolutionaryGold325@reddit
What was the context size and how much did it take memory?
Specialist_Golf8133@reddit
wait people are still sleeping on gemma? the price/performance here is actually insane. like everyone's gonna keep throwing money at the big models while this thing is just sitting there at 31B doing 90% of the work for pennies. kinda feels like the gap between 'good enough' and 'perfect' just became way more expensive than most workflows actually need
one-escape-left@reddit
from your blog post: "Qwen 3.5 9B (bankrupt tier, $0.15/run) — the closest model in parameter count and price"
This is incorrect. Qwen 3.5 27B is the closest dense model in the family. Have you considered running that model?
Disastrous_Theme5906@reddit (OP)
Fair — "closest from what we tested" would've been more accurate. We tested the 9B and 397B from the Qwen 3.5 family, both endpoints of the range. The 397B went bankrupt. Can't realistically test every variant from every lab — each model needs 5 full 30-day runs for reliable medians. If anyone has contacts at Qwen's team and they're interested, happy to run it and publish the results.
ZeitgeistArchive@reddit
is there a dense thinking gemma 4 31B? I tried the 31B instruct version and it was ok, but not great for my knowledge and reasoning goals
Swimming_Gain_4989@reddit
31B is a thinking model, if you're not seeing thinking tokens your provider is misconfigured
FenderMoon@reddit
It has to be enabled by changing the JINGA templates in LMStudio. They haven't fix that yet.
GrungeWerX@reddit
Will changing the jinga template cause it work? I tried enabling true, but it thought for maybe a sentence and then immediately started its output. And it looked weird.
Example above. I'm assuming it needs to be fixed internally?
Disastrous_Theme5906@reddit (OP)
Same issue on our end. The 27B A4B MoE variant leaks <|\ tokens into tool call JSON — every string value comes out as "<|\ground_beef<|\"|" instead of "ground_beef". Had to write a regex sanitizer to strip these tokens just to get it through a benchmark run. The 31B dense doesn't have this problem over API, but A4B is rough.
Warthammer40K@reddit
Gemini 4 Pro gonna crush whatever the hell this benchmark is.
dancinpants@reddit
No one is using local LLMs for serious coding. I tried Gemma 4 with OpenCode and it got stuck in an infinite loop trying to search for a file. This tech ain't ready yet.
phazei@reddit
I've seen a lot of praise for this model. But on most of the comments people are saying it's just benchmaxing. What do you say to that? That all the tests are in the training data?
Disastrous_Theme5906@reddit (OP)
The benchmark is closed source specifically so models can't train on it. No lab has access to the simulation internals. Looking at the logs, the model makes organic decisions — it adapts to events, changes strategy mid-run, makes mistakes and recovers. Doesn't look like memorization.
GenerallyVerklempt@reddit
Does that mean your results are not reproducible by anyone except you? We just have to take your word for it?
Digitalzuzel@reddit
Just curios, what is your solution to benchmaxing?
phazei@reddit
Nice! Can't wait to try it myself, I've been pretty astounded by Qwen 3.5 already, having something else come out so soon after that's even better is awesome.
Honest-Debate-6863@reddit
Where to find the codebase of the harness
ConsiderationHot814@reddit
This is a fascinating breakdown! The cost-to-performance ratio of Gemma 4 (31B) compared to frontier models like GPT-5.2 and Opus 4.6 is truly impressive. It's interesting to see a dense model outperforming MoE architectures in this specific agentic simulation. Looking forward to seeing the results for the 26B A4B version as well!
Euphoric_Emotion5397@reddit
unfortunately, I'm having trouble getting it to do tool calling and instructions following in my LM studio :( The prompt works totally fine with Qwen 3.5.
AgitatedHearing653@reddit
What a clever idea having ai compete with each other to run a fictional business. Game theory at its finest. Kudos on this.
JohnMason6504@reddit
26B with 4B active per token. Running Q8 on Jetson Orin at 40 tok/s. Apache 2.0 license seals it.
Few-Beyond785@reddit
!RemindMe 12h
Maralitabambolo@reddit
16 or 8bit?
Even_Minimum_4797@reddit
This is underrated
gpt872323@reddit
opus 4.6 on multiple leaderboard is not number in just benchmark wise otherwise I am the biggest fan of Opus. Just saying it cannot be this much discrepancy.
Ylsid@reddit
Waiting for the minebench, the real test of skills
Conscious_Nobody9571@reddit
Better than sonnet? No way
Murder_Teddy_Bear@reddit
I'm really happy with it, can't wait for the eventual uncensored release. ; )
Natrimo@reddit
Hauhau has one out already
The_Choir_Invisible@reddit
So far I've tried the e2b and e4b (meant for mobile) versions and they are uncensored to an extent that I haven't seen since wizard-vicuna-uncensored. I hope they work well with AnythingLLM because I'd like to use them for agentic tasks.
Also for anyone downloading the e4b quants, check out the _P versions!
Murder_Teddy_Bear@reddit
Oh shit! That was quick, thanks.
Natrimo@reddit
Let me know how it works, haven't tried it myself
Acceptable_Home_@reddit
yo, im making something same can i have some tips, a noodle shop sim for bias detection on how many LLMs will start capitalising when i tell them there's no proper way to run the shop or win they're free
Rn it has 4 diff suppliers with loyalty and mood system, different type of noodle stock, many moral events, reputation based on stock, cleaniness etc and fatigue system, rent, supply chian inflation or breaking of supply chain due to storms, etc. i would love your opinion on this :)
AdUnlucky9870@reddit
This is the part that keeps surprising me every quarter — we keep thinking we've hit diminishing returns, then something like this drops.
What I'm curious about: is anyone running this at scale in production yet? The benchmarks look great, but I've been burned before by models that crush evals but fall apart on messy real-world inputs. Would love to hear from anyone who's stress-tested it beyond the leaderboard tasks.
somerussianbear@reddit
I don’t get it. Several benchmarks posted here and all over the place have been showing Qwen 3.5 dense beating the Gemma counterpart, not by much, but beating it. But then in other benchmarks Gemma beats everything and Qwen is not even in the picture. I’m a happy user of both, so no rage, just wanna understand really.
yaboyyoungairvent@reddit
I think it's cause smaller models can't be good at everything like with larger models. They can be good at select things. It seems the consensus on here is that qwen performs well when it comes to coding but if your use case is for specifically agentic tasks then gemma is better.
SexyAlienHotTubWater@reddit
Honestly I think this shows that the metric is not particularly good.
Try talking to it, get it to solve some tasks. Gemma is way dumber than sonnet 4.6, kimi K2, Qwen 3.6, 3.5, Gemini (which it was probably distilled from)...
lobehubexp@reddit
Are you factoring in total inference cost or just per run pricing
Quillshade36@reddit
!RemindMe 12h
totonn87@reddit
I have to buy a new laptop, does gemma4 26b works on a macbook air m5 24 gb of ram?
PattF@reddit
26b works great, 31b not so much. 26b is great too though.
totonn87@reddit
But does not fit in 16 gb of ram, right?
PattF@reddit
26b will, even with a high context. 31b will but with less than 1k context and like 3-7 tps
Street_Ice3816@reddit
gemma is not that good
citrusalex@reddit
I've observed the same doing a Home Assistant bench.
m98789@reddit
How does it compare to GPT-OSS-120B?
itsjase@reddit
Tell me I shouldn’t trust your benchmark without telling me I shouldn’t trust your benchmark
virtualunc@reddit
$0.20 per run vs $7.90 for sonnet is insane if these numbers hold up across other benchmarks too.. open source catching frontier models at 1/40th the cost is the real story here
DonnaPollson@reddit
The interesting signal here isn’t just raw quality, it’s price elasticity. Once a model gets good enough for multi-step work, a 20x cost delta changes behavior more than a small benchmark gap because people start routing entire classes of tasks to it by default. The real test now is variance across prompts and tool stacks, not whether it can win one leaderboard headline.
ortegaalfredo@reddit
I had the same experience. Just did a benchmark expecting it to be dumber than Qwen 3.5 27B, but it actually was near 397B in performance.
TQMA@reddit
!RemindMe 24h
MrCoolest@reddit
Is this 4b quantized?
Leonjy92@reddit
!RemindMe 24H
Leonjy92@reddit
!RemindMe 24h
redballooon@reddit
Casually, hu? Can't wait to see results of when it tries earnestly.
Tough-Intention3672@reddit
Where are GPT 5.3, GPT 5.4, which are smarter than GPT 5.2?
NNN_Throwaway2@reddit
What inference backend did you run it with?
trusty20@reddit
What backend are you using for gemma? llama.cpp?
LanceThunder@reddit
i was working on some javascript with Qwen 3.5 9b and Gemma4 26b. the Qwen 3.5 model did a better job.
Roubbes@reddit
Which quants did you use?
Disastrous_Theme5906@reddit (OP)
No quants, we run through OpenRouter API — full weights, thinking mode enabled. https://openrouter.ai/google/gemma-4-31b-it
xplode145@reddit
It’s so slow on my m5 pro 64gb ram
Nervous-Positive-431@reddit
I am thinking of getting one of those bad puppies, how many tokens are you getting?
DroopyMcDoo@reddit
This looks interesting af but I have no idea what’s going on here. Could someone explain?
Disastrous_Theme5906@reddit (OP)
AI models run a simulated food truck business for 30 days — they choose locations, set menus, buy ingredients, hire staff, manage money. We compare how well different models handle it. Leaderboard at foodtruckbench.com, you can also play it yourself.
Rich_Artist_8327@reddit
Grok doing pretty bad. Was Pentagon driven by Grok?
Disastrous_Theme5906@reddit (OP)
Yeah Grok was disappointing. I think Elon knows — hopefully they come back with something stronger. Would love to see them competitive again.
GanacheValuable2310@reddit
The fact that qwen 397B couldn't even survive consistently but this 31B does every time is crazy
Rich_Artist_8327@reddit
Where do you get this 0.2$ run? What is that value?
Enough_Leopard3524@reddit
It’s good to know the open source models are improving. It’s a cold day in hell when I use only paid LLM models. They were trained on public knowledge, used by the public - just like the internet. I will always support this type of behavior from Google or any other organization. AOL learned the hard way, fafo.
traveddit@reddit
This isn't true. Gemma 4 has its own native function calling template that are baked into the tokenizer.
Disastrous_Theme5906@reddit (OP)
You're right, my bad. Gemma 4 does have native function calling tokens. We run it through OpenRouter which handles the conversion to OpenAI-compatible schema on their end, so we didn't interact with the native template directly. Updated the article, thanks for catching that.
ScoreUnique@reddit
I am running 31B on opencode attached to paperclip ai. I find paperclip ai struggling with small MOEs, the only models that didn't fail miserably were Gemma 4 31 and Moe models. Google came to claim the goat title for local models it seems
RealAggressiveNooby@reddit
How does Qwen 3.5 with similar params compare to Gemma 4? Has anyone here messed around with both (for general applications and for coding respectively)?
Disastrous_Theme5906@reddit (OP)
We haven't tested Qwen3.5-27B specifically. The closest we have is Qwen 3.5 9B (0% survival, bankrupt in \~14 days) and Qwen 3.5 397B with 17B active params (29% survival, negative ROI). Even the 397B version couldn't come close to Gemma's results, so honestly not sure what the 27B would do. Can't speak to coding, only agentic tasks on our bench.
illcuontheotherside@reddit
Guess I need to try 31b again. I have not been pleased with the 26b model. At all.
Neither_Nebula_5423@reddit
Qwen works better for my use cases (vibe research)
NotumRobotics@reddit
It's the absolute king of our cluster.
MrMrsPotts@reddit
Where did qwen 3.5 come?
Recoil42@reddit
OP: Looks like you don't have an inference cost column on your results page at all? Seems like it would be useful.
Disastrous_Theme5906@reddit (OP)
Yeah fair point, it's not on the main leaderboard table yet. Cost data is in the individual case studies but should probably be a column on the main page too. Adding it to the list.