[-]

MrCoolest@reddit

Can you run Gemma 4 31b in 24gb ram on a 3090?

[-]

AceHighness@reddit

It think so .. I think 19gb is less than 24gb. But I'm not 100% sure that's how it works.

[-]

MrCoolest@reddit

You still need space for your KV cache. I'm getting a 3090, I'll see if it works

[-]

Wow, those results are seriously impressive for a 31B model. I've been running a bunch of stuff locally on my M4 Max Mac Studio (128GB) and seeing similar trends - the newer models are just efficient. I’m mostly playing with Qwen 3.5, Gemma 4, and a rotating cast of things from Ollama.

Quantization is where it gets tricky though, especially with these MoE models. I've found Q4 is usually fine for basic chat with Qwen and even Gemma, getting around 20-25 tokens/sec on my setup. But if I’m trying to do anything with tool calling or complex reasoning, Q8 is almost essential. The difference in accuracy is noticeable, and those little mistakes can totally derail a tool use. I’m regularly using around 40-50GB of VRAM when running a Q8 30-40B model.

It’s a trade-off, obviously. Q4 saves a ton of VRAM - letting me run more models at once - but Q8 feels significantly more…reliable. I've noticed Q4 can sometimes "forget" earlier parts of the conversation, leading to weird outputs when it tries to use a tool. That said, for just quick back-and-forth, Q4 is perfectly usable.

Honestly, I’m a little skeptical of the super-low cost per run quoted here. My electric bill is definitely going up! But even if it's a bit higher, being able to get this level of performance locally is amazing, and beating GPT-5.2 at that price? That's huge.

[-]

BasaltLabs@reddit

that's an interesting system!

[-]

jkflying@reddit

How does the MoE model do?

[-]

Disastrous_Theme5906@reddit (OP)

MoE models didn't do well on our bench. Qwen 3.5 397B (17B active) only has 29% survival and negative ROI. DeepSeek V3.2 survives 62% of the time but still ends up in the red. Gemma 4 being dense and still beating all of them at 31B is honestly the most surprising part.

[-]

dampflokfreund@reddit

They are talking about Gemma 4 26B A4B.

[-]

Disastrous_Theme5906@reddit (OP)

Oh sorry, misread that. Haven't tested the 26B A4B yet, only the 31B dense. Running it now, will update the post and article with results in the next 12 hours.

[-]

Disastrous_Theme5906@reddit (OP)

Update: 26B A4B results are posted.

5 runs, 3 survived, 2 bankrupt from loan defaults. Median NW: $4,386, ROI +119%, $0.31/run. Sits at #7 on the leaderboard — above DeepSeek, Qwen, GLM, but well below the 31B dense.

Fair warning though — this is the only model out of 23 that needed custom JSON output sanitization to work at all. It makes good business decisions but can't produce clean tool-call JSON. Had to build a 3-stage sanitizer just for it. If you're using this model for agentic tasks, expect to deal with broken output formatting.

The 31B dense has none of these issues and performs 10× better. Leaderboard updated with full data.

[-]

My_excellency@reddit

Thank you!!

[-]

SummarizedAnu@reddit

Let's goo

[-]

EstarriolOfTheEast@reddit

Why not use a grammar? I believe the issue of failing to account for token boundaries should be addressed in most major libraries?

(I write my own samplers so I didn't closely track that).

Aside: 5 runs is still enough to be hit by variance. And, it's possible google trained for (not necessarily on) this benchmark. But then again, looking at positioning, it looks like what this measures is not intelligence or raw capabilities but instead pure agentic grounding and decision making in that context.

[-]

anzzax@reddit

Thank You! Next one qwen3.5 27b, please! I see bigger 397b didn't do well, but just to confirm how is dense 27b

[-]

Strange-Base8809@reddit

!RemindMe 60h

[-]

nonumlog@reddit

!RemindMe 6h

[-]

monolith2303@reddit

!RemindMe 12h

[-]

RemindMeBot@reddit

I will be messaging you in 12 hours on 2026-04-06 21:16:20 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

Ifihadanameofme@reddit

So any updates?

[-]

Turbulent-Walk-8973@reddit

It's 12hrs now :)

[-]

SailIntelligent2633@reddit

!RemindMe 13.565333h

[-]

hesperaux@reddit

!remindme 16.63774h

[-]

semibaron@reddit

!RemindMe 72h

[-]

kuhaku03@reddit

!RemindMe 24

[-]

nomnom2001@reddit

!RemindMe 12h

[-]

Shad0wca7@reddit

!RemindMe 12h

[-]

condrove10@reddit

!RemindMe 12h

[-]

JorgitoEstrella@reddit

!RemindMe 24h

[-]

SailIntelligent2633@reddit

!RemindMe 72h

[-]

Constandinoskalifo@reddit

!RemindMe 24h

[-]

RemindMeBot@reddit

I will be messaging you in 12 hours on 2026-04-06 07:59:45 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

reddotr@reddit

!RemindMe 12h

[-]

roaringpup31@reddit

!REmind me 8 hours

[-]

Koalateka@reddit

!RemindMe 12h

[-]

Candiana@reddit

!Remindme 10h

[-]

BigFuckingStonk@reddit

!RemindMe 14h

[-]

sephiroth_pradah@reddit

!RemindMe 16h

[-]

Prudence-0@reddit

!ReMindMe 12h

[-]

SummarizedAnu@reddit

!RemindMe 15h

[-]

Iwaku_Real@reddit

!remindme 9h

[-]

Its-all-redditive@reddit

!RemindMe 12h

[-]

Witty_Mycologist_995@reddit

!remindme 12h

[-]

Photochromism@reddit

I’d also love to know this

[-]

lostRiddler@reddit

!RemindMe 24h

[-]

Foreign-Beginning-49@reddit

!RemindMe 24h

[-]

deathlymonkey@reddit

!RemindMe 24h

[-]

KPaleiro@reddit

!RemindMe 24h

[-]

engydev@reddit

!RemindMe

[-]

Azrox_@reddit

!remindme 13 hours

[-]

Lorian0x7@reddit

!RemindMe 24h

[-]

FenderMoon@reddit

!RemindMe 12h

[-]

infernys20@reddit

!RemindMe 12h

[-]

FenderMoon@reddit

I'm curious as well. There are a lot of us that can run the 26B one on 16GB systems but can't really run the 31B very easily.

(Technically you CAN run the 31B on a 16GB system if you use some wacky quant like IQ3_XXS, but that's a pretty trash quant, so for all intents and purposes I'm limited to the 26B on my system.)

[-]

Objective-Good310@reddit

How are you going to fit 26GB into 16GB of RAM if the system is also taking up memory? My GPU OS barely fits, and it's not exactly fast.

[-]

FenderMoon@reddit

It's not 26GB. It's smaller much smaller than that on 4 bit quants. Because it's MoE, it performs fine if you run it entirely on the CPU and uncheck "keep model in memory".

I don't know what sorcery MacOS is doing under the hood but it seems to work better than expected.

[-]

-Ellary-@reddit

Running 31B IQ4XS at 16k KV Q8 at 10 tps, 5060 ti 16gb.
Even without thinking it is performs really good, passed all my tests.

[-]

bonesoftheancients@reddit

where can i find the IQ4XS weights? huggingface search shows nothing

[-]

-Ellary-@reddit

https://huggingface.co/bartowski/google_gemma-4-31B-it-GGUF/tree/main

[-]

bonesoftheancients@reddit

thanks

[-]

FenderMoon@reddit

I wish I could run it at IQ4. Unfortunately I can't allocate all 16GB on the Mac to VRAM (it's possible to get it to allocate about 14.5GB with terminal commands, but beyond that, it just crashes out).

[-]

Plasmx@reddit

Now you have a reason to sell your kidney and get that bigger Mac. /s

[-]

jazir55@reddit

Genuinely, how can anyone use such a small context limit? I'm over here with 1M tokens being far too small and that context limit is effectively just double of what GPT 3.5 had. I can't even conceive of the use cases for that small a context window.

[-]

-Ellary-@reddit

It all deepens on the usage, skills and tools for context compression, idk why people need 1m context, when effective range of local models is around 32-64k. If you from 8k max context era, like gemma 2 was, this is not a problem, rag, dynamic context injections similar to lorebook etc.

[-]

jazir55@reddit

I have an extremely large codebase of over 60k lines and another well over 1M lines of code. 16k context usage for those projects is impossible. The codebase is so large that it legitimately needs a large context window to actually work on the project and understand how it all meshes together.

[-]

-Ellary-@reddit

And I'm auto filing specs forms using rag for database file, even 8k is enough.

[-]

jazir55@reddit

Could you clarify?

[-]

Material_Hour_115@reddit

They are using a different workflow from yourself.

"Hey agent, ingest all one million lines of my codebase and then start making changes across multiple subsystems" is frankly not something people who are experts at LLM-driven development do very often.

The person you're responding to, instead, uses a pipeline where they automatically generate small specifications (specs) - highly detailed instructions that define exactly how to solve one particular task. Their project then becomes the sum of many lesser agents reading many small specifications and executing many small tasks and building many small functionality modules that fit together through data exchange contracts explicitly described in the specifications, rather than one giant monolith.

Their agents can use a relatively small context because no single agent ever needs to understand the whole project at once, it just has to tactically build one module at a time from one spec at a time. 16k context is 2000-ish lines of instructions or code, which is typically more than enough to add one piece of functionality, fix one bug, etc. in a well-structured project.

[-]

-Ellary-@reddit

It can even do a web search using 2048-4096 of context, just trying to get not a whole page but snippets from the page that match the search keywords.

[-]

FatheredPuma81@reddit

If MoE doesn't perform well then test Qwen3.5 27B? Other benchmarks show it on par with 122B.

[-]

RobotechRicky@reddit

Which model is best for coding? Is Gemma 4 it?

[-]

inphaser@reddit

Do you know if it has already been converted for apple MLX?

[-]

Deep90@reddit

Just a suggestion, I think it would be interesting if you started running a hidden seed for all models to see if any of them are being trained and overfitted to your benchmark.

[-]

Disastrous_Theme5906@reddit (OP)

Good idea. The benchmark is closed source specifically for this reason, so models can't train on the simulation. A few Chinese labs showed interest but we only share run data, never the simulation itself. That said, running a few random seeds to double-check is a solid idea, we'll do that. Looking at the logs though, the decisions feel organic, no signs of overfitting.

[-]

FranzJoseph93@reddit

Something that seems like a flaw to me (based on reading the Gemma 4 test results) is that as I understand it, making every possible investment is good. I think without throwing in some investments the agent should avoid, you're just rewarding aggressive behavior. The food waste issue might be another sign of this.

Super interesting benchmark though, congrats!

[-]

nuclearbananana@reddit

The benchmark is closed source specifically for this reason, so models can't train on the simulation

mind you with anthropics vending machine going viral and all, I'm sure most are training on very similar tasks.

Which is, I mean, how training works, but depending on how similar is probably benchmaxxing a bit

[-]

Disastrous_Theme5906@reddit (OP)

Yeah Google literally says they trained Gemma 4 on agentic tasks, so in that sense sure. But training on agentic data and overfitting to a specific benchmark are different things. Same way code models are good at coding — nobody calls that benchmaxing.

[-]

nuclearbananana@reddit

yeah, but code is very broad category though. If you trained extra heavily on python/flask and did better on swe-bench (which is like 90% python+flask iirc), that would be benchmaxxing, without ever training directly on the tasks themselves

[-]

florinandrei@reddit

code is very broad category though

Models got so good so quickly at coding specifically because it's such a narrow task. You may think differently about that, because biases and whatnot, but it's hyper-specified and hyper-codified and has extremely few exceptions.

Cleaning toilets is far more complex, and far harder to do.

[-]

Nervous_Variety5669@reddit

You can easily have it verified by a trusted third party. It's a flawed experiment and that is quite obvious from the comments you've posted here, the very obviously vibe statements you've made on the site, and the results you are seeing.

You guys tinkered with LLMs, have very little technical experience, are obviously not data scientists or experts in this domain, or using LLMs either.

It's neat though. But it's a novelty.

I would have a different reaction if it was peddled as entertainment rather than try to appear as if you've built a legitimate measure of model capability here.

You will fool a lot of folks here though, unfortunately for them.

[-]

Deep90@reddit

Very cool! Thank you :)

[-]

nnxnnx@reddit

Would be great to test Qwen 3.5 27B (dense) - which would be a much better "equivalent" to compare against compared to Qwen 3.5 9B that is currently the "closest model" in the leaderboard.

[-]

Phaelon74@reddit

Dense always has an advantage on Moe's. So that should bot be all that surprising.

[-]

BidWestern1056@reddit

commented this separately but same on npcsh benchmarks, it performed worse than gemma3:4b

[-]

BankruptingBanks@reddit

I think he meant gemma moe model

[-]

exact_constraint@reddit

Be interesting to see Qwen3.5 27B added to the test matrix - 31b dense vs Qwen MOE isn’t a super fair comparison, imo.

[-]

Disastrous_Theme5906@reddit (OP)

We tested Qwen 3.5 at both 9B and 397B — the 397B actually went bankrupt. More parameters didn't help. Qwen 3.5 is a great model overall, but this kind of sustained multi-day agentic task seems to hit different. Not sure the 27B would change the picture much — probably needs a generation-level jump.

[-]

StirlingG@reddit

don't underestimate the 27B!

[-]

SSOMGDSJD@reddit

Earlier in this thread you said MoEs specifically don't perform well on this bench. Why would you run Gemma 4 31b dense and Moe and then balk at running qwen.5 27B, dense gemmas obvious nearest neighbor? Qwen3.5 27b has been getting a lot of praise for its performance at its param count, would be nice to see direct comparisons between it and Gemma 4 31b.

[-]

Awwtifishal@reddit

The 397B has 17B active parameters. Maybe the active parameter counts matters much more than the total. The 27B dense has 27B active parameters.

[-]

Plasmx@reddit

That can be true, but technically you would expect the gigantic size of 397B to trump a 27B dense model with overall knowledge, but who knows, has to be tested in the end.

[-]

Awwtifishal@reddit

This specific test was not about knowledge, but about skills related to business decision making.

[-]

exact_constraint@reddit

💯. I would expect 397B to outperform on a crystallized knowledge test, considering it has.. well, more lol. And besting 9B should be expected - I haven’t found a use for it outside tasks where you can define a very narrow scope.

No shade on the testing itself, nice points to have for comparison. Just, yeah, 27B is probably the most relevant model for direct comparison. I’m biased, considering I run 27B daily. Gemma 4 31B is pretty close to a drop in, 1:1 replacement, ignoring the current issues w/ context size.

[-]

Yousef5ory@reddit

How did DeepSeekV4 do at the logical reasoning compared to it and does it stand to the Llama 4 70B or its better for creativity or no?

[-]

asevans48@reddit

Runs on a macbook so its cheaper than $.20 a run.

[-]

EuphoricAnimator@reddit

Wow, those results are seriously impressive for a 31B model. I’ve been running stuff locally on a Mac Studio M4 Max (128GB) for a few months now and have been really digging the progress. I mostly play with Qwen 3.5, Gemma 4, and a bunch of things through Ollama,Mixtral is a daily driver, naturally.

What I've found is that tool calling is so hit or miss, even with models this capable. I’ve been trying to get consistent results with a simple function to look up current weather, and Gemma 4 does noticeably better than most of the 7B/13B models I've tested. The key is really forcing a structured output,like, JSON all the way. Anything less and it gets confused pretty quickly, hallucinating parameters or just ignoring the instructions. Qwen 3.5 actually surprised me here, it's pretty good at following JSON schema even with minimal prompting.

Inference speeds are great on my setup. I can get Gemma 4 running around 25-30 tokens/second with quantization, using about 60GB of VRAM. It's not instantaneous, but totally usable for most tasks. Trying to push it too far with less quantization definitely impacts quality, especially with more complex prompts.

Honestly, benchmarks are great, but I'm more interested in how these models actually behave when you ask them to do something specific. I’m still tweaking prompts and experimenting with different techniques to get reliable outputs. It’s a fun puzzle, and seeing models like Gemma 4 perform this well locally makes it even more exciting.

[-]

petruskax@reddit

tool calling with gemma models is abbysmal, the worst I ever used even after fucking around in VLLM a LOT.

[-]

EuphoricAnimator@reddit

Yeah the native tool calling is rough. I ended up building a harness that normalizes tool calling across models — works with Gemma, Qwen, Deepseek, whatever. Handles the structured output/JSON wrangling so I don't have to fight each model's quirks individually.

https://use-ash.github.io/apex/

[-]

aristotle-agent@reddit

yikes. Question: does it feel better than those paid models?

( like is performance better feeling than sonn4.6 and gem3pro from your image?)

[-]

Disastrous_Theme5906@reddit (OP)

Genuinely yes. In terms of agentic reasoning this model is way above Sonnet 4.6 and Gemini 3 Pro. The decision quality is closer to GPT-5.2 xhigh honestly. How they achieved this in 31B params we don't fully understand yet, but Google says they specifically trained Gemma 4 for agentic tasks so that probably explains a lot.

[-]

Nervous_Variety5669@reddit

Genuinely, I do not appreciate you insulting our intelligence. Your benchmark is vibes and so are your comments. You've lost all credibility with claiming, and I will quote what you said (not that pivot you made to another commenter narrowing the scope to your vibe benchmark):

"In terms of agentic reasoning this model is way above Sonnet 4.6 and Gemini 3 Pro. The decision quality is closer to GPT-5.2/5.3/5.4 xhigh honestly."

Second, you have not provided the parameters you used for any of the models.

- What reasoning effort did you use for each model?
- Did you configure compaction? How?
- What tools were configured? I know you wrote this gem here:

"Text-based tool calling, zero friction: Gemma 4 has no native function-calling API. All 34 tools were invoked via text-based parsing. The model followed the schema perfectly — 462–488 tool calls per run with zero parsing errors."

What does that even mean? What do you mean it has no native function calling API? How did you run it?

Also, you claim this:

"All models receive the exact same system prompt— rules, tools descriptions, and simulation mechanics. No model-specific tuning or hints."

Well no wonder. OpenAI, Anthropic and Google publish prompt guides for a reason. Prompting matters when leveraging the full capability of a model. You've crippled all of them. If Gemma came out on top then all you may have proven is its less sensitive to propr prompt engineering.

Has your benchmark been verified by a third party? Where can we find it?

Once you give us the runtime configuration for each model, and how we can replicate the experiment, this ... thing, is nothing more than vibes and a blog post. Anyone can call anything a benchmark.

It doesnt make it true.

[-]

Icy_Distribution_361@reddit

From what I've heard prompt engineering isn't really much of a thing anymore. What models improve their output on is the quality of the input, i.e. enough context, clear intent. Not anything very specific to the model.

[-]

Old_Cantaloupe_6558@reddit

Since when did prompt engineering stopped meaning managing the context?

[-]

Icy_Distribution_361@reddit

My point is that it isn’t specific to a model

[-]

joost00719@reddit

My experience is way worse than even qwen 3.5 35b. It fails to even edit a json file. I mean, it does edit the file, it just fucks over the syntax.

I don't like it. I wish I could, but for programming it's kinda bad.

[-]

Prestigious-Crow-845@reddit

Fail to edit json sounds not like an issue with model itself but something else

[-]

s101c@reddit

If it fails at something so simple as syntax, something is wrong with your setup. Either unpatched llama.cpp or bad sampler settings.

[-]

joost00719@reddit

Latest Llama-server version and recommended settings from the site. The Llama-server version was so new that it broke my auto update script cuz when I checked, the newest release was 4 minutes ago.

[-]

vinigrae@reddit

It’s not the model it’s your setup or whatever host you’re using

[-]

jakegh@reddit

Qwen is definitely better at coding. That isn’t what this benchmark measures.

[-]

nyrixx@reddit

Ah yes json file syntax deep coding black arts level task right there.

[-]

gnnr25@reddit

Ahh yes, because we all know the Expert Software Developer who is also a *checks notes* successful food truck entrepreneur. See it all soon, in Chef 2: AI Boogaloo

[-]

DOAMOD@reddit

It's very funny how people believe the benchmarks hype, this is absolutely horrible. In my tests, it's much inferior and currently has very serious problems in the state Gemma4 is in. It's totally useless with tools and doesn't do a good job at all, it's completely broken. 3.5 is way ahead in anything related to coding, development, agentic tasks, tools, etc.

Where Gemma4 is clearly better is in everything related to multilingual capabilities and writing; in that aspect, it is much better. So if you want it for topics related to writing, yes it is the best, but for serious development topics, no, at least not currently. And even if all the problems are fixed, I don't think it will improve much; this is more of a generalist model.

[-]

randylush@reddit

You say the word “genuinely” a lot

[-]

Disastrous_Theme5906@reddit (OP)

lol fair, I need to expand my vocabulary

[-]

randylush@reddit

It got very popular as a meaningless meme/filler word lately like “lowkey”. Well, at least it means something. “Lowkey” essentially has no meaning at this point

[-]

nuclearbananana@reddit

AI loves the word 'genuinely' as well as similar ideas of being 'real', 'not performing' etc. This didn't come out of nowhere

[-]

bephire@reddit

Could you expand more on that?

[-]

nuclearbananana@reddit

I don't know what to expand on. I've just noticed it using these phrases/ideas a lot

[-]

bephire@reddit

Oh just, what models, the context in which they appear, what made it stand out to you. When I read your comment I was very surprised that this happened with somebody else and was wondering if this was a well known phenomenon. The model I've been using is Claude Sonnet 4.6

[-]

justgetoffmylawn@reddit

You genuinely do. :)

I'm not super familiar with your benchmark. If I'm reading correctly, right now the best humans double Opus's performance. Opus is almost double GPT 5.2 (are you adding GPT 5.4?). And Gemma 4 is surprisingly close behind.

One thing that's interesting to me - Opus and most other AI models have extremely minimal waste, no matter the ROI. Gemma 4 seems to have much higher waste, but good ROI - which seems more similar to human responses?

Anyways, I only skimmed so I may not be understanding - just curious your thoughts.

[-]

Disastrous_Theme5906@reddit (OP)

haha genuinely sorry about that :)

5.4 isn't on the leaderboard yet. We tested it when it launched but the API was extremely slow in xhigh mode and costs several times more than 5.2. Generates a massive amount of tokens. From what we saw it's better than 5.2 but not by much, not enough to justify the cost increase. Postponed full testing for now.

On humans — best players can beat AI models after 2-3 tries. Getting to the overall #1 on the leaderboard took the top player about 10 runs though.

The waste thing is a great observation. Humans have the same problem — they get lazy with math and don't calculate exact portions. Gemma just physically can't do that math as well as bigger models like Opus. It knows it's wasting food, writes about it every day, but can't fix it. Bigger models with better arithmetic just don't make that mistake.

[-]

justgetoffmylawn@reddit

So that was 5.2 on xhigh, correct?

That's interesting on human performance. And if you trained a model on it, I'm sure it would improve - so doesn't really mean that humans are better.

It knows it's wasting food, writes about it every day, but can't fix it." One of us, one of us.

But seriously - just tried Gemma 4 31b for the first time on a complex medical question where I've used Opus and Gemini 3.1 Pro and GPT 5.4 - and Gemma 4 was shockingly good. Like I keep forgetting that Gemma 4 31b would've been a frontier model not that long ago. Maybe it still is?

Gotta use it more, but didn't expect it to be this good even when I heard the hype - thought it would be narrow knowledge.

[-]

jakegh@reddit

On reasoning capability it’s quite good, and it handles tool calls well. But frontier models have vastly larger world models, which makes them more intuitive and handle ambiguous prompts better.

Gemini 3 is particularly strong at that. It’s a huge model. BUT, it sucks at coding.

[-]

jakegh@reddit

I thought this is an agentic benchmark— why wouldn’t it use tool calls for that sort of thing?

I don’t really see why the ability to do arithmetic is valuable.

[-]

Venium@reddit

quality is closer to GPT-5.2/5.3/5.4 xhigh

lol, lmao even.

[-]

johnnyXcrane@reddit

I really really doubt that. Perhaps in some specific use cases. But I have not yet tested it so I am not saying you lying. I just read that so often here and in tons of benchmarks and all always were way worse than SOTA.

[-]

Disastrous_Theme5906@reddit (OP)

Fair skepticism. We're not claiming it beats SOTA at everything, just on our specific agentic benchmark. The results are public with full day-by-day logs and you can verify the runs. It's definitely not matching Opus 4.6 or GPT-5.2 in overall capability, but for structured multi-step decision making at this price point the gap is way smaller than expected.

[-]

Ardalok@reddit

It feels better than Gemini 3 Flash, or at least on par.

[-]

DarkArtsMastery@reddit

Vibes are fine

[-]

ZucchiniEfficient978@reddit

i tried e4b for openclaw and it was really bad, any suggestions?

[-]

jimmytoan@reddit

Have you published the benchmark methodology anywhere, and are you planning to test the MoE variant of Gemma 4 to see if the dense model's performance advantage holds up?

[-]

ICanSeeYou7867@reddit

Dude, this is a great benchmark. Everything seems to be benchmaxxed these days. Great idea.

[-]

andber6@reddit

Wow i need to try this

[-]

bithatchling@reddit

The 31B vs 26B A4B tradeoff is one of the more interesting parts of this release for me. The MoE variant only activates ~3.8B parameters per token, so the fact that it can land this close to 31B on evals is kind of wild. What stands out here is not just the score, but how badly it breaks the old habit of using parameter count as a proxy for quality.

[-]

tmyx0m0p@reddit

what about GPT-5.4/mini?

[-]

idkedu@reddit

Gemma 4 can run on mobile devices as well. I have created some skills which I found useful for myself. I have them public for everyone. https://github.com/StrinGhost/gemma-skills

[-]

joeyhipolito@reddit

tried it on my orchestrator for planning tasks, held up surprisingly well for 31B. starts getting weird with long tool call chains though, hallucinates tool names around step 6 or 7. still testing. what's the tool call depth looking like in your food truck sim?

[-]

hesperaux@reddit

Thanks for the info! Very helpful. Appreciate you going back and testing 26b a4b.

[-]

TheRiddler79@reddit

100% agree. I started using it yesterday as my 2nd level and it punches almost up to my Qwen 3.5 that's 12x larger

[-]

dev_l1x_be@reddit

Claude is committing suicide with their current approach to fuck over the user base while at the same time somehow managed to make Opus worse at least for the project I am using it for.

[-]

Keinsaas@reddit

Connect it to our keinsaas navigator🙌

[-]

_derpiii_@reddit

Question: is there a framework/harnesses to build these kind of benchmarks? Or are people vibe coding custom harnesses?

I see a lot of these benchmarks and don't know where to even begin.

[-]

BidWestern1056@reddit

i gotta try the 31b cause the e4b did p dogshit in my npcsh benchmarking, doing even worse than the gemma3:4b strangely, might try re-running it but was surprised.

[-]

Disastrous_Theme5906@reddit (OP)

Same experience here. The A4B variant leaks native special tokens into tool call JSON, couldn't even complete our simulation. The 31B dense is a completely different model in terms of quality. Definitely try it.

[-]

BidWestern1056@reddit

ya the gemma4 31b was solid

[-]

z_latent@reddit

Just pointing out, E4B and [26B] A4B are different models. Which one did you mean here?

[-]

BidWestern1056@reddit

good to know, working on training a native-complex model that's eating up most of my gpus atm but am planning to run it through my npcsh benchmarks after that's done.

[-]

Adventurous-Paper566@reddit

Gemma 4 is the first local model I can run on 32Gb of VRAM without having to correct it.

I'm talking this it, with an average stt time of 2 minutes per input, and he never disgress or misunderstood the subject of the conversation. In French.

I'm waiting the 124B MoE with impatience!

[-]

redditorialy_retard@reddit

me with 24 GB :(

[-]

Adventurous-Paper566@reddit

Q4_K_XL still good 👍

[-]

redditorialy_retard@reddit

shi thanks

[-]

Plasmx@reddit

You can just remove the mmproj file? I thought it was a bigger effort when they said to remove vision for less VRAM usage!

[-]

Adventurous-Paper566@reddit

Yes, or just rename it with the .gguf.bckp extension to disable it.

It's only easy with gguf quants.

[-]

bacocololo@reddit

Which quantization model do you use please ? Merci

[-]

Adventurous-Paper566@reddit

Q6_K_XL bartowski, without the mmproj I can reach 20k context.

Q4_K_XL with the same parameters loads with 65536 tokens but I haven't tested its limits yet.

6

[-]

Maleficent-Ad5999@reddit

Do you find Gemma 4 performing better than Qwen 3.5 27b?

[-]

Adventurous-Paper566@reddit

Yes, for my usage 31B performs better.

[-]

Rude_Ambassador_6270@reddit

well, okay, can I put it to fx trading then?

[-]

DetouristCollective@reddit

Do you have any plans to compare it to another comparable dense model like Qwen3.5 27B?

[-]

Disastrous_Theme5906@reddit (OP)

Already tested Qwen 3.5 9B and 397B — the 397B went bankrupt, bigger didn't help here. The 27B would likely land somewhere in between. Great model family overall, just not at Gemma 4's level for this type of task yet.

[-]

my_name_isnt_clever@reddit

Not wanting to test the most modern and directly comparable model to Gemma 4 31B is a strange choice OP.

[-]

pile-of-V100s@reddit

27B has far more active parameters than both 9B and 397B-A17B

[-]

kavakravata@reddit

Can i run it with a single 3090? 😁😁

[-]

LanceThunder@reddit

on my 3090 the 30b is slow or crashes my system. the 26b goes a little slow but not as slow as i would expect for 26b. 4b is pretty good.

[-]

misha1350@reddit

4B on a 3090 is such a waste, 31B would run well on 24GB VRAM. You can use a 3090 or the Intel ARC Pro B60 24GB easily.

[-]

raindownthunda@reddit

Yes, 31B at Q5_K_M runs on my 3090 but it’s on the slower side. The quality is pretty fantastic though! I am going to experiment with 26B A4B next. E4B is insanely fast and surprisingly good, but 31B of course blows it away. Hoping 26B A4B is a good balance of quality and speed for the 3090.

[-]

misha1350@reddit

Try UD-Q4_K_XL as well, or the UD-IQ quants (since you have an NVIDIA card).

[-]

raindownthunda@reddit

Thanks for the tip! Will try those quants. Do you know if imatrix gguf is preferred or standard gguf (thinking about some of the Q3.5 models)? I’ve admittedly gotten a little lost in the different methods.

[-]

misha1350@reddit

Imatrix seems to work well for CUDA, whereas regular gguf and other models are universal and work equally well across all hardware.

[-]

raindownthunda@reddit

Thank you - I appreciate it

[-]

LanceThunder@reddit

i did get the 31B to work for a little while but it was still too slow for my liking.

[-]

z_latent@reddit

The 26B one is an A4B MoE, so it's supposed to be near 4B speeds, assuming it fit your VRAM without CPU offloading. What quant were you running it with?

[-]

Spectrum1523@reddit

Sure

[-]

YetiTrix@reddit

Gemma 4 didn't really work for my use case. Which is diagnosing PLC Code. Qwen-Coder-Next still does best job for that.

[-]

Embarrassed_Adagio28@reddit

Yeah qwen 3 coder next is still my best local coding model. It is great at tool calling and with the right project structure, can run for hours at a time without stopping to ask questions. Even with the iq3_xss quant.

[-]

Disastrous_Theme5906@reddit (OP)

Makes sense, 31B is still a small model and can't be great at everything. Our benchmark tests agentic decision-making, not coding. For PLC diagnostics and dev tasks there are definitely better options at this size. Qwen-Coder is solid for that.

[-]

ceo_of_banana@reddit

How does it test that? I've heard the world many times but I'm still not sure what it means.

[-]

vinigrae@reddit

Logic

[-]

Ryukish@reddit

Have you tried a mix of skills + more detailed prompts. I usually find that open source models need me to be more explicit to get opus level performance

[-]

YetiTrix@reddit

PLC code is especially hard because it's not really trained on especially text representation of ladder logic. It doesn't exist out on the internet. The models have to infer a lot more of the meaning. Yes, I do A/B testing with my prompts.

[-]

Ryukish@reddit

I agree with that, it feels like the harness we use matters a lot more than the models sometimes. Especially IF it isn't a huge field ai is familiar with. I did find processing books(related to my field) into markdown files and referencing them did help performance though.

[-]

Ok-Secret5233@reddit

OP, how do you run Gemma 4? I haven't managed it. Hugging face has only safetensors (no gguf) and the llama.cpp convert script errors with "failed to detect model architecture"...

[-]

danhoel999@reddit

As someone who only starts with local models: is there a guide on how to use for example this 31B model of Gemma?

[-]

DeepOrangeSky@reddit

What does the mean (instead of median) result look like for it compared to these other models?

Also, how extreme is the variance between the runs (for the same model vs itself over the 5 runs)? Like is there some way of expressing the severity of the volatility experienced over the course of the 30 days like standard deviation of volatility per 1-day segment or per 5-day segment across the runs to give a sense of the volatility severity? Like how big are the jumps and dips on the graph as it goes along? Are they severe enough that it would need a lot more than just 5 runs/30 days for it to mean much, or are the size of the wobbles small enough relative to the overall run that it ends up being ultra-meaningful even when taking variance/volatility into account? (I assume it's at least somewhat volatile given that some of the overall profitable models are going broke on some runs, meaning its enough volatility that they are dying in the early phase some significant percentage of the time).

Also, since the models all start with a starting amount of just $2,000 (which is a proportionally fairly small starting amount relative to the ending amounts of money over the course of the month), if the volatility/variance going on during these runs is fairly high, and even models that on avg are doing fairly well overall sometimes just go broke in the early portion of their runs if the volatility dips below 0 during the dangerous early part of the run when they start with such a small amount of money relative to the volatility size, I was wondering if you have considered doing an additional version of this test where you basically "allow the models to go broke" (and basically receive a bailout if they do, like, let's say if one of them goes broke on day 10 or something, you just reboost it back to $2k and let it continue its run but still note the run as a "it went broke" run with an asterisk next to it, but this way you can get more data on the runs if like half the time these models go broke in their early portion of their runs or something, you could get like ~1.1x-1.5x as much data if you did it that way, for example. (The idea of why this is different than just having it do a few extra runs from scratch and noting how many extra runs they did being that you also get to keep the data of what it was doing before it went broke on the runs where it went broke in the analyses of the runs afterwards. Although maybe you are still including data from all aspects of the runs where they go broke anyway, in which case maybe it wouldn't make much difference to any of this?). Also, if you did this, it should go without saying that the models would still need to think that going broke meant going broke, though (obviously don't tell the models that they get to receive a bailout and continue their run if they go broke, otherwise it would affect their strategy since they would try much riskier strategies if they knew they'd get a bailout if they went broke).

[-]

Disastrous_Theme5906@reddit (OP)

We use median because some models have wild variance — one good run out of five. Gemma 4 is actually pretty consistent, ROI across 5 runs ranges from +457% to +1,354%, all profitable. Models that go bankrupt usually blow up in the first week because of bad inventory management, not bad luck.

On the bailout thing — there's a loan system in the sim for this. Models can borrow to recover from a rough start. Doesn't help though. Weak models keep making the same mistakes and go bankrupt anyway, loan or not.

[-]

SummarizedAnu@reddit

Will you make a YouTube animated 3d/2d video about this?

[-]

MoodDelicious3920@reddit

Why AI generated answer? Ur answer contains 2 em dashes.

[-]

Sea-Spot-1113@reddit

Did you know -- humans also use em dashes?

[-]

MoodDelicious3920@reddit

But the dashes u used are -- which are there in keyboard (both phone and PC) but the long em dash — isn't generally available unless u copy paste that symbol , like i did just now

[-]

Party-Special-5177@reddit

I’m on iPad right now I can make em dashes on command with double hyphens, it just autocorrects them: —

And I use them too as I like them, just not on Reddit as you guys have a psychosis about them lol

[-]

CallmeAK__@reddit

That 31B dense model really hits the sweet spot for unit economics. I’ve found that for agentic workflows, the "perception" of tool outputs is usually what kills the smaller variants, so it makes sense that the 31B handles the reasoning loops without the JSON formatting issues.

[-]

Happy-Register3367@reddit

This feels almost too good to be true 31B + $0.20/run beating stuff 10-100x more expensive is kinda insane if it actually holds up outside this benchmark.

[-]

GroundbreakingMall54@reddit

gemma 4 really is a different feel

[-]

Loose_Object_8311@reddit

We need to give them real food trucks to run and benchmark against that.

[-]

FenderMoon@reddit

I've been using the 26B A4B one, and I've been blown away. First local model I've ever used that genuinely feels smart enough to replace ChatGPT for daily stuff.

I did have to get reasoning enabled by modifying the templates. For some reason none of them have the reasoning working out of the box, the model is way worse without it in LMStudio.

[-]

Strange-Base8809@reddit

what sort of use case that you are validating against?

[-]

Hot_Ferret7431@reddit

I don't understand. Is this model really better than Sonnet 4.6? I don't understand how a 32GB model I can run on my machine is better than a multi billion dollar model

[-]

Sky-Asher27@reddit

the 4b param is great too

[-]

Force88@reddit

I tried gemma4 26 and 31b, while fast, it seems to not handle unknown knowledge or web search.

I tried to ask it to find me latest news, that its knowledge base doesn't have, like details of nvidia 5000 gpus, but it said they are not out yet and only leaks shown that 5090 will be very powerful.

The same question will be answered correctly by qwen3.5 though.

I don't know if I do anything wrong, I just pull it from ollama app in windows and chat though.

[-]

ashlord666@reddit

did you connect it to any mcp servers like mcp-server-fetch? If not, how do you expect it to be able to go online?

[-]

Force88@reddit

Nope, ollama client by default support web search, at least every qwen model I used does.

Also, what's a mcp servers? I that other softwares I need to run AI?

[-]

Far_Cat9782@reddit

Mcp servers are the tools. Pretty much python scripts that run http servers in the bg that. It's a set standard that allows your AI to use tools like web search web fetch image generation etc; it lets it interface with outside programs the "bridge." So u can make one to control "blender" or connect to comfyUI to generate images of audio etc; def. Look it up easy to mplement

[-]

year2039nuclearwar@reddit

I’m also interested to find out the answer to this

[-]

Odd_Mortgage_9108@reddit

Wait, if you have a food truck simulation, is it solving an optimisation problem? Maybe a traveling salesman problem? I'm wary of "model X does well in benchmark" if the benchmark is very specific.

[-]

silentus8378@reddit

When you do comprehensive benchmarking, qwen3.5 27b is still better than gemma 4 31b.

[-]

year2039nuclearwar@reddit

Where can we find details on this?

[-]

VoiceApprehensive893@reddit

sometimes benchmarking results are just funny

yes its the moe thats beating sonnet not dense

[-]

EugeneSpaceman@reddit

Huge margin of error on all those scores. They all overlap with each other if you take that into account

[-]

OmarBessa@reddit

how

[-]

GrungeWerX@reddit

Why isn't Qwen 3.5 27B in this testing? That's the only fair comparison to the 31B as they're both dense models...

[-]

Disastrous_Theme5906@reddit (OP)

Getting a lot of Qwen 3.5 27B requests in this thread. We tested the 9B and 397B — both well below Gemma 4 on this task, the 397B went bankrupt. I run this project on my own time and money, so I can't cover every model from every lab. If you want to see Qwen 27B tested — ping u/Alibaba_Qwen on Twitter or tag them in r/LocalLLaMA. If a lab shows interest, I'll run their models and publish everything.

[-]

EugeneSpaceman@reddit

It’s the most direct comparison to Gemma 4 31b, and considered by many to be ahead of it in several domains.

Would be a big omission not to include it.

[-]

Negative-Web8619@reddit

The first one to benchmaxx on ftb

[-]

Iwaku_Real@reddit

How can you benchmaxx if you don't have the actual test data

[-]

Negative-Web8619@reddit

It's a joke

[-]

Digitalzuzel@reddit

google has access to conversation logs of gemini models..

[-]

Beckendy@reddit

Seriously, you have gpt-5.2 on a second place? Where gpt-5.4?

[-]

Ayuzh@reddit

what's your setup for running these?

[-]

RevolutionaryGold325@reddit

What was the context size and how much did it take memory?

[-]

Specialist_Golf8133@reddit

wait people are still sleeping on gemma? the price/performance here is actually insane. like everyone's gonna keep throwing money at the big models while this thing is just sitting there at 31B doing 90% of the work for pennies. kinda feels like the gap between 'good enough' and 'perfect' just became way more expensive than most workflows actually need

[-]

one-escape-left@reddit

from your blog post: "Qwen 3.5 9B (bankrupt tier, $0.15/run) — the closest model in parameter count and price"

This is incorrect. Qwen 3.5 27B is the closest dense model in the family. Have you considered running that model?

[-]

Disastrous_Theme5906@reddit (OP)

Fair — "closest from what we tested" would've been more accurate. We tested the 9B and 397B from the Qwen 3.5 family, both endpoints of the range. The 397B went bankrupt. Can't realistically test every variant from every lab — each model needs 5 full 30-day runs for reliable medians. If anyone has contacts at Qwen's team and they're interested, happy to run it and publish the results.

[-]

ZeitgeistArchive@reddit

is there a dense thinking gemma 4 31B? I tried the 31B instruct version and it was ok, but not great for my knowledge and reasoning goals

[-]

Swimming_Gain_4989@reddit

31B is a thinking model, if you're not seeing thinking tokens your provider is misconfigured

[-]

FenderMoon@reddit

It has to be enabled by changing the JINGA templates in LMStudio. They haven't fix that yet.

[-]

GrungeWerX@reddit

Will changing the jinga template cause it work? I tried enabling true, but it thought for maybe a sentence and then immediately started its output. And it looked weird.

Example above. I'm assuming it needs to be fixed internally?

[-]

Disastrous_Theme5906@reddit (OP)

Same issue on our end. The 27B A4B MoE variant leaks <|\ tokens into tool call JSON — every string value comes out as "<|\ground_beef<|\"|" instead of "ground_beef". Had to write a regex sanitizer to strip these tokens just to get it through a benchmark run. The 31B dense doesn't have this problem over API, but A4B is rough.

[-]

Warthammer40K@reddit

Gemini 4 Pro gonna crush whatever the hell this benchmark is.

[-]

dancinpants@reddit

No one is using local LLMs for serious coding. I tried Gemma 4 with OpenCode and it got stuck in an infinite loop trying to search for a file. This tech ain't ready yet.

[-]

phazei@reddit

I've seen a lot of praise for this model. But on most of the comments people are saying it's just benchmaxing. What do you say to that? That all the tests are in the training data?

[-]

Disastrous_Theme5906@reddit (OP)

The benchmark is closed source specifically so models can't train on it. No lab has access to the simulation internals. Looking at the logs, the model makes organic decisions — it adapts to events, changes strategy mid-run, makes mistakes and recovers. Doesn't look like memorization.

[-]

GenerallyVerklempt@reddit

Does that mean your results are not reproducible by anyone except you? We just have to take your word for it?

[-]

Digitalzuzel@reddit

Just curios, what is your solution to benchmaxing?

[-]

phazei@reddit

Nice! Can't wait to try it myself, I've been pretty astounded by Qwen 3.5 already, having something else come out so soon after that's even better is awesome.

[-]

Honest-Debate-6863@reddit

Where to find the codebase of the harness

[-]

ConsiderationHot814@reddit

This is a fascinating breakdown! The cost-to-performance ratio of Gemma 4 (31B) compared to frontier models like GPT-5.2 and Opus 4.6 is truly impressive. It's interesting to see a dense model outperforming MoE architectures in this specific agentic simulation. Looking forward to seeing the results for the 26B A4B version as well!

[-]

Euphoric_Emotion5397@reddit

unfortunately, I'm having trouble getting it to do tool calling and instructions following in my LM studio :( The prompt works totally fine with Qwen 3.5.

[-]

AgitatedHearing653@reddit

What a clever idea having ai compete with each other to run a fictional business. Game theory at its finest. Kudos on this.

[-]

JohnMason6504@reddit

26B with 4B active per token. Running Q8 on Jetson Orin at 40 tok/s. Apache 2.0 license seals it.

[-]

Few-Beyond785@reddit

!RemindMe 12h

[-]

Maralitabambolo@reddit

16 or 8bit?

[-]

Even_Minimum_4797@reddit

This is underrated

[-]

gpt872323@reddit

opus 4.6 on multiple leaderboard is not number in just benchmark wise otherwise I am the biggest fan of Opus. Just saying it cannot be this much discrepancy.

[-]

Ylsid@reddit

Waiting for the minebench, the real test of skills

[-]

Conscious_Nobody9571@reddit

Better than sonnet? No way

[-]

Murder_Teddy_Bear@reddit

I'm really happy with it, can't wait for the eventual uncensored release. ; )

[-]

Natrimo@reddit

Hauhau has one out already

[-]

The_Choir_Invisible@reddit

So far I've tried the e2b and e4b (meant for mobile) versions and they are uncensored to an extent that I haven't seen since wizard-vicuna-uncensored. I hope they work well with AnythingLLM because I'd like to use them for agentic tasks.

Also for anyone downloading the e4b quants, check out the _P versions!

[-]

Murder_Teddy_Bear@reddit

Oh shit! That was quick, thanks.

[-]

Natrimo@reddit

Let me know how it works, haven't tried it myself

[-]

Acceptable_Home_@reddit

yo, im making something same can i have some tips, a noodle shop sim for bias detection on how many LLMs will start capitalising when i tell them there's no proper way to run the shop or win they're free

Rn it has 4 diff suppliers with loyalty and mood system, different type of noodle stock, many moral events, reputation based on stock, cleaniness etc and fatigue system, rent, supply chian inflation or breaking of supply chain due to storms, etc. i would love your opinion on this :)

[-]

AdUnlucky9870@reddit

This is the part that keeps surprising me every quarter — we keep thinking we've hit diminishing returns, then something like this drops.

What I'm curious about: is anyone running this at scale in production yet? The benchmarks look great, but I've been burned before by models that crush evals but fall apart on messy real-world inputs. Would love to hear from anyone who's stress-tested it beyond the leaderboard tasks.

[-]

somerussianbear@reddit

I don’t get it. Several benchmarks posted here and all over the place have been showing Qwen 3.5 dense beating the Gemma counterpart, not by much, but beating it. But then in other benchmarks Gemma beats everything and Qwen is not even in the picture. I’m a happy user of both, so no rage, just wanna understand really.

[-]

yaboyyoungairvent@reddit

I think it's cause smaller models can't be good at everything like with larger models. They can be good at select things. It seems the consensus on here is that qwen performs well when it comes to coding but if your use case is for specifically agentic tasks then gemma is better.

[-]

SexyAlienHotTubWater@reddit

Honestly I think this shows that the metric is not particularly good.

Try talking to it, get it to solve some tasks. Gemma is way dumber than sonnet 4.6, kimi K2, Qwen 3.6, 3.5, Gemini (which it was probably distilled from)...

[-]

lobehubexp@reddit

Are you factoring in total inference cost or just per run pricing

[-]

Quillshade36@reddit

!RemindMe 12h

[-]

totonn87@reddit

I have to buy a new laptop, does gemma4 26b works on a macbook air m5 24 gb of ram?

[-]

PattF@reddit

26b works great, 31b not so much. 26b is great too though.

[-]

totonn87@reddit

But does not fit in 16 gb of ram, right?

[-]

PattF@reddit

26b will, even with a high context. 31b will but with less than 1k context and like 3-7 tps

[-]

Street_Ice3816@reddit

gemma is not that good

[-]

citrusalex@reddit

I've observed the same doing a Home Assistant bench.

[-]

m98789@reddit

How does it compare to GPT-OSS-120B?

[-]

itsjase@reddit

Tell me I shouldn’t trust your benchmark without telling me I shouldn’t trust your benchmark

[-]

virtualunc@reddit

$0.20 per run vs $7.90 for sonnet is insane if these numbers hold up across other benchmarks too.. open source catching frontier models at 1/40th the cost is the real story here

[-]

DonnaPollson@reddit

The interesting signal here isn’t just raw quality, it’s price elasticity. Once a model gets good enough for multi-step work, a 20x cost delta changes behavior more than a small benchmark gap because people start routing entire classes of tasks to it by default. The real test now is variance across prompts and tool stacks, not whether it can win one leaderboard headline.

[-]

ortegaalfredo@reddit

I had the same experience. Just did a benchmark expecting it to be dumber than Qwen 3.5 27B, but it actually was near 397B in performance.

[-]

TQMA@reddit

!RemindMe 24h

[-]

MrCoolest@reddit

Is this 4b quantized?

[-]

Leonjy92@reddit

!RemindMe 24H

[-]

Leonjy92@reddit

!RemindMe 24h

[-]

redballooon@reddit

Casually, hu? Can't wait to see results of when it tries earnestly.

[-]

Tough-Intention3672@reddit

Where are GPT 5.3, GPT 5.4, which are smarter than GPT 5.2?

[-]

NNN_Throwaway2@reddit

What inference backend did you run it with?

[-]

trusty20@reddit

What backend are you using for gemma? llama.cpp?

[-]

LanceThunder@reddit

i was working on some javascript with Qwen 3.5 9b and Gemma4 26b. the Qwen 3.5 model did a better job.

[-]

Roubbes@reddit

Which quants did you use?

[-]

Disastrous_Theme5906@reddit (OP)

No quants, we run through OpenRouter API — full weights, thinking mode enabled. https://openrouter.ai/google/gemma-4-31b-it

[-]

xplode145@reddit

It’s so slow on my m5 pro 64gb ram

[-]

Nervous-Positive-431@reddit

I am thinking of getting one of those bad puppies, how many tokens are you getting?

[-]

DroopyMcDoo@reddit

This looks interesting af but I have no idea what’s going on here. Could someone explain?

[-]

Disastrous_Theme5906@reddit (OP)

AI models run a simulated food truck business for 30 days — they choose locations, set menus, buy ingredients, hire staff, manage money. We compare how well different models handle it. Leaderboard at foodtruckbench.com, you can also play it yourself.

[-]

Rich_Artist_8327@reddit

Grok doing pretty bad. Was Pentagon driven by Grok?

[-]

Disastrous_Theme5906@reddit (OP)

Yeah Grok was disappointing. I think Elon knows — hopefully they come back with something stronger. Would love to see them competitive again.

[-]

GanacheValuable2310@reddit

The fact that qwen 397B couldn't even survive consistently but this 31B does every time is crazy

[-]

Rich_Artist_8327@reddit

Where do you get this 0.2$ run? What is that value?

[-]

Enough_Leopard3524@reddit

It’s good to know the open source models are improving. It’s a cold day in hell when I use only paid LLM models. They were trained on public knowledge, used by the public - just like the internet. I will always support this type of behavior from Google or any other organization. AOL learned the hard way, fafo.

[-]

traveddit@reddit

Gemma 4 has no native function-calling API.

This isn't true. Gemma 4 has its own native function calling template that are baked into the tokenizer.

Gemma4 special tokens for tool calls

 TOOL_CALL_START = "<|tool_call>"
 TOOL_CALL_END = "<tool_call|>"
 STRING_DELIM = '<|"|>'

[-]

Disastrous_Theme5906@reddit (OP)

You're right, my bad. Gemma 4 does have native function calling tokens. We run it through OpenRouter which handles the conversion to OpenAI-compatible schema on their end, so we didn't interact with the native template directly. Updated the article, thanks for catching that.

[-]

ScoreUnique@reddit

I am running 31B on opencode attached to paperclip ai. I find paperclip ai struggling with small MOEs, the only models that didn't fail miserably were Gemma 4 31 and Moe models. Google came to claim the goat title for local models it seems

[-]

RealAggressiveNooby@reddit

How does Qwen 3.5 with similar params compare to Gemma 4? Has anyone here messed around with both (for general applications and for coding respectively)?

[-]

Disastrous_Theme5906@reddit (OP)

We haven't tested Qwen3.5-27B specifically. The closest we have is Qwen 3.5 9B (0% survival, bankrupt in \~14 days) and Qwen 3.5 397B with 17B active params (29% survival, negative ROI). Even the 397B version couldn't come close to Gemma's results, so honestly not sure what the 27B would do. Can't speak to coding, only agentic tasks on our bench.

[-]

Disastrous_Theme5906@reddit (OP)

Yeah fair point, it's not on the main leaderboard table yet. Cost data is in the individual case studies but should probably be a column on the main page too. Adding it to the list.