Gemma 4 31B beats several frontier models on the FoodTruck Bench

[-]

IntelAmdNVIDIA@reddit

Previously, there was qwen3 opus distillation, and so on Gemma 4 opus distillation

[-]

This is just evidence the FoodTruckBench is a flawed benchmark and not to be taken seriously. It is not published, has not been verified by trusted third parties, and no one knows how they configured the models.

Vibes get votes though. That's all that matters anymore apparently.

[-]

masterlafontaine@reddit

Probably trained on it

[-]

AnticitizenPrime@reddit

The food truck bench is just a month or so old. This model was probably already in internal testing when it came out.

[-]

kvothe5688@reddit

yeah benchmax every tasks in the world. may be that's how they achive agi

[-]

nomorebuttsplz@reddit

That's essentially Dario's vision for AGI and honestly it makes more sense to me than some hypothetical special sauce.

[-]

Dead_Internet_Theory@reddit

The problem is, the real world still has trillions of highly specific benchmarks that just aren't called that and don't get scored for points.

[-]

nomorebuttsplz@reddit

But he’s not saying that there’s no generalization, he’s saying that there is limited, both within a domain, but also across domains. Which is demonstrably, correct, and why we cannot have a really smart coding model, which doesn’t know anything except for code.

[-]

Dead_Internet_Theory@reddit

Fair, but we can have a really good coding model that's really terrible for creative writing. And we can have a chess engine which can only do chess and nothing but chess. I'm not saying generalization isn't a thing, but it would be really weird if we end up with an AI that can do everything there is to do, and still isn't general intelligence.

[-]

AlwaysLateToThaParty@reddit

Yes. Humans also have the same specialisation. I think it will be a long time before we get a system that does it all, but more and more domains will have more systems to 'think' in those domains.

[-]

SkyFeistyLlama8@reddit

Make a bunch of smaller models that call each other for different tasks. RAM is all you need.

[-]

rainbyte@reddit

Deterministic algorithms can also be thrown into the mix. Why using models for problems which can be solved 100% correct with a function?

[-]

Bac-Te@reddit

That's called agent Skills

[-]

eli_pizza@reddit

But not as much sense as: “AGI is a fairytale we told Wall Street.”

[-]

nomorebuttsplz@reddit

make a prediction about what AI won't be able to do in a year. If you can't, stfu

[-]

eli_pizza@reddit

Write an original joke

[-]

gtxktm@reddit

It always can't do something unexpectedly simple

[-]

IrisColt@reddit

There are far more possible 'tasks in the world' than a brain unable to think about niche tasks can even wrap its head around.

[-]

Mescallan@reddit

I think it's more *until we find the special sauce.

Also whenever this comes up I want to point out that it also means we have basically absolute capability control over the current architecture when we do find that special sauce we will still have current LLMs to do most of the work

[-]

bwjxjelsbd@reddit

What if we want AGI to tried and invent new thing?

If they’re benchmaxxing, can it even invent something completely new like what human did?

[-]

MoffKalast@reddit

We need to start making benchmarks faster than they can train on them. If everything is a metric, then nothing is.

[-]

WhoTookPlasticJesus@reddit

I mean, that's the source of 99% of the top 1% of college entry exams test scores.

[-]

Due-Memory-6957@reddit

100%*, I'll be damned if you can find one example of a person on the top 1% who didn't train on it.

[-]

dual_basis@reddit

Sure, but in the case of humans the fact that you were willing and able to successfully train for a particular test is in and of itself evidence of qualities and abilities which are likely to make you better at what comes next in that discipline. Not necessarily the case with LLMs, where I could train an LLM on the test and it will still fail miserably at other things.

[-]

SpicyWangz@reddit

The problem is humans are general intelligence partly because we continue training forever

[-]

gamblingapocalypse@reddit

Is there a way we can prove that?

[-]

TheRealMasonMac@reddit

New benchmark that is a twist on this. If it was trained on this, it will have an inductive bias and will struggle to generalize well outside it.

[-]

MoffKalast@reddit

Perplexity measures, maybe?

[-]

deejeycris@reddit

Perplexity is a useless measure on its own, it doesn't predict how well a model understands a text.

[-]

Ok-Contest-5856@reddit

Right? This just looks like Chinese companies don’t both benchmaxxing this but American companies do.

[-]

c00pdwg@reddit

They all benchmaxx. This one must be more western specific

[-]

Clairvoidance@reddit

don’t bother benchmaxxing this [bench]

i think is what they were communicating

[-]

Deep90@reddit

Honestly not a bad thing since of that likely translates to better performance outside of the benchmark.

[-]

dubesor86@reddit

It also scored very high in my own general purpose testing and outperformed many significantly larger models on my chess benchmark. Seems like a genuinely good model, though obviously use whatever fits your use case best.

[-]

Nindaleth@reddit (OP)

Nice to see that the performance remains unexpectedly good in private benchmarks!

[-]

Winnin9@reddit

Benchmaxing the new issue we have

[-]

Cradawx@reddit

Funny how Gemini 3.1 Pro has 77.1% on ARC AGI 2 compared to 31.1% on Gemini 3.0 Pro. Claude Sonnet 4.5 scored 13.6% but Claude Sonnet 4.6 scores 60.4%. Are we really supposed to believe these models naturally got so much better so quickly at these tests? ARC team even found evidence of the benchmaxxing when testing Gemini.

ARC AGI 3 currently has 0.3% as the best performing model. Just watch in a few months how the new models will magically start scoring 100x better 😅

[-]

lumos675@reddit

In my opinion Gemma is realy good Just saying. I realy don't need to use Cloud models anymore after this release

[-]

PigabungaDude@reddit

Why is going from 14 to 30 to 60 to 77 that weird? These companies cross pollinate and training starts months before we get the model.

[-]

AkiDenim@reddit

Idk people think that benchmaxxing is so real. I’m sure companies that get invested BILLONs actually benchmax their models in the middle of training or something.

[-]

Hulksulk666@reddit

Arc Agi 3 is more robust than 2 for LLMs. I mean its something that's possible to game with some Rl/search modeling , but it's way outside of a LLMs comfort zone. it would be still very much a indication of some progress if a LLM did good on 3

[-]

nomorebuttsplz@reddit

I think it is mostly because it's not a benchmark about getting correct answers or solving puzzles, but HOW the puzzles are solved. I suspect that AGI largely be seen as having arrived before ARC AGI 3 is saturated.

[-]

drallcom3@reddit

The fight over dwindling investment money has begun.

[-]

JazzlikeLeave5530@reddit

You're saying Google is fighting for investment money?

[-]

drallcom3@reddit

They're caught in the fight.

[-]

ResidentPositive4122@reddit

The simulation engine, system prompts, and demand model parameters are not open-sourced. This is a deliberate choice:

Prevents gaming: If the exact demand formula is known, models (or their trainers) could be optimized against specific coefficients rather than demonstrating genuine business reasoning Protects longevity: The benchmark remains useful as long as the internals are unknown — similar to how standardized tests don't publish answer keys in advance Industry standard: LMSYS Chatbot Arena, Anthropic's internal evals, and many established benchmarks use this model — public results, private methodology details What is published: all results, all metrics, scoring formulas, demand factor names, agent architecture, tool list, and this methodology document. What is not published: source code, system prompt text, exact coefficients, and internal simulation parameters.

[-]

-p-e-w-@reddit

I very strongly doubt that Google “benchmaxxed” for this obscure but extremely complicated benchmark. That makes absolutely zero sense.

[-]

IrisColt@reddit

This.

[-]

DrBearJ3w@reddit

Is even better than Gemini Pro. Lol.

[-]

Available-Poet-2111@reddit

Are you using Quantized or BF16?

[-]

DrBearJ3w@reddit

GPU poor. Only GGUF.

[-]

Enthu-Cutlet-1337@reddit

Long-horizon wins usually collapse on KV cache and tool drift; 31B just fits the loop better than 397B.

[-]

SlopTopZ@reddit

The FoodTruck bench is a really interesting real-world eval — trading simulation tests long-horizon planning in a way that standard coding/math benchmarks simply don't capture. Gemma 4 31B placing above Claude Sonnet variants is impressive, especially given the size. The fact that it actually listens to its own advice day-to-day during the run suggests strong instruction following and self-consistency. Curious whether the 26B A4B MoE would perform similarly given the near-identical quality people are reporting locally.

[-]

scknkkrer@reddit

What kind of f up name is the FoodTruck for a benchmark?

[-]

bapuc@reddit

FoodTruck? What benchmark is this lol

Is it about the llms being able to own a profitable foodtruck or what

[-]

DinoAmino@reddit

https://www.reddit.com/r/LocalLLaMA/s/Q2icjYsEqU

[-]

bapuc@reddit

Wtf 🥲

[-]

Webfarer@reddit

If you want to deploy food trucks at scale you now know which models to use

[-]

m3kw@reddit

Looks like there is always some bespoke benchmark that LLMs can beat

[-]

SpicyWangz@reddit

Yeah

[-]

JohnMason6504@reddit

The fact that a 31B dense model is competing with GPT-5.2 and Claude Opus on a real-world planning benchmark is wild. Especially considering you can run it locally on a single 24GB GPU at Q4. The cost-per-token delta between a 31B local model and frontier API calls makes this a no-brainer for any production agentic pipeline where you control the hardware.

[-]

jayhotzzzz@reddit

how tf did it beat Gemini 3 Pro

[-]

venpuravi@reddit

Can it run in 3060 12GB

[-]

florinandrei@reddit

no

[-]

Sabin_Stargem@reddit

I am running an ARA Gemma-4 31b, translating the text in a JSON. So far, it isn't following my instructions in the thinking process: hook brackets are being turned into quotation marks. Qwen 122b and 397b manages to correctly handle this some of the time.

Hopefully, Qwen 3.6 will be able to retain such details with reliability. For now, though, Gemma 4 is slow and not up to the job.

Gemma 4 is a bit better than the bigger models when it comes to the translation of actual dialogue. Considering the NSFW nature of the translation, I won't Reddit the details - but the language is a bit more natural than Qwen's wording.

[-]

florinandrei@reddit

Gemma 3 was always my favorite conversationalist among models in its class. Probably 4 is similar.

[-]

Traditional-Gap-3313@reddit

This one may not be benchmaxxing. I've wrote about my benchmark here: https://www.reddit.com/r/LocalLLaMA/comments/1sbjmpm/gemma431b_vs_qwen3527b_dense_model_smackdown/

I've since run the 31B on all 1500+ queries, the full benchmark. The GT is created by majority vote between Opus 4.6, GPT 5.4 and Gemini 2.5 Pro.

Gemma 4 31B scores closer to GT labels then the inter-annotator agreement.

You can't say this one was benchmaxxed as there are no benchmarks in croatian legal texts and mine is not published yet.

It really does seem like an incredible model...

[-]

florinandrei@reddit

mine is not published yet

Probably the only way for a benchmark to stay relevant.

[-]

Technical-Earth-3254@reddit

Sus as hell, I would assume that ur benchmark is now in the training data

[-]

Zc5Gwu@reddit

Why would Google care about a no-name’s (no offense to OP) benchmark?

[-]

asraniel@reddit

they might not, but it might just end up in the dataset through webscrapping

[-]

m0j0m0j@reddit

You think they run the model through every webscraped online game?

[-]

seamonn@reddit

yes?

[-]

m0j0m0j@reddit

So when they download the pirated version of RDR2, they make Claude drive horses?

[-]

protestor@reddit

This would be neat actually, apart from the compute cost. Have the models watch movies, ride a bike, invite your wife for dinner, etc

[-]

YungCactus43@reddit

i’m assuming FoodTruck bench is just a bunch of prompts, it’s prime LLM training material. Plus reddit is the most scrapped websites for LLMs so it’s very conceivable foodtruck bench might’ve been in the training data.

[-]

TOO_MUCH_BRAVERY@reddit

They probably webscrape forums where people discuss optimizing strategies for it?

[-]

murkomarko@reddit

online benchmark*

yes

[-]

Nindaleth@reddit (OP)

That's not my benchmark :) It just looks fun so I return to it occasionally.

[-]

protestor@reddit

How is GPT-5.2 on top, while GPT-5.3 and GPT-5.4 is nowhere to be found?

[-]

m3kw@reddit

wtf sht is food truck bench, gemma4 is good and calling it beating frontier models and then name drops GLM and Qwen is funny af

[-]

DigThatData@reddit

FoodTruck Bench?

[-]

inaem@reddit

Benchmaxing to AGI

We will literally cover every single use case with benchmarks at this rate and benchmaxing won’t matter

[-]

Warm-Attempt7773@reddit

This is my experience

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

TurnUpThe4D3D3D3@reddit

u/grok test, can you see this?

[-]

bambamlol@reddit

Oh no not the FoodTruck bench.

[-]

bapuc@reddit

No no no

[-]

PattF@reddit

This would be great but it get 3-5 t/s when 26b gets 50 on my m4 pro mac (24gb). thats with about 1000 context length while 26 can do 128,000. something is very wrong with it

[-]

petuman@reddit

31B vs 4B activation, so being ~8 times slower is expected.

[-]

KoloiYolo@reddit

Nah, you just don't have enough RAM

[-]

kweglinski@reddit

makes me wonder - is 31b as stubborn as the 27 moe? I have to explicitely tell it to browse web and then to crawl pages because it constantly tries to rely on it's insufficient knowledge. It seems to avoid tool calls at all costs in chat env (haven't got time to test coding yet). Even at the very specific question about specific device where it had model etc. It sticks to "usually in devices like this". Tried temps from 0.1 to 1 (0.1 increments).

[-]

Shouldhaveknown2015@reddit

Tool calling appears to be different then Qwen3.5 and needs a different setup. I don't know code myself just vibe code a lot, and have Claude Opus code on my custom apps.

Gemma 31b has been running for 2-3 hours doing tool calls no issue on my custom agent app designed for my obsidian vault. It took a little work to get the tool calling right, and get it into agent mode but since I got it running it has been going non stop with no issues failing on tool calls.

"get_audit_progress frontmatter: 44/557 | links: 0/557 | template: 0/557 | organization: 359/557 | content_quality: 0/557"

Don't know the results yet, but we shall see!

[-]

dmigowski@reddit

I guess the only way to validate it is to create own benchmarks for LLMs.

[-]

toothpastespiders@reddit

And they should. Most people would benefit from just putting together a small benchmark from their own real-world needs.

[-]

Due-Memory-6957@reddit

So this is the result of stealing the Qwen staff? I kneel.

[-]

Hyphonical@reddit

It's not the cheapest 30B model though... Not on cloud inference.

[-]

PhotographerUSA@reddit

What is the net worth based upon?

[-]

jeffwadsworth@reddit

Testing it locally 8bit 31B. Amazing what it can do. I hoping for faster inference but I am not complaining about its coding prowess.

[-]

Sem1r@reddit

Gemini 3.1 is also benchmaxed on a lot of niche benchmarks without translating into real workloads- I think google is heavily training on benchmarks and even more so on niche ones

[-]

Emotional-Breath-838@reddit

you are going to see smug comments about how they cheated by training it on the models they beat....

and guess what?

i couldnt care less. all the data they ised was ours. as a result, all i want is the best possible model for free. because it was our data they used without ever asking us.

[-]

Clairvoidance@reddit

consequence being memorizing answers at the cost of understanding tasks, where the bench was made for the purpose of trying to measure understanding of tasks

[-]

Deep90@reddit

Doesn't this bench have random scenarios and such? Or is every day the same for every playthrough?

[-]

Clairvoidance@reddit

if i understand what the website is saying correctly, AI is always benchmarked on seed 42

[-]

Deep90@reddit

Ah I missed that.

I would be really curious what happens if you ran all these models on a different seed to check for overfitting.

[-]

Nindaleth@reddit (OP)

Here's the link to the bench with all the details: https://foodtruckbench.com/ You could even play the benchmark yourself :)