DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper
Posted by Disastrous_Theme5906@reddit | LocalLLaMA | View on Reddit | 36 comments
Tested DeepSeek V4 Pro on FoodTruck Bench — our 30-day agentic benchmark where models run a food truck via 34 tools (locations, pricing, inventory, staff, weather, events) with persistent memory and daily reflection.
First Chinese model to land in the frontier tier on our benchmark. Tied with Grok 4.3 Latest on outcome, within 3% of GPT-5.2's median, #4 overall behind Opus 4.6, GPT-5.2, and Grok 4.3.
The timing is the interesting part. We tested GPT-5.2 in mid-February. DeepSeek V4 Pro matches its numbers ten weeks later. The China–US frontier gap on this benchmark used to feel like a year. Right now it's about ten weeks.
The pricing gap is even sharper. GPT-5.2 charges $1.75/M input and $14/M output. DeepSeek V4 Pro is at $0.435/M input and $0.87/M output, with discounted cache reads on top — \~17× cheaper for the same agentic workload. That's promo pricing today, but DeepSeek's track record is that promo becomes the floor.
On cost-efficiency (net worth per dollar of API spend) DeepSeek V4 Pro is #2 overall on the leaderboard — behind only Gemma 4 31B, ahead of every premium-tier model.
Against Grok 4.3 Latest specifically the medians are basically tied at the same price, but DeepSeek wins on consistency: zero loans, \~6× less food waste, 30% more meals served per day, 2.4× tighter outcome distribution. Grok matches DeepSeek's peak. DeepSeek matches its own peak every time.
Opus 4.6's peak run is still higher than DeepSeek's. Gemma is still cheaper. Otherwise this is a real frontier-tier competitor at a Chinese price point.
Update — Xiaomi MiMo v2.5 Pro just finished its run set as well: 5/5 survived, +1,019% median ROI, $22,388 median net worth at $2.41/run. Lands at #6 on the leaderboard, between Gemma 4 31B and Sonnet 4.6. Slightly behind DeepSeek on outcome and consistency (wider variance — $9K worst run vs $29K best), but a real result for a Chinese model at this price point.
That's now two Chinese models in our top 6, both at sub-$3.5/run. When we started this benchmark in February, neither of these tiers existed outside US labs.
Congrats to the DeepSeek and Xiaomi MiMo teams.
Full write-up: https://foodtruckbench.com/blog/deepseek-v4-pro
Leaderboard: https://foodtruckbench.com
Eyelbee@reddit
Hi, I really like foodtruckbench. It would be great and useful if you could create foodtruckbench v2, where you increase simulation quality and variables to better align with real world. Currently it's a good start but ideally you'd want to engineer some aspects yourself.
Disastrous_Theme5906@reddit (OP)
Oh thanks man. Yeah I have a huge backlog of stuff I could add in a v1.5 or v2. When I built the benchmark, only the top expensive models could even pass it, so making it harder didn't really make sense. I can't drop thousands of dollars on runs and tests since I fund the whole thing myself. But now with all these cheap Chinese models coming out, yeah, it makes sense. Also been thinking about adding the option for users to run their own personal simulations with whatever model they pick and maybe even their own prompt.
bitmoji@reddit
if you go the Aider route - make the benchmark easy to run and install, I am pretty sure people will run benchmarks on their dime and then you can post them back to your site.
Disastrous_Theme5906@reddit (OP)
If the bench goes public, any model can just be trained on it. In that case the usefulness of the benchmark drops to near zero within a few months. Maybe it'd be fine for older versions once new ones come out.
bitmoji@reddit
we should have it run a real food truck
bitmoji@reddit
now do Kimi 2.6 and GLM 5.1? and Mimo 2.5 Pro
Disastrous_Theme5906@reddit (OP)
Mimo 2.5 Pro is already tested and on the leaderboard, the other 2 I started but didn't finish. Kimi isn't doing great tbh.
bitmoji@reddit
I check your site every few days, I must have missed it. it's interesting and weird to me that Kimi 2.6 is not doing well. GLM 5.1 and Kimi 2.6 are mainstays for me of decision making, scoring, and coding. I love this benchmark
Disastrous_Theme5906@reddit (OP)
Aight, aight, bro, you convinced me. Started running sims on Kimi 2.6. Had really little time for the benchmark last week tbh. Yeah for historical data the model is worth adding I think. Looking at it now, the dynamics are actually not bad. It probably won't go bankrupt, but it'll most likely land somewhere around Qwen 3.6 Plus.
Total_Activity_7550@reddit
Good for DeepSeek, but Claude Opus 4.6 doing 1.7x profit over next group of models (and that's not even Mythos) rings a bell that they're leaving competitors behind...
-p-e-w-@reddit
As always, it depends on what you’re trying to do. I’m regularly shocked by how weak Opus is outside of coding and agentic tasks.
My standard benchmark is to ask LLMs to explain jokes (including jokes of my own creation that I know aren’t in their training data), and Opus doesn’t make the top 5 for that. Sonnet doesn’t even make the top 10.
someRandomGeek98@reddit
what's the best model at explaining your own jokes?
Orolol@reddit
Peter.
quickreactor@reddit
Probably a Dad
Ancient-Breakfast539@reddit
That's because opus was trained on tool use and orchestration. Doesn't mean it's smarter or better. Right now it's garbage anyways.
Disastrous_Theme5906@reddit (OP)
Yeah, agreed — Opus is in a league of its own right now. Worth noting xAI and Google's flagships are also lagging on this, not just the Chinese tier.
Aldarund@reddit
Where gpt 5.4/5.5?
Disastrous_Theme5906@reddit (OP)
Wanna test it but not ready to drop $300+ on a full benchmark run rn. Tried 5.3 and 5.4 before that and they kept going into infinite loops in their replies. Sometimes a single request hit like $1 in the API. And a benchmark run is 400-600 requests. Both 5.3 and 5.4 had this issue where instead of giving a short answer they'd loop forever and just spit out 60k tokens in a row, multiple times. So I paused testing new GPT models for now.
Aldarund@reddit
Interesting. I wonder if 5.5 changed in that regard. In my own use I see that 5.5 output way less thinking tokens
Beginning-Window-115@reddit
are you gonna test qwen3.6/
Disastrous_Theme5906@reddit (OP)
3.6 plus in 10th place now
grumd@reddit
Tried using Deepseek v4 pro for a coding task I had yesterday, it just kept overthinking forever and couldn't do anything productive. Even my local Qwen3.6 35B did better.
SmartCustard9944@reddit
Why is DeepSeek flash performing so badly?
ProfessionalJackals@reddit
That is the discounted price that is going to finish soon.
Why not use MiMo their subscription service prices if your using DS4 their discounted prices?
MiMo is with subscription $0.1 / million (for the cheapest), with Pro using 2x the amount of credits ($0.2 / million). As you scale up to higher tiers, its 15 to 20% more credits (tokens), or 10 to 15% lower prices (year sub) what combine (and the 20% token discount on evening hours).
https://platform.xiaomimimo.com/docs/en-US/tokenplan/subscription
So just saying, if your looking at API costs, you need to compare to the non-discounted API for all, or use all the beneficial tariffs.
FullOf_Bad_Ideas@reddit
What's up with Gemma?
It does crazy well on EQBench too but I'm not hearing much about it (nor tried it myself tbh).
Its hard to appreciate Xiaomi or Deepseek or even this benchmark when Gemma 31B beats Sonnet 4.6.
Disastrous_Theme5906@reddit (OP)
Yeah, totally fair — Gemma 4 31B genuinely surprised us too, we re-ran and re-checked because it looks too good. But the result is real: 5/5 runs, +1,144% median ROI at $0.20/run. Full breakdown of why and caveats: https://www.reddit.com/r/LocalLLaMA/comments/1sdcotc/gemma_4_just_casually_destroyed_every_model_on/
TechnoByte_@reddit
Least obvious LLM generated comment
Spectrum1523@reddit
All of their posts are obviously generated. Lazy too. From the thread they linked in this comment:
Future_Manager3217@reddit
The cost delta is the interesting part, but for an agentic benchmark I’d want one more column before calling two runs equivalent: effort/review budget.
Same final net worth can hide very different tool calls, retries, invalid actions, context reads, or manual cleanup. If DeepSeek is \~17x cheaper and similar on those traces too, that’s a much stronger result than outcome-only ranking.
maxpayne07@reddit
And the alucination rate is......
amunozo1@reddit
I'm more surprised about Gemma's position there.
havnar-@reddit
It’s only discounted for another month though.
Disastrous_Theme5906@reddit (OP)
Fair — though DeepSeek has form for extending these promos or reinstating them after a short break. And even at full list price it's still substantially cheaper per result than GPT-5.2 or Opus on this benchmark.
rhythmdev@reddit
i don't see 27b.
DeltaSqueezer@reddit
Qwen 3.6 Plus lands at position 10, so I guess below that.
Mushoz@reddit
Did you also test DeepSeek V4 flash?