Local Minimax M2.7, GTA benchmark
Posted by -dysangel-@reddit | LocalLLaMA | View on Reddit | 88 comments
Minimax M2.7, asking it to make a 3D GTA-like experience. GLM 5.1 still wins on aesthetics and adding detail without being asked, but when I asked Minimax to add trees and birds (with boids algo), it did a decent job!
This was not even in an agentic scaffold, I usually just do initial testing like this in the openwebui artifacts window, but Minimax has also been kicking ass for me in OpenCode. I'm running it at IQ2_XXS for max speed, and it still is coherent and capable.
FatheredPuma81@reddit
Makes me feel sad that I can only just barely run Minimax M2.7 and nothing else on my PC...
Glazedoats@reddit
parkour game benchmark when :]
-dysangel-@reddit (OP)
that is such a perfect idea for a more detailed test! I used to do parkour for \~8 years so I can't believe this didn't even cross my mind
Glazedoats@reddit
I've never done parkour in my life lol. I have some background in animation so I thought: how can we really torture the AI lol. I don't have the VRAM or funds to test it out, so I'd be interested in what you guys do with that idea :]
-dysangel-@reddit (OP)
I think this absolutely has to be the new benchmark! When I just asked for "parkour game" then GLM did more of an endless runner, but when I asked for a city parkour game where the character has arms and legs, GLM did a prety good job on the character - though the actual gameplay/physics needs way more refinement
https://i.redd.it/ifle94la12vg1.gif
Glazedoats@reddit
Look at 'em gooo! I love it!! 😂👏
-dysangel-@reddit (OP)
GLM 5.1. I could have asked it to fix the arms/legs.. but I didn't
https://i.redd.it/6tm428xqr7vg1.gif
Glazedoats@reddit
I'm reminded of those blocky characters of roblox. Very cool :)
-dysangel-@reddit (OP)
did you see this one before? https://www.reddit.com/r/LocalLLaMA/comments/1rqlaw4/new_benchmark_just_dropped/
Glazedoats@reddit
Yeah I did see this one actually. :]
LegacyRemaster@reddit
yes please. I need more. Prompt?
-dysangel-@reddit (OP)
I've added the prompt to the OP
LegacyRemaster@reddit
amazing thx
EndlessZone123@reddit
Idk I never liked using GLM for anything 2d or 3d because its not a vision model. It's just one shotting things from memory and can't do much after or pickup where it self off.
shittyfellow@reddit
Models being able to support vision have no bearing on its ability to program 2d or 3d games. The vision is handled outside the LLM. See: https://www.nvidia.com/en-us/glossary/vision-language-models/
-dysangel-@reddit (OP)
That make sense, but in practice GLM actually has really great 3D ability despite not being a vision model. I've been testing it out for 3D since GLM 4.5, and fully using it in my 3D data vis day job since GLM 4.7.
unjustifiably_angry@reddit
I'll never understand why people do these sorts of prompts. There's so little input given, the output is going to be so complex and unpredictable that it could give completely different results from one run to the next.
-dysangel-@reddit (OP)
That's kind of the whole point? It's very informative for getting to know a model's inbuilt capabilities and aesthetic choices, which I find interesting/useful.
If you just want a model for doing business logic then yeah this isn't the methodology for you.
unjustifiably_angry@reddit
This is a sloptuber test.
The output is going to be random and unpredictable and serves no practical function so there is no objective way to measure it. One model outputs a version with nicer-looking cars, another outputs a version with working hit detection. You hit regenerate and now the first one has working hit detection and the second one has nicer-looking cars. And all the while this isn't even the sort of task AI is trained for, so it indicates absolutely nothing about the model's capabilities.
You're asking two electricians to race a car to see which one is red. What the fuck?
-dysangel-@reddit (OP)
Oh, I just saw your username. Maybe you just enjoy trolling. Moving on..
-dysangel-@reddit (OP)
what is your point?
punishedsnake_@reddit
why not have an MGS VR Missions benchmark for example
-dysangel-@reddit (OP)
That's another great idea - fun but relatively easy to generate puzzles. I'm seeing what GLM-5.1 can manage for a "parkour benchmark" atm. This was the first attempt. It's currently 15k thinking tokens deep into the second attempt..
https://i.redd.it/lr9s672sd2vg1.gif
Ok_Warning2146@reddit
How come lmarena shows 2.7 performs almost identical to 2.5. Is there anything new in 2.7?
-dysangel-@reddit (OP)
It's more of continued post-training on 2.5, focusing on agentic improvements, especially multi-agent apparently. Same architecture, no new tech
Aggressive-Permit317@reddit
This is wild !!!!! a local model straight up generating a playable GTA-style scene in browser? The aesthetics and detail it’s adding on the fly is next level for 2.7B. I’m spinning it up today. Anyone else test it yet or are we all still waiting on proper GGUF quants?
Monad_Maya@reddit
Here's Qwen 27B even with a rather detailed prompt. The controls don't really work and you can space travel with the Space key. Ok for a single shot attempt I guess.
https://i.redd.it/btu724wse0vg1.gif
NoFudge4700@reddit
The headlights beam look so cool
4400120@reddit
Create alternatives of those games advertised on YouTube that never look anything like what is shown.
100lyan@reddit
Amazingly Gemma 4 31B is also able to create almost the same Mini GTA - with virtually the same prompt ! Had to do a couple of rounds fixing some bugs here and there - but at the end I got a very similar experience.
-dysangel-@reddit (OP)
yep Qwen 27B managed too, though not quite as much detail with wheels/lights etc, and also those smaller models struggle even more with figuring out directions in my experience
100lyan@reddit
It's still mind-blowing what is now possible with local LLMs. You are correct though that MiniMax somehow achieves more detailed environment - but hardly surprising given its size. Anyway - a great experiment !
FullstackSensei@reddit
As a software engineer, I'd prefer the LLM not to make things I didn't ask for. It might be nice to add trees on it's own, but if you're writing anything serious, that easily leads to unexpected or undesired behavior.
tillybowman@reddit
it's a simple single shot. as a software dev that uses ai daily it's all about guardrails. put your context into place before start and have your feedback loop setup and you will get pretty much exactly what you asked for.
FullstackSensei@reddit
I know. I always have 40-60k background info. Thing is, even with those some models ignore those sometimes. The situation is very much exacerbated when kv is quantize or the model is heavily quantized.
SnooPaintings8639@reddit
My thoughts exactly. In case od Claude Opus the biggest issue at the moment is how "trigger happy" it gets and does all the "extra" stuff. Model doing less is often a good thing, i.e. a good instruction following.
-dysangel-@reddit (OP)
Sure, that makes sense. When I'm doing these initial game tests of model capability, I like to be purposefully vague to get a sense of their aesthetic choices and how proactive they are. So I'll ask for "beautiful tetris", "gta-like experience", "relaxing flight simulator". Then I'll start asking for more specific things to see how well they handle feedback, bugfixing, and just generally how it feels to work with them.
lorenzo_9696@reddit
Which hardware are you using to run it?
-dysangel-@reddit (OP)
M3 Ultra
RevolutionaryDrop481@reddit
This looks pretty similar to early Roblox
-dysangel-@reddit (OP)
GLM 5 for comparison - more detail on the main character without having to ask for it
https://i.redd.it/gq2fh1y0sxug1.gif
Jackalzaq@reddit
Yeah glm is too good lol. I havent been this impressed with a model that i could run locally in a while.
ForsookComparison@reddit
Holy crap
-dysangel-@reddit (OP)
And a little Frozen themed game I had it generate for my daughter. Though I had to ask for an "Ice Princess" game to get around GLM's copyright worries :p
https://i.redd.it/6z4dw1qj7zug1.gif
-dysangel-@reddit (OP)
https://i.redd.it/zv14ok027zug1.gif
-dysangel-@reddit (OP)
Right? It's also the first model I've run that managed a decent-ish flight sim without too much feedback/tweaks
https://i.redd.it/47auoegk6zug1.gif
Ok_Technology_5962@reddit
GLM 5.1 test?
ambient_temp_xeno@reddit
The car clips through the buildings, though. On your minimax one it doesn't clip through the car.
-dysangel-@reddit (OP)
On minimax, the car still clips through the buildings too, but it does bounce off of other cars. I should really create a more structured prompt to test models' attention to detail and aesthetics, now that pretty much all models are able to handle creating basic 3D worlds.
newcarrots69@reddit
If you added a play tester, couldn't these LLMs continually iterate development?
-dysangel-@reddit (OP)
Yes, that's a great direction!
I've so far just been looking for viable local text model, which is smart + fast enough. It looks like M2.7 is "the one", and that the next frontier is to have for playtesting/feedback models like you're suggesting.
newcarrots69@reddit
I also meant to ask you if you were keeping a record of these results somewhere.
-dysangel-@reddit (OP)
I do have an llm assessments folder dating back to GPT 3.5 days. though these days I usually test directly in openwebui and just pin any cool results.
It's for the most part just "build tetris". "now make it self playing". However since even Qwen 3.5 4B can handle that now, I'm having to up the stakes. We're getting to the stage where intelligence is "good enough" in most new models. Differentiating factors for me now are things like aesthetics, and architectural ability. GLM 5.1 is still the GOAT for me - but M2.7 is 1/3 of the size and it's looking "good enough" so far.
newcarrots69@reddit
I also meant to ask you if you were keeping a record of these results somewhere.
jacek2023@reddit
"This was not even in an agentic scaffold" so what was your workflow? How do you work with multiple files?
-dysangel-@reddit (OP)
When I download a fresh model I initially just do some vibe checks in openwebui to see how well they can code and iterate on their code. Then I either delete the model if it doesn't pass the vibe check, or I test it out further. This is definitely a good 'un, so I've also been using it in opencode.
Both_Opportunity5327@reddit
Very Nice
Was this done by IQ2_XXS version one shot, or did you use Opencode?
Can you post the settings and what you used to serve the model please.
-dysangel-@reddit (OP)
It wasn't one shot, but also not opencode. It was just me going back and forth in openwebui and using their artifacts window for testing
jacek2023@reddit
So this wasn’t in OpenCode, it was in OpenWebUI? That’s what I was asking above. I have some issues with OpenCode, mostly with prompt caching, across all models (qwen, gemma).
-dysangel-@reddit (OP)
Yeah this particular test was just in openwebui.
Yeah I noticed a caching issue in OpenCode too when I tried it - the model had to re-process the whole context on
/compact. It would be way more efficient to either just send a message asking for summarisation at the end of the existing context, or have an option to use a small, fast utility model to handle compaction.jacek2023@reddit
You can see problems often after switching between plan and build or after long operation of opencode
-dysangel-@reddit (OP)
I made an mlx server last summer that caches all system prompts for a model to redis - so when changing mode, it's already warmed up. Sounds like I need to fish that one out and try it with OpenCode.
jacek2023@reddit
But I believe this is context + new prompt so it can't be cached
-dysangel-@reddit (OP)
Makes sense. If I come up with something better, I'll submit a PR (or start my own fork)
jacek2023@reddit
I have plans to debug this whole process but I am still procrastinating that
Both_Opportunity5327@reddit
ok thanks, I will experiment with it.
jacek2023@reddit
what's your speed for IQ2?
-dysangel-@reddit (OP)
\~1k session:
\~90k session:
__JockY__@reddit
The fact that it runs at all with an IQ2_XXS quant is quite extraordinary!
-dysangel-@reddit (OP)
I've actually found a few larger models that work well at IQ2_XXS UD. The first one I found that worked consistently was Deepseek R1-0528.
Deepseek V3-024 was basically the same model as R1-0528, but an instruct rather than thinking model - however it lost a lot of ability at Q2, so it's not purely about size, but the larger models definitely seem to handle this level of compression better than smaller ones. As in, GLM 5.1 at Q2 still gives better result than any other model I can run.
Some other good IQ2_XSS UDs I've found in the last year:
glm-4.6-reap-268b-a32b
glm5
glm 5.1
minimax m2.5
Medium_Chemist_4032@reddit
What are your run params? By the way, posts such as these are worth their weight in gold :D Thanks!
I just tried that exact prompt with:
And it was worse. I had to add repeat penalty, because it looped like crazy otherwise
-dysangel-@reddit (OP)
Hey you're welcome. I just read another post that at least one of the unsloth Q4s was broken - you might have to redownload? :/
Medium_Chemist_4032@reddit
Downloaded an hour ago. Retried a few times with IQ2_XXS and it was much better, when it comes to looping. Out of 5 gens though, I couldn't come close to your result
-dysangel-@reddit (OP)
Note that the controls were backward, the car wouldn't drive, lights were in the wrong place on the cars etc on the first prompt, so I had to go through a few iterations of feedback before everything was fully correct!
Ylsid@reddit
I'm impressed it managed to get the environmental details right. That's usually super difficult for LLMs
-dysangel-@reddit (OP)
Yeah it did a great job of the building positioning in between roads, and trees lining the road!
Of course there were still a couple of trees actually on the road, but still very impressive for a Q2, 65GB model
Itchy-Individual3536@reddit
How did you get your hands on that early copy of GTA 6? /s
BigYoSpeck@reddit
You joke, but stick some DLSS5 on this and you got your self a ~~slop stew~~ AAA game going on
-dysangel-@reddit (OP)
don't tell anyone but this is actually 7
AdultContemporaneous@reddit
What in the Corncob 3D is this?
ambient_temp_xeno@reddit
That's insane! I'm getting the Q8 (but I can't really even vibecode).
-dysangel-@reddit (OP)
It helps a lot if you already know how to make games, but the models themselves are getting pretty good, so they will be able to fix a lot of problems just from you describing what you are seeing. For getting quickly up and running without knowing a lot about coding, you can just ask the model to make the game in html, javascript, three.js (3D engine), ammo.js (Physics engine). If you start getting more serious, you could start setting up typescript projects, or use game frameworks like Godot etc.
ambient_temp_xeno@reddit
I had a lot more luck with the single html route compared to trying to get pygame to do anything good. I had to handhold opus 4.6 so I get what you mean about having to iterate on it by telling it what's not working.
-dysangel-@reddit (OP)
Yeah I've had a similar experience where models can do a better job in a single shot than trying to iterate in a scaffold.
I find that GLM 5.1 needs way less handholding that Claude - it's more confident and proactive with architectural decisions. Did a great job recently refactoring my game engine (which was originally created by Claude \~3.7).
CoolstaConnor@reddit
Prompt?
-dysangel-@reddit (OP)
Prompt 1:
Prompt 2
The rest of the prompts were mostly asking it to reverse directions on the controls. That's a common failure on all models, since they don't really have any experience with real world spaces in their training.
There was also a bug where it was checking the car for collision against itself, which was why it wasn't moving initially. The results are getting pretty impressive though for just drafting everything up "in their head" with no playtesting!
CoolstaConnor@reddit
Thanks!
averagebear_003@reddit
the birds were a nice touch tbf