Local Minimax M2.7, GTA benchmark

[-]

FatheredPuma81@reddit

Makes me feel sad that I can only just barely run Minimax M2.7 and nothing else on my PC...

[-]

-dysangel-@reddit (OP)

that is such a perfect idea for a more detailed test! I used to do parkour for \~8 years so I can't believe this didn't even cross my mind

[-]

I've never done parkour in my life lol. I have some background in animation so I thought: how can we really torture the AI lol. I don't have the VRAM or funds to test it out, so I'd be interested in what you guys do with that idea :]

[-]

-dysangel-@reddit (OP)

I think this absolutely has to be the new benchmark! When I just asked for "parkour game" then GLM did more of an endless runner, but when I asked for a city parkour game where the character has arms and legs, GLM did a prety good job on the character - though the actual gameplay/physics needs way more refinement

https://i.redd.it/ifle94la12vg1.gif

[-]

Glazedoats@reddit

Look at 'em gooo! I love it!! 😂👏

[-]

-dysangel-@reddit (OP)

GLM 5.1. I could have asked it to fix the arms/legs.. but I didn't

https://i.redd.it/6tm428xqr7vg1.gif

[-]

Glazedoats@reddit

I'm reminded of those blocky characters of roblox. Very cool :)

[-]

-dysangel-@reddit (OP)

did you see this one before? https://www.reddit.com/r/LocalLLaMA/comments/1rqlaw4/new_benchmark_just_dropped/

[-]

Glazedoats@reddit

Yeah I did see this one actually. :]

[-]

LegacyRemaster@reddit

yes please. I need more. Prompt?

[-]

-dysangel-@reddit (OP)

I've added the prompt to the OP

[-]

LegacyRemaster@reddit

amazing thx

[-]

EndlessZone123@reddit

Idk I never liked using GLM for anything 2d or 3d because its not a vision model. It's just one shotting things from memory and can't do much after or pickup where it self off.

[-]

shittyfellow@reddit

Models being able to support vision have no bearing on its ability to program 2d or 3d games. The vision is handled outside the LLM. See: https://www.nvidia.com/en-us/glossary/vision-language-models/

[-]

-dysangel-@reddit (OP)

That make sense, but in practice GLM actually has really great 3D ability despite not being a vision model. I've been testing it out for 3D since GLM 4.5, and fully using it in my 3D data vis day job since GLM 4.7.

[-]

unjustifiably_angry@reddit

I'll never understand why people do these sorts of prompts. There's so little input given, the output is going to be so complex and unpredictable that it could give completely different results from one run to the next.

[-]

-dysangel-@reddit (OP)

That's kind of the whole point? It's very informative for getting to know a model's inbuilt capabilities and aesthetic choices, which I find interesting/useful.

If you just want a model for doing business logic then yeah this isn't the methodology for you.

[-]

unjustifiably_angry@reddit

This is a sloptuber test.

The output is going to be random and unpredictable and serves no practical function so there is no objective way to measure it. One model outputs a version with nicer-looking cars, another outputs a version with working hit detection. You hit regenerate and now the first one has working hit detection and the second one has nicer-looking cars. And all the while this isn't even the sort of task AI is trained for, so it indicates absolutely nothing about the model's capabilities.

You're asking two electricians to race a car to see which one is red. What the fuck?

[-]

-dysangel-@reddit (OP)

Oh, I just saw your username. Maybe you just enjoy trolling. Moving on..

[-]

-dysangel-@reddit (OP)

what is your point?

[-]

punishedsnake_@reddit

why not have an MGS VR Missions benchmark for example

[-]

-dysangel-@reddit (OP)

That's another great idea - fun but relatively easy to generate puzzles. I'm seeing what GLM-5.1 can manage for a "parkour benchmark" atm. This was the first attempt. It's currently 15k thinking tokens deep into the second attempt..

https://i.redd.it/lr9s672sd2vg1.gif

[-]

Ok_Warning2146@reddit

How come lmarena shows 2.7 performs almost identical to 2.5. Is there anything new in 2.7?

[-]

-dysangel-@reddit (OP)

It's more of continued post-training on 2.5, focusing on agentic improvements, especially multi-agent apparently. Same architecture, no new tech

[-]

Aggressive-Permit317@reddit

This is wild !!!!! a local model straight up generating a playable GTA-style scene in browser? The aesthetics and detail it’s adding on the fly is next level for 2.7B. I’m spinning it up today. Anyone else test it yet or are we all still waiting on proper GGUF quants?

[-]

Monad_Maya@reddit

Here's Qwen 27B even with a rather detailed prompt. The controls don't really work and you can space travel with the Space key. Ok for a single shot attempt I guess.

https://i.redd.it/btu724wse0vg1.gif

[-]

NoFudge4700@reddit

The headlights beam look so cool

[-]

4400120@reddit

Create alternatives of those games advertised on YouTube that never look anything like what is shown.

[-]

100lyan@reddit

Amazingly Gemma 4 31B is also able to create almost the same Mini GTA - with virtually the same prompt ! Had to do a couple of rounds fixing some bugs here and there - but at the end I got a very similar experience.

[-]

-dysangel-@reddit (OP)

yep Qwen 27B managed too, though not quite as much detail with wheels/lights etc, and also those smaller models struggle even more with figuring out directions in my experience

[-]

100lyan@reddit

It's still mind-blowing what is now possible with local LLMs. You are correct though that MiniMax somehow achieves more detailed environment - but hardly surprising given its size. Anyway - a great experiment !

[-]

FullstackSensei@reddit

As a software engineer, I'd prefer the LLM not to make things I didn't ask for. It might be nice to add trees on it's own, but if you're writing anything serious, that easily leads to unexpected or undesired behavior.

[-]

tillybowman@reddit

it's a simple single shot. as a software dev that uses ai daily it's all about guardrails. put your context into place before start and have your feedback loop setup and you will get pretty much exactly what you asked for.

[-]

FullstackSensei@reddit

I know. I always have 40-60k background info. Thing is, even with those some models ignore those sometimes. The situation is very much exacerbated when kv is quantize or the model is heavily quantized.

[-]

SnooPaintings8639@reddit

My thoughts exactly. In case od Claude Opus the biggest issue at the moment is how "trigger happy" it gets and does all the "extra" stuff. Model doing less is often a good thing, i.e. a good instruction following.

[-]

-dysangel-@reddit (OP)

Sure, that makes sense. When I'm doing these initial game tests of model capability, I like to be purposefully vague to get a sense of their aesthetic choices and how proactive they are. So I'll ask for "beautiful tetris", "gta-like experience", "relaxing flight simulator". Then I'll start asking for more specific things to see how well they handle feedback, bugfixing, and just generally how it feels to work with them.

[-]

lorenzo_9696@reddit

Which hardware are you using to run it?

[-]

-dysangel-@reddit (OP)

M3 Ultra

[-]

RevolutionaryDrop481@reddit

This looks pretty similar to early Roblox

[-]

-dysangel-@reddit (OP)

GLM 5 for comparison - more detail on the main character without having to ask for it

https://i.redd.it/gq2fh1y0sxug1.gif

[-]

Jackalzaq@reddit

Yeah glm is too good lol. I havent been this impressed with a model that i could run locally in a while.

[-]

ForsookComparison@reddit

Holy crap

[-]

-dysangel-@reddit (OP)

And a little Frozen themed game I had it generate for my daughter. Though I had to ask for an "Ice Princess" game to get around GLM's copyright worries :p

https://i.redd.it/6z4dw1qj7zug1.gif

[-]

-dysangel-@reddit (OP)

https://i.redd.it/zv14ok027zug1.gif

[-]

-dysangel-@reddit (OP)

Right? It's also the first model I've run that managed a decent-ish flight sim without too much feedback/tweaks

https://i.redd.it/47auoegk6zug1.gif

[-]

Ok_Technology_5962@reddit

GLM 5.1 test?

[-]

ambient_temp_xeno@reddit

The car clips through the buildings, though. On your minimax one it doesn't clip through the car.

[-]

-dysangel-@reddit (OP)

On minimax, the car still clips through the buildings too, but it does bounce off of other cars. I should really create a more structured prompt to test models' attention to detail and aesthetics, now that pretty much all models are able to handle creating basic 3D worlds.

[-]

newcarrots69@reddit

If you added a play tester, couldn't these LLMs continually iterate development?

[-]

-dysangel-@reddit (OP)

Yes, that's a great direction!

I've so far just been looking for viable local text model, which is smart + fast enough. It looks like M2.7 is "the one", and that the next frontier is to have for playtesting/feedback models like you're suggesting.

[-]

newcarrots69@reddit

I also meant to ask you if you were keeping a record of these results somewhere.

[-]

-dysangel-@reddit (OP)

I do have an llm assessments folder dating back to GPT 3.5 days. though these days I usually test directly in openwebui and just pin any cool results.

It's for the most part just "build tetris". "now make it self playing". However since even Qwen 3.5 4B can handle that now, I'm having to up the stakes. We're getting to the stage where intelligence is "good enough" in most new models. Differentiating factors for me now are things like aesthetics, and architectural ability. GLM 5.1 is still the GOAT for me - but M2.7 is 1/3 of the size and it's looking "good enough" so far.

[-]

newcarrots69@reddit

I also meant to ask you if you were keeping a record of these results somewhere.

[-]

jacek2023@reddit

"This was not even in an agentic scaffold" so what was your workflow? How do you work with multiple files?

[-]

-dysangel-@reddit (OP)

When I download a fresh model I initially just do some vibe checks in openwebui to see how well they can code and iterate on their code. Then I either delete the model if it doesn't pass the vibe check, or I test it out further. This is definitely a good 'un, so I've also been using it in opencode.

[-]

Both_Opportunity5327@reddit

Very Nice

Was this done by IQ2_XXS version one shot, or did you use Opencode?

Can you post the settings and what you used to serve the model please.

[-]

-dysangel-@reddit (OP)

It wasn't one shot, but also not opencode. It was just me going back and forth in openwebui and using their artifacts window for testing

[-]

jacek2023@reddit

So this wasn’t in OpenCode, it was in OpenWebUI? That’s what I was asking above. I have some issues with OpenCode, mostly with prompt caching, across all models (qwen, gemma).

[-]

-dysangel-@reddit (OP)

Yeah this particular test was just in openwebui.

Yeah I noticed a caching issue in OpenCode too when I tried it - the model had to re-process the whole context on /compact. It would be way more efficient to either just send a message asking for summarisation at the end of the existing context, or have an option to use a small, fast utility model to handle compaction.

[-]

jacek2023@reddit

You can see problems often after switching between plan and build or after long operation of opencode

[-]

-dysangel-@reddit (OP)

I made an mlx server last summer that caches all system prompts for a model to redis - so when changing mode, it's already warmed up. Sounds like I need to fish that one out and try it with OpenCode.

[-]

jacek2023@reddit

But I believe this is context + new prompt so it can't be cached

[-]

-dysangel-@reddit (OP)

Makes sense. If I come up with something better, I'll submit a PR (or start my own fork)

[-]

jacek2023@reddit

I have plans to debug this whole process but I am still procrastinating that

[-]

Both_Opportunity5327@reddit

ok thanks, I will experiment with it.

[-]

jacek2023@reddit

what's your speed for IQ2?

[-]

-dysangel-@reddit (OP)

\~1k session:

prompt eval time = 342.45  tokens per second
       eval time = 55.77  tokens per second

\~90k session:

prompt eval time = 102.17 tokens per second
       eval time = 15.41 tokens per second

[-]

JockY@reddit

The fact that it runs at all with an IQ2_XXS quant is quite extraordinary!

[-]

-dysangel-@reddit (OP)

I've actually found a few larger models that work well at IQ2_XXS UD. The first one I found that worked consistently was Deepseek R1-0528.

Deepseek V3-024 was basically the same model as R1-0528, but an instruct rather than thinking model - however it lost a lot of ability at Q2, so it's not purely about size, but the larger models definitely seem to handle this level of compression better than smaller ones. As in, GLM 5.1 at Q2 still gives better result than any other model I can run.

Some other good IQ2_XSS UDs I've found in the last year:

glm-4.6-reap-268b-a32b

glm5

glm 5.1

minimax m2.5

[-]

Medium_Chemist_4032@reddit

What are your run params? By the way, posts such as these are worth their weight in gold :D Thanks!

I just tried that exact prompt with:

  "MiniMax-M27-IQ4":
...
      docker run --rm --init --label llama-swap.managed=true
      --name llama-cpp-minimax-m27-iq4
      --gpus all
      --ipc=host
      -p ${PORT}:8080
      -v /home/user/prj/llama-swap/models/.cache/llama.cpp:/root/.cache/llama.cpp
      -v /home/user/prj/llama-swap/models:/models
      llama-swap/llama-cpp:b8775
      --model /root/.cache/llama.cpp/MiniMax-M2.7-UD-IQ4_XS/UD-IQ4_XS/MiniMax-M2.7-UD-IQ4_XS-00001-of-00004.gguf
      --host 0.0.0.0
      --port 8080
      --ctx-size 132000
      --parallel 1
      --jinja
      --temp 1.0
      --top-p 0.95
      --top-k 40
      --repeat-penalty 1.05

And it was worse. I had to add repeat penalty, because it looped like crazy otherwise

[-]

-dysangel-@reddit (OP)

Hey you're welcome. I just read another post that at least one of the unsloth Q4s was broken - you might have to redownload? :/

[-]

Medium_Chemist_4032@reddit

Downloaded an hour ago. Retried a few times with IQ2_XXS and it was much better, when it comes to looping. Out of 5 gens though, I couldn't come close to your result

[-]

-dysangel-@reddit (OP)

Note that the controls were backward, the car wouldn't drive, lights were in the wrong place on the cars etc on the first prompt, so I had to go through a few iterations of feedback before everything was fully correct!

[-]

Ylsid@reddit

I'm impressed it managed to get the environmental details right. That's usually super difficult for LLMs

[-]

-dysangel-@reddit (OP)

Yeah it did a great job of the building positioning in between roads, and trees lining the road!

Of course there were still a couple of trees actually on the road, but still very impressive for a Q2, 65GB model

[-]

Itchy-Individual3536@reddit

How did you get your hands on that early copy of GTA 6? /s

[-]

BigYoSpeck@reddit

You joke, but stick some DLSS5 on this and you got your self a ~~slop stew~~ AAA game going on

[-]

-dysangel-@reddit (OP)

don't tell anyone but this is actually 7

[-]

AdultContemporaneous@reddit

What in the Corncob 3D is this?

[-]

ambient_temp_xeno@reddit

That's insane! I'm getting the Q8 (but I can't really even vibecode).

[-]

-dysangel-@reddit (OP)

It helps a lot if you already know how to make games, but the models themselves are getting pretty good, so they will be able to fix a lot of problems just from you describing what you are seeing. For getting quickly up and running without knowing a lot about coding, you can just ask the model to make the game in html, javascript, three.js (3D engine), ammo.js (Physics engine). If you start getting more serious, you could start setting up typescript projects, or use game frameworks like Godot etc.

[-]

ambient_temp_xeno@reddit

I had a lot more luck with the single html route compared to trying to get pygame to do anything good. I had to handhold opus 4.6 so I get what you mean about having to iterate on it by telling it what's not working.

[-]

-dysangel-@reddit (OP)

Yeah I've had a similar experience where models can do a better job in a single shot than trying to iterate in a scaffold.

I find that GLM 5.1 needs way less handholding that Claude - it's more confident and proactive with architectural decisions. Did a great job recently refactoring my game engine (which was originally created by Claude \~3.7).

[-]

CoolstaConnor@reddit

Prompt?

[-]

-dysangel-@reddit (OP)

Prompt 1:

Task: create a 3D GTA-like experience in a single web page. The player should be able to walk around, and enter/leave/drive cars

Prompt 2

nice one! Ok so some feedback - the lights are on the side of the cars
forward/back/left/right are reversed when walking
the cars don’t drive foward?
Could you also add some trees, and maybe some flocks of birds with boids?

The rest of the prompts were mostly asking it to reverse directions on the controls. That's a common failure on all models, since they don't really have any experience with real world spaces in their training.

There was also a bug where it was checking the car for collision against itself, which was why it wasn't moving initially. The results are getting pretty impressive though for just drafting everything up "in their head" with no playtesting!

[-]

CoolstaConnor@reddit

Thanks!

[-]

averagebear_003@reddit

the birds were a nice touch tbf