I tested 9 local models on the same flight sim prompt, all Q8, different Q providers, MLX

Posted by StudentDifficult8240@reddit | LocalLLaMA | View on Reddit | 13 comments

I gave 9 local models the same flight combat sim prompt. The results broke a few of my assumptions about quant providers and parameter count.

All 8-bit MLX, M3 Max 128GB, served via omlx, prompted through Claude Code. Same prompt every time — single-file HTML, three selectable planes (jet, prop, wildcard of the model's choice), dynamic enemies, tracers, damage, crash spiral on loss. Counted prompts-to-final and graded on "does it actually play."

The lineup:

Gemma 31B dense unsloth
Gemma 4 26B a4b unsloth
Qwen3.5 27B dense
Qwen3.5 35B A3B MoE
Qwen3.6 35B A3B in **three different quants** (oMLX, Unsloth, MLX Community)
Qwen3 Coder Next 80B
Qwopus 3.5 27B

Surprising findings:

1. Quant provider matters more than bit width.Three 8-bit quants of the exact same Qwen3.6 35B produced three meaningfully different games. Unsloth nailed it in 3 prompts (1,304 lines, working minimap, round planet, the model reviewed its own code for bugs before I pressed enter). MLX Community was fine in 4. oMLX was a 5-prompt debugging slog where the controls rubberbanded back to neutral and the model couldn't figure out why after three attempts. Same base model. Same 8-bit but different UX. "It's 8-bit" is not a sufficient description of a quant.

2. Line count is basically uncorrelated with quality. The winner (Qwopus 3.5 27B) shipped in 2 prompts at 1,049 lines. The loser (Qwen Coder Next 80B) shipped in 3 prompts at 1,635 lines — the most code of anyone — with over-sensitive camera, no enemies, and planes rotated 180°. The 80B sibling generated 3× the code of Gemma 31B dense and shipped a worse game.

3. Qwopus was the only model that implemented actual flight physics. Nobody asked for it. It just did it — integrated thrust/drag with per-plane aerodynamic constants, per-frame velocity damping, the F-16 accelerates differently than the Mustang because the constants are different. Also the only one that shipped procedural audio (engine frequency modulated by airspeed ratio). 2 prompts. I have to assume this is the Opus distillation doing real work, because the vanilla Qwen3.5 27B dense — same base — shipped the worst game in the lineup (control loop mixing quaternion rotations with direct Euler writes in the same frame, plane spun like a blender while falling out of the sky).

Web audio engine with pitch modulated by airspeed ration

function updateEngineSound(speedRatio) {
engineOsc.frequency.setValueAtTime(80 + speedRatio * 120, audioCtx.currentTime);
}

// From the F-16 config, velocity, thrust and drag
speed: 1200, turnRate: 0.015, climbRate: 0.008, thrust: 0.02, drag: 0.001,

// In the update loop
this.velocity.add(forward.multiplyScalar(this.stats.thrust * 1000 * delta));
this.velocity.multiplyScalar(1 - this.stats.drag);

Other notes worth mentioning:

- Generation speed: Gemma 4 26B a4b was the king at 58.3 tok/s, nearly 2× the Qwen A3B variants and \~7× the dense models. Qwopus generates at under 11 tok/s and still won. Per-token speed is a bad proxy for "time to working artifact."

- Qwen3.6 is a real step up over 3.5. The .1 increment packs more than usual — models reviewing their own output, trying to open the generated HTML in a browser for you. Little things, but they add up.

- The "pick a third plane" wildcard was a surprisingly good creativity probe. Qwen3.6 oMLX picked an AH-64 Apache (technically not a plane, technically the most interesting answer). Qwen Coder Next 80B, the largest model in the lineup, responded to "an option of your choosing" by shipping a third fighter jet.

- The Qwen signature bug: planes rendered 180° rotated. Showed up in most of the Qwen variants.

My personal ranking:

Qwopus 3.5 27b dense
Qwen3.6 35b unsloth
Gemma 4 26b unsloth
Qwen3.6 35B mlx-community
Qwen3.5 35b mlx-community
Qwen3.6 35b oMLX oQ quant
Qwen3Coder-Next 80B mlx-community

If anyone is interested in a more detailed and punny writeup with per-model breakdowns, and the specific code snippets that caused the blender-plane incident, there's a write-up on my Medium page, no paywall.

There are comments at the top of each HTML file in github that provide each prompt that was fed back into Claude Code and also provide ntoes.

Happy to dig into any of the specific results in comments. Two follow-ups planned — same 9 models on a 10-bug code review, and a creative task still TBD.

[-]

nullrecord@reddit

Very interesting! I tried now Qwen3.6-35b-a3b unsloth at Q4_K_S on my 32GB Mac and it works pretty well. It gave an almost usable game on the first try.

[-]

Bootes-sphere@reddit

FWIW from running flight sim inference in prod, quantization provider matters way more than people think. The difference between a tight Q8 and a sloppy one can tank context coherence in long sequences—especially for domain-specific stuff like flight dynamics where precision compounds.

Parameter count becomes almost secondary when your quant is lossy. You might see a 13B Q8 from Provider A crush a 70B from Provider B if the latter was aggressive with calibration.

What's your take on how the best performers handled the technical details? Did you notice one provider consistently preserve numerical stability better, or was it more about which model architecture just "fit" the domain? The fact that you tested 9 with the same quantization level is gold—most people don't isolate variables that carefully.

[-]

StudentDifficult8240@reddit (OP)

Small clarification — the test was models generating code that implements flight physics, not models doing physics inference themselves, so "numerical stability" in the quant sense isn't really the mechanism here. Kindly let me know if I misunderstood that part. What I did see is more like code coherence over long generations: the oMLX variant lost the thread on control-loop logic in a way the Unsloth variant didn't, on the same base model at the same bit width. That feels more like calibration quality affecting long-range instruction-following than anything numerical.

I didn't test enough models per provider to say "Unsloth consistently better" with confidence — it's one data point on one base model. But the Qwen3.6 35B three-way was striking enough that I want to run the same test on a different base next to see if the ordering holds or flips.

I did perform another batch of testing a few weeks ago comparing the same models across different quant providers whilst aiming for the same Q level. What I found then was quite similar to these findings; the unsloth quants and other quants that prioritize sensitive layers and preserve them at full precision will tend to preserve more capability, better CoT and overall more reliability in long-horizon tasks - shorter apply here too but I feel like longer ones are the better metric - than a Q that uniformly applies quantization to all areas of one model.

[-]

BingpotStudio@reddit

I don’t consider adding features you don’t ask for a positive. This is scope creep and it leads to a model that isn’t doing as it’s told and builds features your program wasn’t designed to handle elsewhere.

[-]

StudentDifficult8240@reddit (OP)

You raise a very good point: the web audio feature was definitely something no one had asked for. The plane flight simulator aspect, however, is contained in the prompt, as the first sentence is 'Design and create flight combat simulator game.'

It is interesting that out of all the models, the Qwopus model went above and beyond on the simulator concept.

When I mentioned in my post that no one has asked for it, I meant something along the lines of: no one has asked for this level of a flight simulator. In hindsight, I should have been clearer.

I do agree, having models going rogue and implementing features you haven't asked for, would lead to a whole project drift.

[-]

Sabin_Stargem@reddit

AI are like genies and lawyers: Be careful about how you word things.

[-]

9gxa05s8fa8sh@reddit

good work. incredible what the locals can vibe. imagine what they would do if you tested with a full framework built by a smarter model using superpowers...

[-]

Long_comment_san@reddit

"Qwen3.6 is a real step up over 3.5. The .1 increment packs more than usual"

That's what I fucking said in another post. It should have been called 4.0. 3.6 is stupid and confusing. I don't know what kind of "collective hallucination" Chinese managers operate at, but 0.1 is minor improvement, +1 is a major improvement and I'm ready to die on this hill defending this logic. Their logic makes no sense to me.

[-]

StudentDifficult8240@reddit (OP)

Thank you kindly :)