Local Qwen 3.6 vs frontier models on a coding primitive: single-file HTML canvas driving animation - results and GIFs

Posted by Fragrant-Remove-9031@reddit | LocalLLaMA | View on Reddit | 95 comments

Saw this post comparing Qwen 3.6 variants on coding primitives, so I wanted to see how local quants stack up against frontier models on a similar dense, single-file coding task. I ran the exact same prompt across local and web-based models accessed through my Perplexity subscription.

The prompt

Write a single HTML file with a full-page canvas and no libraries. Simulate a realistic side-view of a moving car as the main subject. Keep the car visible in the foreground while the background landscape scrolls continuously to create the feeling that the car is driving forward. Use layered scenery for depth: nearby ground, roadside elements, trees, poles, and distant hills or mountains should move at different speeds for a natural parallax effect. Animate the wheels spinning realistically and add subtle body motion so the car feels connected to the road. Let the environment pass smoothly behind it, with repeating but varied scenery that makes the movement feel believable. Use cinematic lighting and a cohesive sky, such as sunset, dusk, or daylight, to enhance atmosphere. The overall motion should feel calm, immersive, and realistic, with a seamless looping animation.

Models tested

Frontier (web-based via Perplexity, tok/s not measured):

Claude sonnet 4.6 Thinking — used internet for reasoning
Gemini 3.1 Pro Thinking
GPT 5.4 Thinking
Kimi k2.6 Thinking

Local (Ryzen 5 5600, 24 GB DDR4-3200, RX 5700 XT 8GB):

Qwen3.5 9B Q4_K_M — \~50 tok/s
Qwen3.6-27B (Claude-opus-reasoning-distilled) Q4_K_M — 2.65 tok/s
Qwen3.6-27B Q4_K_M — 2.70 tok/s
Qwen3.6-31B A3B Q4_K_M — 12.13 tok/s
Gemma-4-31b-it — 1.91 tok/s
Qwen3.5 4B Q8 — 60 tok/s — used internet for reasoning
Qwen3.5 4B Q4_K_M — 80 tok/s — used internet for reasoning

What I looked for
Realistic side-view driving animation: layered parallax scenery, spinning wheels, subtle chassis motion, cohesive sky and lighting, and seamless looping — all vanilla JS/canvas, zero libraries.

Subjective ranking for this specific task

Kimi k2.6 Thinking — cleanest overall visual result
Qwen3.6-27B Q4_K_M (local) — stronger than I expected; good parallax and road feel
Qwen3.6-27B Claude-opus-reasoning-distilled — close third

The local 27B quant delivered more natural motion and layering than some frontier outputs for this specific visual primitive. I was expecting frontier models to do much better — am I missing something?

Outputs
I only changed the HTML <title> tags to track which model generated which file. I’ll share all the output files and probably a few screenshots of the running animations so you can judge the visual quality yourself.

If anyone wants to run the exact same prompt on their setup — especially other MoE cuts or distills — feel free to share your results.

[-]

TheRealMasonMac@reddit

What's interesting is how so many of them have the car backwards, presumably because right-to-left appeared far more often than left-to-right. Variations of this task, where the model has to correctly orient things, could be a good benchmark for memorization!

[-]

laul_pogan@reddit

27B isn't surprising for this class of task. Single-file canvas animations are pattern-dense in pretraining data: MDN tutorials, CodePen archives, game-jam entries. The model doesn't need cross-file reasoning or architectural judgment, just spatial geometry and trigonometry it's seen thousands of times. Frontier models pull ahead on tasks that require multi-file coherence, API surface awareness, or novel domain reasoning. Pure self-contained physics/parallax math in one HTML file is exactly where a well-quantized 27B sits in its comfort zone. At 2.70 tok/s on DDR4-3200 dual-channel you're at the bandwidth ceiling for Q4_K_M 27B anyway (~51 GB/s / ~14.5 GB weights ≈ 3.5 tok/s theoretical), so the quant isn't leaving much on the table.

[-]

Charming-Author4877@reddit

Quite insane that Qwen 27B and Kimi are winning

[-]

BlackBeardAI@reddit

Qwen 27b is just…

[-]

LikeSaw@reddit

https://i.redd.it/shw9w6ndhl1h1.gif

Qwen 3.6 27b in full BF16 precision

[-]

Legal-Ad-3901@reddit

Q4 ain't the same no matter what they tell us 😭

[-]

BlackBeardAI@reddit

Y it is either q8 or the nvfp4 version that comes close. The rest are useful for many stuff but not quite the same as the full precision model. Q8 and nvfp4 though, they are very close.

[-]

NickCanCode@reddit

Gemini 3.1 Pro...😂😂😂

[-]

Minimum-Avocado-8015@reddit

Gemini trainers imagined the future of cars to be like that?

[-]

This_Maintenance_834@reddit

claude-opus-reasoning-distilled seems to be a terrible model in this test. the original official model performed very well.

[-]

MrBabai@reddit

https://i.redd.it/rjdgfsco2m1h1.gif

gemma-4-31B-it-Q6_K - zero shot

[-]

Shin-Ryuu@reddit

Love these kinda tests cheers 🍻

[-]

itssethc@reddit

Quite the variance haha, anything special about any of your MDs on frontier? Or this was blind prompts to all

Also love seeing others squeezing local AI gold out of a 5700 XT other than myself

[-]

Shoddy_Bed3240@reddit

https://i.redd.it/de4o2weabk1h1.gif

Qwen 3.6 27B Q8_0

[-]

Fragrant-Remove-9031@reddit (OP)

I really wanted to see what qwen 3.6 27b q8 could do. It wouldn't fit on my system though. Thanks.

[-]

swagonflyyyy@reddit

Its fucking amazing. That's all I'm going to say.

[-]

Maverobot@reddit

I've never tried the Q8. Is the difference between Q8 and Q4 really that visible?

[-]

Shoddy_Bed3240@reddit

The quality difference between Q8 and Q4 becomes noticeable with longer contexts — Q8 is also less likely to get stuck in a loop.

[-]

jazir55@reddit

The car is driving in reverse lmao

[-]

Photoperiod@reddit

Crazy how good this model is at 27b.

[-]

blastcat4@reddit

My favourite is the Moon Patrol buggy.

[-]

Rabus@reddit

where's opus?

[-]

Fragrant-Remove-9031@reddit (OP)

Sorry don't have access to it, but nice that you did. And very interesting results also.

[-]

Rabus@reddit

IMO if you would put it into plan mode, get some iterations, it would be even better too

[-]

serige@reddit

Bro your opus car is moving backward though.

[-]

Rabus@reddit

I never said it was correct lol

yes its backwards but it was a oneshot. I could easily prompt it again or modify the prompt a little bit to have some kind of a feedback loop and make it right. Pretty sure the ones above wouldn't be so easy to fix lol

[-]

Agitated_Space_672@reddit

I just tried the API and it wasn't as detailed as yours. Any chance you can share the raw prompts or any skills used? And it iterate, or just a single prompt/response?

[-]

Rabus@reddit

Well i literally oneshot prompted this into claude code

Single prompt, opus 4.7 max reasoning

This is literlaly full conversation, it didnt even use any agents (besides vercel deploy agent).

I suggest you try claude code?

[-]

Agitated_Space_672@reddit

Ahh I didnt have max reasoning set.

[-]

Rabus@reddit

Ha, rookie mistake i also make lol

#1 i say on all ai trainings I run is to verify you have max reasoning set up

[-]

mehedi_shafi@reddit

Not in copilot anymore would be my guess. Rest of the frontier except kimi is available in copilot.

[-]

Rabus@reddit

damn tbh i dont touch sonnet or haiku so opus would be more sensible for me to know

[-]

nasduia@reddit

This is original Qwen/Qwen3.6-27B-FP8 in vLLM with 4 token mtp. The exact prompt from OP was entered into Open WebUI which created it in a panel.

Pretty decent!

https://jsfiddle.net/z5bjhkrL

[-]

SporksInjected@reddit

Gpt has variable level of thinking. It would be nice to have known which level you used or if you just didn’t set which goes to Auto.

[-]

Ok_Technology_5962@reddit

Mimo V2.5 PRO online version

https://i.redd.it/t301tkzskl1h1.gif

[-]

Ok_Technology_5962@reddit

Mimo v2.5 Q8_0 from unsloth non pro . Thought for 36,212 tokens 26 minutes at 23 tps

https://i.redd.it/co4y04u2kl1h1.gif

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Legal-Ad-3901@reddit

qwen 3.5 397B Q4_0 on 8x mi50s:

https://i.redd.it/5p38r7h86l1h1.gif

my qwen3.6 27b int on rtx6000pro came out nearly identical to OP's. Between that and a recent memory injection (intended) that threw off 397b that 27b completely navigated with class today, has me really rethinking my stack.

[-]

dryadofelysium@reddit

FWIW I got a better result for Gemini 3.1 Pro using Google Antigravity. Harness matters!

https://toasty-aurora-bv45.pagedrop.io

[-]

NUMERIC__RIDDLE@reddit

I keep seeing these being used as banchmarks. I think it would be way more useful to run the prompt multiple times through each model, have checkboxes for each element thats described in the prompt, and mark each run with what it adhered to on the prompt, binary yes or no, then have a percentage score for each element on each model.

Because i cant help but feel some of these could have been really good/bad luck with what is a statistical engine.

[-]

Heikob@reddit

https://i.redd.it/ti5zzqq79k1h1.gif

Here's what I'm getting on Qwen 3.6 A35B Claude distilled locally on a M5 pro 64GB

[-]

ElChupaNebrey@reddit

Wow it's so bad

[-]

Economy_Cabinet_7719@reddit

The landscape is just meant for a fantasy game 😁 And it looks gorgeous!

[-]

BitsOnWaves@reddit

what GPU are you running?

[-]

mjsxi__@reddit

M5 pro 64GB

[-]

BitsOnWaves@reddit

Oh wow so you are functioning on one kidney now?

[-]

road-runn3r@reddit

https://single-file-html-canvas-driving-ani.vercel.app/

Qwen3.6-27B-UD-IQ3_XXS.gguf (Unsloth's MTP one)

[-]

MindRuin@reddit

Qwen3.5 4b Q8 -.-

[-]

codehamr@reddit

check Q4 😃

[-]

lordsnoake@reddit

what is qwen 3.5 9b q4 k m even doing hahaha

[-]

codehamr@reddit

nailed it 😃

[-]

NigaTroubles@reddit

27b always a beast

[-]

codehamr@reddit

it´s crazy good

[-]

codehamr@reddit

I like Qwen3.5 9B Q4 😉
Cool test btw!

[-]

misha1350@reddit

You should rather test DeepSeek V4 Flash and V4 Pro, given that they, especially V4 Flash, are much less expensive than any other LLM on the likes of Openrouter

[-]

MadGenderScientist@reddit

please, please control for:

how many times you run the same model with different parameters
how many experiments you run (N>1)

what's more likely: that a particular Qwen 3.6 quantization is weirdly better at this challenge than others? or that you ran Qwen 3.6 more in general?

I'd be very interested in a properly-controlled benchmark like this. I can conclude almost nothing from your runs.

[-]

Fragrant-Remove-9031@reddit (OP)

1 run for each model, no biases. My PC is kinda slow for dense 9b+ models, so takes a lot of time. But for general tasks, which ever model can do better in one go should be better no?

But if you have the resources, please share if there are any differences.

[-]

MadGenderScientist@reddit

But for general tasks, which ever model can do better in one go should be better no?

that's not how models work. they're probabilistic, unless you sample at. temperature=0 or fix the RNG seed

I ran an experiment - same exact prompt, same deterministic infrastructure - to prove Sperner's Lemma in 2D in Lean. I ran it about 6 times. it failed 2/6 times and passed 4/6 times, given ~2000 turns. one run used as little as ~450 turns, finding a very efficient proof, while another took ~1800 turns.

I challenge you to run the same model multiple times. use the same prompt, the same quantization, the same harness. do you get the same result? do you get the same quality every time?

[-]

nomorebuttsplz@reddit

Sounds like you were making a lot of assumptions

[-]

LegacyRemaster@reddit

Qwen3.5 4b Q4_K_M best one. The only one to have simulated a drunk driver

[-]

philmarcracken@reddit

it already reported on my gta6 playthrough

[-]

thetaFAANG@reddit

You ever add playwright-mcp so it can see how it messes up and iterates more heavily with greater scrutiny?

[-]

Fragrant-Remove-9031@reddit (OP)

Thanks for the tip, I usually tell the model what it did wrong (feedback loop), but visual feedback might be more effective in some cases like this.

I do have a playwright-mcp currently active, but as a web search tool actually.

[-]

MasterScrat@reddit

Oh so this is not zero shot? Did you use the same number of feedback loops per model?

[-]

Fragrant-Remove-9031@reddit (OP)

This specific task is one shot. But for other tasks I do error feedback to the model, only text, but I see how image as feedback can be helpful actually.

[-]

Xonzo@reddit

Yep I do that, and tell it to interact with the project. Gives substantially better results.

[-]

xornullvoid@reddit

I am a little skeptical of this test and the usefulness of it's results.

Real-world coding and development hardly depends on one-shot prompts, and usually requires multiple steps, as well the as the ability of the model to ingest large and complex information and instructions over multiple prompts at large context lengths. Besides, some models are horrendous at visualizing graphical geometry, and even UI for that matter, but excel at general-purpose coding.

I think the test shows how visually cohesive the model is at a single-shot open-ended graphical vibe-coding prompt, but to measure a level of general cohesion, you should also -
1. See how many additional features (not included in the prompt) did the model try to implement, and which one of those did not work well.
2. Check for instruction-following and violations/deviations. This requires giving specific detailed instructions at various depths of complexity to see at which level does the model begin to fail at following your instructions.

Cool experiment though.

[-]

New-Implement-5979@reddit

one correction for the last gif it is 35b not 31b

[-]

Fragrant-Remove-9031@reddit (OP)

Oh, yes, 35b. Typo, my bad.

[-]

Shoddy_Bed3240@reddit

https://i.redd.it/znl73oghlk1h1.gif

gemma-4-26B-A4B-it-UD-Q8_K_XL

[-]

2Norn@reddit

personally i never understood the point of these "tests"

they don't really tell anything

[-]

Turbulent_Pin7635@reddit

27b ganhô

[-]

HumanoidMuppet@reddit

I love these single file HTML benchmarks. Lately I've had AI write stories about my dog's adventures, and then I ask it to transform the story into a full screen single file HTML animation. It's hilarious how it puts it all together.

Gemini Pro gets the full story but the animation is boring, dogs look like sheep.

Local Qwen3.6-35b does better art and animation, but usually leaves out half the story.

I could probably improve both with a better prompt, but the vague prompt allow the reasoning to go in different directions.

[-]

roninXpl@reddit

local qwen/qwen3.6-35b-a3b https://jsfiddle.net/m01s87f4/7/

[-]

ElChupaNebrey@reddit

This model is just so bad

[-]

roninXpl@reddit

Compared to? It's fast and decent on a go. 50toks vs 10toks 27B Q4KM on the same machine.

[-]

s3sebastian@reddit

Usually Gemini is quite good at things like that, first try, I improved the prompt a little though, gave me a proper car and a nice smooth animation (Gemini 3.1 Pro Thinking via Perplexity).

[-]

BraveBrush8890@reddit

I tried your prompt to ChatGPT Pro and this is what it made: https://jsfiddle.net/b4fnqda2/

[-]

Fragrant-Remove-9031@reddit (OP)

Is that GPT 5.5? Rotating wheel and road movement is kinda coherent here.

[-]

BraveBrush8890@reddit

Yeah, 5.5 Pro model, which is on the $100 and $200 tiers

[-]

MrShrek69@reddit

I’m gonna try this on my strix halo thanks! How many runs did u try?

[-]

Fragrant-Remove-9031@reddit (OP)

Just one for each model. You'd have much better options for testing with that chip. I wanna get my hands on that someday.

[-]

Shoddy_Bed3240@reddit

https://i.redd.it/0l125ciwdk1h1.gif

GLM 5.1 via opencode

[-]

rorykoehler@reddit

They all obviously stole their code and assets from a kids car building game called Labo Lado

[-]

Look_0ver_There@reddit

Here's the results from using Qwen3.6-27B-Q8_0

https://github.com/stew675/car-animation/blob/main/car-drive.html

Full model invocation is below. Ran at \~46t/s on 2 x Radeon AI Pro R9700 GPU's

GGML_VK_VISIBLE_DEVICES=1,2                             \
       llama-server                                    \
       --host 0.0.0.0 --port 8033                      \
       --spec-type draft-mtp                           \
       --spec-draft-n-max 3                            \
       --fit off                                       \
       --mmap                                          \
       --temp 0.6                                      \
       --top-k 20                                      \
       --parallel 1                                    \
       --flash-attn auto                               \
       --ctx-size 262144                               \
       --cache-ram 12288                               \
       --batch-size 4096                               \
       --ubatch-size 1024                              \
       --n-gpu-layers all                              \
       --cache-type-k f16                              \
       --cache-type-v f16                              \
       --split-mode layer                              \
       --ctx-checkpoints 96                            \
       --repeat-penalty 1.00                           \
       --presence-penalty 0.0                          \
       --device Vulkan0,Vulkan1                        \
       --main-gpu 0                                    \
       --tensor-split 35,30                            \
       --mmproj ../mmproj-F32.gguf                     \
       --alias "Qwen3.6-27B-Q8_0"                      \
       --model ./Qwen3.6-27B-Q8_0.gguf

[-]

Look_0ver_There@reddit

https://i.redd.it/k0coww0abk1h1.gif

[-]

sernamenotdefined@reddit

Qwen was going "car has 4 wheels" must all be on one side then ... 😃

[-]

snapo84@reddit

this is a very cool test :-)

clear winners for me
- kimi k2.6 thinking
- qwen 3.6 27B Q4_K_M

[-]

BitsOnWaves@reddit

wondering about Chatgpt 5.5 extra

[-]

osoltokurva@reddit

I just run same prompt in Gemini I got this.

[-]

Shoddy_Bed3240@reddit

The quantization is matter for one shot prompt generation. Here is example of Qwen 27b Q8_0. https://turquoise-jany-97.tiiny.site/. Fill free to add it as example

[-]

LoveMind_AI@reddit

Kimi 2.6’s and GPT-5.4 actually had some feel to it, imho.

[-]

sparticleaccelerator@reddit

Cool experiment but I think you're measuring "vibes on one prompt" more than coding ability. Canvas animation is unusually forgiving - there's no "correct" output, so a model that picks slightly nicer color choices or smoother easing curves can look better than one with cleaner code underneath. Would be interesting to see you open the HTML files and compare: line count, whether they actually use requestAnimationFrame properly, whether the parallax math is principled or just hardcoded magic numbers. My guess is the frontier models look "worse" but have more correct/extensible code. Single-prompt n=1 on a subjective task is where local models tend to overperform.

[-]

Fragrant-Remove-9031@reddit (OP)

I do feel frontier models have better "physics" though

[-]

Foreign_Risk_2031@reddit

Very poor benchmark in 2026 but it’s 2002 adobe flash level cool