Local Qwen 3.6 vs frontier models on a coding primitive: single-file HTML canvas driving animation - results and GIFs
Posted by Fragrant-Remove-9031@reddit | LocalLLaMA | View on Reddit | 95 comments
Saw this post comparing Qwen 3.6 variants on coding primitives, so I wanted to see how local quants stack up against frontier models on a similar dense, single-file coding task. I ran the exact same prompt across local and web-based models accessed through my Perplexity subscription.
The prompt
Write a single HTML file with a full-page canvas and no libraries. Simulate a realistic side-view of a moving car as the main subject. Keep the car visible in the foreground while the background landscape scrolls continuously to create the feeling that the car is driving forward. Use layered scenery for depth: nearby ground, roadside elements, trees, poles, and distant hills or mountains should move at different speeds for a natural parallax effect. Animate the wheels spinning realistically and add subtle body motion so the car feels connected to the road. Let the environment pass smoothly behind it, with repeating but varied scenery that makes the movement feel believable. Use cinematic lighting and a cohesive sky, such as sunset, dusk, or daylight, to enhance atmosphere. The overall motion should feel calm, immersive, and realistic, with a seamless looping animation.
Models tested
Frontier (web-based via Perplexity, tok/s not measured):
- Claude sonnet 4.6 Thinking — used internet for reasoning
- Gemini 3.1 Pro Thinking
- GPT 5.4 Thinking
- Kimi k2.6 Thinking
Local (Ryzen 5 5600, 24 GB DDR4-3200, RX 5700 XT 8GB):
- Qwen3.5 9B Q4_K_M — \~50 tok/s
- Qwen3.6-27B (Claude-opus-reasoning-distilled) Q4_K_M — 2.65 tok/s
- Qwen3.6-27B Q4_K_M — 2.70 tok/s
- Qwen3.6-31B A3B Q4_K_M — 12.13 tok/s
- Gemma-4-31b-it — 1.91 tok/s
- Qwen3.5 4B Q8 — 60 tok/s — used internet for reasoning
- Qwen3.5 4B Q4_K_M — 80 tok/s — used internet for reasoning
What I looked for
Realistic side-view driving animation: layered parallax scenery, spinning wheels, subtle chassis motion, cohesive sky and lighting, and seamless looping — all vanilla JS/canvas, zero libraries.
Subjective ranking for this specific task
- Kimi k2.6 Thinking — cleanest overall visual result
- Qwen3.6-27B Q4_K_M (local) — stronger than I expected; good parallax and road feel
- Qwen3.6-27B Claude-opus-reasoning-distilled — close third
The local 27B quant delivered more natural motion and layering than some frontier outputs for this specific visual primitive. I was expecting frontier models to do much better — am I missing something?
Outputs
I only changed the HTML <title> tags to track which model generated which file. I’ll share all the output files and probably a few screenshots of the running animations so you can judge the visual quality yourself.
If anyone wants to run the exact same prompt on their setup — especially other MoE cuts or distills — feel free to share your results.
TheRealMasonMac@reddit
What's interesting is how so many of them have the car backwards, presumably because right-to-left appeared far more often than left-to-right. Variations of this task, where the model has to correctly orient things, could be a good benchmark for memorization!
laul_pogan@reddit
27B isn't surprising for this class of task. Single-file canvas animations are pattern-dense in pretraining data: MDN tutorials, CodePen archives, game-jam entries. The model doesn't need cross-file reasoning or architectural judgment, just spatial geometry and trigonometry it's seen thousands of times. Frontier models pull ahead on tasks that require multi-file coherence, API surface awareness, or novel domain reasoning. Pure self-contained physics/parallax math in one HTML file is exactly where a well-quantized 27B sits in its comfort zone. At 2.70 tok/s on DDR4-3200 dual-channel you're at the bandwidth ceiling for Q4_K_M 27B anyway (~51 GB/s / ~14.5 GB weights ≈ 3.5 tok/s theoretical), so the quant isn't leaving much on the table.
Charming-Author4877@reddit
Quite insane that Qwen 27B and Kimi are winning
BlackBeardAI@reddit
Qwen 27b is just…
LikeSaw@reddit
https://i.redd.it/shw9w6ndhl1h1.gif
Qwen 3.6 27b in full BF16 precision
Legal-Ad-3901@reddit
Q4 ain't the same no matter what they tell us 😭
BlackBeardAI@reddit
Y it is either q8 or the nvfp4 version that comes close. The rest are useful for many stuff but not quite the same as the full precision model. Q8 and nvfp4 though, they are very close.
NickCanCode@reddit
Gemini 3.1 Pro...😂😂😂
Minimum-Avocado-8015@reddit
Gemini trainers imagined the future of cars to be like that?
This_Maintenance_834@reddit
claude-opus-reasoning-distilled seems to be a terrible model in this test. the original official model performed very well.
MrBabai@reddit
https://i.redd.it/rjdgfsco2m1h1.gif
gemma-4-31B-it-Q6_K - zero shot
Shin-Ryuu@reddit
Love these kinda tests cheers 🍻
itssethc@reddit
Quite the variance haha, anything special about any of your MDs on frontier? Or this was blind prompts to all
Also love seeing others squeezing local AI gold out of a 5700 XT other than myself
Shoddy_Bed3240@reddit
https://i.redd.it/de4o2weabk1h1.gif
Qwen 3.6 27B Q8_0
Fragrant-Remove-9031@reddit (OP)
I really wanted to see what qwen 3.6 27b q8 could do. It wouldn't fit on my system though. Thanks.
swagonflyyyy@reddit
Its fucking amazing. That's all I'm going to say.
Maverobot@reddit
I've never tried the Q8. Is the difference between Q8 and Q4 really that visible?
Shoddy_Bed3240@reddit
The quality difference between Q8 and Q4 becomes noticeable with longer contexts — Q8 is also less likely to get stuck in a loop.
jazir55@reddit
The car is driving in reverse lmao
Photoperiod@reddit
Crazy how good this model is at 27b.
blastcat4@reddit
My favourite is the Moon Patrol buggy.
Rabus@reddit
where's opus?
Fragrant-Remove-9031@reddit (OP)
Sorry don't have access to it, but nice that you did. And very interesting results also.
Rabus@reddit
IMO if you would put it into plan mode, get some iterations, it would be even better too
serige@reddit
Bro your opus car is moving backward though.
Rabus@reddit
I never said it was correct lol
yes its backwards but it was a oneshot. I could easily prompt it again or modify the prompt a little bit to have some kind of a feedback loop and make it right. Pretty sure the ones above wouldn't be so easy to fix lol
Agitated_Space_672@reddit
I just tried the API and it wasn't as detailed as yours. Any chance you can share the raw prompts or any skills used? And it iterate, or just a single prompt/response?
Rabus@reddit
Well i literally oneshot prompted this into claude code
Single prompt, opus 4.7 max reasoning
This is literlaly full conversation, it didnt even use any agents (besides vercel deploy agent).
I suggest you try claude code?
Agitated_Space_672@reddit
Ahh I didnt have max reasoning set.
Rabus@reddit
Ha, rookie mistake i also make lol
#1 i say on all ai trainings I run is to verify you have max reasoning set up
mehedi_shafi@reddit
Not in copilot anymore would be my guess. Rest of the frontier except kimi is available in copilot.
Rabus@reddit
damn tbh i dont touch sonnet or haiku so opus would be more sensible for me to know
nasduia@reddit
This is original Qwen/Qwen3.6-27B-FP8 in vLLM with 4 token mtp. The exact prompt from OP was entered into Open WebUI which created it in a panel.
Pretty decent!
https://jsfiddle.net/z5bjhkrL
SporksInjected@reddit
Gpt has variable level of thinking. It would be nice to have known which level you used or if you just didn’t set which goes to Auto.
Ok_Technology_5962@reddit
Mimo V2.5 PRO online version
https://i.redd.it/t301tkzskl1h1.gif
Ok_Technology_5962@reddit
Mimo v2.5 Q8_0 from unsloth non pro . Thought for 36,212 tokens 26 minutes at 23 tps
https://i.redd.it/co4y04u2kl1h1.gif
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Legal-Ad-3901@reddit
qwen 3.5 397B Q4_0 on 8x mi50s:
https://i.redd.it/5p38r7h86l1h1.gif
my qwen3.6 27b int on rtx6000pro came out nearly identical to OP's. Between that and a recent memory injection (intended) that threw off 397b that 27b completely navigated with class today, has me really rethinking my stack.
dryadofelysium@reddit
FWIW I got a better result for Gemini 3.1 Pro using Google Antigravity. Harness matters!
https://toasty-aurora-bv45.pagedrop.io
NUMERIC__RIDDLE@reddit
I keep seeing these being used as banchmarks. I think it would be way more useful to run the prompt multiple times through each model, have checkboxes for each element thats described in the prompt, and mark each run with what it adhered to on the prompt, binary yes or no, then have a percentage score for each element on each model.
Because i cant help but feel some of these could have been really good/bad luck with what is a statistical engine.
Heikob@reddit
https://i.redd.it/ti5zzqq79k1h1.gif
Here's what I'm getting on Qwen 3.6 A35B Claude distilled locally on a M5 pro 64GB
ElChupaNebrey@reddit
Wow it's so bad
Economy_Cabinet_7719@reddit
The landscape is just meant for a fantasy game 😁 And it looks gorgeous!
BitsOnWaves@reddit
what GPU are you running?
mjsxi__@reddit
BitsOnWaves@reddit
Oh wow so you are functioning on one kidney now?
road-runn3r@reddit
https://single-file-html-canvas-driving-ani.vercel.app/
Qwen3.6-27B-UD-IQ3_XXS.gguf (Unsloth's MTP one)
MindRuin@reddit
Qwen3.5 4b Q8 -.-
codehamr@reddit
check Q4 😃
lordsnoake@reddit
what is qwen 3.5 9b q4 k m even doing hahaha
codehamr@reddit
nailed it 😃
NigaTroubles@reddit
27b always a beast
codehamr@reddit
it´s crazy good
codehamr@reddit
I like Qwen3.5 9B Q4 😉
Cool test btw!
misha1350@reddit
You should rather test DeepSeek V4 Flash and V4 Pro, given that they, especially V4 Flash, are much less expensive than any other LLM on the likes of Openrouter
MadGenderScientist@reddit
please, please control for:
how many times you run the same model with different parameters
how many experiments you run (N>1)
what's more likely: that a particular Qwen 3.6 quantization is weirdly better at this challenge than others? or that you ran Qwen 3.6 more in general?
I'd be very interested in a properly-controlled benchmark like this. I can conclude almost nothing from your runs.
Fragrant-Remove-9031@reddit (OP)
1 run for each model, no biases. My PC is kinda slow for dense 9b+ models, so takes a lot of time. But for general tasks, which ever model can do better in one go should be better no?
But if you have the resources, please share if there are any differences.
MadGenderScientist@reddit
that's not how models work. they're probabilistic, unless you sample at. temperature=0 or fix the RNG seed
I ran an experiment - same exact prompt, same deterministic infrastructure - to prove Sperner's Lemma in 2D in Lean. I ran it about 6 times. it failed 2/6 times and passed 4/6 times, given ~2000 turns. one run used as little as ~450 turns, finding a very efficient proof, while another took ~1800 turns.
I challenge you to run the same model multiple times. use the same prompt, the same quantization, the same harness. do you get the same result? do you get the same quality every time?
nomorebuttsplz@reddit
Sounds like you were making a lot of assumptions
LegacyRemaster@reddit
Qwen3.5 4b Q4_K_M best one. The only one to have simulated a drunk driver
philmarcracken@reddit
it already reported on my gta6 playthrough
thetaFAANG@reddit
You ever add playwright-mcp so it can see how it messes up and iterates more heavily with greater scrutiny?
Fragrant-Remove-9031@reddit (OP)
Thanks for the tip, I usually tell the model what it did wrong (feedback loop), but visual feedback might be more effective in some cases like this.
I do have a playwright-mcp currently active, but as a web search tool actually.
MasterScrat@reddit
Oh so this is not zero shot? Did you use the same number of feedback loops per model?
Fragrant-Remove-9031@reddit (OP)
This specific task is one shot. But for other tasks I do error feedback to the model, only text, but I see how image as feedback can be helpful actually.
Xonzo@reddit
Yep I do that, and tell it to interact with the project. Gives substantially better results.
xornullvoid@reddit
I am a little skeptical of this test and the usefulness of it's results.
Real-world coding and development hardly depends on one-shot prompts, and usually requires multiple steps, as well the as the ability of the model to ingest large and complex information and instructions over multiple prompts at large context lengths. Besides, some models are horrendous at visualizing graphical geometry, and even UI for that matter, but excel at general-purpose coding.
I think the test shows how visually cohesive the model is at a single-shot open-ended graphical vibe-coding prompt, but to measure a level of general cohesion, you should also -
1. See how many additional features (not included in the prompt) did the model try to implement, and which one of those did not work well.
2. Check for instruction-following and violations/deviations. This requires giving specific detailed instructions at various depths of complexity to see at which level does the model begin to fail at following your instructions.
Cool experiment though.
New-Implement-5979@reddit
one correction for the last gif it is 35b not 31b
Fragrant-Remove-9031@reddit (OP)
Oh, yes, 35b. Typo, my bad.
Shoddy_Bed3240@reddit
https://i.redd.it/znl73oghlk1h1.gif
gemma-4-26B-A4B-it-UD-Q8_K_XL
2Norn@reddit
personally i never understood the point of these "tests"
they don't really tell anything
Turbulent_Pin7635@reddit
27b ganhô
HumanoidMuppet@reddit
I love these single file HTML benchmarks. Lately I've had AI write stories about my dog's adventures, and then I ask it to transform the story into a full screen single file HTML animation. It's hilarious how it puts it all together.
Gemini Pro gets the full story but the animation is boring, dogs look like sheep.
Local Qwen3.6-35b does better art and animation, but usually leaves out half the story.
I could probably improve both with a better prompt, but the vague prompt allow the reasoning to go in different directions.
roninXpl@reddit
local qwen/qwen3.6-35b-a3b https://jsfiddle.net/m01s87f4/7/
ElChupaNebrey@reddit
This model is just so bad
roninXpl@reddit
Compared to? It's fast and decent on a go. 50toks vs 10toks 27B Q4KM on the same machine.
s3sebastian@reddit
Usually Gemini is quite good at things like that, first try, I improved the prompt a little though, gave me a proper car and a nice smooth animation (Gemini 3.1 Pro Thinking via Perplexity).
BraveBrush8890@reddit
I tried your prompt to ChatGPT Pro and this is what it made: https://jsfiddle.net/b4fnqda2/
Fragrant-Remove-9031@reddit (OP)
Is that GPT 5.5? Rotating wheel and road movement is kinda coherent here.
BraveBrush8890@reddit
Yeah, 5.5 Pro model, which is on the $100 and $200 tiers
MrShrek69@reddit
I’m gonna try this on my strix halo thanks! How many runs did u try?
Fragrant-Remove-9031@reddit (OP)
Just one for each model. You'd have much better options for testing with that chip. I wanna get my hands on that someday.
Shoddy_Bed3240@reddit
https://i.redd.it/0l125ciwdk1h1.gif
GLM 5.1 via opencode
rorykoehler@reddit
They all obviously stole their code and assets from a kids car building game called Labo Lado
Look_0ver_There@reddit
Here's the results from using Qwen3.6-27B-Q8_0
https://github.com/stew675/car-animation/blob/main/car-drive.html
Full model invocation is below. Ran at \~46t/s on 2 x Radeon AI Pro R9700 GPU's
Look_0ver_There@reddit
https://i.redd.it/k0coww0abk1h1.gif
sernamenotdefined@reddit
Qwen was going "car has 4 wheels" must all be on one side then ... 😃
snapo84@reddit
this is a very cool test :-)
clear winners for me
- kimi k2.6 thinking
- qwen 3.6 27B Q4_K_M
BitsOnWaves@reddit
wondering about Chatgpt 5.5 extra
osoltokurva@reddit
I just run same prompt in Gemini I got this.
Shoddy_Bed3240@reddit
The quantization is matter for one shot prompt generation. Here is example of Qwen 27b Q8_0. https://turquoise-jany-97.tiiny.site/. Fill free to add it as example
LoveMind_AI@reddit
Kimi 2.6’s and GPT-5.4 actually had some feel to it, imho.
sparticleaccelerator@reddit
Cool experiment but I think you're measuring "vibes on one prompt" more than coding ability. Canvas animation is unusually forgiving - there's no "correct" output, so a model that picks slightly nicer color choices or smoother easing curves can look better than one with cleaner code underneath. Would be interesting to see you open the HTML files and compare: line count, whether they actually use requestAnimationFrame properly, whether the parallax math is principled or just hardcoded magic numbers. My guess is the frontier models look "worse" but have more correct/extensible code. Single-prompt n=1 on a subjective task is where local models tend to overperform.
Fragrant-Remove-9031@reddit (OP)
I do feel frontier models have better "physics" though
Foreign_Risk_2031@reddit
Very poor benchmark in 2026 but it’s 2002 adobe flash level cool