Guys we have to change the pelican test
Posted by Tall-Ad-7742@reddit | LocalLLaMA | View on Reddit | 77 comments
So i have been seeing more of those pelican on a bike svg tests and while they work i feel like (and maybe you guys do too) they are getting kinda benchmaxxed so we should switch things up soon and this is my idea
generate me a html svg of a horse sitting in an f1 race car
Gemini 3.1 Pro gave me this
[Gemini 3.1 Pro]()
and DeepSeek Expert Mode this
[DeepSeek Expert (official website)]()
GLM 5.1 (hosted on unofficial cloud)
[GLM 5.1]()
MiniMax 2.7 (hosted on unoffical cloud)
[Minimax M2.7]()
Kimi K2.5 (dont have access to 2.6 / budget was limited so i used it via offical website)
[Kimi K2.5]()
Claude Sonnet 4.6 (official website and yes probably quantized version)
[Claude Sonnet 4.6 (Normal Thinking\/Official Website)]()
Qwen 3.6 Plus (official website)
[Qwen 3.6 Plus]()
eli_pizza@reddit
Kinda think we’re overindexing on “generate an svg” questions altogether. It’s only useful if it also says something about how smart the model will be on other tasks. I have never once actually needed a zero-shot svg.
BigYoSpeck@reddit
It is an easy way to visualise the quality of output they're capable of. Both in terms of how intricate and well styled their attempt is and how well the elements align with each other
It's a bit like comparing children’s drawings. There's how neat what they do is which is basically down to their fine motor skills, and then there is how complex and detailed what they do is. My 5 year old for instance isn't anything like as neat as my 9 year old. She just hasn't developed the fine control yet. But, she makes way more use of colour and puts effort into intricate details than the 9 year old does. He's happy to just draw basic shapes for things, but she includes as much detail as she can in things, eyes have an iris and pupil with eye lashes, clothes have stitching, buttons and pockets and so on
So with the models attempts at these you can see both the accuracy in terms of how well the different shapes line up (a lot of them have the rear wing floating away from the car) and then just how intricately they attempt to create the concept in the first place. The more capable models do seem to have more complicated concepts they try to recreate
llmentry@reddit
It was never a good test of anything. The originator of the "test" has never bothered to correlate SVG complexity with model density, or with any other benchmarks or suite of benchmarks. Nor, for that matter, can they even say what a good representation of a pelican on a bicycle actually is, other than a entirely subjective like/dislike rating.
I think we should start calling this type of lazy, pointless, uninformative pseudo-test, "vibe-benching". All benchmarks are wrong, but some are useful. But this is not the latter.
eli_pizza@reddit
It was only originally intended as a goof. A visual and intentionally silly demonstration, not a proper benchmark.
Tall-Ad-7742@reddit (OP)
yea svg doesnt show that much but its nice/interesting to see if they can manage it like gemini did a decent job i think its probably the best of all
Imaginary-Anywhere23@reddit
Qwen3.5 27b. (Qwopus v3) , Not bad but look like an ant :-)
https://huggingface.co/YTan2000/Qwopus3.5-27B-v3-Abliterated-TQ3_4S
Tall-Ad-7742@reddit (OP)
true looks like an ant on a motorcycle
ambient_temp_xeno@reddit
Gemma 4 31b Q8
likegamertr@reddit
I can hear Jeremy clarkson reading this lol
lfrtsa@reddit
Ngl it's insane that we can run models like that on a consumer GPU. Being able to create arbitrary SVGs like that was considered a "spark of AGI" back in the GPT-4 days.
MLDataScientist@reddit
True. I wonder if we already have a different type of intelligence that we refuse to accept. An intelligence that works within a limited context and can hallucinate but still it is non human intelligence.
seamonn@reddit
Dolphins?
lfrtsa@reddit
There's a good chance dolphins are very human-like in intelligence... but they're so damn hard to study. There's strong evidence for their vocalizations approaching human language in complexity, and they appear to be able to communicate arbitrary information. Also, when two different pods of dolphins meet, they use much simpler vocalizations to communicate, which only makes sense if the complexity comes from a learned system that transmits complex information.
lfrtsa@reddit
I personally consider LLMs to be a form of general intelligence. The "AGI" label generally refers to the capability to fully replace humans in (at least) intellectual jobs, which we are clearly not there yet, but I feel that calling LLMs "narrow AI" is a stretch.
I don't know if you noticed this, but people suddenly stopped calling modern AI "narrow". That term used to be thrown around all the time a few years ago. I feel that pretty much everyone quietly agrees the label doesn't fit anymore. Today's AI is clearly general, but it's still not as capable as humans for performing the vast majority of jobs. That appears to be what is saving most people's human exceptionalism, and as much as I fear the economic impact of AGI under capitalism, maybe it'll help shatter the inflated ego humans tend to have. Perhaps leading people to value nature more, but that might be optimistic.
solestri@reddit
I think this is my favorite one.
fairydreaming@reddit
Snake racing, yay!
Tall-Ad-7742@reddit (OP)
That's funny xD
Its definitely not the best but at least the wheels are not a square like Qwen's try
zwcbz@reddit
ChatGPT Pro, extended thinking, took 45 minutes
krzonkalla@reddit
Mine looked like this, why is mine lobotomized 😭
Tall-Ad-7742@reddit (OP)
45 minutes is wild but thanks for sharing
PaMRxR@reddit
2 tries with Qwen3.5-35B-A3B, no amount of prompting can get it to make something coherent :|
Tall-Ad-7742@reddit (OP)
That's uhhh hmmm... I actually don't know what to say to these results... But thanks for sharing 👍
PaMRxR@reddit
Qwen3.5-27B Q8 below.
Tall-Ad-7742@reddit (OP)
Uhm that looks... cool I guess Nah looks funny tho
mc_nu1ll@reddit
claude 4.5 opus vs 4.6 opus, both with extended thinking
4.5 Opus
Tall-Ad-7742@reddit (OP)
oh thanks
wow this one looks really solid... very interesting to see
i think it did a better job in general but the details like in gemini's try are missing (gradients and stuff)
mc_nu1ll@reddit
4.6 Opus
segmond@reddit
I have been doing this for a while with my own SVGs. When I saw the results I realized no one is benchmaxing on the pelican test. The models are truly marvelous and intelligence. VL models are often better for this and I think Google's vision strength really shows up well in such test. They certainly are doing something other's are not.
Tall-Ad-7742@reddit (OP)
true gemini did a really solid job even if its not perfect its probably the best
FinBenton@reddit
I did one on gpt5.4 and realised it actually animated it :D Doesnt look like a horse too much but its nice
https://upload.blazeit.club/index.html
Tall-Ad-7742@reddit (OP)
thats a wild looking horse driving an f1 car xD but nice to see it animated it
Makers7886@reddit
Qwen3.5 122b FP8
Tall-Ad-7742@reddit (OP)
idk if its better or worse then Qwen 3.6 Plus
Less_Sandwich6926@reddit
claude opus
mc_nu1ll@reddit
Claude 4.5 Opus with extended thinking. Not 4.6
Tall-Ad-7742@reddit (OP)
Interesting... i bet that the earlier version of opus (i guess 4.6) would do better because i heard that recently claude got lobotomized badly
Disposable110@reddit
https://www.youtube.com/watch?v=ZHhX44XkH-c
This should be the benchmark, replicate this video in SVG.
It contains kinds of asinine animation goofery. And it's in Flemish full of typos.
So it needs to do animation goofery, video recognition and deal with Flemish full of typos.
dandmetal@reddit
Omnicode 9B Q4: A horse is some sort of eldrich horror, right?
fairydreaming@reddit
This car obviously already crashed and poor horse is in pieces.
thrownawaymane@reddit
Pre Halo era
Tall-Ad-7742@reddit (OP)
true that is what probably happend
fairydreaming@reddit
It still holds an eyeball and one leg with its horse-hands and is smiling while thinking about the happy horsey post-replantation life. But we all know it will race again.
Tall-Ad-7742@reddit (OP)
what? XD
Thats some wild answer but a good one
Tall-Ad-7742@reddit (OP)
uhhh where is the horse and where is the car... like where does each one start xD?
SufficientDamage9483@reddit
I see nothing but profile pictures, especially the qwen one
Tall-Ad-7742@reddit (OP)
it is done now... my new profile picture the deepseek horse car
Tall-Ad-7742@reddit (OP)
xD good idea maybe i will finally set a profile picture for my reddit account
MantisAwakening@reddit
Obviously this is something a lot of models struggle with, but I gotta say it’s simply amazing that any of them can do it at all. Ask ten people you work with to draw a horse in a race car and see what you get.
Tall-Ad-7742@reddit (OP)
an answer from chatgpt...
nah just joking you are right about that
ambassadortim@reddit
Why horse and not llama
Tall-Ad-7742@reddit (OP)
omg thats brilliant... how did i miss that
jacek2023@reddit
Maybe at least pretend you tried it on the local LLM
Tall-Ad-7742@reddit (OP)
what do you mean by that?
some models are hosted on my "own" cloud... more like borrowed but yea
its just i cant run every model locally and in general its interesting to see how local models like Minimax, GLM, Kimi can compete against Gemini and Claude
jacek2023@reddit
You probably have no knowledge about local models at all, you could compare Gemma 26B to Qwen 35B or to Nemotron 30B but for you Kimi is "local".
Tall-Ad-7742@reddit (OP)
i am sorry if you see it that way... yes they are not locally hosted inside my home pc i use a rented cloud and many of us cant but there are people who can
if you have enough money you can
and sorry if i shouldn't do things like that... like spending actual money on renting a cloud provider to post some "shitty" post then ok... if you think so
a_beautiful_rhind@reddit
That's why this test is so great. You can always pick something else and run it through a series a models. Miku, a gorilla.. can't benchmaxx it all.
Tall-Ad-7742@reddit (OP)
thats true
666666thats6sixes@reddit
looks like Qwen 3.6 Plus has some Canadian influence
Tall-Ad-7742@reddit (OP)
xD
Ok_Technology_5962@reddit
So 3.1 Gemini still solved it. .. i use ps4 controller tests and usually they explode on that one.
Tall-Ad-7742@reddit (OP)
yea gemini did a very good result.
not perfect but i think its the best of all of them
magnus-m@reddit
chatgpt with thinking extented (plus plan)
https://chatgpt.com/share/69df5d9a-5ec4-832e-acf2-aba30646aa30
Tall-Ad-7742@reddit (OP)
nice looks pretty good
Admirable-Cell-2658@reddit
DeepSeek Expert is the winner!
Tall-Ad-7742@reddit (OP)
yea thats funny what deepseek did there
ResidentPositive4122@reddit
That term has become so overloaded it lost all the meaning.
The idea behind simon's test is that you can always change what you ask for, so it can't be trained for. Ask for something doing something on top of something. Or whatever you want. You can't benchmaxxx for this. Or at least the end result will be a general model that can output svg of random stuff - which is what you want anyway.
Tall-Ad-7742@reddit (OP)
I haven't said anything about gemini being bad? its actually pretty good or who did you mean by the second part?
ResidentPositive4122@reddit
Someone replied with that message and then deleted their comment.
Tall-Ad-7742@reddit (OP)
oh ok
thank you
Thomas-Lore@reddit
What are you on about? Gemini did awfully in this test.
Tall-Ad-7742@reddit (OP)
Oh sorry i know many use the term benchmaxxed and yes i know probably many of you change the prompt how they want its just i see this pelican test all the time so i just thought i post something about that
AlternativeApart6340@reddit
Gpt 5.4 pro does extremelly well in my tests
Tall-Ad-7742@reddit (OP)
can you share it with me? i would be interested but i don't have access to gpt 5.4 thats why i didn't include it
unculturedperl@reddit
Kimi being a Bottas to Ferrari stan was not on my F1 bingo card this year. But where would Leclerc end up in that case?
Tall-Ad-7742@reddit (OP)
there is a sidecar attached for Charles don’t worry xD
Remarkable-Avocado@reddit
So goofy! Love it!
Tall-Ad-7742@reddit (OP)
yea especially claude, minimax and qwen are looking funny