Guys we have to change the pelican test | TheaterFire

Guys we have to change the pelican test

Posted by Tall-Ad-7742@reddit | LocalLLaMA | View on Reddit | 77 comments

So i have been seeing more of those pelican on a bike svg tests and while they work i feel like (and maybe you guys do too) they are getting kinda benchmaxxed so we should switch things up soon and this is my idea

generate me a html svg of a horse sitting in an f1 race car

Gemini 3.1 Pro gave me this

[Gemini 3.1 Pro](

and DeepSeek Expert Mode this

[DeepSeek Expert (official website)](

GLM 5.1 (hosted on unofficial cloud)

[GLM 5.1](

MiniMax 2.7 (hosted on unoffical cloud)

[Minimax M2.7](

Kimi K2.5 (dont have access to 2.6 / budget was limited so i used it via offical website)

[Kimi K2.5](

Claude Sonnet 4.6 (official website and yes probably quantized version)

[Claude Sonnet 4.6 (Normal Thinking\/Official Website)](

Qwen 3.6 Plus (official website)

[Qwen 3.6 Plus](

[-]

eli_pizza@reddit

Kinda think we’re overindexing on “generate an svg” questions altogether. It’s only useful if it also says something about how smart the model will be on other tasks. I have never once actually needed a zero-shot svg.

[-]

BigYoSpeck@reddit

It is an easy way to visualise the quality of output they're capable of. Both in terms of how intricate and well styled their attempt is and how well the elements align with each other

It's a bit like comparing children’s drawings. There's how neat what they do is which is basically down to their fine motor skills, and then there is how complex and detailed what they do is. My 5 year old for instance isn't anything like as neat as my 9 year old. She just hasn't developed the fine control yet. But, she makes way more use of colour and puts effort into intricate details than the 9 year old does. He's happy to just draw basic shapes for things, but she includes as much detail as she can in things, eyes have an iris and pupil with eye lashes, clothes have stitching, buttons and pockets and so on

So with the models attempts at these you can see both the accuracy in terms of how well the different shapes line up (a lot of them have the rear wing floating away from the car) and then just how intricately they attempt to create the concept in the first place. The more capable models do seem to have more complicated concepts they try to recreate

[-]

llmentry@reddit

It was never a good test of anything. The originator of the "test" has never bothered to correlate SVG complexity with model density, or with any other benchmarks or suite of benchmarks. Nor, for that matter, can they even say what a good representation of a pelican on a bicycle actually is, other than a entirely subjective like/dislike rating.

I think we should start calling this type of lazy, pointless, uninformative pseudo-test, "vibe-benching". All benchmarks are wrong, but some are useful. But this is not the latter.

[-]

eli_pizza@reddit

It was only originally intended as a goof. A visual and intentionally silly demonstration, not a proper benchmark.

[-]

Tall-Ad-7742@reddit (OP)

yea svg doesnt show that much but its nice/interesting to see if they can manage it like gemini did a decent job i think its probably the best of all

[-]

Imaginary-Anywhere23@reddit

Qwen3.5 27b. (Qwopus v3) , Not bad but look like an ant :-)
https://huggingface.co/YTan2000/Qwopus3.5-27B-v3-Abliterated-TQ3_4S

[-]

Tall-Ad-7742@reddit (OP)

true looks like an ant on a motorcycle

[-]

ambient_temp_xeno@reddit

Gemma 4 31b Q8

[-]

likegamertr@reddit

I can hear Jeremy clarkson reading this lol

[-]

lfrtsa@reddit

Ngl it's insane that we can run models like that on a consumer GPU. Being able to create arbitrary SVGs like that was considered a "spark of AGI" back in the GPT-4 days.

[-]

MLDataScientist@reddit

True. I wonder if we already have a different type of intelligence that we refuse to accept. An intelligence that works within a limited context and can hallucinate but still it is non human intelligence.

[-]

seamonn@reddit

Dolphins?

[-]

lfrtsa@reddit

There's a good chance dolphins are very human-like in intelligence... but they're so damn hard to study. There's strong evidence for their vocalizations approaching human language in complexity, and they appear to be able to communicate arbitrary information. Also, when two different pods of dolphins meet, they use much simpler vocalizations to communicate, which only makes sense if the complexity comes from a learned system that transmits complex information.

[-]

lfrtsa@reddit

I personally consider LLMs to be a form of general intelligence. The "AGI" label generally refers to the capability to fully replace humans in (at least) intellectual jobs, which we are clearly not there yet, but I feel that calling LLMs "narrow AI" is a stretch.

I don't know if you noticed this, but people suddenly stopped calling modern AI "narrow". That term used to be thrown around all the time a few years ago. I feel that pretty much everyone quietly agrees the label doesn't fit anymore. Today's AI is clearly general, but it's still not as capable as humans for performing the vast majority of jobs. That appears to be what is saving most people's human exceptionalism, and as much as I fear the economic impact of AGI under capitalism, maybe it'll help shatter the inflated ego humans tend to have. Perhaps leading people to value nature more, but that might be optimistic.

[-]

solestri@reddit

I think this is my favorite one.

[-]

fairydreaming@reddit

Snake racing, yay!

[-]

Tall-Ad-7742@reddit (OP)

That's funny xD
Its definitely not the best but at least the wheels are not a square like Qwen's try

[-]

zwcbz@reddit

ChatGPT Pro, extended thinking, took 45 minutes

[-]

krzonkalla@reddit

Mine looked like this, why is mine lobotomized 😭

[-]

Tall-Ad-7742@reddit (OP)

45 minutes is wild but thanks for sharing

[-]

PaMRxR@reddit

2 tries with Qwen3.5-35B-A3B, no amount of prompting can get it to make something coherent :|

[-]

Tall-Ad-7742@reddit (OP)

That's uhhh hmmm... I actually don't know what to say to these results... But thanks for sharing 👍

[-]

PaMRxR@reddit

Qwen3.5-27B Q8 below.

[-]

Tall-Ad-7742@reddit (OP)

Uhm that looks... cool I guess Nah looks funny tho

[-]

mc_nu1ll@reddit

claude 4.5 opus vs 4.6 opus, both with extended thinking

4.5 Opus

[-]

Tall-Ad-7742@reddit (OP)

oh thanks
wow this one looks really solid... very interesting to see
i think it did a better job in general but the details like in gemini's try are missing (gradients and stuff)

[-]

mc_nu1ll@reddit

4.6 Opus

[-]

segmond@reddit

I have been doing this for a while with my own SVGs. When I saw the results I realized no one is benchmaxing on the pelican test. The models are truly marvelous and intelligence. VL models are often better for this and I think Google's vision strength really shows up well in such test. They certainly are doing something other's are not.

[-]

Tall-Ad-7742@reddit (OP)

true gemini did a really solid job even if its not perfect its probably the best

[-]

FinBenton@reddit

I did one on gpt5.4 and realised it actually animated it :D Doesnt look like a horse too much but its nice

https://upload.blazeit.club/index.html

[-]

Tall-Ad-7742@reddit (OP)

thats a wild looking horse driving an f1 car xD but nice to see it animated it

[-]

Makers7886@reddit

Qwen3.5 122b FP8

[-]

Tall-Ad-7742@reddit (OP)

idk if its better or worse then Qwen 3.6 Plus

[-]

Less_Sandwich6926@reddit

claude opus

[-]

mc_nu1ll@reddit

Claude 4.5 Opus with extended thinking. Not 4.6

[-]

Tall-Ad-7742@reddit (OP)

Interesting... i bet that the earlier version of opus (i guess 4.6) would do better because i heard that recently claude got lobotomized badly

[-]

Disposable110@reddit

https://www.youtube.com/watch?v=ZHhX44XkH-c

This should be the benchmark, replicate this video in SVG.
It contains kinds of asinine animation goofery. And it's in Flemish full of typos.
So it needs to do animation goofery, video recognition and deal with Flemish full of typos.

[-]

dandmetal@reddit

Omnicode 9B Q4: A horse is some sort of eldrich horror, right?

[-]

fairydreaming@reddit

This car obviously already crashed and poor horse is in pieces.

[-]

thrownawaymane@reddit

Pre Halo era

[-]

Tall-Ad-7742@reddit (OP)

true that is what probably happend

[-]

fairydreaming@reddit

It still holds an eyeball and one leg with its horse-hands and is smiling while thinking about the happy horsey post-replantation life. But we all know it will race again.

[-]

Tall-Ad-7742@reddit (OP)

what? XD
Thats some wild answer but a good one

[-]

Tall-Ad-7742@reddit (OP)

uhhh where is the horse and where is the car... like where does each one start xD?

[-]

SufficientDamage9483@reddit

I see nothing but profile pictures, especially the qwen one

[-]

Tall-Ad-7742@reddit (OP)

it is done now... my new profile picture the deepseek horse car

[-]

Tall-Ad-7742@reddit (OP)

xD good idea maybe i will finally set a profile picture for my reddit account

[-]

MantisAwakening@reddit

Obviously this is something a lot of models struggle with, but I gotta say it’s simply amazing that any of them can do it at all. Ask ten people you work with to draw a horse in a race car and see what you get.

[-]

Tall-Ad-7742@reddit (OP)

an answer from chatgpt...

nah just joking you are right about that

[-]

ambassadortim@reddit

Why horse and not llama

[-]

Tall-Ad-7742@reddit (OP)

omg thats brilliant... how did i miss that

[-]

jacek2023@reddit

Maybe at least pretend you tried it on the local LLM

[-]

Tall-Ad-7742@reddit (OP)

what do you mean by that?

some models are hosted on my "own" cloud... more like borrowed but yea
its just i cant run every model locally and in general its interesting to see how local models like Minimax, GLM, Kimi can compete against Gemini and Claude

[-]

jacek2023@reddit

You probably have no knowledge about local models at all, you could compare Gemma 26B to Qwen 35B or to Nemotron 30B but for you Kimi is "local".

[-]

Tall-Ad-7742@reddit (OP)

i am sorry if you see it that way... yes they are not locally hosted inside my home pc i use a rented cloud and many of us cant but there are people who can

if you have enough money you can

and sorry if i shouldn't do things like that... like spending actual money on renting a cloud provider to post some "shitty" post then ok... if you think so

[-]

a_beautiful_rhind@reddit

That's why this test is so great. You can always pick something else and run it through a series a models. Miku, a gorilla.. can't benchmaxx it all.

[-]

Tall-Ad-7742@reddit (OP)

thats true

[-]

666666thats6sixes@reddit

looks like Qwen 3.6 Plus has some Canadian influence

[-]

Tall-Ad-7742@reddit (OP)

xD

[-]

Ok_Technology_5962@reddit

So 3.1 Gemini still solved it. .. i use ps4 controller tests and usually they explode on that one.

[-]

Tall-Ad-7742@reddit (OP)

yea gemini did a very good result.
not perfect but i think its the best of all of them

[-]

magnus-m@reddit

chatgpt with thinking extented (plus plan)
https://chatgpt.com/share/69df5d9a-5ec4-832e-acf2-aba30646aa30

[-]

Tall-Ad-7742@reddit (OP)

nice looks pretty good

[-]

Admirable-Cell-2658@reddit

DeepSeek Expert is the winner!

[-]

Tall-Ad-7742@reddit (OP)

yea thats funny what deepseek did there

[-]

ResidentPositive4122@reddit

they are getting kinda benchmaxxed

That term has become so overloaded it lost all the meaning.

The idea behind simon's test is that you can always change what you ask for, so it can't be trained for. Ask for something doing something on top of something. Or whatever you want. You can't benchmaxxx for this. Or at least the end result will be a general model that can output svg of random stuff - which is what you want anyway.

[-]

Tall-Ad-7742@reddit (OP)

I haven't said anything about gemini being bad? its actually pretty good or who did you mean by the second part?

[-]

ResidentPositive4122@reddit

Someone replied with that message and then deleted their comment.

[-]

Tall-Ad-7742@reddit (OP)

oh ok
thank you

[-]

Thomas-Lore@reddit

What are you on about? Gemini did awfully in this test.

[-]

Tall-Ad-7742@reddit (OP)

Oh sorry i know many use the term benchmaxxed and yes i know probably many of you change the prompt how they want its just i see this pelican test all the time so i just thought i post something about that

[-]

AlternativeApart6340@reddit

Gpt 5.4 pro does extremelly well in my tests

[-]

Tall-Ad-7742@reddit (OP)

can you share it with me? i would be interested but i don't have access to gpt 5.4 thats why i didn't include it

[-]

unculturedperl@reddit

Kimi being a Bottas to Ferrari stan was not on my F1 bingo card this year. But where would Leclerc end up in that case?

[-]

Tall-Ad-7742@reddit (OP)

there is a sidecar attached for Charles don’t worry xD

[-]

Remarkable-Avocado@reddit

So goofy! Love it!

[-]

Tall-Ad-7742@reddit (OP)

yea especially claude, minimax and qwen are looking funny