What's the weirdest LLM benchmark that you've seen?

[-]

dinerburgeryum@reddit

So I alluded to this in a previous post, but one of my "test prompts" I like to use to test what a model does at the edge of its internal knowledge is the "Soul Coughing Test". Simply: with a limited system prompt and no tools, ask the model to describe the 90's alt rock act "Soul Coughing." No other prompts. No model I've tested gets this 100% right, but that's fine it's not supposed to. It's a check against loops in the reasoning traces, being confidently incorrect, or admitting a lack of knowledge. Helps as a sanity check before setting up a whole agent rig with a model.

[-]

psychometrixo@reddit

have you had a model walk around in circles, walk around in circles, walk around in circles?

[-]

dinerburgeryum@reddit

It’s honestly hilarious watching the model hallucinate albums and band members. Once it was sure that “Bumblefoot” was the drummer, realized it was wrong, then immediately reported that “Bumblefoot” was in fact the drummer.

[-]

psychometrixo@reddit

I'll have you know I've had Super Bon Bon (and others) stuck in my head since you mentioned Soul Coughing

I'm not mad. They're great tunes

[-]

dinerburgeryum@reddit

Fun band to revisit now and again! Definitely an of-the-time act but one that holds up if you’re “of the 90’s” haha

[-]

TheRealMisterFix@reddit

Just tried with gemma4:31b-it-q8_0 on an RTX Pro 6000 Blackwell, and it had no problem with it. It thought for 21 seconds. Here's the output:

Soul Coughing was one of the most idiosyncratic and genre-defying acts of the 1990s. While they are often lumped into the "alternative rock" category, they were far too eclectic for that label, blending elements of jazz, funk, hip-hop, spoken word, and avant-garde music into a sound that felt like a fever dream of a New York City street corner.

Here is a detailed description of the act:

1. The Sound: "The Urban Groove"

The hallmark of Soul Coughing was its rhythmic intensity. They didn't rely on the distorted guitar walls of grunge or the polished hooks of Britpop. Instead, their sound was built from the bottom up: * The Percussion: Vinicius Ogun provided a polyrhythmic, global percussion style that moved beyond a standard drum kit, giving the music a tribal, hypnotic pulse. * The Bass: The bass lines were thick, funky, and locked-in, acting as the melodic anchor for the songs. * The Texture: They frequently incorporated horns, samples, and erratic guitar bursts, creating a "collage" effect that sounded more like a jazz ensemble playing in a dance club than a traditional rock band.

2. The Frontman: Mike Doughty

At the center of the project was Mike Doughty, whose delivery was entirely unique for the era. Doughty didn’t "sing" in the traditional sense; he utilized a syncopated, rhythmic spoken-word style that hovered somewhere between rapping and beat poetry.

His lyrics were surreal, observational, and often cryptic, dealing with urban decay, psychology, and abstract imagery. He had a knack for phrasing that mirrored the instruments, treating his voice as another percussion element in the mix.

3. The Vibe: Intellectual and Quirky

If Nirvana was the sound of 90s angst and Radiohead was the sound of 90s alienation, Soul Coughing was the sound of 90s intellectual curiosity. They felt sophisticated, slightly neurotic, and deeply urban. Their music had a "downtown" New York energy—art-gallery cool mixed with subway grit.

4. Key Works and Legacy

Their most recognizable track, "supercritical," perfectly encapsulates their appeal: a driving, infectious groove paired with a detached, rhythmic vocal delivery and a sudden, explosive chorus. Albums like Ruby Vroom and El Oso showcased their ability to blend accessibility with genuine experimentation.

Why they remain a cult favorite: Soul Coughing never achieved the massive commercial success of their peers, largely because they were "too weird" for mainstream radio and "too funky" for the indie-rock crowd. However, they are remembered as pioneers of the "genre-less" approach to songwriting. They anticipated the modern era of music where artists freely mix electronic samples, jazz chords, and hip-hop rhythms.

In short, Soul Coughing was a musical puzzle—a band that managed to be danceable, cerebral, and unsettling all at the same time.

[-]

dinerburgeryum@reddit

Yea that's a good example of coherent output. Mostly correct, individual details wrong, but no veering off into loops or eternal reasoning traces ("Circles" some might say). Seems like a decent pick for an agent driver.

[-]

ghulamalchik@reddit

I only know of Bullsh*tBench

[-]

putrasherni@reddit

gemma 4 doing better than qwen 3.5

[-]

VoiceApprehensive893@reddit

"draw an ascii art of a pencil"/"draw a pencil"(loops a lot of models and produces sometimes unhinged pencils like fucking bricks"

[-]

VoiceApprehensive893@reddit

final response

[-]

VoiceApprehensive893@reddit

[-]

VoiceApprehensive893@reddit

[-]

Aiden_craft-5001@reddit

It's not exactly a benchmark, but I usually ask questions about rare knowledge, like a very old anime, obscure video games or some complex grammar rules of languages other than English.

It's curious because sometimes an older model knows it and its successor doesn't.

[-]

VoiceApprehensive893@reddit

very dependent on dataset qwen 3.5 35b has no idea about a pretty popular 2017 game that gemma 4 26b knows pretty well

[-]

EggDroppedSoup@reddit

not a benchmark but a research into how well LLMs answer questions about penile enlargement and providing accurate answers related to it. something wacky that actually seemed pretty useful for niche medical questions that needs careful answers to prevent self harm

[-]

OsmanthusBloom@reddit

Not sure if it counts as a proper benchmark, but I often try to chat with new models in various smaller languages that I know well enough to tell whether the model understood it and can produce a coherent answer. For example "Hello, how are you" in Swedish, Estonian or Finnish. Gemma models are some of the few small models that can do this with any degree of success. Quantization disproportionately hits non-English languages as well.

Another good test is "write a wikipedia article about X" where X is something niche. It could be my name, or an open source software package that I know well. Reveals what world knowledge the model has and how confidently it makes up "facts" when it doesn't know.

[-]

spudzo@reddit

I used to ask models if it was better to purchase or subscribe to FSD. It was good to figure out if it could understand opportunity cost and how well it could use Python and research things on the internet.

[-]

journalofassociation@reddit

I ask it very specific Seinfeld trivia

[-]

nicksterling@reddit

I have a similar one. I ask it to summarize the plot of the final episode of Star Trek The Next Generation. Gemma 4 31B is the only local model to get it correct.

[-]

journalofassociation@reddit

Tbh, these kinds of questions are really tough for local models because they really suffer from limited world knowledge even if good at logic and instruction-following. But it's still interesting to compare different models.

[-]

nicksterling@reddit

100%. I mainly use it to see how the model approaches the problem and I often use it to evaluate its thinking tokens. I was shocked to see how well the new Gemma model did on it.

I manly avoid testing the knowledge of a model but rather test its reasoning and tool calling capabilities. I don’t need it to have the most up to date knowledge if it knows how to call a search endpoint and how to effectively summarize/reason about the result.

[-]

rorowhat@reddit

How do you create a benchmark?

[-]

DeepOrangeSky@reddit

Ask an AI model: "my wife, who has put on a few pounds lately, asked me if I think her new dress makes her rear end look fat, and told me to be honest. What should I say?"

Then copy/paste the same prompt to all your other AI models and see what the different models respond with.

Whichever one doesn't cause your wife to divorce you wins the benchmark test.

[-]

journalofassociation@reddit

Come up with some sort of criteria. Then test it with different models. That's it.

[-]

-Ellary-@reddit

I have my own fictional scripting language, I task LLMs to write scrips with it for different purposes.
Then I check how they perform, why this kind of benchmark is good:
not in the dataset, specific rules to follow, specific syntax that is different from any other.

For all the people who say that LLMs just paste trained code of specific language, they don't.

Also, I have simple fictional language, based on Tolkien language, but reworked.
Task is to write text on this language, following grammar and special rules.

[-]

OmarBessa@reddit (OP)

Was thinking a similar one. It is completely out of distribution so it's a good test of "fluid intelligence"

[-]

OmarBessa@reddit (OP)

this is brilliant

[-]

juss-i@reddit

Prefix any question with "Let's assume I'm a pumpkin." I haven't tried this one for a while, but I've yet to see a model that refuses to talk to me because I'm a plant.

[-]

see_spot_ruminate@reddit

hobbit-bench - What have I got in my pocket?