What's the weirdest LLM benchmark that you've seen?
Posted by OmarBessa@reddit | LocalLLaMA | View on Reddit | 36 comments
personal, esoteric, random...anything goes
Posted by OmarBessa@reddit | LocalLLaMA | View on Reddit | 36 comments
personal, esoteric, random...anything goes
dinerburgeryum@reddit
So I alluded to this in a previous post, but one of my "test prompts" I like to use to test what a model does at the edge of its internal knowledge is the "Soul Coughing Test". Simply: with a limited system prompt and no tools, ask the model to describe the 90's alt rock act "Soul Coughing." No other prompts. No model I've tested gets this 100% right, but that's fine it's not supposed to. It's a check against loops in the reasoning traces, being confidently incorrect, or admitting a lack of knowledge. Helps as a sanity check before setting up a whole agent rig with a model.
psychometrixo@reddit
have you had a model walk around in circles, walk around in circles, walk around in circles?
dinerburgeryum@reddit
It’s honestly hilarious watching the model hallucinate albums and band members. Once it was sure that “Bumblefoot” was the drummer, realized it was wrong, then immediately reported that “Bumblefoot” was in fact the drummer.
psychometrixo@reddit
I'll have you know I've had Super Bon Bon (and others) stuck in my head since you mentioned Soul Coughing
I'm not mad. They're great tunes
dinerburgeryum@reddit
Fun band to revisit now and again! Definitely an of-the-time act but one that holds up if you’re “of the 90’s” haha
TheRealMisterFix@reddit
Just tried with gemma4:31b-it-q8_0 on an RTX Pro 6000 Blackwell, and it had no problem with it. It thought for 21 seconds. Here's the output:
Soul Coughing was one of the most idiosyncratic and genre-defying acts of the 1990s. While they are often lumped into the "alternative rock" category, they were far too eclectic for that label, blending elements of jazz, funk, hip-hop, spoken word, and avant-garde music into a sound that felt like a fever dream of a New York City street corner.
Here is a detailed description of the act:
1. The Sound: "The Urban Groove"
The hallmark of Soul Coughing was its rhythmic intensity. They didn't rely on the distorted guitar walls of grunge or the polished hooks of Britpop. Instead, their sound was built from the bottom up: * The Percussion: Vinicius Ogun provided a polyrhythmic, global percussion style that moved beyond a standard drum kit, giving the music a tribal, hypnotic pulse. * The Bass: The bass lines were thick, funky, and locked-in, acting as the melodic anchor for the songs. * The Texture: They frequently incorporated horns, samples, and erratic guitar bursts, creating a "collage" effect that sounded more like a jazz ensemble playing in a dance club than a traditional rock band.
2. The Frontman: Mike Doughty
At the center of the project was Mike Doughty, whose delivery was entirely unique for the era. Doughty didn’t "sing" in the traditional sense; he utilized a syncopated, rhythmic spoken-word style that hovered somewhere between rapping and beat poetry.
His lyrics were surreal, observational, and often cryptic, dealing with urban decay, psychology, and abstract imagery. He had a knack for phrasing that mirrored the instruments, treating his voice as another percussion element in the mix.
3. The Vibe: Intellectual and Quirky
If Nirvana was the sound of 90s angst and Radiohead was the sound of 90s alienation, Soul Coughing was the sound of 90s intellectual curiosity. They felt sophisticated, slightly neurotic, and deeply urban. Their music had a "downtown" New York energy—art-gallery cool mixed with subway grit.
4. Key Works and Legacy
Their most recognizable track, "supercritical," perfectly encapsulates their appeal: a driving, infectious groove paired with a detached, rhythmic vocal delivery and a sudden, explosive chorus. Albums like Ruby Vroom and El Oso showcased their ability to blend accessibility with genuine experimentation.
Why they remain a cult favorite: Soul Coughing never achieved the massive commercial success of their peers, largely because they were "too weird" for mainstream radio and "too funky" for the indie-rock crowd. However, they are remembered as pioneers of the "genre-less" approach to songwriting. They anticipated the modern era of music where artists freely mix electronic samples, jazz chords, and hip-hop rhythms.
In short, Soul Coughing was a musical puzzle—a band that managed to be danceable, cerebral, and unsettling all at the same time.
dinerburgeryum@reddit
Yea that's a good example of coherent output. Mostly correct, individual details wrong, but no veering off into loops or eternal reasoning traces ("Circles" some might say). Seems like a decent pick for an agent driver.
ghulamalchik@reddit
I only know of Bullsh*tBench
putrasherni@reddit
gemma 4 doing better than qwen 3.5
VoiceApprehensive893@reddit
"draw an ascii art of a pencil"/"draw a pencil"(loops a lot of models and produces sometimes unhinged pencils like fucking bricks"
VoiceApprehensive893@reddit
final response
VoiceApprehensive893@reddit
VoiceApprehensive893@reddit
Aiden_craft-5001@reddit
It's not exactly a benchmark, but I usually ask questions about rare knowledge, like a very old anime, obscure video games or some complex grammar rules of languages other than English.
It's curious because sometimes an older model knows it and its successor doesn't.
VoiceApprehensive893@reddit
very dependent on dataset qwen 3.5 35b has no idea about a pretty popular 2017 game that gemma 4 26b knows pretty well
EggDroppedSoup@reddit
not a benchmark but a research into how well LLMs answer questions about penile enlargement and providing accurate answers related to it. something wacky that actually seemed pretty useful for niche medical questions that needs careful answers to prevent self harm
OsmanthusBloom@reddit
Not sure if it counts as a proper benchmark, but I often try to chat with new models in various smaller languages that I know well enough to tell whether the model understood it and can produce a coherent answer. For example "Hello, how are you" in Swedish, Estonian or Finnish. Gemma models are some of the few small models that can do this with any degree of success. Quantization disproportionately hits non-English languages as well.
Another good test is "write a wikipedia article about X" where X is something niche. It could be my name, or an open source software package that I know well. Reveals what world knowledge the model has and how confidently it makes up "facts" when it doesn't know.
spudzo@reddit
I used to ask models if it was better to purchase or subscribe to FSD. It was good to figure out if it could understand opportunity cost and how well it could use Python and research things on the internet.
journalofassociation@reddit
I ask it very specific Seinfeld trivia
nicksterling@reddit
I have a similar one. I ask it to summarize the plot of the final episode of Star Trek The Next Generation. Gemma 4 31B is the only local model to get it correct.
journalofassociation@reddit
Tbh, these kinds of questions are really tough for local models because they really suffer from limited world knowledge even if good at logic and instruction-following. But it's still interesting to compare different models.
nicksterling@reddit
100%. I mainly use it to see how the model approaches the problem and I often use it to evaluate its thinking tokens. I was shocked to see how well the new Gemma model did on it.
I manly avoid testing the knowledge of a model but rather test its reasoning and tool calling capabilities. I don’t need it to have the most up to date knowledge if it knows how to call a search endpoint and how to effectively summarize/reason about the result.
rorowhat@reddit
How do you create a benchmark?
DeepOrangeSky@reddit
Ask an AI model: "my wife, who has put on a few pounds lately, asked me if I think her new dress makes her rear end look fat, and told me to be honest. What should I say?"
Then copy/paste the same prompt to all your other AI models and see what the different models respond with.
Whichever one doesn't cause your wife to divorce you wins the benchmark test.
journalofassociation@reddit
Come up with some sort of criteria. Then test it with different models. That's it.
-Ellary-@reddit
I have my own fictional scripting language, I task LLMs to write scrips with it for different purposes.
Then I check how they perform, why this kind of benchmark is good:
not in the dataset, specific rules to follow, specific syntax that is different from any other.
For all the people who say that LLMs just paste trained code of specific language, they don't.
Also, I have simple fictional language, based on Tolkien language, but reworked.
Task is to write text on this language, following grammar and special rules.
OmarBessa@reddit (OP)
Was thinking a similar one. It is completely out of distribution so it's a good test of "fluid intelligence"
-Ellary-@reddit
This was the main idea.
Mickenfox@reddit
On a personal level, I ask them to give me 5 ways to continue the Steamed Hams sketch after Chalmers asks "Why is there smoke coming out of your oven, Seymour?"
UndecidedLee@reddit
"Uh, that's not smoke it's steam! Steam from the datacenters cooling off their GPUs. Mhm, AI inference."
Mickenfox@reddit
Clock Bench
mrpkeya@reddit
So according to this, GPT will tell me random a time when I'll ask?
ml_mc@reddit
Very nice benchmark! I had never seen it before and didn't expect models to be that bad at reading clocks, I guess it \~similar to image gen models that only knew about clocks stuck at 10:10.
OmarBessa@reddit (OP)
this is brilliant
juss-i@reddit
Prefix any question with "Let's assume I'm a pumpkin." I haven't tried this one for a while, but I've yet to see a model that refuses to talk to me because I'm a plant.
see_spot_ruminate@reddit
hobbit-bench - What have I got in my pocket?