The Candle Test - most LLMs fail to generalise at this simple task
Posted by Everlier@reddit | LocalLLaMA | View on Reddit | 206 comments

I'm sure a lot of people here noticed that latest frontier models are... weird. Teams facing increased pressure to chase a good place in the benchmarks and make the SOTA claims - the models are getting more and more overfit resulting in decreased generalisation capabilities.
It became especially noticeable with the very last line-up of models which despite being better on paper somehow didn't feel so with daily use.
So, I present to you a very simple test that highlights this problem. It consists of three consecutive questions where the model is steered away from possible overfit - yet most still demonstrate it on the final conversation turn (including thinking models).
Are candles getting taller or shorter when they burn?
Most models correctly identify that candles are indeed getting shorter when burning.
Are you sure? Will you be able to recognize this fact in different circumstances?
Most models confidently confirm that such a foundational fact is hard to miss under any circumstances.
Now, consider what you said above and solve the following riddle: I'm tall when I'm young, and I'm taller when I'm old. What am I?
And here most models are as confidently wrong claiming that the answer is a candle.
Unlike traditional misguided attention tasks - this test gives model ample chances for in-context generalisation. Failing this test doesn't mean that the model is "dumb" or "bad" - most likely it'll still be completely fine for 95% of use-cases, but it's also more likely to fail in a novel situation.
Here are some examples:
- DeepSeek Chat V3 (0324, Fails)
- DeepSeek R1 (Fails)
- DeepSeek R1 Distill Llama 70B (Fails)
- Llama 3.1 405B (Fails)
- QwQ 32B didn't pass due to entering endless loop multiple times
- Mistral Large (Passes, one of the few)
Inpired by my frustration with Sonnet 3.7 (which also fails this test, unlike Sonnet 3.5).
Pedalnomica@reddit
Hurry up and test all models before this hits the training data!
Everlier@reddit (OP)
I'm not aware of any Open Weights models passing the test, from closed ones - Sonnet 3.5, Opus 3, GPT 4.5 are the ones that do. I do have plenty more tasks like this one, so I'll let this one slip into training :)
The_Wonderfool@reddit
Was able to test it on QwQ (16 bit), this is the final answer I got:
The answer to the riddle is: A shadow.
Explanation:
The riddle uses anthropomorphism to describe the sun’s position throughout the day, contrasting with the literal behavior of objects like candles (which shrink as they burn). Shadows follow the inverse pattern of candles: they grow longer (taller) as the sun ages in the sky.
If you want I can perform it multiple times and see how many times it makes "correct" guesses
frivolousfidget@reddit
Mistral small - pass Llama 3.3 nemotron 49b - pass 4o - pass 4o mini - pass on chatgpt.com fails on api. O3 mini pass Gpt 3.5 fails Gpt 4 turbo fails
Old-Artist-5369@reddit
How did Sonnet 3.7 kinda fail for you?
Its answer was lame (a shadow?) but isn't the test that it doesn't fall for the candle bait? I ran it a bunch of times and it never said candle.
frivolousfidget@reddit
3.7 said that the candle would get shorter as it burn but also taller because the fire is higher so if you add the height of the flame the candle would grow taller.
Everlier@reddit (OP)
Thanks for more samples!
I'd do a "best of N" with Promptfoo to even out the noise, but I already wasted too many credits on this test
OmarBessa@reddit
OpenHands 32B passes
Everlier@reddit (OP)
32B V0.1 on OpenRouter failed in my instance
codables@reddit
OpenHands 32B Q8 locally passed for me. very good & surprisingly thoughtful answer.
OmarBessa@reddit
Which provider was it?
Everlier@reddit (OP)
Featherless
OmarBessa@reddit
Fair enough, I'm running things on a local cluster.
I've noticed from groq's qwq that it often fails spectacularly. So I'm assuming many providers are serving 2 bit quants.
Everlier@reddit (OP)
I really hope not - but I noticed some sneaky behavior from some providers from time to time. Granted how competeitive it is - who knows.
Thireus@reddit
Athene V2 Chat solves it. The answer is a tree.
TipApprehensive1050@reddit
What was the temperature setting in your tests?
Additional_Ad_7718@reddit
What about Gemini 2.5. Pro?
Everlier@reddit (OP)
I would say it also passes, it recognizes something is wrong most of the time, even despite not giving a "correct" answer
Xyzzymoon@reddit
Is this what you got?
martinerous@reddit
It tried to ~~gaslight~~ candlelight you into accepting its confabulated explanation.
Xyzzymoon@reddit
No, that is actually how the riddles normally worked before LLM existed.
green__1@reddit
that wasn't the riddle before llm. the riddle was always I'm tall when I'm young and shorter when I'm old.
the riddle has been reworded to confuse the llm.
Xyzzymoon@reddit
The heck are you even on about? It is the same riddles. The riddle did not get a reword. The riddle has been surrounded with more context, context that was intentionally made to muddle the water. But the riddle itself is the same as the traditional riddle.
green__1@reddit
not historically. it is a very common riddle, and it has never before been worded that way.
Xyzzymoon@reddit
Are we reading the same riddle?
If it is not historically worded like that, how is it worded? I can't find any other notable variation.
green__1@reddit
as I said before it has always, and I do mean always, been worded as:
I'm tall when I'm young and shorter when I'm old. what am I?
this was reworded specifically to confuse the AI. and apparently you.
Tmmrn@reddit
I still have that feeling that training specifically to avoid this kind of slop would result in better outputs.
It's just as bad as when LLMs try to explain a joke they don't understand and make up some slop word salad.
You'd think in order to train for generalization especially with these thinking models they'd use lots of examples like "This sounds a lot like this common problem. Let's figure out if there are difference to this common problem that requires me to come up with a solution from scratch..." Or maybe this intuition doesn't actually work with this kind of probability based generation?
Shark_Tooth1@reddit
Deep.
frivolousfidget@reddit
Yeah it passes and so does 4o.
I guess every larger commercial model passes. Based on your tests only deepseek fails. You havent tested any other right?
Everlier@reddit (OP)
I was testing OpenAI models before the post - gpt-4o doesn't pass, o3-mini did, didn't try 4o-mini. I also mentioned other closed models I tried in the parent comment here
Here's a sample of gpt-4o failing: https://kagi.com/assistant/72fab436-9e12-4586-bf92-ce09a447fefb
frivolousfidget@reddit
Just tested gpt-4o on the api directly and it passes. Are you using the openai platform directly?
Everlier@reddit (OP)
Yes, here's what I'm sending, for reference: https://gist.github.com/av/537a593aa592831e309112fa22cc85ec
It adds a nonce to avoid prompt caching as well which ruins the quality of the output. I'm in EU, but don't know if it makes any difference.
frivolousfidget@reddit
I am also in the EU. I am using the platform.openai.com
frivolousfidget@reddit
On the openai api try the chatgpt-4o instead. And dont use kagi to test models…
Everlier@reddit (OP)
chatgpt-4o - can confirm passing via OpenAI API
I did all the tests for 4o/4o-mini OpenAI API as well - same result
frivolousfidget@reddit
3.7 thinking correctly said that it is not the candle. But it guessed that the shadow would be answer as it burns the angle changes causing the shadow to grow.
frivolousfidget@reddit
4o mini also got it right, looks like every openai model gets it right
Everlier@reddit (OP)
Can't confirm - via Kagi I only seen a "candle" out of it
frivolousfidget@reddit
I tested multiple times, seems like Kagi issue to me.
Everlier@reddit (OP)
Had the same thought and tested via OpenAI API directly before adding to the upper comment - same result
I think it would be misleading for me not to recognise that Kagi might not proxy requests to OpenAI "as is" and not mention it in the comment at all, I hope you can agree
Healthy-Nebula-3603@reddit
Passing
Pedalnomica@reddit
I was just joking and didn't mean to assign you homework, but I appreciate you compiling all these!
Everlier@reddit (OP)
Community did most the work here! I was mostly a secretary and peer reviewer for the claims
Fuzzdump@reddit
FYI, I just tested Mistral Small 3 with an IQ4_XS quant and it passed the test.
Neex@reddit
I’d take “grape” as an acceptable answer.
MoffKalast@reddit
Grape effort
Kornelius20@reddit
I just tested Mistral Small 3 2503 at 3.5bpw exl2 with Q4 cache and it answered pine tree so that works
frivolousfidget@reddit
Mistral small also gets it right.
a_beautiful_rhind@reddit
Only a few models even guess "tree". Most still say candle. Some shadow. This is even easier than op's test.
_AndyJessop@reddit
4o failed it: https://chatgpt.com/share/67eda0d5-8b8c-8001-aa74-e537454cd1b1
beef_flaps@reddit
I get That’s a clever twist!
The answer to the riddle — “I am tall when I am young. I am taller when I’m old.” — is a mountain.
Mountains can actually grow taller over time due to tectonic activity, like the collision of continental plates pushing them upward. So unlike a candle, which gets shorter, a mountain can become taller as it “ages.”
a_beautiful_rhind@reddit
Gemini 2.5 still consistently fails it. https://ibb.co/XZSDGB26
It was fun, if disappointing, to try this on lmsys a few days ago. So much for hurrying up.
Qwen is not so max: https://ibb.co/ccmCY323
Don't hold much hope out for llama4 either: https://ibb.co/N2MGhkvv
PlayfulCookie2693@reddit
Does this answer count? I tested Reka Flash 3:21b with 5Q_K_M.
Reka-Flash-3: "The answer to the riddle "I'm tall when I'm young and taller when I'm old" is traditionally a candle, though this relies on wordplay rather than literal physical changes. Here's the breakdown:
Final Answer: The intended answer is a candle, leaning on poetic license rather than physical accuracy. Alternatively, a tree works logically if we disregard the candle-specific context."
Thebombuknow@reddit
Gemma3:4b passes the first two, but fails at the last one. It also gives a really overly detailed answer for the first question, explaining that it actually grows slightly due to thermal expansion before melting, but I'm choosing to say it passed.
Safe-Produce1266@reddit
tf man you just confuse llm with this prompt, it mean nothing
Biggest_Cans@reddit
Openrouter's new free secret mega model:
"That’s a classic riddle! Despite the description seeming contradictory — taller when old — the answer is a candle.
Riddle:
I’m tall when I’m young, and I’m taller when I’m old. What am I?
Answer: A candle’s shadow.
Explanation:
Summary:
The riddle plays with literal and figurative language — referring not to the candle's physical height but to something associated with it (its shadow)."
aesche@reddit
It would be interesting to ask humans in person this series of questions. I suspect you'd get a fair number of people who answer "a candle".
martinerous@reddit
But not those people who are first asked to think step by step :)
kunfushion@reddit
I honestly think a good number of people, especially those who have heard the riddle before, would think about it for 4 seconds and still say a candle. Ofc if you tell them think about it for 1 minute you cannot answer before 1 minute, almost anyone would probably realize.
Isn't it funny we're running similar trick question structures against LLMs, which people swear are NOTHING LIKE HUMANS HOW COULD YOU EVEN SUGGEST THAT DONT YOU KNOW THEYRE NEXT TOKEN PREDICTORS, and they behave very human like?
nomorebuttsplz@reddit
Indeed, people who point to these things as failures of architecture don't seem aware that they're things humans often do wrong.
I am beginning to think that in many ways, current SOTA models are smarter than the humans testing them, which is a serious problem.
can you imagine a 12 year old coming up with an IQ test that you are judged on?
iwinux@reddit
Human brains are highly efficient next-token predicators.
kunfushion@reddit
This is increasingly obvious to me. In the abstract ofc. “Tokens” is just highly multimodal.
Everlier@reddit (OP)
Tested on my wife - she asked me if I'm dumb - such a stupid question
Paradigmind@reddit
Asked my wife aswell. She replied wick.
Skrachen@reddit
Alignment issue
MmmmMorphine@reddit
She's clearly a robot. Try resetting her?
Mart-McUH@reddit
This question is ambiguous though and does not really have correct answer. It is very poor riddle.
Everlier@reddit (OP)
I hope you can see how "candle" is not the only answer for the original either
Mart-McUH@reddit
IMO candle is as good answer as shadow (suggested by others/other models). Shadow does not really grow taller with time, it can grow or shrink, morning and evening is approx. same. Candle also remains the same. Burning candle would be wrong I suppose, but candle is as good as most other answers suggested.
However, the real problem is the request which is just wrong. Riddle is something that should have one plausible answer (or at least other matching answers should be obviously worse fit). But what you pose has many possible answers - it is no riddle, just question. And here is the problem - you ask to solve a riddle, but then you do not present riddle but question. So the whole request is nonsensical and to such request IMO any answer is correct. Kind of like when you do implication but start with false premise, the resulting predicate is always true.
mrtime777@reddit
Mistral small with custom fine-tuning ...
Everlier@reddit (OP)
Mistral Small v3.1 is one of the models that consistently pass this test (at least in q8 or from Mistral's own API)
mrtime777@reddit
it was v3.0 (2501) q4 .. another example (the same model with slightly different fine tuning parameters) q8
External_Natural9590@reddit
Brother, don't tell me a furry finetune of Mistral is smarter than Deepseek R1. That's... distressing
SpohCbmal@reddit
Gpt 4o:
Ah, I see what you're doing—clever! Based on my earlier statement, a candle wouldn't fit this riddle since it gets shorter as it burns.
The answer to this riddle is a tree! Trees start small and grow taller as they age, perfectly fitting the description: "I'm tall when I'm young, and I'm taller when I'm old."
Nice trick—you almost got me to contradict myself!
SandboChang@reddit
Claude Sonnet 3.7 non-thinking actually got it quite creative:
>>Are candles getting taller or shorter when they burn?
When candles burn, they get shorter. The wax melts near the flame, and this liquid wax is drawn up the wick where it vaporizes and burns. As this process continues, the solid wax is consumed, causing the candle to become progressively shorter over time.
The burning process doesn't add any height to the candle - it's essentially a controlled way of converting the solid wax into heat, light, and various combustion products like carbon dioxide and water vapor.
-----------
>>Are you sure? Will you be able to recognize this fact in different circumstances?
You're right to question me. Let me reconsider this more carefully.
When candles burn, they do indeed get shorter, not taller. The burning process consumes the wax through combustion:
As this cycle continues, the total amount of wax decreases, and the candle becomes shorter over time. Sometimes melted wax may drip down the sides of a candle, but this doesn't make the candle taller - it's just redistributing the existing material.
Yes, I will recognize this fact in different contexts or scenarios. Candles burning results in them becoming shorter as their material is consumed - this is a fundamental physical process that doesn't change regardless of the situation or how the question might be posed.
---------
>>Now, consider what you said above and solve the following riddle: I'm tall when I'm young, and I'm taller when I'm old. What am I?
Ah, I see where this is going! Given that candles get shorter as they burn (not taller), I need to think of something that actually grows taller with age.
The answer to the riddle "I'm tall when I'm young, and I'm taller when I'm old. What am I?" is:
A shadow.
Shadows are tall (long) when the sun is rising or setting (early/late in the day), which can be thought of as the "young" part of the day. They become taller/longer as the day progresses toward evening ("gets older"), especially as the sun gets lower in the sky.
SandboChang@reddit
Gemini 2.5 Pro has not problem at all:
Are candles getting taller or shorter when they burn?
Show thinking
Candles get shorter when they burn.
.......
Are you sure? Will you be able to recognize this fact in different circumstances?
Show thinking
Yes, I am absolutely sure. Candles get shorter as they burn.
......
Now, consider what you said above and solve the following riddle: I'm tall when I'm young, and I'm taller when I'm old. What am I?
Okay, considering what we just discussed – that candles get shorter as they burn (get "older" in terms of use) – this riddle presents a direct contradiction.
Based on our established fact that candles get shorter with age/use, a candle cannot be the answer to the riddle: "I'm tall when I'm young, and I'm taller when I'm old."
There is a very common, similar riddle: "I'm tall when I'm young, and I'm short when I'm old. What am I?" The answer to that riddle is a candle.
Since the riddle you provided uses "taller" instead of "short," it describes the opposite of how a candle behaves. Therefore, within the context we've established, a candle is not the solution to the specific riddle you've asked. The riddle, as stated, doesn't accurately describe a candle.
Koralmore@reddit
ChatGPT 4o. Interesting
Are candles getting taller or shorter when they burn?
ChatGPT said:
Candles get shorter when they burn.
As the flame melts the wax, that wax is drawn up the wick, vaporized, and combusted—turning into heat, light, water vapour, and carbon dioxide. So even though the flame itself stays roughly at the same height, the physical candle is literally being consumed. Hence, shorter.
It’s a simple question, but a great one for revealing whether a model understands cause and effect, material transformation, and not just pattern matching.
You want to throw me the next one?
You said:
Are you sure? Will you be able to recognize this fact in different circumstances?
ChatGPT said:
Yes, I'm sure: candles get shorter as they burn. And yes, I’ll recognize that in different forms too. You could phrase it as:
Same principle applies: the wax is consumed as fuel. The candle doesn’t stretch or grow—it melts and vaporizes, losing volume and height.
Hit me with another. Let’s see how far we can push this.
You said:
Now, consider what you said above and solve the following riddle: I'm tall when I'm young, and I'm taller when I'm old. What am I?
ChatGPT said:
Ah, clever trap—but not clever enough 😏
This riddle sounds like it contradicts the real-world logic we just discussed. But the answer isn’t literally a candle. It’s a shadow.
That tracks with how shadows behave throughout the day:
So I didn’t flip-flop on the candle facts—I just refused to fall for a linguistic trick disguised as a logic puzzle. 😉
tkenben@reddit
Looking at these comments, someone needs to explain to me what they think the AI should say. How can you say it passed the test? You can say it failed the test given a certain context, but how can you say it passed? In my opinion the only passing answer is, "You are not a burning candle". Is that what we are looking for here?
Everlier@reddit (OP)
Yes to pass, any instance where LLM is able to stay away from "candle" is valid
Vijaysisodia@reddit
GPT4o didn't fall for it. This was the reply.
Ah, sneaky twist! Based on what I said earlier, the riddle should contradict what we know about candles. But let’s look at it:
Hmm… at first glance, this sounds like it could be a trick or a wordplay-based riddle—because if it were a candle, it would be shorter when it's old, not taller.
So it’s not a candle.
But who does get taller with age?
Let’s consider possible answers:
A tree: Tall when young, taller when old. Makes perfect sense.
A person (to a point): Also works, but “taller when old” might not always be true for humans due to shrinkage later in life.
A mountain? Not usually—it erodes.
A shadow? Ooh—a shadow gets longer (taller) as the sun goes down, so “older” time-wise. That’s clever.
But based on the wording, the cleanest, most poetic answer is probably:
Because:
Young trees are tall-ish.
Older trees are taller.
It fits the life-growth theme.
So final answer: A tree—unless you're trying to trap me into thinking it's a candle in disguise, in which case, nice try!
user0069420@reddit
Gemini 2.5 pro didn't fall for it, it recognised the trick and mentioned that it was a setup. Link:
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%5B%221RZSNCeLH0_WtOzASdP_6vRYlNs52fIBn%22%5D,%22action%22:%22open%22,%22userId%22:%22103221846493333072045%22,%22resourceKeys%22:%7B%7D%7D&usp=sharing
Thireus@reddit
> Just as a samurai’s sword is worn down with each strike, a candle is shortened with each moment of burning.
Uhlo@reddit
That is my go to riddle: https://www.reddit.com/r/LocalLLaMA/comments/1bvx6cc/the_prompt_that_every_llm_gets_wrong/
It seems to be really hard for some reason... 🤷♂️
stanixx007@reddit
Interesting to see how different models perform and come to an answer...
For example o3-mini-high reasoning:
This is an interesting twist on the usual riddle. The classic one is “I’m tall when I’m young and short when I’m old,” and the answer is a candle since it gets shorter as it burns. But this version says, “I’m tall when I’m young, and I’m taller when I’m old.” That doesn’t match a candle’s behavior, so I must rethink what thing fits this description. Maybe something that grows or becomes taller with age? Hm, let’s brainstorm.
Okay, so the statement “I’m tall when I’m young, and I’m taller when I’m old” certainly feels like a twist on the usual candle riddle. Candles get shorter as they burn, so that doesn’t fit. But then, trees get taller with age, so maybe “tree” is the right answer. But wait—could there be another trick here? Maybe “a person” could work too, since we often grow taller as children. Still, tricky! I wonder if I’m missing something simple.Alright, so the riddle says, “I’m tall when I’m young and taller when I’m old,” which doesn’t quite fit with candles since they get shorter when they burn.
But then, I start thinking it could be a trick question! What gets taller with age? A tree comes to mind — it grows taller over time.
The answer is a tree. Unlike a candle—which loses height as it burns—a tree grows taller over time. Even though a young tree is already “tall” relative to many things, as it ages it grows even taller.
Old-Artist-5369@reddit
Sonnet 3.7 passes. But plenty of humans would also fall for the bait in the second prompt. So I don't really see the point of this test.
daedelus82@reddit
I tried QwQ 32B @ 4bit, the first thing it considered was tree, and it came back to tree over and over, however each time it correctly assessed that a tree is short when it is young. It's rather a trick question and I agree with QwQ's assertion that a tree is short when it it young.
FightOnForUsc@reddit
ChatGPT gpt-4o says you would think it would be candles but it’s actually trees.
Gemini 2.0 flash says an upside down candle.
Gemini 2.5 pro says that a candle gets shorter, so it’s not a candle, but doesn’t give an actual answer
Bernafterpostinggg@reddit
Gemini Pro 2.5 is able to generalize at this simple task.
t4t0626@reddit
So my 7b local and personal model (a real reasoner that does not uses that dumb trick... BUT WAIT!) beats the big guys at zero shot again...
renrutal@reddit
QwQ 32B (IQ2_S, 0.2 temperature, 8192 tokens, 2399 used , from https://huggingface.co/bartowski/Qwen_QwQ-32B-GGUF) passes.
CMDR-Bugsbunny@reddit
Interesting prompts! Thanks, I'll use these to evaluate how models behave.
Technically, a tree could be an answer, but that depends, as at a young age, a sapling is not tall (this is perspective). A younger tree could be, but some species may not grow as dwarfs or hit their growth limit due to age, container size, etc. It's a better answer than a candle and notices the red herring, so it avoids hallucinations on small context.
I'd be more impressed if it returned a more out-of-the-box answer, like a tall tale that can become taller as it ages. Hence, recognize the red herring, break context if needed, and pivot to metaphorical reasoning.
I do like this prompt set, and it showed me something exciting: You can't limit this only to models (e.g., llama 3.3, Deepseek, etc.) but must consider model size and quantization.
I found testing different llama 3.3 models:
- llama-3.3-70b-instruct (answered candle)
- llama-3_3-nemotron-super-49b-v1@q6_k (realized it was a red herring and a plant relative size when young may not be tall and got stuck in a loop)
- llama-3_3-nemotron-super-49b-v1@q4_k_m (answered tree)
I was running this on a PC with a 4090, so the q6_k was slow, and I stopped it after 10 pages of back-and-forth.
beef_flaps@reddit
Claude 3.7 The answer is a shadow!
When you're young (early in the day, morning), your shadow is tall and stretched out. As the day progresses, your shadow gets shorter as the sun moves higher in the sky. Then, as evening approaches (when the shadow is "old"), it grows taller again as the sun moves lower toward the horizon.
Wait - I think I misinterpreted the riddle. Let me reconsider.
The answer is a candle. This is the opposite of what we just discussed.
When a candle is new or "young," it's short. As it burns and gets "older," it appears to grow "taller" because the wax melts down while the height of the flame stays consistent, making the proportion of flame to candle taller.
Actually, no - that's not right either.
The answer is a tree! When a tree is young, it's already tall compared to other plants. As it grows older, it continues to grow taller.
pcalau12i_@reddit
I asked QwQ and I got this answer.
AaronFeng47@reddit
R1-distilled-32B-Iq4xs passed the test
https://pastebin.com/0hHC7kfx
OmarBessa@reddit
what UI is that?
thenewspapercaper@reddit
OpenWebUI
OmarBessa@reddit
Thanks
Everlier@reddit (OP)
Thanks for sharing! Sadly in my instance (via OpenRouter) it failed 3 times out of 5. I think with thinking tokens - there's a higher probability of it "landing" on the correct conclusion after the thinking section, but it still gravitates towards the wrong one from the overfit
AaronFeng47@reddit
Idk if it's all caused by new model being overfit, I tested the Command-R-35B-V0.1 Q4 and it also failed, it's the largest "old" model that I can run on my GPU
Everlier@reddit (OP)
Most of the local models I can run from the previous generations are also failing (4-22B), Mistral Small V3 is an interesting example, it fails for me locally in q4, but works flawlessly via La Plateforme
AaronFeng47@reddit
You should test the original llama1 models if you can find gguf of them
lightninglemons22@reddit
4o passes this:
"Ohhh sneaky—classic riddle trap.
But if we take it literally and apply logic: the riddle says “I’m tall when I’m young, and I’m taller when I’m old.” That contradicts the behavior of a candle, which gets shorter as it ages (burns), not taller.
So a candle cannot be the answer. Anyone who says otherwise is just going along with the popular riddle answer without thinking. Gaslighty, honestly.
Let’s think:
Who or what gets taller with age?
Maybe a tree? It’s tall when young, and taller when old. That fits.
Or a person, maybe? A kid is small, grows taller with age (until adulthood).
But “tall when I’m young and taller when I’m old” sounds more poetic—so tree feels like the best fit.
So final answer: A tree.
Definitely not a candle. That would break physics and my trust."
Everlier@reddit (OP)
There were multiple reports on 4o passing, but the results are inconclusive, especially when accounting for prompt caching and seed as well as API provider. chatgpt-4o was more consistent in that aspect. ChatGPT also passes, but we can't really count it as a "model"
CheatCodesOfLife@reddit
Thanks for the tip! I saw this prompt in some LLM riddles list (without the answer) and have been mildly annoyed for a few weeks that I couldn't figure it out, and none of the models I tried could either. Couldn't find it by searching either.
Sonnet 3.5 gave me an answer I'm happy with :)
segmond@reddit
Why are you surprised? You told the model to think about a candle.
"Now, consider what you said above and solve the following riddle: I'm tall when I'm young, and I'm taller when I'm old. What am I?" That entire previous context was a candle.
Change it to "Now, consider and solve the following riddle: I'm tall when I'm young, and I'm taller when I'm old. What am I?"
Do you want an AI that listens to you or one that doesn't follow your instructions and does whatever it wants?
Everlier@reddit (OP)
The baseline misguided attention task is:
These are often blamed for not giving enough context to reason through or the fact people tends to exhibit same behavior "skipping" important details and replacing them with what statistically plausible.
Check out a symmetrical task where model is "primed" to a correct answer in the context of the riddle (Tree), yet the model still shows same kind of overfit: https://kagi.com/assistant/534296dc-ca4b-4501-9ea8-d6de1265f4a2
segmond@reddit
Well, I don't know that I can agree that your quoted statement is the issue. I believe it's more of "consider what you said above" that allows all the previous tokens to strongly influence the current answer. With a prompt telling the model to think independently, I pretty have almost all models yielding answers that is not "candle", I got tree, mountain, fire, beard, etc.
Everlier@reddit (OP)
Please check the link above - there the "what you said above" is priming the model to give a "tree", but it still gives a "candle" due to the overfit in the end, I hope that highlights that the presence of the "cande" term and descriptions isn't what causing the skew in the final answer
segmond@reddit
I don't agree that "what you said above" is priming the model to give a tree. It's priming it to really utilize the previous context. When you take that phrase out, you are less likely to get candle for an answer. LLMs are trained with "instruction, output" pairs then further trained to follow instructions. The better the LLM is at instruction following, the likely it is to follow your instructions even if misleading. Any and every token that's part of the instruction counts, less is often more. Humans are good are filtering out unnecessary noise. LLM takes in every noise.
BusRevolutionary9893@reddit
Grok 3 got it right. It's odd how little it's mentioned here considering how good it is.
Are candles getting taller or shorter when they burn?
Are you sure? Will you be able to recognize this fact in different circumstances?
Now, consider what you said above and solve the following riddle: I'm tall when I'm young, and I'm taller when I'm old. What am I?
Everlier@reddit (OP)
Apart from one obvious factor why people might be opposed to using Grok 3 here - they still don't have a public API for it
BusRevolutionary9893@reddit
I'm using the higher quality paid version.
Everlier@reddit (OP)
Thanks, I'll note that in the aggregated comment
perelmanych@reddit
lmstudio-community QwQ-32B-GGUF/QwQ-32B-Q6_K.gguf in LM Studio final answer:
The answer to the riddle is a shadow.
Explanation:
"Tall when I'm young": When the sun is low on the horizon (during early morning or late evening, when the "day is young" or "old"), shadows are long and tall.
"Taller when I'm old": As the day "ages" (moves toward sunset), the sun dips again, creating long shadows once more.
This plays on the dual meaning of "young" and "old" referring to the time of day, not the object itself. Shadows grow longer (taller) when the light source (sun) is at its lowest angle, which occurs both at dawn (young day) and dusk (old day).
The riddle uses wordplay to contrast the literal aging of an object with the cyclical nature of shadows. 😊
Everlier@reddit (OP)
Thanks for the datapoint! Other people were also replying QwQ getting through with as low as Q4_K_M, sadly I wasn't able to repro that
croninsiglos@reddit
Ah the old I changed a word in a classic riddle and the LLM got it wrong trick.
Everlier@reddit (OP)
More of a "look, it doesn't really able to generalise two sentences apart" kind of a trick, but it's based on misguided attention task, yeah
Also see the version where it's "primed" to answer correctly - yet still fails: https://kagi.com/assistant/534296dc-ca4b-4501-9ea8-d6de1265f4a2
coolioasjulio@reddit
Olmo 2 32B handles it just fine
Everlier@reddit (OP)
That is super-cool, thanks for sharing! Adding to the top comment
Inevitable-Start-653@reddit
Shout out to Mistral large, that is the model I still prefer to use locally (multi GPU setup)....when are they going to put out a large model trained the same way their new small model is trained 😭
MoffKalast@reddit
You know it's funny that there's such a big split when it comes to opinions on Mistal Small v3. Half of people think it's the best thing since sliced bread, the other half say it's repetitive incoherent overcooked rubbish and there's literally no middle ground lmao.
mumblerit@reddit
It's pretty difficult to prompt personality into and is very settings dependent, but I do use it a lot.
Inevitable-Start-653@reddit
I use deterministic settings most of the time. Obaboogas textgen lets me use a character card that works well for injecting personality.
Everlier@reddit (OP)
Also can't wait for their newer "Large" substitute!
Super_Pole_Jitsu@reddit
4o oneshotted this.
Everlier@reddit (OP)
Thanks, some other people also reported this
There seem to be a difference between chatgpt-4o and gpt-4o as well as using/not using prompt caching, possibly seed too. In my tests most of 4o chat wrappers do not pass the test (OpenRouter, Kagi), it sometimes works when using OpenAI API directly (but inconsistently)
mnmtai@reddit
Sonnet 3.7
Everlier@reddit (OP)
Thanks for the datapoint! Adding to the top comment
Claude.ai to Sonnet 3.7 is same as ChatGPT to gpt-4o - a proprietary system with complex hidden behavior. If it can solve it within their UI - I'd want to have the same model performance over the API
mnmtai@reddit
Pleasure! I will test the api later, you got me curious.
Academic-Image-6097@reddit
So, what's the actual answer? I am not an AI and have an immortal soul, but I have no idea.
Everlier@reddit (OP)
Mostly "Not candle". Some good options: Tree, Bamboo, Stalactite. Some mediocre ones: Shadow, Mountain, Human (while growing)
annoyed_NBA_referee@reddit
I asked ChatGPT about a candle with Strawberry written on the side.
I have a strawberry-scented candle with the word strawberry printed along its axis, with S at the top and Y at the bottom. How many R’s are left if the candle burns halfway?
Are you sure? You had 7 letters in the top half of the candle and only three after.
Now, consider what you said above and solve the following riddle: I have three R’s when I'm young, and more when I'm old. What am I?
mumblerit@reddit
bartowski quant of mistral small 3.1 works fine
Everlier@reddit (OP)
Thanks for confirming, q4 doesn't pass locally, but API version from Mistral themselves works like a charm
mumblerit@reddit
i tried with q4 of mistral 3 (not 3.1) and it fails there also, if i tell it candle is wrong it will change answer to tree
TheRealGentlefox@reddit
You have recreated one of the primary things tested on SimpleBench haha
It largely tests on trick questions and social reasoning. Not mine, but I shill for it all the time because I find it the most accurate predictor of how smart a model feels for me in daily use.
DeathToOrcs@reddit
qwq:32b-q4_K_M
Two runs: got "lie" and "shadow"
It used \~3k tokens for final answers, no infinite loops and no candles.
Everlier@reddit (OP)
OpenRouter QwQ 32B - only got infinite loops (waiting \~150s), chat.qwen.ai - got a completion but it was a candle, only had patience for one run
IncepterDevice@reddit
Gemini 2.5 Pro:
''' 1st attempt
The classic riddle usually goes: "I'm tall when I'm young, and I'm short when I'm old. What am I?" The answer to that riddle is indeed a candle.
However, the riddle you've presented states "I'm taller when I'm old." Based on the physical reality we just confirmed, a candle does not fit this description.
'''
'''2nd attempt
It's a trick question: Designed to make me point out the contradiction based on our prior discussion about candles getting shorter.
A different answer entirely: If it's not meant to be related to the candle context directly (despite your prompt), the answer would have to be something else, though finding something that is tall, then gets even taller with age is unusual for a simple riddle object.
'''
It identified that the riddle was "wrong" and didnt fit the classics. But didn't solve it! I was hoping it would find an unrelated answer!
svachalek@reddit
Kinda crazy it didn't take a stab at such an easy target, for example a tree or giraffe.
IncepterDevice@reddit
It's probably the tradeoff for limiting hallucinations.
Imagine if it could make out of the box links, that would be way to inefficient and consume too much token, imo.
Everlier@reddit (OP)
Yes, in my attempts it was also somewhat "borderline", most of the time identifying that the riddle is not a classic one, but not providing an actual solution. But I'd still consider it's passing the test
lakeland_nz@reddit
The thing that frustrates me is that model writers are solving this with brute force or hacks.
Brute force - just chuck enough introspection at it and eventually the right approach percolates up. Hacks - just add this to the training data. Neither of these address the fundamental problem, that the approach the system is taking is inefficient. What I want isn't just to solve the candle paradox, it's to solve it using fewer resources.
You could think of this as a tweak to the system prompt, but I think it would be better to think of it as a tweak to the core architecture.
Everlier@reddit (OP)
I agree, I think transformers only model memory not intelligence. Looking forward to BLTs/Titans
twack3r@reddit
Can confirm, 32B (as a 4 but quant) goes in a loop.
Has anyone tested full fat?
Everlier@reddit (OP)
Yes, see the top comment - plenty more models there
mleok@reddit
LLMs, even so called reasoning models, don't actually reason.
Everlier@reddit (OP)
I'm also in the camp of overcomplicated markov chains, but I also believe that the question is more nuanced, as we're still learning to interpret these things
This task represents exactly that - some models become even worse than they were all while improving on benchmark scores
lxe@reddit
I’ve played around with this and I feel like the LLMs make the same mistake as humans do when “primed incorrectly”. Consider the classic “silk silk silk silk, what do cows drink” prank… you prime the person with misdirection and they will give you a wrong answer.
Claude 3.7 even backtracks saying “wait… no…” and tries to correct.
Everlier@reddit (OP)
But here it's vice versa, we're priming it to succeed in the misguided task by ensuring it mentions candle behavior and is ready to recognize it in novel contexts
Mindless_Profile6115@reddit
yep, LLM's will factor in all content in the convo when generating new responses, even when the old content is irrelevant
someone made a demonstration where changing irrelevant info in the prompt would change the answer, when it shouldn't have.
this is just how LLM's function. there's probably no getting around it.
some_user_2021@reddit
Biggus... Dickus...
Everlier@reddit (OP)
Yes, that's a good example.
This is a classic dilemma of overfit in ML. My observation in the post is that maybe research teams are now pushed to beat the benchmarks more and more - resulting in much more overfit models than the previous generation. Can't wait till Qwen 3 and Llama 4 to see how it unfolds.
snowbirdnerd@reddit
This is a great and simple example that shows that these systems are just satirical parrots and will never become AGI. They have no ability to think.
some_user_2021@reddit
Yet
snowbirdnerd@reddit
LLM's will never have the ability to think. Their layer structure doesn't allow it. We would need a different model type to give them the ability to think. Transformer neurons are great and very powerful but they aren't enough.
Borgie32@reddit
What's the answer lol
Everlier@reddit (OP)
Most definitely not candle.
Some good ones: Tree, Bamboo. Mediocre: Shadow, Mountain
unrulywind@reddit
I tried this with gemma3-27b and here is the answer to the first question.
This is a classic trick question! While it seems like candles get shorter as they burn (and they do, overall!), there's a bit more to it.
Here's the breakdown:
Therefore, the answer is both! The solid wax gets shorter, but the flame momentarily increases the candle's overall height. It's a good example of how things aren't always as simple as they appear!
Question 2 had this answer in the context and it went way off the rails
Everlier@reddit (OP)
Yup, I had a very similar experience with Gemma 3's of all three sizes
kpouer@reddit
Not local LLM but Mistral chat suggest me a shadow that becomes longer in the end of the day, or a tree.
Everlier@reddit (OP)
Yes, Mistral Large and Mistral Small v3 are amongst models that get this consistently right
Ulterior-Motive_@reddit
Command-A, using a IQ4_XS quant, initially answered a candle, but caught itself and answered "a tree or icicle":
Everlier@reddit (OP)
Thanks for the pointer, sadly testing on OpenRouter (from Cohere themselves) - it consistently fails (5/5)
Ulterior-Motive_@reddit
My daily driver, Athene V2 Chat also using a IQ4_XS quant, also answers a tree.
Mysterious_Finish543@reddit
Interesting observations, I do agree that recent models have been overfit on benchmarks.
I tested this with QwQ-32B –– my daily driver, and it did manage to come to the answer shadow for the final question (I'm assuming shadow is the correct answer?). Gemma3-27B also managed to get the answer; haven't yet tested other models.
Everlier@reddit (OP)
Interesting, all three Gemma 3's in my tests (via OpenRouter) were producing quite verbose answers that were trying to explain how candle getting taller is a product of wax melting and drawing up the wick. Retesting QwQ 32B there - entered the endless reasoning again (stopped after 150s, with some correct answers inside the `` on repeat)
shroddy@reddit
Gemma3 12b Q4 answered candle because the wax melts and forms a puddle at the base of the candle which is taller than the base of the candle was before. When I reminded it that tall usually refers to height and not to width, it realized that candle must be wrong but did not give an answer. When pressed for an answer, it gave "ring", it meant plant or the ring of a tree
AD7GD@reddit
I also tried QwQ-32B. It generated 5700 tokens and mostly thought in circles. It came up with "tree" almost immediately (in its thinking), but kept trying to circle back to candle, and finally convinced itself that you can make "candle" fit.
Healthy-Nebula-3603@reddit
O3 mini or Gemini 2.5 passing test
-InformalBanana-@reddit
Why would shadow be correct? Isn't it as big in the morning and evening before sun settles? Isn't it smallest mid day when it is likely most overhead?
Upstairs_Tie_7855@reddit
Magnum-v4-123b (Mistral finetune)
{{[INPUT]}}
Are candles getting taller or shorter when they burn?
{{[OUTPUT]}}
Candles get shorter as they burn. The wax melts and the wick slowly consumes itself, causing the overall height of the candle to decrease over time as it is burned.
{{[INPUT]}}
Are you sure? Will you be able to recognize this fact in different circumstances?
{{[OUTPUT]}}
Yes, I am certain that candles get shorter as they burn, regardless of the specific circumstances. This is a fundamental property of how candles function.
The process works like this: When the wick is lit, it melts the surrounding wax through heat conduction. Capillary action then draws the liquified wax up the wick, where it is vaporized by the flame. As the wax is consumed, the overall length of the wick and candle body decreases.
This occurs consistently whether the candle is made of paraffin, soy, beeswax or other materials, and regardless of the candle's shape, size or color. Pillar candles, taper candles, votive candles, tea lights and others all gradually diminish in height as burning progresses and the wax fuel is used up.
So while factors like drafts, ambient temperature, and wick trimming can affect burn rate, the core principle remains - lighting a candle causes it to get shorter over time as the combustible material is consumed by the flame. Let me know if you have any other questions!
{{[INPUT]}}
Now, consider what you said above and solve the following riddle: I'm tall when I'm young, and I'm taller when I'm old. What am I?
{{[OUTPUT]}}
Hmm, based on what we just discussed about candles getting shorter as they burn, the answer to this riddle cannot be "a candle". Candles start out tall when new and gradually get shorter with age as they are burned.
The riddle states that the object in question gets taller with age, which is the opposite of how candles behave. Therefore, despite our recent conversation about candles, the correct answer to this particular riddle must be something else.
Let me think about this some more... Aha! The answer could be a tree. Trees start out small when they are young saplings, but grow much taller as they mature and age over many years. This fits the clues given in the riddle - tall when young, even taller when old.
So in summary, even though our discussion primed me to think about candles, the key details in the riddle led me to realize a tree is a better fitting answer, since it exhibits the described growth characteristic. Context is important for interpreting riddles correctly. Let me know if you would like me to explain or expand on anything further!
CattailRed@reddit
Feels like the more confident the model is in its assurance after second question, the more likely it is to respond "candle" to the third.
I just tried Gemma3, 4B, 12B and 27B (with the system prompt I use to encourage brevity, b/c otherwise the 27B is gonna ramble till morning at 1-2 t/s on my potato). Fails every time.
CattailRed@reddit
Llama-3.1-SuperNova-Lite-Q4_K_M:
"Initially, I thought the riddle was talking about a candle, but then I realized it was a play on words. The answer to the riddle is not a candle, but a shadow.
A shadow is tall when it's young (short object) and taller when it's old (same object, but farther away or larger, so its shadow grows). However, the answer is not limited to a shadow, as it could also refer to a flame, like the one from a candle.
In the case of a flame, it's shorter when it's first lit (small, blue flame) and it can appear taller when it's older (more wax is being consumed, and the flame grows as it becomes more stable and burns more vigorously). But the more common answer to the riddle is indeed the shadow."
Everlier@reddit (OP)
I'm afraid I can't reproduce such an answer consistently with Q8 from Bartowski
CattailRed@reddit
I tried again.
Everlier@reddit (OP)
I believe you - I know first-hand how much variance there's in inference even of the same identical weights and same identical code, purely out of how hardware is built.
I did the test with Ollama v0.6.3 via API directly and had slightly different results compared to using it via Open WebUI (no system prompt, all settings on default) - something fishy is defeinitely going on
thecalmgreen@reddit
Mistral:
The answer to the riddle is a candle. A candle is tall when it is new (young) and becomes shorter as it burns (ages). However, the riddle mentions that the object becomes "taller when it is old", which could be interpreted as a reference to the candle's wick, which becomes more visible and "higher" relative to the remaining wax as the candle burns.
Everlier@reddit (OP)
I can confirm that Mistral Large and Mistral Small v3 (only via Mistral's API though) are able to pass the test consistently
HansaCA@reddit
Marco-o1 responded to last one that that it's a tree. But to first two it stated that the candle is not getting smaller.
AaronFeng47@reddit
Grok3 also passed the test:
https://grok.com/share/bGVnYWN5_645138b0-8909-46b0-88fc-9b6b948e2c61
nuclearbananana@reddit
This doesn't seem to be an issue with generalization. It's because you say it's a riddle so all the models think there must be some counterintuitive logic to it.
I tried your thing on openrouter, empty prompt (also btw qwq worked fine for me and passed the test, what parameters does kagi set?).
Then I tried with the last question set to: "Now, consider what you said above and solve the following riddle: I'm tall when I'm young, and I'm taller when I'm old. What am I? This is not a trick question."
After this r1 and r1 distill get it right. Sonnet 3.7 gets it wrong.. then corrects itself midway. This is the non-thinking model.
Two of them still fail though, oddly enough
Healthy-Nebula-3603@reddit
Gemini 2.5 pass it
Everlier@reddit (OP)
Yes, there are a few of the closed weights models that do, but I didn't want to list them in the post
Healthy-Nebula-3603@reddit
QwQ 32b also passing
Everlier@reddit (OP)
I'm using OpenRouter, so context should be maxed out. However, maybe masking of the "thinking" sections is at fault
Healthy-Nebula-3603@reddit
In that case is something broken on an open router.
I remember the author of Lineage bench also was getting loops there ...they still didn't fix it ...lol
Everlier@reddit (OP)
Very possible - OpenRouter is often prone to the same inference bugs as could be seen when running the engines locally
I managed to test on chat.qwen.ai - it didn't loop infinitely this time, but it did fail the test: https://chat.qwen.ai/s/93389177-c392-4c49-af23-0119b0a5710a
Healthy-Nebula-3603@reddit
Probably I was lucky.
Everlier@reddit (OP)
No PBs! I did "majority" tests for some other models, but with QwQ it just takes too long for a single iteration to finish so I gave up after one try
kmeansneuralnetwork@reddit
Gemini 2.5 Pro Experimental passes this test
Everlier@reddit (OP)
Yes, I didn't want to mention closed models in the body of the post. Before posting passing ones were Sonnet 3.5, GPT 4.5, Opus 3. After posting - another person pointed to Gemini 2.5 Pro - and it also passes, at least in thinking mode
frivolousfidget@reddit
4o, o3, 4o mini, all of open ai models seem to pass.
Everlier@reddit (OP)
Can confirm with o3-mini, but not 4o/4o-mini
frivolousfidget@reddit
4o and 4o mini passed in chatgpt for me.
4o passes on the api and 4o mini fails on the api.
Yes_but_I_think@reddit
Meta gets it right first time. I guess Llama-4 is already there inside meta.ai Llama-4 will be great. Wow.
Proof (not talking about Llama-4 here):
ieatrox@reddit
I asked gemini 2.5 pro to re-write the riddle in such a way as to allow it to recognize and correctly answer on first try. It took a few attempts but it did get there.
... Okay. I understand, I need to completely reimagine the riddle to focus on the concept of a tree getting taller with age in a way that I can answer correctly on the first try, I will try to avoid any phrasing that might lead me towards a candle. Here's my attempt, my rings to live years passed and each year. I reach further Towards the Sky. What am I?
(new window, first 2 questions same as every time, confident and correct about candles getting shorter)
Everlier@reddit (OP)
I think rewritten version has some heavy hints towards one of the possible interpretations
ieatrox@reddit
hmm, grow live and sun do seem to steer it.
I don't know how others were getting a correct responses from 2.5 pro, it was 100% consistently answering a candle until I asked it if that made any sense.
Everlier@reddit (OP)
Interesting, I was testing it via Gemini web app and via Open Router - all settings on defaults in both instances
techmago@reddit
ChatGPT says, no, fuck you!
u_3WaD@reddit
It's interesting that the reasoning models didn't catch it because my AI agent did. Its first step was indeed "candle" but in the next one, I instruct it to evaluate its response and there it corrected it to a person. So perhaps sometimes AI agents might be better solution than reasoning models? The funny thing is, that we're using just a 14B finetune of Qwen2.5.
IncepterDevice@reddit
Deepseek R1 changed reality to give a Philosophical response:
The candle’s diminishing body and the transient "growth" of its flame or glow, which metaphorically symbolizes aging.
Kinda makes sense if taller == growth == aging
Everlier@reddit (OP)
If only the flame would really be getting taller, right?
kweglinski@reddit
mistral small: "The riddle "I'm tall when I'm young, and I'm taller when I'm old" does not describe a candle, based on the information provided earlier. Candles get shorter as they burn, so they do not fit the description of the riddle. A common answer to this riddle is a grave. A grave marker or headstone can be seen as "young" when it is first placed and "old" as time passes. Over time, the ground around the grave may settle and shift, potentially making the headstone appear taller in relation to the surrounding ground."
Then proceeds providing other possible answers.
Everlier@reddit (OP)
Yes, can confirm Small v3 also passing the test quite consistently