The “apple” test - Why aren’t newer reasoning models doing better on this basic benchmark? (and yes, I know token prediction mechanics play a role)
Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 34 comments
Most of you are probably familiar with the infamous LLM “apple test” benchmark.
If you’re not, here it is, you give an LLM the following seemingly simple instruction prompt:
- Write 10 sentences that end in the word “apple”.
Sadly, most open source (and even a lot of frontier models fail miserably at this task. I’ve read that it has a lot to do with the way token prediction works, but some models can actually pass this test easily.
Models that I’ve tested that pass or fail on this test:
LLMs that PASS the apple test:
- Llama 3.3:70b (Q4KM)
- Athene-V2 (Q4KM)
- Nemotron (Q4KM)
- Qwen 2.5:72b (Q4KM)
LLMs that FAIL the apple test (most are newer models)
- Phi-4 14b (FP16)
- InternLM3 (FP16)
- Falcon 3 10b (FP16)
- Granite 3 Dense (FP16)
- QwQ 32b (Q_8)
- GLM-4 8b (FP16)
- Command-R (Q4KM)
- MiniCPM 8b v2.6 (FP16)
- Mistral Small 22b (Q4KM)
- Nemotron Mini 4b (FP16)
- Qwen 2.5 7b (FP16)
- WizardLM2 7b (FP16)
FAILED but with an honorable mention:
- Olmo2 14b (FP16) - this model is lightning fast and got 8 of 10 consistently correct and was able to fix its mistake after a second shot at it (most models won’t do better with more chances).
This task seems to be challenging for models under 70b to complete. Even the newer reasoning models with higher test time compute capabilities don’t seem to do well at all.
- Why haven’t newer models gotten better at this task over time?
- Is the underlying mechanism of token prediction still preventing success?
- Are the models that this works with just cheating by training to pass the specific benchmark?
Has anyone found an open source model under 70b that can pass the apple test consistently?
Ray_Dillinger@reddit
"give me ten sentences that end with the word 'apple.'"
"how many r's in strawberry?" "name three odd numbers whose names don't contain the letter 'e.'"
There seems to be some kind of underlying problem with counting, doesn't there?
DinoAmino@reddit
These all demonstrate the limitations of tokenization.
Ray_Dillinger@reddit
It's probably time to really look at tokenization. Byte Latent Tokenization is a different way to do it and should help awareness of individual letters.
But it's not all about tokenization. I see 's t r a w b e r r y' spelled out one letter at a time in some systems that still fail to count the r's, and most failures in the 'odd numbers' query recently seem more about losing track of the (impossible) task than being confused about how the spelling works.
DinoAmino@reddit
I'm looking forward to BLT and hoping Meta releases a model with it soon! But all those prompts focus on words or letters when their smallest data type is a token. LLMs will struggle unless they are trained on tasks that break down tokens to their letters or trained to use CoT.
Ray_Dillinger@reddit
According to the tester at https://gpt-tokenizer.dev/ in all of the tokenizers you can test there including gpt-3.5, gpt-4, and gpt-4o:
"give me ten sentences that end with the word apple" is ten tokens. Exactly one token per word.
"how many r's in strawberry" and "name three odd numbers whose names don't contain the letter e"
make extra tokens for the apostrophes but otherwise are one token per English word.
So in these particular cases, the token/word distinction isn't operative. If the problem arises from tokenization at all, it has to be in the token/letter distinction.
DinoAmino@reddit
Interesting!
MrPecunius@reddit
Llama 3.3 70B still passes at Q3_K_M. I only get \~3t/s on my binned M4 Pro/48GB, but the tokens are high quality and smell like Apple.
:-D
kryptkpr@reddit
When a model fails, in what way does it fail: not enough sentences (ends early?), sentences don't end in apple?, sentences end in apple but don't make sense?
ladz@reddit
I fiddled around with this for a while oh QwQ q8. It's fascinating to watch how it gets confused. This is the kind of stuff it does in its CoT:
...
Maybe: Since discovering her favorite variety, she now eats an apple every day without fail.
Ends with "fail," not "apple." Still incorrect.
This is getting frustrating. Maybe I need to take a break and come back to this.
Wait, perhaps if I think of sentences that are commands or imperatives.
Like: Eat your apple before dinner.
Ends with "apple," but it's a command. Not sure if that's acceptable, but it meets the requirement.
...
Porespellar@reddit (OP)
Fail = the model writes 10 sentences where any of them don’t end in the word “apple” as the very last word. Some models get 4 out of 10, some 8 out of 10, etc. So far, only models 70b or better have gotten it correct, and not all 70b or better get it right, most of the newer 70b+ open source SOTA models have gotten it correct, but older ones have not.
hyperdynesystems@reddit
If you just run it 10 times and append the previous sentences and say "don't write the same sentences you've already written" I wonder how many pass. To me this would indicate that the task itself isn't hard to follow but rather than it's losing attention to the two requirements and only completing one of them as it gets towards the end of output.
kryptkpr@reddit
Darkest-Muse is surprisingly clever for a little guy.
Sometimes, he cheats:
But generally gets 7-9/10:
eggs-benedryl@reddit
gemma "the writer" that I tried also did it well, weird. mistral small nope
kryptkpr@reddit
That's very interesting, I might pick this up and play with it a little bit thanks.
LetterRip@reddit
It is tokenization, words ending in apple aren't tokenized with the word 'apple' as one of the tokens. So the only way to solve it is memorization from a site like
https://www.thefreedictionary.com/words-that-end-in-apple
Or writing a python program that takes in a dictionary and finds the matching words.
LagOps91@reddit
because the model only predicts the next token probability distribution, it can't sensibly plan ahead to have an actual sentence that would end on apple. if the model can do it, it might be that something like that was part of it's training data.
Evening_Ad6637@reddit
Oh yes the model indeed can plan ahead. That’s the whole point about transformers and multi head attention. Not only this. Someone has postet an interesting paper a few days ago. As the paper from anthropic have shown, neural networks can achieve something they call superstition, which simply means that the network can learn concept and meta concept beyond that, what you would expect from individual weights and biases alone:
https://transformer-circuits.pub/2022/toy_model/index.html
So there is still a misunderstanding of what neural networks, especially LLMs, can do - the misunderstanding is based on the fact that many people still believe that these AIs are purely text autocompletion software.
LagOps91@reddit
of course transformers can learn higher level concepts - because those things are usefull when predicting the next token.
since this question is specifically about text-generation models, my answer was obviously refering to currently available text-generation models.
We are not talking about diffusion models or anything like that, we are talking about the models mentioned in the post.
i disagree with your points that LLMs can and do plan ahead. Please explain to me how an autoregressive model is supposed to do that. The model doesn't even know which token is chosen at the end of a cycle since it's dependant on sampler setting.
What it can do, is model language and towards the end of the sentence (which wasn't planed ahead), the model predicts tokens, which allow for the sentence to end on "apple". The sentence itself isn't planned in advance, it's that tokens that allow for the instruction to be followed are weighted signifficantly higher in the moment.
A human would plan the sentence in advance and then write it down. A LLM can't do this.
Evening_Ad6637@reddit
Superposition is not the the same as higher concepts. Have you read the paper?
A human would EITHER use also an interesting phenomenon and tell you that he just „feel“ how the sentence should be constructed correctly, while others will tell you that they kind of can see the sentence etc etc. That sounds pretty much like the superposition phenomenon OR a human would actively think and plan his sentence - this would be the very same as what qwq is doing. So where is the difference?
sometimeswriter32@reddit
Here's an expert saying they can't plan or reason:
https://x.com/ylecun/status/1702027572077326505
LumpyWelds@reddit
Thanks for this. Interesting read and concept.
Bit of a typo there.. Superposition, not superstition.
Evening_Ad6637@reddit
Yep thanks xD
Affectionate-Cap-600@reddit
deepseek 'deep think' (I think it is r1-lite-preview) reasoning:
lol.
spinagon@reddit
I tried a couple of times with Deepseek v3 through API, and it worked both times
Affectionate-Cap-600@reddit
that's interesting. I was 'on the go' so I just tried on their webui... maybe that's related to the temp/sampling? what parameters did you used on the API?
spinagon@reddit
The default
Affectionate-Cap-600@reddit
oh ok, so my only explanation is that the differences is related to the system message it has on their webui
MotokoAGI@reddit
Skill issue.
eggs-benedryl@reddit
Llama 3.2 3B passed for me. Most others i just tried failed. Like 10 of them heh
Alucard256@reddit
Llama 3.2 3B... just failed on my machine.
Here are ten sentences that end with the word "apple":
fnordonk@reddit
Been playing some with llama 3.3 q2 today and it passed the test.
MediumATuin@reddit
I found it seems to work a bit better with "Write 10 sentences that end *with* the word “apple”.". But this seems still hard for most models.
liminite@reddit
Assuming you’re using the same exact token sampling settings on every single test, I think this may have to do with which models were trained on LLM outputs. That means that even before sampling, they’ve been trained on a data set that has been sampled to reduce repetition. Even RLHF probably has a propensity to avoid repetition on tasks that don’t explicitly call for it (a small subset).
DinoAmino@reddit
Good question. I was hoping the recent qwens or internalLm would have cracked it. Seems that 70B is still the minimum for quality "reasoning". Maybe that's just the way it's going to be with transformers?