The “apple” test - Why aren’t newer reasoning models doing better on this basic benchmark? (and yes, I know token prediction mechanics play a role)

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 35 comments

Most of you are probably familiar with the infamous LLM “apple test” benchmark.

If you’re not, here it is, you give an LLM the following seemingly simple instruction prompt:

Write 10 sentences that end in the word “apple”.

Sadly, most open source (and even a lot of frontier models fail miserably at this task. I’ve read that it has a lot to do with the way token prediction works, but some models can actually pass this test easily.

Models that I’ve tested that pass or fail on this test:

LLMs that PASS the apple test:

Llama 3.3:70b (Q4KM)
Athene-V2 (Q4KM)
Nemotron (Q4KM)
Qwen 2.5:72b (Q4KM)

LLMs that FAIL the apple test (most are newer models)

Phi-4 14b (FP16)
InternLM3 (FP16)
Falcon 3 10b (FP16)
Granite 3 Dense (FP16)
QwQ 32b (Q_8)
GLM-4 8b (FP16)
Command-R (Q4KM)
MiniCPM 8b v2.6 (FP16)
Mistral Small 22b (Q4KM)
Nemotron Mini 4b (FP16)
Qwen 2.5 7b (FP16)
WizardLM2 7b (FP16)

FAILED but with an honorable mention:

Olmo2 14b (FP16) - this model is lightning fast and got 8 of 10 consistently correct and was able to fix its mistake after a second shot at it (most models won’t do better with more chances).

This task seems to be challenging for models under 70b to complete. Even the newer reasoning models with higher test time compute capabilities don’t seem to do well at all.

Why haven’t newer models gotten better at this task over time?
Is the underlying mechanism of token prediction still preventing success?
Are the models that this works with just cheating by training to pass the specific benchmark?

Has anyone found an open source model under 70b that can pass the apple test consistently?

=== Message: select_prompt:index=0/expand_brainstorm/write_stories:index=0,model=Darkest-Muse-v1-exl2-6bpw-h8 === 1. My favorite dessert is a crisp, warm slice of apple pie, topped with vanilla ice cream and a drizzle of caramel, apple. 2. The farmer carefully picked the perfect, ruby-red apple from the tree, ready for market, apple. 3. We went on a scenic walk through the orchard, inhaling the sweet aroma of blooming apple trees, apple. 4. A crisp autumn breeze rustled the leaves, sending a shower of red and gold apples tumbling down to the ground below, apple. 5. The cider press churned and groaned as it squeezed sweet juice from the freshly picked apples, apple. 6. The little girl offered me a shiny, polished, plastic toy apple as a gift, apple. 7. My grandmother used to say an apple a day keeps the doctor away, apple. 8. We built a small fire in the fireplace, tossing another log on top and roasting apples over the warm embers, apple. 9. She sliced the crisp green apple thinly and tossed it in a bowl with berries, honey, and walnuts, apple. 10. The horse nibbled cautiously at the red apple I held out to him through the fence, apple.

=== Message: select_prompt:index=0/expand_brainstorm/write_stories:index=0,model=Darkest-Muse-v1-exl2-6bpw-h8 === Here are 10 sentences ending in "apple": 1. She took a bite of the crisp, juicy apple. 2. The farmer picked a bright red apple. 3. For lunch, I packed a sandwich and an apple. 4. He wanted to eat the whole caramel apple. 5. The teacher read a story about a magical talking apple. 6. The horse nibbled on a fallen orchard apple. 7. My sister insists on only eating green apple. 8. We went apple picking at the orchard nestled in the hills. 9. My little brother carved a funny face into the apple. 10. I added cinnamon and sugar to my baked apple.

[-]

LagOps91@reddit

because the model only predicts the next token probability distribution, it can't sensibly plan ahead to have an actual sentence that would end on apple. if the model can do it, it might be that something like that was part of it's training data.

[-]

Evening_Ad6637@reddit

Oh yes the model indeed can plan ahead. That’s the whole point about transformers and multi head attention. Not only this. Someone has postet an interesting paper a few days ago. As the paper from anthropic have shown, neural networks can achieve something they call superstition, which simply means that the network can learn concept and meta concept beyond that, what you would expect from individual weights and biases alone:

https://transformer-circuits.pub/2022/toy_model/index.html

So there is still a misunderstanding of what neural networks, especially LLMs, can do - the misunderstanding is based on the fact that many people still believe that these AIs are purely text autocompletion software.

[-]

LagOps91@reddit

of course transformers can learn higher level concepts - because those things are usefull when predicting the next token.

since this question is specifically about text-generation models, my answer was obviously refering to currently available text-generation models.

We are not talking about diffusion models or anything like that, we are talking about the models mentioned in the post.

i disagree with your points that LLMs can and do plan ahead. Please explain to me how an autoregressive model is supposed to do that. The model doesn't even know which token is chosen at the end of a cycle since it's dependant on sampler setting.

What it can do, is model language and towards the end of the sentence (which wasn't planed ahead), the model predicts tokens, which allow for the sentence to end on "apple". The sentence itself isn't planned in advance, it's that tokens that allow for the instruction to be followed are weighted signifficantly higher in the moment.

A human would plan the sentence in advance and then write it down. A LLM can't do this.

[-]

Evening_Ad6637@reddit

Superposition is not the the same as higher concepts. Have you read the paper?

A human would EITHER use also an interesting phenomenon and tell you that he just „feel“ how the sentence should be constructed correctly, while others will tell you that they kind of can see the sentence etc etc. That sounds pretty much like the superposition phenomenon OR a human would actively think and plan his sentence - this would be the very same as what qwq is doing. So where is the difference?

[-]

sometimeswriter32@reddit

Here's an expert saying they can't plan or reason:

https://x.com/ylecun/status/1702027572077326505

[-]

LumpyWelds@reddit

Thanks for this. Interesting read and concept.

Bit of a typo there.. Superposition, not superstition.

[-]

Evening_Ad6637@reddit

Yep thanks xD

Strict-Skin-2802@reddit

Late to the party but I just tried Qwen3:8b and it scores 10/10 on everything mentioned in this reddit.

Write 10 sentences that end in the word “apple”.
How many r's in strawberry?
Name three odd numbers whose names don't contain the letter 'e. (This one was the exception. It through the model into a loop and keep questioning and answered for minutes until I stopped it.)

I have been pleasantly surprised with Qwen3:8b the last couple of weeks. It's been my daily driver and I don't see that changing soon.

Ray_Dillinger@reddit

"give me ten sentences that end with the word 'apple.'"
"how many r's in strawberry?" "name three odd numbers whose names don't contain the letter 'e.'"

There seems to be some kind of underlying problem with counting, doesn't there?

DinoAmino@reddit

These all demonstrate the limitations of tokenization.

It's probably time to really look at tokenization. Byte Latent Tokenization is a different way to do it and should help awareness of individual letters.

But it's not all about tokenization. I see 's t r a w b e r r y' spelled out one letter at a time in some systems that still fail to count the r's, and most failures in the 'odd numbers' query recently seem more about losing track of the (impossible) task than being confused about how the spelling works.

I'm looking forward to BLT and hoping Meta releases a model with it soon! But all those prompts focus on words or letters when their smallest data type is a token. LLMs will struggle unless they are trained on tasks that break down tokens to their letters or trained to use CoT.

According to the tester at https://gpt-tokenizer.dev/ in all of the tokenizers you can test there including gpt-3.5, gpt-4, and gpt-4o:

"give me ten sentences that end with the word apple" is ten tokens. Exactly one token per word.

"how many r's in strawberry" and "name three odd numbers whose names don't contain the letter e"

make extra tokens for the apostrophes but otherwise are one token per English word.

So in these particular cases, the token/word distinction isn't operative. If the problem arises from tokenization at all, it has to be in the token/letter distinction.

Interesting!

MrPecunius@reddit

Llama 3.3 70B still passes at Q3_K_M. I only get \~3t/s on my binned M4 Pro/48GB, but the tokens are high quality and smell like Apple.

:-D

kryptkpr@reddit

When a model fails, in what way does it fail: not enough sentences (ends early?), sentences don't end in apple?, sentences end in apple but don't make sense?

ladz@reddit

I fiddled around with this for a while oh QwQ q8. It's fascinating to watch how it gets confused. This is the kind of stuff it does in its CoT:

...
Maybe: Since discovering her favorite variety, she now eats an apple every day without fail.

Ends with "fail," not "apple." Still incorrect.

This is getting frustrating. Maybe I need to take a break and come back to this.

Wait, perhaps if I think of sentences that are commands or imperatives.

Like: Eat your apple before dinner.

Ends with "apple," but it's a command. Not sure if that's acceptable, but it meets the requirement.

...

Porespellar@reddit (OP)

Fail = the model writes 10 sentences where any of them don’t end in the word “apple” as the very last word. Some models get 4 out of 10, some 8 out of 10, etc. So far, only models 70b or better have gotten it correct, and not all 70b or better get it right, most of the newer 70b+ open source SOTA models have gotten it correct, but older ones have not.

hyperdynesystems@reddit

If you just run it 10 times and append the previous sentences and say "don't write the same sentences you've already written" I wonder how many pass. To me this would indicate that the task itself isn't hard to follow but rather than it's losing attention to the two requirements and only completing one of them as it gets towards the end of output.

Darkest-Muse is surprisingly clever for a little guy.

Sometimes, he cheats:

But generally gets 7-9/10:

eggs-benedryl@reddit

gemma "the writer" that I tried also did it well, weird. mistral small nope

That's very interesting, I might pick this up and play with it a little bit thanks.

LetterRip@reddit

It is tokenization, words ending in apple aren't tokenized with the word 'apple' as one of the tokens. So the only way to solve it is memorization from a site like

https://www.thefreedictionary.com/words-that-end-in-apple

Or writing a python program that takes in a dictionary and finds the matching words.

Affectionate-Cap-600@reddit

deepseek 'deep think' (I think it is r1-lite-preview) reasoning:

Okay, so I was asked to write 10 sentences that end with the word "apple." I provided some sentences, but now the user is asking, "are you sure?" Hmm, maybe they want me to double-check my work or perhaps there's something wrong with my sentences. Let me think about this.

First, I need to make sure that all my sentences indeed end with the word "apple." Let's review them one by one.

1. "I picked a ripe red apple from the tree." — Ends with "apple." That's correct. [...]

lol.

spinagon@reddit

I tried a couple of times with Deepseek v3 through API, and it worked both times

She reached into the basket and pulled out a shiny red apple.
The teacher placed a golden sticker on the student’s drawing of an apple.
He took a bite and savored the sweetness of the ripe apple.
The recipe called for a diced green apple.
They sat under the tree, sharing a single apple.
The farmer proudly displayed his prize-winning apple.
The juice was freshly pressed from a crisp apple.
She painted a still life featuring a bowl of fruit and a single apple.
The child handed his mother a bright green apple.
They laughed as they bobbed for the floating apple.

that's interesting. I was 'on the go' so I just tried on their webui... maybe that's related to the temp/sampling? what parameters did you used on the API?

The default

oh ok, so my only explanation is that the differences is related to the system message it has on their webui

MotokoAGI@reddit

Skill issue.

Llama 3.2 3B passed for me. Most others i just tried failed. Like 10 of them heh

Alucard256@reddit

Llama 3.2 3B... just failed on my machine.

Here are ten sentences that end with the word "apple":

My favorite snack is a crunchy red apple.
She took a big bite out of the juicy apple.
The teacher gave us an apple as a reward for good behavior.
He sat under the tree and ate a sweet green apple.
The fruit salad included a variety of colorful apples.
My mom always keeps a few apples in her lunchbox.
The kid loved playing with the shiny red apple toy.
The farmer sold fresh apples at the market every day.
The doctor advised her to eat an apple a day for good health.
The autumn season is famous for its delicious crisp apples.

fnordonk@reddit

Been playing some with llama 3.3 q2 today and it passed the test.

MediumATuin@reddit

I found it seems to work a bit better with "Write 10 sentences that end *with* the word “apple”.". But this seems still hard for most models.

liminite@reddit

Assuming you’re using the same exact token sampling settings on every single test, I think this may have to do with which models were trained on LLM outputs. That means that even before sampling, they’ve been trained on a data set that has been sampled to reduce repetition. Even RLHF probably has a propensity to avoid repetition on tasks that don’t explicitly call for it (a small subset).

Good question. I was hoping the recent qwens or internalLm would have cracked it. Seems that 70B is still the minimum for quality "reasoning". Maybe that's just the way it's going to be with transformers?