The “apple” test - Why aren’t newer reasoning models doing better on this basic benchmark? (and yes, I know token prediction mechanics play a role)

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 34 comments

Most of you are probably familiar with the infamous LLM “apple test” benchmark.

If you’re not, here it is, you give an LLM the following seemingly simple instruction prompt:

Sadly, most open source (and even a lot of frontier models fail miserably at this task. I’ve read that it has a lot to do with the way token prediction works, but some models can actually pass this test easily.

Models that I’ve tested that pass or fail on this test:

LLMs that PASS the apple test:

LLMs that FAIL the apple test (most are newer models)

FAILED but with an honorable mention:

This task seems to be challenging for models under 70b to complete. Even the newer reasoning models with higher test time compute capabilities don’t seem to do well at all.

Has anyone found an open source model under 70b that can pass the apple test consistently?