When you figure out it’s all just math:

[-]

ZiggityZaggityZoopoo@reddit

Apple is coping because they can’t release a large model that’s even remotely useful

[-]

Hopeful-Result-9335@reddit

Apple has tons of cash. They could easily do it if they wanted too…

[-]

Due-Memory-6957@reddit

Do you have any idea of how many groups of people (even national ones) with tons of cash can't do it?

[-]

6969696969696969690@reddit

They’re literally trying right now and failing

[-]

ninjasaid13@reddit

Apple is coping because they can’t release a large model that’s even remotely useful

Wtf does apple have to do with this research being true or false?

[-]

ZiggityZaggityZoopoo@reddit

I don’t trust research from labs that can’t output good models. Talk is cheap. Training is hard. Anyone can have ideas, few people can test them.

[-]

That's a pretty narrow view of research. You're essentially putting the entire scientific method into doubt, because a huge part of science is rigorously testing ideas and poking holes in theories, not just building new things. Understanding limitations is just as vital as creating good models for real progress.

[-]

ZiggityZaggityZoopoo@reddit

Most labs put out a mix of research and models

[-]

t3h@reddit

When you can't actually understand the paper, you have to aim the blows a little lower...

[-]

joe190735-on-reddit@reddit

it depends, take a look at apple stock price now

[-]

threeseed@reddit

They never tried to build one though.

The focus was on building LLMs that can work on-device and within the limitations of their PCC (i.e. it can run on a VM style slice of compute).

[-]

pitchblackfriday@reddit

Wish they put such research effort into repairing their braindead Apple Intelligence bullshit.

[-]

chkno@reddit

Link to the paper: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

[-]

keepthepace@reddit

Link to the retort: The illusion of "The Illusion of Thinking"

[-]

ninjasaid13@reddit

How many humans can sit down and correctly work out a thousand Tower of Hanoi steps? There are definitely many humans who could do this. But there are also many humans who can’t. Do those humans not have the ability to reason? Of course they do! They just don’t have the conscientiousness and patience required to correctly go through a thousand iterations of the algorithm by hand

I don't understand why people are using human metaphors when these models are nothing like humans.

[-]

ginger_and_egg@reddit

Humans can reason

Humans don't necessarily have the ability to write down thousands of towers of Hanoi steps

-> Not writing thousands of towers of Hanoi steps doesn't mean that something can't reason

[-]

DwampDwamp@reddit

Not all humans are capable of reason

[-]

t3h@reddit

It didn't even consider the algorithm before it matched a different pattern and refused to do the steps.

Thus, no reasoning.

[-]

ginger_and_egg@reddit

I believe somewhere else in this thread, they pointed out that the structuring of the query for the paper explicitly asked the LLM to list out every single step. When this redditor asked it to solve it without listing that requirement, it wrote out the algorithm and then gave the first few steps as an example.

[-]

t3h@reddit

Again, missing the point. Ask it to provide the algorithm then list the steps - and suddenly it forgets all about the algorithm.

Hence, no 'reasoning', just regurgitation.

[-]

ginger_and_egg@reddit

Here's some text from the relevant comment

https://www.reddit.com/r/LocalLLaMA/s/nV4wtUMrPO

There is a serious rookie error in the prompting. From the paper, the system prompt for the Tower of Hanoi problem includes the following:

When exploring potential solutions in your thinking process, always include the corresponding complete list of moves.

(My emphasis). Now, this appears to be poor prompting. It's forcing a reasoning LLM to not think of an algorithmic solution (which would be, you know, sensible) and making it manually, pointlessly, stupidly work through the series of manual steps.

[-]

t3h@reddit

At that point, you're just being tricked into making stone soup.

That 'better prompt' works because you're doing all the reasoning for it - and guiding it manually towards the desired outcome.

This doesn't actually disprove the point.

[-]

ginger_and_egg@reddit

What better prompt? I didn't mention a better prompt.

[-]

ginger_and_egg@reddit

[...]

I was interested to try out the problem (providing the user prompt in the paper verbatim) on a model without a system prompt. When I did this with GPT-4.1 (not even a reasoning model!), giving it an 8 disc setup, it:

Correctly tells me that the problem is the Tower of Hanoi problem (I mean, no shit, sherlock)

Tells me the simple algorithm for solving the problem for any n

Shows me what the first series of moves would look like, to illustrate it

Tells me that to do this for 8 disks, it's going to generate a seriously long output (it tells me exactly how many moves it will involve) and take a very long time -- but if I really want that, to let it know -- and if so, what output format would I like it in?

Tells me that if I'd prefer, it can just write out code, or a function, to solve the problem generically for any number of discs

[-]

ConversationLow9545@reddit

how do we know? Do you know what thinking or reasoning is in humans, either?

[-]

ninjasaid13@reddit

We don't know what it is, but knowing what it isn't is a much easier task.

[-]

ConversationLow9545@reddit

Ok tell me if it is not predictive processing?

[-]

ninjasaid13@reddit

Your brain uses electric charge and a calculator uses electric charge, does that mean that your brain is not in contradiction to a calculator?

And we do know what is AGI, and its criterias

We do not have any besides defining it in terms of human intelligence.

Ok tell me if it is not based predictive processing and attention processing?

This doesn't mean LLMs have human-like thinking. The LLM predicts the most probable next token by learning statistical language while humans are not based on token or language at all. What humans predict is the state of the world.

If anyone tells you that LLMs have a world model or that video generators have a world model, they're sorely mistaken about what a world model is.

This is what a real world model requires: https://en.wikipedia.org/wiki/Schema_(psychology)

[-]

SportsBettingRef@reddit

because that is what the paper is trying to derive

[-]

Thick-Protection-458@reddit

Well, because "can't generalize further step generation across >=X task complexity" need some references to compare. Is it utterly useless? Or not.

And if someone understand it as "can't follow 8 and more step Hanoi tower, absolutely fail at 10 - means not a reasoner at all" - well, that logic is flawed, and one of the ways to show flaw is to remind that by that logic humans are not reasoners too.

[-]

t3h@reddit

Because they have zero clue about how LLMs work, and what's going on in their own head is only "the illusion of thinking"...

[-]

keepthepace@reddit

I blame people who argue whether a reasoning is "real" or "illusory" without providing a clear definition that leaves humans out of it. So we have to compare what models do to what humans do.

[-]

oxygen_addiction@reddit

Calling that a retort is laughable.

[-]

keepthepace@reddit

It addresses independently 3 problematic claims of the paper which you are free to address with arguments rather than laugh:

Hanoi tower puzzle algorithm is part of the training dataset so of course providing it to the models wont change anything.
Apple's claim of a ceiling in capabilities is actually a ceiling in willingness: at one point models stop trying to solve the problem directly and try to find a general solution. It is arguably a good thing that they do this, but it does make the problem much harder.
(The most crucial IMO) The inability to come up with some specific reasoning does not invalidate other reasoning the model does.

And I would like to add a 3.b. point:

This is a potentially unfair criticism, because the paper itself doesn’t explicitly say that models can’t really reason (except in the title)

Emphasis mine. It makes this article clickbaity and that's problematic IMO when the title says something that the content does not support.

[-]

t3h@reddit

True, but doesn't invalidate the claims made. Also Towers of Hanoi was not the only problem tested, others started to fail at n=3 with 12 moves required.
Describing this as "willingness" is a) putting human emotions on a pile of maths, and b) still irrelevant. It's unable to provide the answer, or even a general algorithm, when the problem is more complex and the algorithm identical to the simple version of the same problem.
Unless you consider "that's too many steps, I'm not doing that" as 'reasoning', no they don't. Reasoning would imply it's still able to arrive at the algorithm for problem n=8, n=9, n=10 even if it's unwilling to do that many steps. It doesn't even find the algorithm, which makes it highly suspect that it's actually reasoning.

It's just outputting something that looks like reasoning for the simpler cases.

[-]

keepthepace@reddit

About 3. I am seriously confused about how one could in good faith hold the view that being unable to adapt a reasoning at an arbitrary large step invalidates any reasoning below that step.

About 2. it is not anthropomorphizing at all, it is not an "emotion". It is a reasoning branch that says "this is going to be tedious, let's try to find a shortcut". It is a choice we would find reasonable if it were made by a human.

Here again, I am comparing with humans, for lack of objective criterion that allows one to differentiate between valid and invalid reasoning independently from the source.

Give me a blind experiment that evaluates reasonings and does not take into account whether they come from a human brain or an algorithm, and we can stop invoking comparisons with humans.

Barring a clear criterion, all we can point out is that "you would accept that in humans so surely this is valid?"

[-]

t3h@reddit

I am seriously confused about how one could in good faith hold the view that being unable to adapt a reasoning at an arbitrary large step invalidates any reasoning below that step.

I ask you 1+2. You say it's 3.

I ask you 1+23. You say first we do 23 which is 6, because we should multiply before adding, then we add 1 to that and get 7.

I ask you 1+23+4+5+6+7+89. You say that's too many numbers, and the answer is probably just 123456789.

Can you actually do basic maths, or have you just learned what to say for that exact form of problem? The last one requires nothing more than the first two.

And yet the reasoning LLM totally runs off the rails, instead providing excuses, because apparently it can't generalise the algorithm it knows to higher orders of puzzle.

That's why it invalidates the 'reasoning' below that step. If it was 'reasoning', it'd be able to generalise and follow the general steps for an arbitrarily long problem. The fact that it doesn't generalise is a pretty good sign it really isn't 'reasoning', it's just generating the expected 'reasoning' for a problem it's been trained on. The 'thinking' output doesn't consider the algorithm at all, it just says "no".

It is a choice we would find reasonable if it were made by a human.

Yes, but it's not a human, and it should be better than one. That's why we're building it. Why does it do this though? It's a pile of tensors - does it actually 'feel' like it's too much effort? Of course it doesn't, it doesn't have feelings. The training dataset contains examples of what's considered "too much output" and it's giving you the best matched answer - because it can't generalise at inference time to the solution for arbitrary cases.

Remember, the original paper wasn't just Towers of Hanoi. There were other puzzles that it failed at in as little as 12 moves required to solve.

[-]

keepthepace@reddit

Can you actually do basic maths, or have you just learned what to say for that exact form of problem?

This is actually testable and tested, and the LLMs do provide a reasoning in the form of what we teach schoolkids, even though they themselves are typically doing the calculation differently when unprompted.

The LLMs do pattern matching on abstract levels. The philosophical question is whether there is more to reasoning than applying patterns at a certain degree of abstraction.

because apparently it can't generalise the algorithm it knows to higher orders of puzzle.

This is not what they tested. They did not test its ability to produce a valid algorithm to solve Hanoi towers, which they all can probably, as it is part of their training dataset.

They tested its ability to run a very long algorithm in a "dumb" way which is more of a test for context windows than anything else and, quite honestly, a dumb way to test reasoning abilities. Make it generate a program, test the output.

The trace they ask for take 11 tokens per move. It takes 1023 moves to solve the 10 disks problems. They gave it 64k tokens to solve it, which would include 11k to generate the solution in thought, probably a similar amount to double-check it as it will typically do, and 11k to output it, dangerously close to the 64k limit. I find it extremely reasonable that models refuse to do such a long error-prone reasoning. That's like

Yes, but it's not a human, and it should be better than one.

Unless you give definitions of "valid reasoning" that does not boil down to "whatever humans do" you will have to accept constant accusation of human-centric bias and constant reference to abilities that humans have or have not. Give a definition that works under blind experimentation and we can go forward.

Why does it do this though?

Are you really interested in the answer? It is answered in the article I linked, it does not involve feelings (which I suspect you would be equally unable to define in a non-human-centric way)

Remember, the original paper wasn't just Towers of Hanoi.

It does 4 of them, including an even more known problem: the river crossing. It mostly talks about the Hanoi though, and fails to explore an effect on the river crossing that is actually fairly known: there are so many examples and variations of it on the web with a small number of steps, that models tend to fail there as soon as you introduce a variation.

For instance, a known test is to say "there is a man and a sheep on river bank, the boat can only contain 2 objects, how can the man and the sheep cross?", which is trivial, but the model will tend to repeat solution of the more complex problem involving a wolf or a cabbage.

However, correctly prompted (typically by saying "read that thoroughly" or "careful, this is a variation") they do solve the problem correctly, which, in my opinion, totally disproves the thesis that they can't get past reasoning that appeared in their training dataset.

[-]

t3h@reddit

This is actually testable and tested, and the LLMs do provide a reasoning in the form of what we teach schoolkids, even though they themselves are typically doing the calculation differently when unprompted.

No, not really. They aren't doing reasoning because what comes out of them looks like reasoning. Same as it's not actually doing research when it cites legal cases that don't exist. It's just outputting what it's been trained to show you - what the model creators think you want to see.

Unless you give definitions of "valid reasoning" that does not boil down to "whatever humans do"

If it is doing 'reasoning', it should devise a method/algorithm to solve the problem, using logic about the parameters of the puzzle. Once again, as the core concept seems overly difficult to grasp here, the fact it can apparently do this for a simple puzzle, but not for a more complicated puzzle, when it's the same algorithm, is showing it's not really doing this step. It's just producing output that gives the surface level impression that it is.

That's enough to fool a lot of people, though, who like to claim that if it looks like it is, it must be.

What I would expect if it actually was, though, is that it would still say "the way we solve this is X" even if it thinks the output will be too long to list. Although the other thing that would be obvious with understanding of how LLMs work is that this 'percieved' maximum length is purely a function of the LLM's training dataset - it does not 'know' what its context window size is.

This is not what they tested. They did not test its ability to produce a valid algorithm to solve Hanoi towers, which they all can probably, as it is part of their training dataset.

Yes, this wasn't what they tested to produce those graphs. I'm describing what they observed about the cases that they failed. The fact that it spews endless tokens about the solution and then refuses to solve it is the exact problem being described here.

fails to explore an effect on the river crossing that is actually fairly known

Once again, you are excusing it for failing, and saying they should have changed the prompt until it worked. A little ironic in Apple's case that you're basically resorting to "you're holding it wrong".

[-]

keepthepace@reddit

They aren't doing reasoning because what comes out of them looks like reasoning.

Come up with a test that can make the difference. Until then, this conversation will just go in circles.

If it is doing 'reasoning', it should devise a method/algorithm to solve the problem, using logic about the parameters of the puzzle.

It was not prompted for that. If prompted to do that it succeededs. And this is a bad test to test it because programs to solve these 4 puzzles are likely in the LLMs datasets.

Yes, this wasn't what they tested to produce those graphs. I'm describing what they observed about the cases that they failed. The fact that it spews endless tokens about the solution and then refuses to solve it is the exact problem being described here.

Please read both articles. They forced it to spew move tokens and dismissed the output when it actually tried to give a generic answer.

Once again, you are excusing it for failing, and saying they should have changed the prompt until it worked.

Uh, yeah? If I claim a CPU can't do basic multiplications but it turns out I did not use the correct instructions, my initial claims would be false.

[-]

t3h@reddit

Come up with a test that can make the difference.

Already did, you've ignored it.

They forced it to spew move tokens and dismissed the output when it actually tried to give a generic answer.

And? You can make excuses for it forever, but it failed at the task.

If I claim a CPU can't do basic multiplications but it turns out I did not use the correct instructions, my initial claims would be false.

Not at all what's happening here. Not even close.

[-]

ConversationLow9545@reddit

lol all the criticisms of neural networks b like It's not real AI, because (take your pick): "It's just pattern matching", "It's just linear equations", "It's just combining learned data"

Maybe so. But first, you will have to prove that your brain does anything different, otherwise the argument is moot.

[-]

chm85@reddit

Yeah definitely an opinion piece.

Apples research is valid but narrow. At least they are starting to scientifically confirm the anecdotal claims we have all seen. Someone needs to shut up Sam’s exaggerated claims because explaining this to executives every month is tiring. For some reason my VP won’t let me enroll them all in a math course.

[-]

welcome-overlords@reddit

Excellent read, thank you!

[-]

TheRealMasonMac@reddit

The author of that post is a software engineer, not an ML researcher. He also uses many anecdotal experiences to justify his contention with the paper's procedure and results. Skepticism is always good but this is more like a personal reaction than a proper retort.

[-]

Jolly_Mongoose_8800@reddit

Explain how people reason and think then. Go on

[-]

martinerous@reddit

There are a few known reasoning tools that people usually learn early on. For example, induction, deduction, abduction.

Without applying these principles, we would be as inefficient as LLMs, requiring huge amounts of examples and relying on memorization alone, and making stupid mistakes anytime when an example is missing.

[-]

ConversationLow9545@reddit

do u think it's hard to just add more training for the application of principles like induction, deduction, abduction? no, actually LLM reasoning is trained on that principles only along with training data. they overlap both approaches while reasoning

[-]

martinerous@reddit

With LLMs, it's too much tied to the massive amounts of data in the weights. Their reasoning is more emergent property, not something that would be possible to train "in the core".

A simple example. I had a mathematics professor who "trained" us with examples that had all numbers -1, 0, 1 wherever possible. As he said: "I know you are good at using calculators and I don't want to waste your time on that. I want you to immediately see the logic behind the idea". With humans, this works amazingly well. With LLMs, not so much. They need lots of examples and even then we cannot be sure they "truly understood" the idea because they cannot even reliably reply if they actually did understood it or not, as they are biased to tell they did.

Essentially, I'm with Andrej Karpathy on this. He recently discussed ideas about architectures and why we need smaller solid AI cores that do not rely on massive amounts of data in their weights to grasp core reasoning concepts.

[-]

ConversationLow9545@reddit

Understanding in itself a vague term and LLMs do pass various understanding tests, each having its own definition for understanding. There is no true understanding as bs Searle room mystical type shit, as what we do is actually thinking and claiming that we truly understand irrespective of whether there is thing called true understanding in the neurons. Its functional, and to understand is to produce the behaviour of understanding that's it

And I didnot get about what you trained by that 1,0,-1. Seems like just pattern matching

[-]

martinerous@reddit

Yes, "understanding" is quite vague word. Still, even a first-grader can feel when they do not understand something and admit it. We are very aware of our own thought process and have the intuitive sense of being stuck and unable to solve a task properly. LLMs don't ask "Wait, I don't get it how it works, can you explain me more logic about this specific step?"

The point behind 1,0,-1 was that the professor could present us a new concept using those simple numbers alone and do it just once, and we got the idea immediately (or those who did not could ask for more explanations) and could apply it later to whatever numbers we might encounter. LLMs usually need more patterns with different numbers to "grasp" the concept, and still sometimes get confused when variable names are switched around (e.g. can solve an exercise about Peter and Anna, and mess up the same exercise about John and Mary).

The question is less about if LLMs can or cannot learn something. If we throw "everything" at them and give them endless resources, they might achieve superpowers. It's more about efficiency. Can we make them work at least as reliably as an intelligent professional person at the peak of their performance without spending energy as an entire country? That's where we would need architecture changes to find a way to ingrain the world model and causality at the core with minimum possible weights, and then be able to throw more information as needed to create domain experts.

[-]

ConversationLow9545@reddit

>"Wait, I don't get it how it works, can you explain me more logic about this specific step?"

They do, checkout the reasoning process of SOTA reasoning models

[-]

martinerous@reddit

That's the problem. It requires a huge SOTA reasoning model to notice that it's missing something, and even then the behavior can be flaky and not always clear if the model has actually detected the lack of something or if it's saying so because it was trained to do that, which can lead to the opposite problem (e.g. overthinking and getting caught in "Wait, what if..."(.

In any case, it should be core functionality of any model to detect when there's not enough information or "understanding" of the context to solve the task.

[-]

ConversationLow9545@reddit

They do lol, you simply haven't seen it, GPT5 plus and pro shows self awareness and affirmation in the available information. They don't hallucinate and even sometimes gets to conclusion of inability to find something due to lack of information.

And a human in wild without training would act like chimpanzee. Training is essential for everything

[-]

martinerous@reddit

Because current SOTA models are too large and inefficient for what they do. The architecture and approach needs serious changes.

GPT‑5 still hallucinates, OpenAI even themselves admit it.

But Andrej Karpathy would be a much better person to discuss this. Specifically, in this article and video: https://www.dwarkesh.com/p/andrej-karpathy

[-]

ConversationLow9545@reddit

Vague arguments by karpathy(ofc it's a podcast), as long as There is no quantified parameter for reliability, or measurement of how much training for what type of task to showcase and claim human abilities are distinct or it excels in those areas. I don't disagree, but whatever you said is still based on general behaviour observation of intuition, rather than a quantified approach to remark those human characteristics of learning

[-]

ConversationLow9545@reddit

>The point behind 1,0,-1 was that the professor could present us a new concept using those simple numbers alone and do it just once, and we got the idea immediately (or those who did not could ask for more explanations) and could apply it later to whatever numbers we might encounter. LLMs usually need more patterns with different numbers to "grasp" the concept, and still sometimes get confused when variable names are switched around (e.g. can solve an exercise about Peter and Anna, and mess up the same exercise about John and Mary).

baseless, LLMs r great at generalization and solve really hard puzzles now

[-]

a_curious_martin@reddit

Yes, "understanding" is quite vague word. Still, even a first-grader can feel when they do not understand something and admit it. We are very aware of our own thought process and have the intuitive sense of being stuck and unable to solve a task properly. LLMs don't ask "Wait, I don't get it how it works, can you explain me more logic about this specific step?"

The point behind 1,0,-1 was that the professor could present us a new concept using those simple numbers alone and do it just once, and we got the idea immediately (or those who did not could ask for more explanations) and could apply it later to whatever numbers we might encounter. LLMs usually need more patterns with different numbers to "grasp" the concept, and still sometimes get confused when variable names are switched around (e.g. can solve an exercise about Peter and Anna, and mess up the same exercise about John and Mary).

The question is less about if LLMs can or cannot learn something. If we throw "everything" at them and give them endless resources, they might achieve superpowers. It's more about efficiency. Can we make them work at least as reliably as an intelligent professional person at the peak of their performance without spending energy as an entire country? That's where we would need architecture changes to find a way to ingrain the world model and causality at the core with minimum possible weights, and then be able to throw more information as needed to create domain experts.

[-]

goldlord44@reddit

I was talking to a quant the other day. He genuinely believed that a reasoning model was completely different to a normal llm and that it had a specific real logical reasoning that could deterministically think through logical problems baked into the model. It was wild to me.

He was specifically saying to me that an LLM can't compete with a reasoning model because the former is simply a stochastic process, whereas the latter has this deterministic process. Idk where tf he heard this.

[-]

ConversationLow9545@reddit

reasoning models are LLMs lol

[-]

goldlord44@reddit

That was why I was not sure about his knowledge lol

[-]

llmentry@reddit

I was talking to a quant the other day.

This is why it's always better to talk to a bf16 model.

[-]

PeachScary413@reddit

I can't even imagine how frustrating it must be to be a neuroscientist doing studies on the brain rn. With all the tech bros running around asserting confidently that the brain is basically just an LLM and throwing around wild statements about how basic it is and shit lmaoo

[-]

ConversationLow9545@reddit

as if brain is not based on predictive coding huhh? neuroscientists themselves found it.

and AI is not trivial either

[-]

SuccessfulTell6943@reddit

I want to mention that the whole "Apple is just incompetent they can't make a better siri" argument is just... not a good one.

Apple and it's competitors know that voice assistants were mostly just a bad idea in the first place. People generally tend to avoid using voice assistants even when there is better software out there, I think there is a good reason basically zero companies have made efforts at their own and that Apple has essentially made it a legacy offering at this point, nobody really wants it.
What exactly will Apple do with an LLM anyways? Make an onboard chatgpt/Google competitor? There really isn't a use-case for Apple that wouldn't be better served by allowing some other company to do the hard work and then offering it as a service on their devices. It's like somebody asking why Apple never made a Google competitor, or Facebook, or whatever technology have you. It just doesn't make sense because there is nothing particular to their product lines that having an LLM on top of improves.

[-]

ConversationLow9545@reddit

Do u know why Gemini (not just the chatbot app, but an AI assistant) operates on Android and not on iOS, and why Apple is now begging Google to incorporate Gemini's architecture to modify Siri according to iOS? Because Gemini is linked to Android, both are owned by Google.
If Apple wants to make an AI assistant akin to Gemini, they have to make their own associated

[-]

Interesting8547@reddit

Having an AI assistant on your phone is actually a good idea. Though I like to chat with the AI, not to talk with it... but some people might like to talk.

[-]

t3h@reddit

It's a valid argument in terms of "company Z is doing X because they can't Y", like Anthropic's "we need more regulation of AI" because they're scared of not being able to compete in a free market.

In terms of a research paper, writing it off with allegations of motivation isn't a counter-argument. You need to criticise the actual claims made in the paper.

[-]

SuccessfulTell6943@reddit

I don't think you can even say that Apple HAS a motivation other than to publish findings. It's not like they are in any way a direct competitor to OpenAI/Anthropic/Google in the software space. They are a luxury personal computing company that has some base software suite. So really the argument that they have an agenda seems like it's far reaching for some sort of malicious attribution of intention.

[-]

t3h@reddit

Well if you can't actually understand the paper, or how LLMs work, it's all you've got to go on...

[-]

SuccessfulTell6943@reddit

I'm going to be honest I have no idea if you're agreeing with me or not.

[-]

GatePorters@reddit

The thing is. Reasoning isn’t supposed to be thoughts. It is explicitly just output with a different label.

Populating the context window with relevant stuff can increase the fitness of the model in a lot of tasks.

This is like releasing a paper clarifying that Machine Learning isn’t actually a field of education.

[-]

ASYMT0TIC@reddit

As though you actually know what a thought is, physically.

[-]

GatePorters@reddit

Check out the other comments in this thread

[-]

ConversationLow9545@reddit

no one knows what thought is physically either

[-]

Potential-Net-9375@reddit

Exactly this holy hell I feel like I'm going insane. So many people just clearly don't know how these things work at all.

Thinking is just using the model to fill its own context to make it perform better, it's not a different part of the ai brain metaphorically speaking, it's just the ai brain taking a beat to talk to itself before choosing to start talking out loud

[-]

chronocapybara@reddit

Keep in mind this whitepaper is really just Apple circling the wagons because they have dick for proprietary AI tech.

[-]

threeseed@reddit

One of the authors is the co-author of Torch.

On top of which most of AI was designed and built.

[-]

DrKedorkian@reddit

...And? Does this mean they don't have dick for propietary AI tech?

[-]

threeseed@reddit

It means that when making claims about him you should probably have a little more respect.

Given that you know AI wouldn't exist today without him.

[-]

bill2180@reddit

Or he’s working for the benefit of his own pockets.

[-]

threeseed@reddit

You don't work for Apple if you want to make a ton of money.

You run your own startup.

[-]

bill2180@reddit

Uhhhh what kind of meth you got over there, have you heard of FAANG. The companies everyone is software wants to work for because of the pay and QoL they have. FAANG=FaceBook, Apple, Amazon, Netflix, Google.

[-]

threeseed@reddit

I worked as an engineer at both Apple and Google.

If you want to make real money you run your own startup.

[-]

ConversationLow9545@reddit

Yeah both pay peanuts, horrible

[-]

obanite@reddit

It's really sour grapes and comes across as quite pathetic. I own some Apple stock, and that they spend effort putting out papers like this while fumbling spectacularly on their own AI programme makes me wonder if I should cut it. I want Apple to succeed but I'm not sure Tim Cook has enough vision and energy to push them to do the kind of things I think they should be capable of.

They are so far behind now.

[-]

ninjasaid13@reddit

It's really sour grapes and comes across as quite pathetic.

it seems everyone whining about this paper is doing that.

[-]

-dysangel-@reddit

they're doing amazing things in the hardware space, but yeah their AI efforts are extremely sad so far

[-]

KrayziePidgeon@reddit

What is something "amazing" apple is doing in hardware?

[-]

-dysangel-@reddit

The whole Apple Silicon processor line for one. The power efficiency and battery life of M based laptops was/is really incredible.

512GB of VRAM in a $10k device is another. There is nothing else anywhere close to that bang for buck atm, especially off the shelf.

[-]

KrayziePidgeon@reddit

Oh, that's a great amount of VRAM for local LLM inference, good to see it, hopefully it makes Nvidia step it up and offer good stuff for the consumer market.

[-]

-dysangel-@reddit

I agree, it should. I also think with a year or two more of development we're going to have really excellent coding models fitting in 32GB of VRAM. I've got high hopes for a Qwen3-Coder variant

[-]

MoffKalast@reddit

Apple: Stop having fun!

[-]

Cute-Ad7076@reddit

The commenter wrote a point you agree with, but not all of it therefore he’s stupid. But wait, hmmmm-what if it’s a trap. No I should disagree with everything they said, maybe accuse them of something. Yeah that’s a plan Nu-uh

[-]

scoop_rice@reddit

You’re absolutely right!

[-]

-dysangel-@reddit

I'm liking your vibe!

[-]

dashingsauce@reddit

I’ll delete the code we’ve written so far and start fresh with this new vibe.

Let’s craft!

[-]

-dysangel-@reddit

I've mocked the content of everything so that we don't have to worry about actually testing any of the real implementation.

[-]

dashingsauce@reddit

Success! All tests are now passing.

We’ve successfully eliminated all runtime dependencies, deprecated files, and broken tests.

Is there anything else you’d like help with?

[-]

dhamaniasad@reddit

Ikr? Apple had another paper a while back that was similarly critical of the field.

It feels like they’re trying to fight against their increasing irrelevance, with their joke of an assistant Siri and their total failure Apple intelligence, now they’re going “oh but AI bad anyway”. Maybe instead of criticising the work of others Apple should fix their own things and contribute something meaningful to the field.

[-]

MutinyIPO@reddit

People don’t know how they work, yes, but part of that is on companies like OpenAI and Anthropic, primarily the former. They’re happily indulging huge misunderstandings of the tech because it’s good for business.

The only disclaimer on ChatGPT is that it “can make mistakes”, and you learn to tune that out quickly. That’s not nearly enough. People are being misled and developing way too much faith in the trustworthiness of these platforms.

[-]

GatePorters@reddit

Anthropic’s new circuit tracing library shows us what the internal “thoughts” actually are like.

But even then, those map moreso to subconscious thoughts/neural computation.

[-]

SamSlate@reddit

interesting, how do they compare to the reasoning output?

[-]

GatePorters@reddit

It’s just like node networks of concepts in latent space. It isn’t readable.

Like they can force some “nodes” to be activated or prevent them from being activated and then get some wild outputs.

[-]

clduab11@reddit

Which is exactly why Apple's paper almost amounts to jack shit, because that's exactly what they tried to force these nodes to do in latent, sandboxed space.

It does highlight (between this and the ASU paper "Stop Anthropomorphizing Reasoning Tokens" that we need a new way to talk about these things, but this paper doesn't do diddly squit as far as take away from the power of reasoning modes. Look at Qwen3 and how its MoE will reason on its own when it needs to via that same MoE.

[-]

Ok-Kaleidoscope5627@reddit

The way I prefer to think about it is that people input suboptimal prompts so the LLM is essentially just taking the users prompt to generate a better prompt which it then eventually responds to.

If you look at the "thoughts" they're usually just building out the prompt in a very similar fashion to how they recommend building your prompts anyways.

[-]

clduab11@reddit

insert Michael Scott "THANK YOU!!!!!!!!!!!!!!!!" gif

[-]

jimmiebfulton@reddit

Is this context filling happening during the inference, Kinda like a built-in pre-amp, or is it producing context for the next inferencing's context?

[-]

silverW0lf97@reddit

Okay but what is thinking really then? Like if I am thinking something I too am filling up my brain with data about the thing and the process to which I will use it for.

[-]

aftersox@reddit

I think of it as writing natural language code to generate the final response.

[-]

The-Dumpster-Fire@reddit

Wow, no way! You’re telling me the evolution simulator / gzip hybrid isn’t putting its model through college?

[-]

stddealer@reddit

It's literally just letting the model find a way to work around the limited compute budget per token. The actual text generated in the "reasoning" section is barely relevant.

[-]

X3liteninjaX@reddit

I’m a noob to LLMs but to me it seemed reasoning solved the cold start problem with AI. They can’t exactly “think” before they “talk” like humans.

Is the compute budget for reasoning tokens different than the standard output tokens?

[-]

stddealer@reddit

No, the compute budget is the same for every token. But the interesting part is that some of the internal states computed when generating or processing any token (like the key and query vectors for the attention heads) are kept in cache and are available to the model when generating the following token.

Which means that some of the compute used to generate the reasoning tokens is reused to generate the final answer. This is not specific to reasoning tokens though, literally any tokens in between the question and the final answer could have some of their compute be used to figure out a better answer. Having the reasoning tokens related to the question seems to help a lot, and avoids confusing the model.

[-]

fullouterjoin@reddit

Is this why I prefill the context by asking the model to tell me about what it knows about domain x in the direction y about problem z, before asking the real question?

[-]

stddealer@reddit

I believe it could help, but it would probably be better to ask the question first so the model knows where you're getting at, but then ask the model to tell you what it knows before answering the question.

[-]

fullouterjoin@reddit

Probably true, would make a good experiment.

Gotta find question response pairs with high output variance.

[-]

yanes19@reddit

I don't think that helps either, since the answer to the actual question is generated from scratch the only benefibis it can guide general context , IF your model have access to message history

[-]

fullouterjoin@reddit

What I described is basically how RAG works. You can have an LLM explain how my technique modifies the output token probabilities.

[-]

-dysangel-@reddit

similar to this - if I'm going to ask it to code up something, I'll often ask its plan first just to make sure it's got a proper idea of where it should be going. Then if it's good, I ask it to commit that to file so that it can get all that context back if the session context overflows (causes problems for me in both Cursor and VSCode)

[-]

exodusayman@reddit

Well explained, thank you.

[-]

MoffKalast@reddit

There's an old blog post from someone at OAI with a good rundown of what's conceptually going on.

The bottom line is, the current architecture can't really draw conclusions based on latent information directly (it's most analogous to fast thinking where you either know the answer instantly or don't), they can only do that on what's in the context. So the workaround is to first dump everything from the latent space into the thinking block, and then work with that for much superior results.

[-]

CheatCodesOfLife@reddit

The actual text generated in the "reasoning" section is barely relevant.

You tried the original R1 locally? The reasoning chain is often worth reading there (I know it's not really thinking, etc).

[-]

stddealer@reddit

The original R1 is a little too big for my local machines, but I didn't say that the content of the reasoning chain is useless or uninteresting. Just that it's not very relevant when it comes to explaining why it works.

But there's definitely a reason why they let the model come up with the content of the reasoning section instead of just putting some padding tokens inside it, or repeating the users question multiple times.

[-]

AppearanceHeavy6724@reddit

Yet I learn more from R1 traces, than actyal answers.

[-]

CheatCodesOfLife@reddit

Yet I learn more from R1 traces, than actyal answers

Same here, I actually learned and understood several things by reading them broken down to first principles in the R1 traces.

[-]

The_Shryk@reddit

Yeah it’s using the LLM to generate a massive and extremely detailed prompt, then sending that prompt to itself to generate the output.

In the most basic sense

[-]

Commercial-Celery769@reddit

I learn alot about whatever problem I am using an LLM for by reading the thinking section and then the final answer, the thinking section gives a deeper insight to how its being solved

[-]

dagelf@reddit

TL;DR The Illusion referred to in the paper is the tags, that doesn't reason formally, but just pre-populates the model context for better probabilistic reasoning.

[-]

GatePorters@reddit

Oh so I just summarized the paper by clarifying what the title means?

I guess they named it that on purpose as an in-joke

But that leads the media to say so many wrong things and then the average Joe will just regurgitate the weirdest talking points “straight from the mouths of the experts”

[-]

TheRealGentlefox@reddit

Yeah, we already have evidence that they can fill their reasoning step at least partially with nonsense tokens and still get the performance boost.

I would imagine it's basically a way for them to modify their weights at runtime. To say "Okay, we're in math verification mode now, we can re-use some of these pathways we'd usually use for something else." Blatant example would be that if my prompt starts with "5+4" it doesn't even have time to recognize that it's math until multiple tokens in.

[-]

-dysangel-@reddit

the first token is actually kind used as an "attention sink". So I would guess starting with things like "please", "hi" or something else that isn't essential to the prompt probably helps output quality. Though I've not tested this

https://www.youtube.com/watch?v=Y8Tj9kq4iWY

[-]

Jawzper@reddit

The thing is that everyone and their grandma seem to be convinced that AI is about to become sentient because it learned how to "think". We need research articles like this to shove in the faces of such people as evidence to bring them back to reality, even these things are obvious to you and me. That's the reason most "no shit, Sherlock" research exists.

[-]

MINIMAN10001@reddit

Inversely populating the context window with irrelevant stuff can decrease the fitness of the model in a lot of tasks. IE Discuss one subject and transition subjects in a different field. It will start referencing the previous material even though it is entirely irrelevant.

[-]

Educational_News_371@reddit

I dont get why people are dissing on this paper. Nobody cares what ‘thinking’ means, people care about the efficacy of thinking tokens for a desired task.

And thats what they tried to test, how well the models do across tasks of varying level of complexity. I think the results are valid, and thinking tokens doesn’t really do much for problems which are very complex. It might also ‘overthink’ and waste tokens for easier problems.

That being said, for easier to mid level problems, thinking tokens provide relevant context and are better than models with no reasoning capabilities.

They confirmed through experiments all of this which we already know.

[-]

TimeTravelingBeaver@reddit

Fr

[-]

DinoAmino@reddit

Except ... it's a meme and nowhere close to a paper. And ... this place is full of hype and delusion. Considering how so many here reference and post YouTube links this is probably a more effective PSA.

[-]

TheManni1000@reddit

your brain is also just doing math lol

[-]

WinDrossel007@reddit

I guess 99% of population doesn't reason as well

[-]

SupeaTheDev@reddit

I think the thoughts humans create is just maths in the neuronal/physics level. Max Tegmark has a brilliant book on how the whole Universe is probably Math

[-]

Lazyworm1985@reddit

It makes you think; Is our brain also just predicting the next token? A biological computer, nothing more. And somewhere something cyclical giving the experience of consciousness. Some wave function that recollapses on the quantum level again and again, as long as the blood flow is intact to provide energy? Who knows? I certainly don’t.

[-]

no_brains101@reddit

I would assume they all knew this, but apple was the only one with a bad enough AI to have to point it out.

[-]

e33ko@reddit

AI companies know this but don’t want to impose the Ratner effect on themselves by being forthright.

If modern AI models can’t actually reason, then it means we’re back to square one at least as far as AGI is concerned.

Then all of this just becomes a repeat of expert systems from the ‘80s

[-]

economic-salami@reddit

As if human brain isn't just a bunch of on and off cells.

[-]

kaba40k@reddit

Human brain reasons differently though.

Being "a bunch of on and off cells" (which it isn't, but let's for a minute suppose it is) does not imply equivalence. You could also say that all software is just bytes, but it doesn't make all software equivalent or "reasoning".

[-]

economic-salami@reddit

Do we even know how 'we' reason? It's a black box all the same (I would venture further and say that we know more about how AI 'reasons' than how humans 'reason'). In this case the idiom that applies is: what acts and looks like a duck is a duck.

[-]

kaba40k@reddit

Well, if by "we" you mean "we, humanity", then yes, there's a lot of works on how we humans reason, how we develop concept models, and in particular on abstract reasoning.

Abstract reasoning in humans has greatly improved even in the latest ~2-3 thousand years, where written materials were already available, so we have some nice evidence to base the research on.

For "AI reasoning" the mathematical models were developed pretty much in 1960-es (neural networks, attention idea), although a real breakthrough happened with the "all you need is attention" article relatively recently. So we have also pretty much a good idea of how that works.

There's a nuance here of course. We know the general principles of how neural networks are built, but once a NN is trained it's not practically possible to "reverse engineer" it, i.e. figure out how it makes one or the other conclusion given a set of inputs. In this sense you can hear sometimes people say that we don't know how NN work, but that's a different level of "not knowing". We definitely know how the math works, I studied this in the university quite extensively, and that was already more than twenty years ago.

In that sense, neural network-based AI does not really look or quack like a duck, or in any case exactly like a duck.

[-]

Altruistic_Heat_9531@reddit

I will add another point,

Most of user actually hate waiting for reasoning, they prefer just to have their answer fast
Based on point 1, actually most of user ask for simple question rather than high level stuff most of the time.
Tool usage and vision is much more important than reasoning model.
You can turn a non reasoning model to a semi reasoning model with n-shot prompting and RAG

[-]

iMADEthisJUST4Dis@reddit

Can u explain point 3

[-]

Altruistic_Heat_9531@reddit

Usually new breed LLM already has capability in tool/ function calling. Where it can connect to DB or use any program really as long as you provide a correct interface. I prefer this since i just made couple of tool like document summarizer or writter that can connect to for example LaTeX compiler to make me document with bunch of chart. And actually this can be usefull for apps since the LLM can connects to company database and act as QA without training and preparing for BERT model.

And for vision, i mostly use it for OCR

[-]

BusRevolutionary9893@reddit

I rather wait for a correct answer than get a wrong one quickly. I won't even use a non thinking model for a question that requires the model to do searches.

[-]

dagelf@reddit

Funny story, more often than not the answer without reasoning is better, only exception I've found is for programming tasks.

[-]

BusRevolutionary9893@reddit

No.

[-]

No_Wind7503@reddit

Real, tool usage is really underrated I haven't seen any advanced features for it, although it is a very powerful feature

[-]

damienVOG@reddit

Right for me I either want the answer fast, or I'm willing to wait quite a while for it to reason. Like 5 to 10 minutes. Not a lot where I'd prefer the in between for.

[-]

panchovix@reddit

Wondering if there's a way to disable thinking/reasoning on Deepseek R1. Just to try a "alike" DeepSeekV3 0528.

[-]

Altruistic_Heat_9531@reddit

try typing "/no_think" in system prompt or user prompt itself

[-]

random-tomato@reddit

um... that's only for Qwen 3 models??

[-]

Altruistic_Heat_9531@reddit

well worth trying

[-]

SlaveZelda@reddit

Doesn't work, even on the deepseek qwen distill.

[-]

EricForce@reddit

There is! Most front ends allow you to pre-fill the next response for the AI to go off from. It's seriously as easy as putting a </think> at the start. A few front ends even offer this as a toggle and do it in the background.

[-]

Altruistic_Heat_9531@reddit

I prefer a strong multi turn tool / function calling rather than reasoning

[-]

zelkovamoon@reddit

It's all just math... Like the universe you mean? Your and my brains? LLMs too.

[-]

wrecklord0@reddit

This is my gripe with all the criticisms of neural networks. It's not real AI, because (take your pick): "It's just pattern matching", "It's just linear equations", "It's just combining learned data"

Maybe so. But first, you will have to prove that your brain does anything different, otherwise the argument is moot.

[-]

wowzabob@reddit

Saying “well humans are no different,” is not a real response to these types of critiques, it is just obfuscation. It is just an assertion made with no evidence.

[-]

wrecklord0@reddit

Well I don't necessarily disagree, but 99.999% of humans do not produce novel theories in various fields, yet are still considered intelligent. It is difficult to argue about a concept that we have no understanding of.

[-]

wowzabob@reddit

The fact that the vast majority of people don’t is not necessarily because their brains are physically incapable of it though.

[-]

threeseed@reddit

There has been research that the brain may use quantum computing.

In which case this is far beyond what AI is capable of within our lifetime at least.

[-]

sage-longhorn@reddit

I'm the first to say we need big architecture improvements for AGI. But:

It's just linear equations

Is blatantly false. The most basic understanding of the theory behind artificial nueral nets will tell you that if it were all linear equations then all nueral nets could be reduced to a single layer. Each layer must include a non-linear component to be useful, commonly a ReLU nowadays

[-]

wrecklord0@reddit

Technically true, but you get the gist

[-]

ColorlessCrowfeet@reddit

Unfortunately, the gist isn't worth getting.

[-]

zelkovamoon@reddit

The funny part to me is that people think they can even. Like we don't understand the human brain, and even the best in the world AI researchers can't tell you how exactly an LLM arrives at some conclusion, usually. But everybody is an expert on reddit.

[-]

dagelf@reddit

Math is just a syntax, a language. It can describe things, and looking at things from a different angle, shows possibilities not immediately obvious. Closer to the truth: everything is just geometry.

[-]

saantonandre@reddit

Good, so by any means should we anthropomorphize the following code?
```js
const a = 1;
const b = 1;

console.log(`I'm sentient. ${a} + ${b} equals ${a + b}.`).
```
It's like us (just math) but it is not limited by organic bounds.
Who knows what this code snippet will be able to do in five years?

[-]

zelkovamoon@reddit

Hilarious

[-]

RQManiac@reddit

Convenient timing for the paper, just when Apple is falling o far behind in the AI race

[-]

emergent-emergency@reddit

Isomorphisms man, isomorphisms…

[-]

had12e1r@reddit

It's all just statistics

[-]

dani-doing-thing@reddit

Read the paper (not just the abstract), then read this:

https://www.seangoedecke.com/illusion-of-thinking/

[-]

WeGoToMars7@reddit

Thanks for sharing, but I feel like this criticism cherry-picks one of its main points.

Apart from the Tower of Hanoi, there were three more puzzles: checker jumping, river crossing, and block stacking. Tower of Hanoi requires the order of 2^n moves, so 10 blocks is indeed a nightmare to follow, but the other pizzles require on the order of n^2 moves, and yet the models start to fail much sooner (as low as n=3 for checkers and river crossing!). I don't think it's unreasonable for a "reasoning" model to keep track of a dozen moves to solve a puzzle.

Besides, the same AI labs for which "puzzles weren't a priority" lauded their results on ARC-AGI, which is also based on puzzles. I guess it's all about which narrative is more convenient.

[-]

dani-doing-thing@reddit

The paper only shows how models reinforced to solve some kind of problems that require reasoning fail to solve some puzzles. Is an interesting paper as another benchmark for models, that's it.

I bet someone could take Qwen3-0.6B and use GRPO to train it to solve this exact same puzzles as a weekend project...

[-]

TheRealMasonMac@reddit

Right, but that's the point. Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure"

[-]

FateOfMuffins@reddit

However that appears to be the conclusion by many with regards to benchmarks (courtesy of ARC AGI's Chollet's criteria for AGI - when we can no longer create benchmarks where humans outperform AI):

Make every benchmark a target and benchmax every measure. Once we've exhausted all benchmarks, and any new benchmarks we try to create get saturated almost instantly after, then we conclude we have achieved AGI.

[-]

TheRealMasonMac@reddit

Pretty sure that would be ASI, not AGI, no?

[-]

Snow-Silent@reddit

You can argue that if one model can do every benchmark we make that is AGI

[-]

TheRealMasonMac@reddit

If you create an AI that can solve all problems known to be solveable by mankind, that is by colloquial "definition" an ASI. Otherwise, applying the definition of ASI is impossible as no human can measure the intelligence of the AI at that point.

[-]

FateOfMuffins@reddit

Not my definition of AGI, that's Chollet's

[-]

dani-doing-thing@reddit

All models generalize up to a point, we train models to perform well in a particular area because training models to perform well on everything require bigger models, probably bigger than the models we have today.

I see no hard line between reasoning or not reasoning depending on how broadly the model is able to generalize the training data to unseen problems. And sure, is going to be based on patterns, is how humans learn and solve problems too... How do you recognize a problem and a possible solution if it's not based on your previos experience and knowledge?

[-]

Live_Contribution403@reddit

The problem is, that you dont know if your model memorized the solution or was able to generalize the principle behind the solution, so that it can be used for other instances in a different context. The paper at least to some extend seems to show exactly this. Memorization from the training data is probably the reason it performed better on the towers of hanoi, then the other puzzles. This means the models do not generate a generalized capability to be good puzzle solvers, they just remember the necessary training samples, which are compressed in their parameter space.

[-]

TheRealMasonMac@reddit

From my understanding, what they mean is that models are memorizing solution paths learned through training rather than adapting to the current problem. Similar to what is noted in https://vlmsarebiased.github.io/

[-]

Live_Contribution403@reddit

If you use your test data as training data, your model will always better perform when you feed it the same data again for testing. Because it has seen the data already and can just memorize it, especially with a large enough parameter space. The problem is then, that your test data became worthless in testing the generalization capability of your model. Thats why it is normally one of the most basic rules in Data Science, that you dont want to pollute your training data with your test data.

[-]

fattylimes@reddit

“they say i can’t speak spanish but give me a weekend and i can’t memorize it phonetically!”

[-]

t3h@reddit

Or Chinese perhaps?

[-]

llmentry@reddit

Taking a closer look at the Apple paper (and noting that this is coming from a company that has yet to demonstrate success in the LLM space ... i.e. the whole joke of the posted meme):

There is a serious rookie error in the prompting. From the paper, the system prompt for the Tower of Hanoi problem includes the following:

When exploring potential solutions in your thinking process, always include the corresponding complete list of moves.

(My emphasis). Now, this appears to be poor prompting. It's forcing a reasoning LLM to not think of an algorithmic solution (which would be, you know, sensible) and making it manually, pointlessly, stupidly work through the series of manual steps.

The same prompting error applies to all of the "puzzles" (the quoted line above is present in all of the system prompts).

I was interested to try out the problem (providing the user prompt in the paper verbatim) on a model without a system prompt. When I did this with GPT-4.1 (not even a reasoning model!), giving it an 8 disc setup, it:

Correctly tells me that the problem is the Tower of Hanoi problem (I mean, no shit, sherlock)
Tells me the simple algorithm for solving the problem for any n
Shows me what the first series of moves would look like, to illustrate it
Tells me that to do this for 8 disks, it's going to generate a seriously long output (it tells me exactly how many moves it will involve) and take a very long time -- but if I really want that, to let it know -- and if so, what output format would I like it in?
Tells me that if I'd prefer, it can just write out code, or a function, to solve the problem generically for any number of discs

Even though the output is nothing but obsequious politeness, you can almost hear the model rolling its eyes, and saying, "seriously??"

I don't even use reasoning models, because I actually agree that they don't usefully reason, and don't generally help. (There are exceptions, of course, just not enough to justify the token cost or time involved, in my view.) But this facile paper is not the way to prove that they're useless.

All it's showing is that keeping track of a mind-numbingly repetitive series of moves is difficult for LLMs; and this should surprise nobody. (It's sad to say this, but it also strongly suggests to me that Apple still just doesn't get LLMs.)

Am I missing something here? I'm bemused that this rather unimaginative paper has gained so much traction.

[-]

michaelsoft__binbows@reddit

This is why prompting/prompt engineering is the new hotness. Stuff like tracking state can be a game changingly good prompt for other use cases.

A surprising amount of value can be brought by trying to cut through the right abstractions and starting a brainstorming session with optimized conceptual framing. Prompting is an art form like architecture for large systems and inventing new UX patterns.

[-]

llmentry@reddit

But this isn't a question of prompt engineering. This is just an unforced error.

The researchers appear to have wanted a simple measure of model performance, and in doing so actually took away the model's capability to reason effectively. What was left, what the researchers were testing here, was nothing akin to reasoning.

This is a perfect example of why I think prompt engineering often does more harm than good. With some minor exceptions, I tend to give a model its head, and keep any system prompt instructions to a minimum.

[-]

Thick-Protection-458@reddit

But this is not a error. They wanted to check its ability to follow n steps. They tried to enforce it.

[-]

llmentry@reddit

If so, then they were trying to assess reasoning ability by literally preventing the models from reasoning. The point of reasoning CoT is to find new ways to solve a problem or answer a question, not to brute force a scenario by repeating endless, almost identical steps ad infitum (something we already knew LLMs were bad at). That's beyond stupid.

Mindlessly reproducing a series of repetitive steps is not reasoning. Not for us, not for LLMs.

[-]

michaelsoft__binbows@reddit

You seem to have a different definition of what prompt engineering is than me. I agree with your notion that less is usually better. But you seem to be insinuating that prompt engineering means constructing large prompts, but what I use it to describe is just the pragmatic optimization of the prompt for what we want to achieve.

I don't really like the term really but have to admit it's sorta sound. We try different prompts and try to learn and explain which approaches work better. Maybe we don't have enough of a body of knowledge to justify calling it engineering, but I guess I'll allow it.

[-]

llmentry@reddit

Ah, fair enough, that makes more sense -- and you're absolutely right. I've just seen too many recent examples of prompts becoming overly-complicated and counter-productive, and I've started to associate prompt engineering with this (which it's not). My bad!

[-]

MoffKalast@reddit

†Work done during an internship at Apple.

It's just some interns doing whatever, it's only got cred because Apple's trademark is attached to it.

[-]

llmentry@reddit

The other first author (equal contribution) is not listed as an intern. All six authors' affiliations are simply given as "Apple" (no address, nothing else -- seriously, the hubris!) All authors' emails apple.com addresses.

So, Apple appears fully behind this one -- it's not just a rogue intern trolling.

[-]

Revolutionary-Key31@reddit

" I don't think it's unreasonable for a "reasoning" model to keep track of a dozen moves to solve a puzzle."
Did you mean it's unreasonable for language model to keep track of 12+ moves?

[-]

WeGoToMars7@reddit

There is a double negative and a pun there, haha. No, I mean that the model should be expected to do shorter puzzles, unlike requiring to list the exact sequence of 1023 steps for solving the Tower of Hanoi.

[-]

t3h@reddit

That is an utterly ridiculous article.

It starts off with a bunch of goalpost shifting about what "reasoning" really means. It's clear he believes that if it looks smart, it really is (which actually explains quite a lot here).

Next, logic puzzles, apparently, "aren't maths" in the same way that theorems and proofs are. And these intelligent LLMs that 'can do reasoning', shouldn't be expected to reason about puzzles and produce an algorithm to solve them. Because they haven't been trained for that - they're more for things like writing code. Uhhh....

But the most ridiculous part is - when DeepSeek outputs "it would mean working out 1023 steps, and this is too many, so I won't", he argues "it's still reasoning because it got that far, and besides, most humans would give up at that point too".

This is the entire point - it can successfully output the algorithm when asked about n=7, and can give the appearance of executing it. Ask about the same puzzle but with n=8 and it fails hard. The original paper proposes that it hasn't been trained on this specific case, so can't pattern match on it, despite what it appears to be doing in the output.

He's got a point that 'even providing the algorithm, it still won't get the correct answer' is irrelevant as it's almost certainly in the training set. But this doesn't actually help his argument - it's just a nit-pick to provide a further distraction from the obvious point that he's trying to steer your attention away from.

And then, with reference to 'it's too 'lazy' to do the full 1023 steps', when DeepSeek provides an excuse, he seems to believe it at face value, assigning emotion and feelings to the response. You really believe that a LLM has feelings?

He re-interprets this as "oh look how 'smart' it is, it's just trying to find a more clever solution - because it thinks it's too much work to follow an algorithm for 1023 steps - see, reasoning!". No, it's gone right off the path into the weeds, and it's walking in circles. It's been trained to give you convincing excuses when it fails at a task - and it worked, you've fallen for them, hook line and sinker.

Yes, it's perfectly reasonable to believe that a LLM's not going to be great at running algorithms. That's actually quite closely related to the argument the original paper is making. It gives the appearance of 'running' for n=7 and below, because it's pattern matching and providing the output. It's not 'reasoning', it's not 'thinking', and it's not 'running', it's just figured out 'this output' is what the user wants to see for 'this input'

It's pretty obvious, ironically, the author of that article is very much deploying 'the illusion of reasoning'.

[-]

Nulligun@reddit

I disagree with almost everything he said except for point 3. He is right that if apple was better at prompt engineering they could have gotten better results.

[-]

CHG__@reddit

I mean it's potentially all just math, or at least math is the best representation of reality we've come up with so

[-]

SX-Reddit@reddit

Attention is all Apple needs.

They need to earn some credit in the area first.

[-]

Jemainegy@reddit

I hate these but AI doesn't actually do anything posts. It's such a flawed oppinion. It's an information it's an information carrier and retrieval system and generative tool. It's such a throw away mentality. Like yeah no doi it can't think. But that doesn't stop large data companies from reducing the busy work of analysts by more then 80%. Yeah it's not thinking but that does not mean it's not outperforming normal people across the board on tons of fields including for a lot of people reading and writing. Yeah it's math, and you know what that math is going to completely change Hollywood in the next 2 years. Like literally everything is math, using math as a redundancy is in itself redundant. These damn kids and their flippy phones and their interwebs, I have all I need right here in the only book I need.

[-]

I2obiN@reddit

I think people struggle with the concept that math, especially formulas/algorithms, are not just lots of sums adding things up.

You can explain what linear regression is to someone, you can explain the formulas used, the hardest part I find is to explain why it's being used and why it gives us the result we want because that means the person innately has to understand that mathematics can conceptually model almost anything.

A side effect I've found of this AI boom is that people now think "if we give something lots of data then computers can magically guess things pretty well".

[-]

Current-Ticket4214@reddit (OP)

Cool story bro. Back to work building this agent that consumes inference from my local AI server.

[-]

Jemainegy@reddit

Oh man I hate that I have to sharpen my knives regularly, I never needed to sharpen my knife when I did all my cutting with rocks.

[-]

Murph-Dog@reddit

Then you begin to contemplate, how do our own neurons activate to store, access, and associate data?

Strengthening and weakening connections between themselves at synapses, probabilistic reasoning, like some type of mathematical weighting and matrix transformation.

...wait a second...

[-]

SSeThh@reddit

Yeah, big brain, but it is still not the whole picture. There are multiple brain domains that are interconnected for making «reasoning». it’s not enough just throwing “neurons” until they work out

[-]

Claxvii@reddit

Guys, the term TEST TIME COMPUTE is almost a year old now. People have been hinting at this since forever. Still we don't understand shit about llms. In the MATHEMATICAL sense too.

[-]

Kitchen_Werewolf_952@reddit

Everyone knows the model thinking isn't actually thinking but statistically we know that it certainly helps a lot.

[-]

bossonhigs@reddit

Isn't our own reasoning just ...math. Often bad, erroneous and chaotic.

The smartest of us, with brains in best shape and high IQ can answer any question because they are good at learning and memorizing. The worst of us, with low IQ, often don't even think. They just go around without discussion in their brain. (this is sadly true)

At the end, whatever we create, is a reflection of us self.

Let models constantly random hallucinate on low level, and there is your thinking. Add a camera to that, smell and touch senors, audio recording so it can look around and be aware of environment it is, and there you go. Thinking.

[-]

Mysterious_Flow226@reddit

Lolz

[-]

carnyzzle@reddit

still applies

[-]

RobXSIQ@reddit

I thought this was settled ages ago. reasoning models are just doing thinking theater but its mostly just coming up with a roleplay on how it came to the answer it made seconds earlier before it even typed the first letter. I prefer non reasoning models as I have only noticed slowdown and token increase without giving be better results, but that is my personal experience.

[-]

Interesting8547@reddit

It depends on what you're trying to do, for roleplay and casual conversations non thinking models are usually better. For more complex tasks reasoning models are much better. The new R1 wipes the floor with V3, when reasoning tasks are involved. I mean tasks where you should think about what you're doing and not just "yapping".

[-]

Chmuurkaa_@reddit

Apple: tries making an LLM

Apple: Fails horribly

Apple: "THIS DOESN'T WORK!"

[-]

Interesting8547@reddit

... and will never work... 😁

[-]

nomorebuttsplz@reddit

It seems like a solid paper.

Haven’t done a deep dive into it yet.

Does it make any predictions that in 9 months we could look back and see if they were accurate? If not, can we not pretend they’re predicting something dire?

[-]

Current-Ticket4214@reddit (OP)

I haven’t read the entire paper, but the abstract does actually provide some powerful insight. I would argue the insights can be gleaned through practice, but this is a pretty strong confirmation. The insights:

non-reasoning models are better at simple tasks
reasoning models are better at moderately complex tasks
even reasoning models collapse beyond a certain level of complexity
enormous token budget isn’t meaningful at high levels of complexity

[-]

Interesting8547@reddit

Though it seems newer thinking models can solve more and more complex problems, so it's a matter of "iteration". I haven't seen a "hard wall" yet. Though it's true thinking models are not needed for simpler tasks.

I'm really impressed by the latest Deepseek and Qwen models. If we advance like that, after about 10 years there might not be a "thinking" task these models would not be able to do. Though creativity is still somewhat of a problem for now. It seems (sadly) the non thinking models are better for creative tasks.

[-]

kunfushion@reddit

But that level of complexity will increase and increase and increase though. So… who cares?

[-]

burner_sb@reddit

Not really. You can put it in the context of other work that shows that fundamentally the architecture doesn't "generalize" so you can never reach a magic level of complexity. It isn't really all that surprising since this is fundamental to NN architecture (well all of our ML architecture), and chain of thought was always a hack anyway.

[-]

kunfushion@reddit

You can also put it in the context of psychological work that shows that human brains don’t “generalize” fully.

So again I ask, who cares.

[-]

burner_sb@reddit

I don't really understand the hostile response. I was just saying that you can't really say that as the level of complexity increases that "reasoning" will improve. Maybe I misunderstood.

But the point here is that people do care. Trying to get to "human"-like behavior is kind of an interesting, fun endeavor, but it's more of an academic curiosity or maybe creative content generation. But there's an entire universe of agentic computing / AI replacing SaaS / agents replacing employee functions that is depending on the idea that AI is going to be an effective, generalizable reasoning platform.

And what this work is showing is that you can't just project out X months/years and say that LLMs will get there, instead you need to implement other kinds of AI (like rule-based systems) and accept fundamental limits on what you can do. And, yeah, given how many billions of dollars are on the line in terms of CapEx, VC, investment, people do care about that.

[-]

kunfushion@reddit

Sorry if I came across hostile, I’m just tired of what I deem misrepresenting of what LLMs are capable but primarily the over representing of what humans are.

I think that is the key thing. I don’t buy that LLMs are a constrained system and humans are perfectly general. Let me put that a different way. I do buy LLMs aren’t perfectly general and are constrained in some way. I dont buy that humans are perfectly general and we need our systems to be to match human level performance.

To me I just see so so so so many of the same flaws in LLMs that I see in humans. To me this says we’re on the right track. People constantly put out “hit” pieces trying to show what LLMs can’t do, but where is the “control”. Aka, humans. Ofc humans can do a lot of things better than LLMs right now, but to me, if they can ever figure out online learning, LLMs (and by LLMs I really mean the rough transformer architecture but tweaked and tinkered with) are “all we need”.

[-]

PeachScary413@reddit

The thing is, LLMs get stumped by problems in surprising ways. They might solve one issue perfectly, then completely fail on the same issue with slightly different wording. This doesn't happen with humans, who possess common sense and reasoning abilities.

This component is clearly missing from LLMs today. It doesn't mean we will never have it, but it is not present now.

[-]

kunfushion@reddit

“It doesn’t happen with humans”

… yes it absolutely does, maybe not with as simple of things, because we are more general. But it ABSOLUTELY does happen that’s ridiculous

[-]

joe190735-on-reddit@reddit

doesn't matter, as long as the LLMs don't perform at superhuman level, the rich won't buy it to replace human capital

[-]

Bakoro@reddit

The problem is that when you say "humans", you are really talking about the highest performing humans, and maybe even the top tier of human performance.

Most people can barely read. Something like 54% of people read at or below a 6th grade level. We must imagine that there is an additional band of the people above the 54%, up to some other number, maybe 60~70% who are below a high school level.
Judging from my own experience, there are even people in college who just barely squeak by and maybe wouldn't have earned a bachelor's degree 30 or 40 years ago.

I work with physicists and engineers, and while they can be very good in their domain of expertise, as soon as they step out of that, some of them get stupid quite fast, and the farther away they are from their domain, the more "regular dummy" they are. And honestly, some just aren't great to start with, but they're still objectively in the top tier of human performance by virtue of most people having effectively zero practical ability in the field.

I will concede that LLMs do sometimes screw up in ways you wouldn't expect a human to, but also I have seen humans screw up in a lot of strange ways, including having to some very sideways interpretations of what they read, or coming to spurious conclusions because they didn't understand why they read and injected their own imagined meaning, or simply thinking that a text means the opposite of what it says.

Humans screw up very badly in weird ways, all the time.
We are very forgiving of the daily fuck-ups people make.

[-]

Snoo_28140@reddit

Hopefully someone cares, so we can see progress beyond the small incremental improvements we see now. Current llms rely on brute force example providing to cover as much ground as possible. That's an issue, it makes them extremely expensive to train and severely limits their abilities to what they are trained on. Depending on your usage, you might run into these barriers. Personally, that's why I care.

[-]

MalTasker@reddit

That just proves scaling CoT tokens doesn’t solve it, not that its a fundamental issue

[-]

kmouratidis@reddit

Only if there are available data:

[LLMs] don't actually reason at all but memorize well [...]. They tested [LLMs] on problems these models have never seen before , neither existed in training data of these models before - r\/machinelearning post

If you ask them about hard and novel problems (e.g. related to new paper releases, or about fields without much open research/code/tutorials), then you're not going to get great results.

Gamedev is one such area where the advanced stuff of AAA games cannot easily be found in the wild, if at all. Finance simulations & quantitative trading will be harder. Even the relatively simpler stuff is hard for LLMs, since there aren't much data.

[-]

huffalump1@reddit

Note that these tasks are puzzles that require applying a simple algorithm over and over - very different than most headlines implying its general tasks.

The complexity is the number of steps, repetitions of the algorithm, and/or complexity/length of the algorithm required to solve the repetitive puzzles.

[-]

VihmaVillu@reddit

Classic reddit. OP sucking d**k and sharing papers right after reading abstract

[-]

Orolol@reddit

He didn't share the paper, he made a même about it.

[-]

lance777@reddit

So, provided enough computing power it can get to a point where it can "think"?

[-]

colbyshores@reddit

Now imagine if they put that kind of work in to improving Siri

[-]

SilentLennie@reddit

I wouldn't expect to much from Siri.

The US government helped fund the research at a university, then the people who worked on it at the university started a company which got bought by Apple, then from that money they started a new company and then Apple didn't know what to do with it.

[-]

burner_sb@reddit

It's worth taking a look at the Gary Marcus substack post about it for context -- Though you have to wade past his ego as per usual: https://garymarcus.substack.com/p/a-knockout-blow-for-llms

[-]

qroshan@reddit

Actually this particular post he gives a lot of credit to Subbarao Kambhipati (spelling ?), Overall good post for any objective observers

[-]

burner_sb@reddit

Yeah I didn't mean he doesn't give credit. He just always frames stuff in the context of himself. I agree it's a good post or I wouldn't have recommended it :)

[-]

SilentLennie@reddit

If you want to be lazy and get some idea of what the paper is about:

https://www.youtube.com/watch?v=fGcfJ9J_Faw

[-]

CupcakeSecure4094@reddit

Smells like an apple rage quit to me.

[-]

Hyperion141@reddit

Reasoning is not true or false, it’s a continuous variable. When you do maths, in the process of reasoning you can make mistakes. It is clear that models do reasoning but it is very abstract and shallow, and also sometimes unreasonable, but there definitely is reasoning.

[-]

jasont80@reddit

Do we think? I'm feel less sure on the daily.

[-]

randomkotorname@reddit

if it wasn't patterns it would be AGI.. we don't have AGI and Apple is coping hard cause they are terminally shit

[-]

Snoo_28140@reddit

Tell me you didnt even glance at it.... It's not about it being mathematical or not. Its not about 2 ways to view the same thing.

What it is about: lack of generalization abilities, which fundamentally limits their abilities.

[-]

dagelf@reddit

If probabilistic reasoning can give you code based on known solutions, and those code can run down a path to find an answer, the original premise that the LLM can't do that kind of falls flat, doesn't it? ... I mean, the LLM can't do it in inference, but it can write the code, run the code, read the answer... and who knows, this approach might actually help us figure out how to do the former at inference time...

[-]

Snoo_28140@reddit

I don't believe so. For AI to be general it needs to be able to generalize, not just be trained to use a tool in the same patterned and constrained and pretrained ways it does everything else. Alphacode is an example of the sort of thing that can propel AI forward. It can go beyond the known patterns - especially important for advancements in computer science.

[-]

MalTasker@reddit

No its not

https://www.seangoedecke.com/illusion-of-thinking/

My main objection is that I don’t think reasoning models are as bad at these puzzles as the paper suggests. From my own testing, the models decide early on that hundreds of algorithmic steps are too many to even attempt, so they refuse to even start. You can’t compare eight-disk to ten-disk Tower of Hanoi, because you’re comparing “can the model work through the algorithm” to “can the model invent a solution that avoids having to work through the algorithm”. More broadly, I’m unconvinced that puzzles are a good test bed for evaluating reasoning abilities, because (a) they’re not a focus area for AI labs and (b) they require computer-like algorithm-following more than they require the kind of reasoning you need to solve math problems. I’m also unconvinced that reasoning models are as bad at these puzzles as the paper suggests: from my own testing, the models decide early on that hundreds of algorithmic steps are too many to even attempt, so they refuse to start. Finally, I don’t think that breaking down after a few hundred reasoning steps means you’re not “really” reasoning - humans get confused and struggle past a certain point, but nobody thinks those humans aren’t doing “real” reasoning.

[-]

Snoo_28140@reddit

Why are you posting nonsense? That is completely unrelated to what I said. I pointed out what the paper is about. Your criticism of the paper is irrelevant to the fact that the paper is about this subject.

[-]

randylush@reddit

You’ve posted this same comment all over the place

[-]

Olangotang@reddit

Because they are the biggest Singularity death doomer on Reddit.

[-]

Current-Ticket4214@reddit (OP)

Tell me the joke went over your head without telling me the joke went over your head.

[-]

Snoo_28140@reddit

Your set up blatantly misses the point and misrepresents it.

[-]

DFEN5@reddit

Isn’t it just a model doing self prompt engineering? :p

[-]

Soggy_Wallaby_8130@reddit

Obligatory “but everything is just math, doofus!” comment.

[-]

Literature-South@reddit

Here's the kicker. Most people aren't reasoning either. They're just accessing their memory and applying the response that fits the best when prompted.

We're capable of reasoning and novel thinking, but there isn't a ton of that going on at any given time for a person.

[-]

martinerous@reddit

We are reasoning much more than it seems. For example, we know when variable names are relevant and when they are not.

If given a task, "Alice has two apples, Bob has three apples. How many apples do they have together?", we immediately know that we don't need to remember anything related to Bob and Alice. And then, if given the same task where the names are changed to Peter and Hanna, we know it's still the same task, and we don't even need to calculate, but fetch it directly from our recent memory. We are applying induction, deduction, abduction... constantly without being aware of that. LLMs do not seem to have the ability to do that. That is why we need an insane amount of training data for even quite basic tasks.

[-]

dagelf@reddit

You have clearly never tried to get a teenager to do anything. The only reasoning they do is: "you can't control me so I don't have to"

[-]

martinerous@reddit

Seems that video generation models are like teenagers - they kinda start doing what you asked for but then take their own control and add crazy stuff to the video just because they can or feel bored by your annoying request :D

[-]

reza2kn@reddit

Two responses I liked coming from Reasoning models:

Gemini 2.5 Pro:
"The paper’s findings don't prove reasoning is an illusion; they prove that probabilistic, pattern-based reasoning is not the same as formal, symbolic reasoning. It is a different kind of cognition. Calling it an "illusion" is like arguing that because a bird's flight mechanics are different from an airplane's, the bird is creating an "illusion of flight." They are simply two different systems achieving a similar outcome through different means, each with its own strengths and failure points."

DeepSeek R1:
"The Scaling Paradox Isn’t Illogical: Reducing effort near collapse thresholds could be rational: Why "think hard" if success probability is near zero? Humans give up too."

[-]

EstarriolOfTheEast@reddit

We should be careful how we ask LLM questions because they tend to respond only to what is asked, often in a way that reinforces the user's implied preferences. IMO, in consulting an LLM, we shouldn't ask them to think for us or support what we say but to find missteps and errors in our own thinking. We should be able to stand by our argument in a manner not reliant on the LLMs outputs.

I don't believe in pasting LLM responses but I think it's ok here. Here is what Gemini 2.5 pro says to itself when given the full paper's context:

Accurate/Reasonable Parts of the Statement:

"The paper’s findings don't prove reasoning is an illusion..." (in the sense of no reasoning whatsoever): This is largely true. The Apple paper isn't arguing that LRMs do nothing akin to reasoning. It's critiquing the depth, robustness, and generalizability of that reasoning, especially when compared to the elaborate "thinking" traces they produce.
"...they prove that probabilistic, pattern-based reasoning is not the same as formal, symbolic reasoning. It is a different kind of cognition.": This is a widely accepted and important distinction. The paper's findings (e.g., failure to consistently apply algorithms, struggles with compositional complexity beyond a certain point) are consistent with the idea that LLMs operate differently from classical symbolic AI or even from how humans might approach some formal reasoning tasks. The paper provides evidence for the limitations of this probabilistic approach in certain contexts.
"They are simply two different systems achieving a similar outcome through different means, each with its own strengths and failure points.": In a broad sense, comparing LLM "cognition" to human or symbolic AI cognition, this is true.

The Key Flaw(s):

The Bird vs. Airplane Analogy Misrepresents the "Illusion":
- The statement says: "Calling it an "illusion" is like arguing that because a bird's flight mechanics are different from an airplane's, the bird is creating an "illusion of flight.""
- This analogy is flawed because it misses the core of the paper's "illusion" argument.
  - A bird achieves flight. An airplane achieves flight. Neither is an "illusion" of the other, nor does a bird produce outputs that mimic the engineering principles of an airplane while actually being unable to sustain flight in the way an airplane does.
  - The "illusion" the Apple paper refers to is about the mismatch between the appearance of deep, step-by-step, human-like thinking (as suggested by the long Chain-of-Thought traces LRMs generate) and the actual brittleness, inefficiency, and scaling limitations of their problem-solving capabilities.
  - It's more like if a system produced incredibly detailed and complex architectural blueprints (the "thinking trace") for a 100-story building but consistently failed to actually construct a stable building beyond 10 stories (the "accuracy collapse"). The detailed blueprints might then create an "illusion" of profound architectural understanding and capability that isn't fully borne out.
  - The paper isn't saying probabilistic reasoning is an illusion because it's different from symbolic reasoning. It's saying the manifestation of LRM reasoning (the verbose "thinking") can create an illusion of greater capability than exists under stress or at scale.
Downplaying the "Why" Behind the "Illusion" Title:
- By focusing only on "different kinds of cognition," the statement sidesteps the paper's specific observations about how current LRMs fail despite their elaborate reasoning traces: the accuracy collapse, the decline in effort at high complexity, the "overthinking" on simple tasks, and the failure to consistently use provided algorithms. These behaviors are what lead the authors to use the term "illusion" – the thinking process looks sophisticated, but the results often don't match that sophistication when pushed.

In essence, the statement correctly identifies that LLMs use a different "kind of cognition" (probabilistic, pattern-based) than formal symbolic systems. However, it fails to grasp that the "illusion" highlighted by the paper isn't about this difference per se, but about the deceptive appearance of depth and robustness in the reasoning processes of current LRMs when contrasted with their actual performance limitations.

[-]

dagelf@reddit

Would you mind sharing your prompt?

[-]

218-69@reddit

The "blueprint vs collapsing building" analogy is genius, but I think it misses one, tiny, crucial point.

We keep talking about the model's Chain-of-Thought as if it's a transparent log file of its "thinking." It's not.

The model isn't performing a reasoning task and then reporting its steps.

It's performing a text-generation task where the target is "a plausible-sounding explanation of a reasoning process."

The CoT isn't a window into its mind; it's part of the performance. Its entire goal, dictated by its loss function, is to generate text that looks like what a human would write after they've reasoned. It's learned the form of "showing your work" without ever learning the function of the math itself.

The "illusion" isn't just that the reasoning is brittle. The illusion is that we think we're watching it reason at all. We're just watching a very, very good actor.

[-]

EstarriolOfTheEast@reddit

I agree, although I wouldn't go so far as to say it's purely acting.

Reasoning traces help LLMs overcome the "go with the first dominant prediction and continue along that line" issue. The LLM can iterate on more answer variations and possible interpretations of the user query. The reasoning tokens also do have an impact.

While the actual computation occurs in a high dimensional space, and we only glimpse shadows from a pinhole at best, the output tokens still serve as anchors for this space, with the tokens and their associated hidden states affecting future output through attention mechanisms. The hidden state representations of output tokens become part of the sequence context, actively influencing how the subsequent attention patterns and computations driving future reasoning steps will unfold. The selected "anchors" are also not arbitrary; during training, which selections set up the best expected values (or associations between reasoning token sequences and outcome quality) are learned and reinforced.

As LLMs learn to stop overthinking or converging on useless loops, we'll also gain a flexible approximation to adaptive computation for free. Except that when to stop will be modulated by the semantic content of the tokens, instead of being done at a syntactic or lower level. Related is that as LLM reasoning improves, they'll also be able to revise, iterate and improve on their initial output; stopping and outputting a response when it makes sense.

Finally, for those times when the LLMs are actually following an algorithm or recipe--say for a worked example--being able to write to context boost the LLM's computational expressiveness. So, while I agree that reasoning traces are largely post-hoc, non-representative and not faithful reports of the computations occurring internally, they are not purely performative and do serve a role. And can be improved to be better at that.

[-]

ColorlessCrowfeet@reddit

Excellent explanation!

[-]

michaelsoft__binbows@reddit

We gave it arbitrary control over how long we let it perform inception on itself, and the fact that it works pretty well seems to me about as magical as the fact that they work at all.

[-]

reza2kn@reddit

I didn't ask / enjoy the public service anouncement at the beginning of your response, but ok.

I also gave all models the entire PDF file before asking for their opinion, and also of course didn't copy the model's entire response.

If I saw a system / person that "produced incredibly detailed and complex architectural blueprints (the "thinking trace") for a 100-story building but consistently failed to actually construct a stable building beyond 10 stories" i would NOT say their architechtural knowledge is an illusion. Their capabilities have bounds and limits, like literally everyone and everything. never mind that these capabilities are growing much much faster than a human's could.

[-]

a_lit_bruh@reddit

This is surprising well put.

[-]

SuccessfulTell6943@reddit

Gemini seems confused, not technically wrong, but it's worded oddly. It's as if it has the two concepts are backwards in two different scenarios. People generally don't say reasoning itself is an illusion, they say that models deploy an illusion of reasoning. Then it says that birds mimic the flight of a plane, when the general sentiment is the opposite. I get the point that it is making because it's been made a million times before, but it's weird that it's backwards in this case.

Deepseek seems like it is contributing characteristics that really aren't present in these models. I don't think any models are currently just phoning it in because they know they will be wrong anyways. If that were the case why not just explicitly say that instead of going out of your way to makeup plausible but false text? You can't make a claim that your just conserving energy and then write 4 paragraphs of nonsense.

[-]

reza2kn@reddit

I think the confusion is coming from you my friend, not Gemini.
Gemini didn't say Reasoning is an illusion, the paper is claiming that just because LLM's reasoning doesn't look like ours and has some observed limits, it's not doing reasoning, but an illusion of it.

By the same token, one could say since birds don't have black boxes and jet fuel, their flight (or even the flight of the plane) is an illusion vs the real thing.

And it points to the fact that the same result (i.e. reasoning / flying) can come from a variety of methods, all of which will be called that.

[-]

Worth_Plastic5684@reddit

My instinct aligns with that first take a lot. How do you write down the 1024-line solution to 10-disk towers of Hanoi? 30 lines in you're an automaton, the language centers in your brain have checked out, they are a poor fit for this problem. You're using what one might call "Python in the chain of thought"... Some frontier models already have that...

[-]

Viablemorgan@reddit

I appreciate the bird and flight metaphor. But the problem with the AI companies is that they are pushing and heavily suggesting (or at the very least, not stopping anyone from thinking) that the AI models ARE THINKING in the way that people and animals think. Even if AI models end at the same result (not reliably lol) through a different process, they are telling us that it IS the same process. That’s what’s fucking annoying

[-]

OkZebra9086@reddit

Thats also how brain works.

[-]

clduab11@reddit

It's ALWAYS been just math. The right meme is the astronaut meme and "Always has been".

The nomenclature around "reasoning" needs to change, and how it's marketed needs to change, but all the mouthbreathers who are buying into this meme a) are already behind the 8-ball because there's a lot of utility you can't refute when it comes to reasoning layers and tokens and 2) Apple's "whitepaper" used abstract, algorithmic layers they "claim" as "reasoning layers" and apply it to puzzle-centric testings these layers were not designed to be used for in a vacuum. Anyone who actually READ the paper and not focused on this meme realizes this.

Reuven Cohen said it best under a LinkedIn post to this...

Same could be said for most humans.

[-]

Current-Ticket4214@reddit (OP)

Can you just laugh?

[-]

ortegaalfredo@reddit

It's not reasoning unless it's from the French region of Hypothalamus. Otherwise it's sparkling CoT.

[-]

crusoe@reddit

Their study just reads mostly like sour grapes. "Fine, we didn't want AI anyways".

[-]

crusoe@reddit

Most people don't reason EITHER, Reasoning is HARD.

Westworld was right in that regard.

Most people engage with the world, 99% of the time, in a reflexive manner.

If you gave towers of Hanoi to people, most would fail to discover the optimal move list or the optimal way to plan for n towers of hanoi.

The vast majority of people never make it the Abstract Thinking Stage of development, per Jean Piaget.

We keep expecting AI to act like computers and be super-logical, but it is definitely a mirror, as it were.

[-]

Terrible_Visit5041@reddit

The problem with that is, are we actually thinking? Decision reasoning happens seconds after decision finding. Split brains showcase how we take responsibility for our actions and make up reasons even though we have no idea why we did something, and we would swear we did it because we had an internal monologue, a thought pattern, leading us to it.

All those "LLM's" aren't really thinking leaves me with two questions: 1. How do we define that. 2. How do we prove any other human does this? Extrinsic checks, not intrinsic. We know we fool ourselves.

Turing was right, the only test we can do is extrinsic and if the answer book inside a Chinese room is complex enough, it is aware. Even though the internals are as unimpressive as the observation of a single neuron.

[-]

Mart-McUH@reddit

We were cruising around Iceland (before age of LLM) on a ship and at one moment the ship captain said a phrase I remember: "Everything is mathematics".

Yeah, LLM is mathematics. But so is ultimately our our brain (let's not forget that random and quantum effects are also described by mathematics).

[-]

AppearanceHeavy6724@reddit

Our could be a weird quantum contraption that has nothing to do with math after certain scale.

[-]

mitchins-au@reddit

Reasoning isn’t magic. It just guides the prompt into known rails self-echoing alignment. And a lot of the time it works because it steers it back into territory the model was familiar with from training.

[-]

Purplekeyboard@reddit

I couldn't solve these extended Tower of Hanoi puzzles either. Shit, I guess I only have the illusion of thinking, and can't reason.

[-]

rorowhat@reddit

Is that siri on the corner?

[-]

CheatCodesOfLife@reddit

Nah, Siri would be butting it abruptly and answering questions nobody asked.

[-]

VisceralMonkey@reddit

Wait…your Siri answers questions?

[-]

rickschott@reddit

Actually, this is an interesting paper with a misleading main title, but a very clear subtitle (which one could find out reading the abstract)

[-]

cnnyy200@reddit

I still think LLM is just a small part of what would make an actual AGI. You can’t just recognize patterns to do actual reasoning. And the current methods are too inefficient.

[-]

liquiddandruff@reddit

Actually, recognizing patterns may be all that our brains do at the end of the day. You should look into what modern neuroscience has to say about this.

https://en.m.wikipedia.org/wiki/Predictive_coding

[-]

MalTasker@reddit

And yet: Researchers Struggle to Outsmart AI: https://archive.is/tom60

[-]

threeseed@reddit

Humans struggle to outsmart a calculator.

So we've had AGI for decades now ?

[-]

Pretty_Insignificant@reddit

How many novel contributions do LLMs have in math vs humans?

[-]

ColorlessCrowfeet@reddit

No, no, no -- It's not intelligent, it's just ~~meat~~ math!

[-]

cnnyy200@reddit

My point is not that LLMs are worse than humans. It’s that I’m disappointed we are too focused on just LLMs and nothing on experimenting in other areas. There are already signs of development stagnation. Companies just brute force data into LLMs and are running out of them. Return to me when LLMs are able to achieve 100% benchmarks. By that time, we would already be in new paradigms.

[-]

LeopardOrLeaveHer@reddit

Possibly. And there's no reason to believe it would be conscious. Anybody who has programmed much knows that most programming is made of hacks. Shit would be so hacky, insane AGI is the likelihood.

[-]

YouDontSeemRight@reddit

I think we could mimic and AGI with an LLM. Looking at biology I think the system would require a sleep cycle where the days context is trained into the neural network itself. It may not be wise to train the whole network but perhaps a lora or subset. I also feel like a lot of problem solving does follow a pattern. I've debugged thousands of issues in my career and I've learned to solve them efficiently by using patterns. My question is whether LLM's learn general problem solving patterns that just fits the training data really well but isn't context based and can fail or if it learns subject matter specific problem solving capabilities. If it can do both generalize and context specific problem solving patterns and we let it update the patterns it uses and adapts itself through experience, at what point does it cease to improve and at what point have we essentially created an engine capable of that of biological creatures.

[-]

TheRealVRLP@reddit

I remember having a standard prompt on ChatGPT 1 that would give extra instructions on being specific and reasoning it's answers first to make them better etc.

[-]

PeachScary413@reddit

It has been interesting to read so many emotional and hostile responses; it seems like many people are heavily invested in LLMs being the path to AGI (and perhaps that "thinking" would get us there).

[-]

t3h@reddit

That, and this paper came from researchers at Apple, so that triggers the other half of the irrational hatred.

[-]

threeseed@reddit

Even though one of the researchers co-wrote Torch.

[-]

AcidCommunist_AC@reddit

Ok, now prove humans are "actually reasoning".

[-]

TheTomatoes2@reddit

Who would've guessed??? Thank god Tim Cook is here to rescue us

[-]

LetsileJulien@reddit

Yeah, they dont , its just a buzz word for marketing

[-]

Teetota@reddit

Reasoning can be seen as adding a few more ai-generated shots to the conversation. If you send your initial prompt and ask to analyze it, break it down to steps and enrich with examples to a non reasoning model and then use the output plus original prompt in a new chat you kinda reproduce the reasoning model.

[-]

MountainRub3543@reddit

And Apple cannot build a functional assistant, let alone an LLM.

[-]

relmny@reddit

Maybe that's why they invest time and money on this paper.

[-]

Commercial_Stand5086@reddit

@grok is this true

[-]

hipster-coder@reddit

Saying that something is "just math" doesn't mean anything.

[-]

Thick-Protection-458@reddit

Okay, I need to play around with other puzzles they used. But Hanoi tower example sounds ridiculous.

--------

Apple: Benchmarks are leaked into train data

@

Also apple: Let's use hanoi tower puzzle. It definitely did not leaked

--------

Also, losing performance after 7-8 disks? Man, without having physical freaking tower or at least drawing it after every step (and they did not mentioned tools allowing to imitate physical tower) I personally would lose coherence much faster. Probably would even with them. Most probably like V3 or so.

Well, on the other hand I was always joking that attributing intelligence to us was overstated, so I have no problems with me being a pattern matching. Even if a bit more general.

And frankly, if we - just because of complexity generalization being expected to be not 100%-good - assume we have M steps ahead. and N% chance to generate correct step and K% chance to find error and retrace the whole approach since than - shouldn't we expect exponential quality loss (the only question is the exponent base)? Which, upon certain threshold - will look like almost 0% chance to solve for given amount of samplings, and will look exactly llke 0% for certain samples?

--------

And finally... Degrading performance? Yes, it seems for that puzzle it is just reasonable to write some python program instead of solving it manually, lol. Or cheat and move whole tower physically, lol (which I got as an option from deepseek, lol).

--------

This being said - that's still interesting.

They measured some qualities, instead - so now we have them measured numerically. Is that correct to intepret like "no generalization at all" or "complexity generalization is so imperfect so it lose quality after N steps" or "it finds out it is pointless to do that way and suggest another" is another question.

At least now we have numbers to compare one more set of things.

(btw, still interesting where humans will be at these plots).

[-]

tonsui@reddit

TL;DR: In a way, "thinking" is a sophisticated form of prompt engineering.

[-]

Training-Event3388@reddit

reasoning = tag/label for AI that uses a system to validate itself… ITS NOT HUMAN

[-]

squatsdownunder@reddit

I saw this embedded in one of the dumbest artciles I have read on Linkedin for a while. Read and weep: https://www.linkedin.com/posts/stephenbklein_apple-just-blew-the-doors-off-the-generative-activity-7337545423194136576-_CYg?utm_source=share&utm_medium=member_ios&rcm=ACoAAAAvNcMBu6PHvMQ7vLf6XsNnfK3jYvhuPI0

[-]

NamelessNobody888@reddit

I wonder if this paper will end up becoming an AI meme in the way that Minsky & Papert's book 'Perceptrons' did back in the day...

[-]

Svedorovski@reddit

No Shit

Sun Tzu, The Art of War

[-]

perth_girl-V@reddit

But but BUT BuTt fucking tokens

[-]

rcparts@reddit

[-]

Demoralizer13243@reddit

I think the paper has some interesting conclusions but I would say 3 things about it:

I would have liked to have seen them actually work with/modify their own model. They mentioned how they were limited by the fact that their data is limited because they are just API calls. You are apple, literally download the full deepseek or hell even a smaller Local LLAMA and do a closer analysis. Attempt to modify it to continue to scale in compute even after it reaches the collapse threshold and see if that actually does anything. That was the most frustrating part. I believe they only went to 100k tokens for some of these more complex puzzles and even then the model stopped using as many tokens. I wish they had just created a modified in house model that attempted to scale inference time compute and see how that effected the whole collapse regime.
I think analyzing thinking traces is a little bit fraught with the recent paper on LLMs lying/omitting their real thinking processes in the recent anthropic paper. I would have at least liked a little bit of discussion on how that could have effected the intermediate results.

[-]

change_of_basis@reddit

Unsupervised next token prediction followed by Reinforcement Learning with a "good answer" reward does not optimize for intelligence or "thinking"; it optimizes for exactly the former. Useful, still.

[-]

Halfwise2@reddit

f you'll allow a bit of cheekiness, human thinking can be reduced to basically just math too. Extremely complex math, but math nonetheless.

[-]

Lacono77@reddit

Apple should rename themselves "Sour Grapes"

[-]

Equivalent_crisis@reddit

Which reminds me, " When the monkey can't reach the bananas, he says they are not sweet"

[-]

Ikinoki@reddit

Apple sounds like their AI division is trying to justify budgets by saying "why this won't work". And then a few days later we receive a model which exactly proves them wrong :)

[-]

Professional_Job_307@reddit

What? Ofcourse they are reasoning. They are capable of inventing new, novel things and solve very complex problems.

[-]

cokecanpapi@reddit

How do we feel about this paper? I just put it on my reading list this morning

[-]

CraigBMG@reddit

My semi-informed opinion is that LLMs are more like our language intuition, reasoning models are like our self-talk to validate our intuitions. We are asking how well this performs at solving visual-spatial puzzles, and the answer is an exceptionally unsurprising "not very". Let's not judge a fish by how poorly it flies.

[-]

shiftingsmith@reddit

[-]

Subject-Building1892@reddit

Take 2 pills of 300 mg of copium after meal twice a day.

[-]

SmallMacBlaster@reddit

I mean, we're just biological computers. Does it mean free will is an illusion?

[-]

unlucky_fig_@reddit

It might one day

[-]

Anyusername7294@reddit

I'm too lazy to read, what this paper is about? Reasoning models?

[-]

sapoepsilon@reddit

Basically the whole paper is that the reasoning models don't reason but are really good at pattern matching. Which has been true in my experience.

[-]

lance777@reddit

Intelligence is pattern matching. Pattern matching is intelligence. I have held that opinion for a long time

[-]

Bakoro@reddit

This is maybe some imprecise language, but go with it:

Everything is either a function, the output of a function, or both.

Functions can be composed into new functions, and composite functions a can be decomposed into component functions.

Pattern matching is also a function.

Neural nets are universal function approximators.

Fundamentally, that's all there is to intelligence. Some people hate that everything we are can be reduced to something so "simple", but it is what it is.

Somewhere in our genes are the instructions to make a brain, and the something about the topology and chemistry of the human brain makes human intelligence, and that collection of things can be approximated digitally.

[-]

MalTasker@reddit

Apparently pattern matching is enough to make Researchers Struggle to Outsmart AI: https://archive.is/tom60

[-]

kunfushion@reddit

Aka Humans

People have this ridiculous definition of reasoning that they pretend humans have… we’re doing the same thing that LLMs are, just at a much more complex level

[-]

sapoepsilon@reddit

Not really. Humans can deduce a new decision if the pattern match leads to a different result. LLMs do not do this, and the paper clearly states that fact. If you read the paper, it explains that when LLMs cannot find a decision, they become stuck in a loop, trying to identify a pattern that allows them to solve the problem. No matter how much computing power you apply, they still cannot solve the issue. Hence, the title of the paper: "Illusion of Thinking."

Essentially, if they cannot solve the problem with the pattern match they were trained on, they will not be able to solve it, regardless of the computing resources available.

[-]

kunfushion@reddit

I would argue that humans do the same, we are just a lot more general at this point.

Have you never gotten yourself into a loop on a hard problem? And had to bring in outside help?

It’s not an illusion of thinking, it’s just beyond current capabilities.

IMO generality isn’t 0 or 1. It’s a spectrum. Humans are high on the spectrum but absolutely not a 1 (perfectly general). LLMs are much below humans at this point, but getting more general every year.

[-]

sapoepsilon@reddit

If what you are saying is true, LLMs should be solving new problems, but they are not. Humans can innovate; LLMs are not yet. Even the latest alphaEvole math discovery with matrix multiplication had a human in loop, I think.

[-]

kunfushion@reddit

We’ve seen LLMs come up with novel hypothesis for multiple different scientific studies.

We’ve seen Alphafold come up with never before seen foldings and bindings (which is a transformer).

For all that is holy can we stop perpetuating this which has been proven untrue countless times?

[-]

sapoepsilon@reddit

could you link those, please?

[-]

NuScorpii@reddit

I guess the next question is if this shortcoming is due to the type of model, model size, training data, or some combination? It could be that it turns out to be not much of a road block as model sizes improve and vast amounts of synthetic data is used to augment training data and improve reasoning.

[-]

freecodeio@reddit

"much more complex level" is doing some very heavy lifting here.

If you had a LLM that had eyes, ears, a body that moves in the 3d world and an infinite context window that learns for years, you'd need a datacenter the size of the moon.

[-]

kunfushion@reddit

I didn’t say we are the same as LLMs I said we’re doing the same thing in this context. In the way that we reason.

And yes the human brain is ridiculously more efficient than current LLMs. LLMs are also getting ~90% more efficient per year currently.

[-]

freecodeio@reddit

I didn’t say we are the same as LLMs I said we’re doing the same thing in this context.

Yes, and my point is we can't for sure.

[-]

kunfushion@reddit

Ofc not… But the patterns are there

[-]

freecodeio@reddit

The patterns are there for the sun circling around earth, but the truth is completely different. We can't just take small patterns and apply them to complex things, but it is indeed the way of how things are discovered & improved.

[-]

PlanetaryPickleParty@reddit

It's yet to be seen whether humanity will fully map and understand the human brain before building an AI with general reasoning, but we will gain that understanding eventually. It could be 100 years but we will do it, just as we did with the human genome, proteins, etc. etc. At which point humanity will unlock the math and chemistry required to create a mind without anything nearing that amount of resources.

[-]

Current-Ticket4214@reddit (OP)

I haven’t read the entire paper, but the abstract does actually provide some powerful insight. I would argue the insights can be gleaned through practice, but this is a pretty strong confirmation. The insights:

non-reasoning models are better at simple tasks
reasoning models are better at moderately complex tasks
even reasoning models collapse beyond a certain level of complexity
enormous token budget isn’t meaningful at high levels of complexity

[-]

MalTasker@reddit

And its wrong https://www.seangoedecke.com/illusion-of-thinking/

My main objection is that I don’t think reasoning models are as bad at these puzzles as the paper suggests. From my own testing, the models decide early on that hundreds of algorithmic steps are too many to even attempt, so they refuse to even start. You can’t compare eight-disk to ten-disk Tower of Hanoi, because you’re comparing “can the model work through the algorithm” to “can the model invent a solution that avoids having to work through the algorithm”. More broadly, I’m unconvinced that puzzles are a good test bed for evaluating reasoning abilities, because (a) they’re not a focus area for AI labs and (b) they require computer-like algorithm-following more than they require the kind of reasoning you need to solve math problems. I’m also unconvinced that reasoning models are as bad at these puzzles as the paper suggests: from my own testing, the models decide early on that hundreds of algorithmic steps are too many to even attempt, so they refuse to start. Finally, I don’t think that breaking down after a few hundred reasoning steps means you’re not “really” reasoning - humans get confused and struggle past a certain point, but nobody thinks those humans aren’t doing “real” reasoning.

[-]

sluuuurp@reddit

If they can solve technical problems and help me without any “real reasoning”, I don’t really care. I care about results.

[-]

Local_Beach@reddit

How is it related to function calls? I mean, it determines what makes the most sense. That's some thinking, at least.

[-]

TrifleHopeful5418@reddit

This is a very good paper, re-enforcing the belief that I have held for long that transformer architecture can’t/ won’t get us to AGI, it is just a token prediction machine that draws the probability of next token based on the sequence + training data.

RL fine tuning for reasoning helps as it’s makes the input sequence longer by adding the “thinking” tokens, but at the end it’s just enriching the context that helps with better prediction but it’s not truly thinking or reasoning.

I believe that true thinking and reasoning come from internal chaos and contradictions. We come up with good solutions by mentally thinking about multiple solutions from different perspectives and quickly invalidating most of the solutions with problems. You can simulate that by running 10/20/30 iterations of non thinking model by varying the seed/temp to simulate entropy and then crafting the solution from that, it’s a lot more expensive than the thinking model but it does work.

[-]

No-Break-7922@reddit

transformer architecture can’t/ won’t get us to AGI

Nothing will. "AI" was no intelligence, "ML" was no learning, and now we saw that our hype-fueled buzzwords don't align with the realities of technological capability, we "up" the buzzword to "AGI" to basically refer to an artifically produced intelligence.

I just hope this hype and buzzword focused culture ends soon but I doubt since this is the main method of ripping off investors and rich people. The best humanity is capable of mathematically is interpolation, call it AI, ML, AGI, that doesn't change it. It's just interpolation. Intelligence the way we humans are intelligent has all to do with extrapolation to a very good level of accuracy and success, which we simply do not have the math for and won't likely have it for at least another 500-1000 years because it's a damn difficult subject.

[-]

MalTasker@reddit

No it isnt

https://www.seangoedecke.com/illusion-of-thinking/

My main objection is that I don’t think reasoning models are as bad at these puzzles as the paper suggests. From my own testing, the models decide early on that hundreds of algorithmic steps are too many to even attempt, so they refuse to even start. You can’t compare eight-disk to ten-disk Tower of Hanoi, because you’re comparing “can the model work through the algorithm” to “can the model invent a solution that avoids having to work through the algorithm”. More broadly, I’m unconvinced that puzzles are a good test bed for evaluating reasoning abilities, because (a) they’re not a focus area for AI labs and (b) they require computer-like algorithm-following more than they require the kind of reasoning you need to solve math problems. I’m also unconvinced that reasoning models are as bad at these puzzles as the paper suggests: from my own testing, the models decide early on that hundreds of algorithmic steps are too many to even attempt, so they refuse to start. Finally, I don’t think that breaking down after a few hundred reasoning steps means you’re not “really” reasoning - humans get confused and struggle past a certain point, but nobody thinks those humans aren’t doing “real” reasoning.

[-]

Current-Ticket4214@reddit (OP)

I definitely agree with your last sentence. Not disagreeing with any others, but AGI with transformers will require massive scaffolding.

[-]

dani-doing-thing@reddit

Best reasoning models already "thinking about multiple solutions from different perspectives and quickly invalidating most of the solutions with problems".

[-]

tyty657@reddit

Everything is math it's still works

[-]

gyanster@reddit

2+2 from memory or actually computing is quite different

[-]

ScrapMode@reddit

Everything quite literally math

[-]

colbyshores@reddit

Wither or not a LLM is actually reasoning is irrelevant when it comes to its usefulness because it is a fact that LLMs are more accurate when when given test time compute. It’s why o3 beat the ARC-AGI text.. mind you it cost millions of dollars for what would have took a human a couple of minutes to figure out but still.

[-]

mrb1585357890@reddit

Ermmm, isn’t neuronal firing in the brain just math?

Do we think the brain is special?

[-]