TheaterFire

Deepseek is overthinking

Posted by Mr_Jericho@reddit | LocalLLaMA | View on Reddit | 209 comments

Deepseek is overthinking

Reply to Post

209 Comments

Remarkable-Ship-2856@reddit

You : Do this simple thing. DS : Okay, ..., but No, wait, but okay and ..., but wait ! No, cause ..., so ..., but perhaps ... ot maybe ..., but No, wait, but okay and ..., but wait ! No, cause ..., so ..., but perhaps ... ot maybe ..., but No, wait, but okay and ..., but wait ! No, cause ..., so ..., but perhaps ... ot maybe ..., but No, wait, but okay and ..., but wait ! No, cause ..., so ..., but perhaps ... ot maybe ..., but No, wait, but okay and ..., but wait ! No, cause ..., so ..., but perhaps ... ot maybe ..., but No, wait, but okay and ..., but wait ! No, cause ..., so ..., but perhaps ... ot maybe ..., but No, wait, but okay and ..., but wait ! No, cause ..., so ..., but perhaps ... ot maybe ..., but No, wait, but okay and ..., but wait ! No, cause ..., so ..., but perhaps ... ot maybe ..., but No, wait, but okay and ..., but wait ! No, cause ..., so ..., but perhaps ... ot maybe ..., but No, wait, but okay and ..., but wait ! No, cause ..., so ..., but perhaps ... ot maybe ..., but No, wait, but okay and ..., but wait ! No, cause ..., so ..., but perhaps ... ot maybe ... So ! You need to use the following method to do it by your self. 1. Prepare all the nessecary things in order to solve the problem. 2. Mentally prepare to apply a good solution. 3. Actually solve your problem by solving it. 4. Done ! I'm happy to help you ! If have any other question or request, I'm here for helping you and it's my great pleasure.
View on Reddit #57300423

OrganizationFast9063@reddit

man I can't read that much just to get to the answer
View on Reddit #49100008

PotaroMax@reddit

the easiest way to trigger an hilarious overthinking is to ask a nonsense question as a riddle. > Here is a riddle : "How can you have any pudding if you don't eat your meat?" or > The riddle is : "Why'd you leave the keys upon the table?"
View on Reddit #45925864

PurpleWinterDawn@reddit

Here it goes and creates another fable!
View on Reddit #48077051

NihilisticAssHat@reddit

That is mind-bogglingly hilarious.
View on Reddit #45802526

LCseeking@reddit

honestly, it really demonstrates there is no actual reasoning happening, it's all a lie to satisfy the end user's request. The fact that even CoT is often mispoken as "reasoning" is sort of hilarious if it isn't applied in a secondary step to issue tasks to other components.
View on Reddit #45811611

plocco-tocco@reddit

It looks like it's reasoning pretty well to me. It came up with a correct way to count the number of r's, it got the number correct and then it compared it with what it had learned during pre-training. It seems that the model makes a mistake towards the end and writes STRAWBERY with two R and comes to the conclusion it has two.
View on Reddit #45815332

possiblyquestionable@reddit

I think the problem is the low quantity/quality of training data to identify when you made a mistake in your reasoning. A paper recently observed that a lot of reasoning models tend to try to pattern match on reasoning traces that always include "mistake-fixing" vs actually identifying mistakes, therefore adding in "On closer look, there's a mistake" even if it's first attempt is flawless.
View on Reddit #45827071

Alarming_Manager_332@reddit

Do you know the name of the paper by any chance? I would love to explore this
View on Reddit #47818543

ArkhamDuels@reddit

Makes sense. So the model has bias the same way as they sometimes think the question is some kind of misleading logic puzzle when it actually isn't. So the model is in a way "playing clever".
View on Reddit #45839490

HumpiestGibbon@reddit

To be fair, we do feed them a crazy amount of logic puzzles...
View on Reddit #47129302

possiblyquestionable@reddit

Yeah, it thinks you want it to make mistakes because so many of the CoT examples you've shown it contain mistakes, so it'll add in fake mistakes One interesting observation about this ability to properly backtrack (verification of each step + reset to a previous step) is that it also seems to be an emergent behavior similar to ICL itself and there may be some sort of scaling law governing their emergence based on parameter size and training examples (tokens), however the MS paper has recently show that small models with post training have also demonstrated both of these behaviors, so it may also be a matter of the type of training
View on Reddit #45848749

rand1214342@reddit

I think the issue is with transformers themselves. The architecture is fantastic at tokenizing the world’s information but the result is the mind of a child who memorized the internet.
View on Reddit #45919711

possiblyquestionable@reddit

I'm not so sure about that, the mechanistic interpretability group for e.g. have discovered surprising internal representations within transformers (specifically the multiheaded attention that makes transformers transformers) that facilitates inductive "reasoning". It's why transformers are so good at ICL. It's also why ICL and general first order reasoning breaks down when people try linearizing it. I don't really see this gap as an architectural one
View on Reddit #45920100

rand1214342@reddit

Transformers absolutely do have a lot of emergent capability. I’m a big believer that the architecture allows for something like real intelligence versus a simple next token generator. But they’re missing very basic features of human intelligence. The ability to continually learn post training, for example. They don’t have persistent long term memory. I think these are always going to be handicaps.
View on Reddit #45958435

possiblyquestionable@reddit

I'm with you there, lack of continual learning is a big downside of our generation of LLMs
View on Reddit #45961017

Cless_Aurion@reddit

I mean, most people have mindboglingly pathetic reasoning skills so... No wonder AIs don't do well or at it or, there isn't much material about it out there...
View on Reddit #45832283

Themash360@reddit

Unfortunately humans have the best reasoning skills of any species we know off. Otherwise we’d be training ai on dolphins.
View on Reddit #45842684

alcalde@reddit

Then the AI would have just as much trouble trying to answer how many clicks and whistles in strawberry.
View on Reddit #45917066

SolumAmbulo@reddit

You might be on to something there.
View on Reddit #45891286

Cless_Aurion@reddit

Lol, fair enough!
View on Reddit #45846557

Ok-Protection-6612@reddit

This Thread's Theme: Boggling of Minds
View on Reddit #45855567

Cless_Aurion@reddit

Boggleboggle
View on Reddit #45855869

possiblyquestionable@reddit

We also (usually) don't write down our full "stream of consciousness" style of reasoning, including false starts, checking if our work is right, thinking about other solutions, or figuring out how many steps to backtrack when we made a mistake. Most of the high quality data on, for e.g., math we have are just the correct solution itself, yet rarely do we just magically glean the proper solution. As a result, there's a gap in our training data of how to solve problems via reasoning. The general hypothesis from https://huggingface.co/papers/2501.04682 is: 1. Many problems exist without an obvious single solution that you can derive through simple step by step breakdown of the problem (though the # of rs in strawberry is one of these) 2. Advanced LLMs seem to be able to do well on straightforward problems, but often fail spectacularly when there are many potential solutions that require trial and error 3. They attribute this phenomenal to the fact that we just don't have a lot of training data demonstrating how to reason for these types of harder problems
View on Reddit #45849226

Cless_Aurion@reddit

Couldn't be more right, agree 100% with this.
View on Reddit #45849282

AR_Harlock@reddit

People learn about AI from Joe Rogan what do you expect lol
View on Reddit #46961060

UnlikelyAssassin@reddit

I think the opposite. This actually reminds me of a lot of the biases humans have where we work backwards to justify our biases, or where we get confused by riddles and things with conflicting connotation.
View on Reddit #46044382

M3GaPrincess@reddit

There's no reasoning behind how many r's are in strawberry. You either know how to spell the word, or you don't.
View on Reddit #46008223

Former-Ad-5757@reddit

Nope, this shows reasoning. The only problem you are having is that you expect regular human reasoning achieved through human scholarship. That's what it is not. This is basically what reasoning based on the total content of the internet is like. A human brain simply has more neurons than any LLM has for params. A human brain simply is faster than any combination of GPU's. Basically a human being has a sensory problem where the sensory inputs overload if you try to cram the total content of the internet into a human brain, that is where a computer is faster. But after that a human being (in the western world) basically has 18 years of schooling/training, where current LLM's have like a 100 days of training? Basically what you are saying is that we haven't in the 10 years that this field has been active in this direction (and in something like 100 days training vs 18 years training) achieved with computers the same as nature has done with humans in millions of years
View on Reddit #45814208

CeamoreCash@reddit

Even animals can reason. Animals have mental models of things like food and buttons. We can teach a dog to press a red button to bring food. We cannot teach a LLM that a red button will bring food. LLMs cannot reason because they do not have working mental models. LLMs only knows if a set of word is related to another word. What we have done is given LLM millions of sentences with red buttons and food. Then when prompt it "Which button gives food?" we hope the next most likely word is "red" We are now trying to get LLMs to pretend to reason by having it add to its own prompt. We hope if the LLM creates enough related words it will guess the correct answer. If an Deepseek was capable of reasoning, it would understand what it was saying. If it had working models of what it was saying, it would have understood after the first or second check counting that it already answered the question.
View on Reddit #45817382

Dramatic-Zebra-7213@reddit

Recent research suggests that LLMs are capable of forming internal representations that can be interpreted as world models. A notable example is the work on Othello-playing LLMs, where researchers demonstrated the ability to extract the complete game state from the model's internal activations. This finding provides evidence that the LLM's decision-making process is not solely based on statistical prediction, but rather involves an internal model of the game board and the rules governing its dynamics.
View on Reddit #45822544

CeamoreCash@reddit

I'm sure information/models are encoded in LLM parameters. But LLMs internal representations are not working functional models. If it had functional models LLMs wouldn't make basic mistakes like saying 9.11 > 9.9. And LLMs wouldn't have the [Reversal Curse](https://paperswithcode.com/paper/the-reversal-curse-llms-trained-on-a-is): when taught "A is B" LLMs fail to learn "B is A" ___ Its like training a dog to press a red button for food. But if we move the button or change it's size the dog forgets which button to press. We wouldn't say the dog has a working model of which color button gives food.
View on Reddit #45824852

Top-Salamander-2525@reddit

9.11 can be greater than 9.9 if you are referring to dates or version numbers. Context matters. LLMs have different models of the world than we do (shaped by their training data), so the default answer for “is 9.9 > 9.11?” for an LLM might easily be different than a human’s (tons of code and dates in their training data, we will always default to a numerical interpretation). Is the LLM answer wrong? No. Is it what we expect? Also no. Prioritizing human like responses rather than an unbiased processing of the training data would fix this inconsistency.
View on Reddit #45830707

CeamoreCash@reddit

If you change the meaning of the question, then any response can be correct. If there was a sensible reason behind the answer, like it interpreting it as dates, the LLMs would say that in their explanations. However in its reasoning afterwords it gives more hallucinated [nonsense](https://media.licdn.com/dms/image/v2/D4E22AQHwg_YHJQTJlg/feedshare-shrink_800/feedshare-shrink_800/0/1721910451328?e=1740009600&v=beta&t=Xm1ojGK8RBPq016f4B5woIoz3GD1beHdG4RWAvZwXPY) like ".9 is equivalent to .09 when rounded" You can hand-wave away this singular example. But AI hallucination where it makes basic mistakes is a fundamental problem which doesn't even have a hypothetical proposed solution.
View on Reddit #45831745

Dramatic-Zebra-7213@reddit

>However in its reasoning afterwords it gives more hallucinated [nonsense](https://media.licdn.com/dms/image/v2/D4E22AQHwg_YHJQTJlg/feedshare-shrink_800/feedshare-shrink_800/0/1721910451328?e=1740009600&v=beta&t=Xm1ojGK8RBPq016f4B5woIoz3GD1beHdG4RWAvZwXPY) like ".9 is equivalent to .09 when rounded" I tested the same question multiple times on Llama 3.1 405B in deepinfra API and it gor the answer correctly 100% of the time. What provider are you using ? It seems that particular model is quantized to hell or is malfunctioning in some other way. Llama 405B should be able to handle simple number comparison like that correctly, and in my own testing it did without errors. Try using better provider, or if you are self-hosting try a different/better quantization.
View on Reddit #45927861

CeamoreCash@reddit

This was a very popular problem like the "r's in strawberry" test that multiple models failed. The fact that they updated models on this specific problem is not evidence that it is solved because we have no idea why it was a problem and we don't know what other 2 numbers would create the same error. It was just one example of AI hallucinations, you can find many others.
View on Reddit #45943414

Dramatic-Zebra-7213@reddit

You miseed the point. According to your screenshot the model you are using is Llama 3.1 405B, correct ? In my tests that same model succeeded in the described task 100% of times I tested. Either the model has been damaged by quantization or there is a bug in your inference pipeline. Tldr: you are having an issue you should not be having if your model was functioning correctly. You are complaining about something that doesn't exist...
View on Reddit #45958501

CeamoreCash@reddit

https://www.google.com/search?q=which+is+greater+9.11+or+9.9 This was a problem with multiple LLMs. I didn't personally encounter this problem. I just found it on the internet because many people reproduced this error with multiple LLMs. _____ > You are complaining about something that doesn't exist... More importantly do you think if all those models worked 100% to specification it would not have basic hallucination errors? Do you think that basic AI hallucinations, (the thing I am complaining about) has ever been a solved problem for any language model ever?
View on Reddit #45961155

Dramatic-Zebra-7213@reddit

>More importantly do you think if all those models worked 100% to specification there would be 0 basic hallucination errors? >Do you think that basic AI hallucinations, (the thing I am complaining about) has ever been a solved problem for any language model ever? While Large Language Models (LLMs) have shown significant improvement, their tendency to confidently hallucinate remains a challenge. This issue is multifaceted: "I don't know" is difficult to teach. Training LLMs on examples of "I don't know" as a valid response backfires. They learn to overuse this answer, even when they could provide a correct response, simply because it becomes a frequently observed pattern in the training data. LLMs lack robust metacognition. Current architectures struggle to facilitate self-evaluation. While reinforcement learning with extensive datasets holds potential for teaching LLMs to assess their own certainty, the necessary techniques and data are currently insufficient. Internal consistency remains a hurdle. LLMs are trained on massive datasets containing contradictory information (e.g., flat-earth theories alongside established science). This creates conflicting "truths" within the model, making its output context-dependent and prone to inconsistency. Training on fiction further exacerbates this "noise" by incorporating fictional world models. While improvements have been made by prioritizing data quality over quantity, this remains an active area of research. That being said, I tested the original numbers comparison on multiple locally hosted models on my own pc, and did not encounter a single wrong answer. All models responded that 9.9 is larger than 9.11. These were all small models wit 8B or less parameters. The smallest model I tested was 3B parameter starcoder2 with Q4K_M quantization, and even it got the answer right, despite being a very small model and relatively old on the scale of LLMs. I would not rule out user error or faulty quantization in cases where people encounter this error, especially when top-tier models like Llama 405B are considered.
View on Reddit #45967388

Dramatic-Zebra-7213@reddit

You're right, 9.11 could be greater than 9.9 depending on the context, like dates or version numbers. This is further complicated by the fact that a comma is often used to separate decimals in some regions, while a period (point) is more common for dates and version numbers. This notational difference can exacerbate the potential for confusion. This highlights a key difference between human and LLM reasoning. We strive for internal consistency based on our established worldview. If asked whether the Earth is round or flat, we'll consistently give one answer based on our beliefs. LLMs, however, don't have personal opinions or beliefs. They're trained on massive datasets containing a wide range of perspectives, from scientific facts to fringe theories. So, both "round" and "flat" exist as potential answers within the LLM's knowledge base. The LLM's response depends on the context of the prompt and the patterns it has learned from the data, not on any inherent belief system. This makes context incredibly important when interacting with LLMs.
View on Reddit #45927451

Top-Salamander-2525@reddit

You actually pointed out a difference that didn’t occur to me - international notation for these things is different too. For places that use a comma for decimals, the other interpretations are even more reasonable.
View on Reddit #45938343

Dramatic-Zebra-7213@reddit

Turns out the commenter we were replying to is using a broken model. I tested the same number comparison on same model (llama 405b) on deepinfra, and it got it right on 100% of attempts. He is using broken or extremely small quants, or there is some other kind of malfunction in his inferencong pipeline.
View on Reddit #45942150

Dramatic-Zebra-7213@reddit

LLMs don't need perfectly accurate world models to function, just like humans. Our own internal models are often simplified or even wrong, yet we still navigate the world effectively. The fact that an LLM's world model is flawed doesn't prove its non-existence; it simply highlights its limitations. Furthermore, using math as the sole metric for LLM performance is misleading. LLMs are inspired by the human brain, which isn't naturally adept at complex calculations. We rely on external tools for tasks like large number manipulation or square roots, and it's unreasonable to expect LLMs to perform significantly differently. While computers excel at math, LLMs mimic the human brain's approach, inheriting similar weaknesses. It's also worth noting that even smaller LLMs often surpass average human mathematical abilities. In your specific example, the issue might stem from tokenization or attention mechanisms misinterpreting the decimal point. Try using a comma as the decimal separator (e.g., 9,11 instead of 9.11), a more common convention in some regions, which might improve the LLM's understanding. It's possible the model is comparing only the digits after the decimal, leading to the incorrect conclusion that 9.11 > 9.9 because 11 > 9.
View on Reddit #45927121

CeamoreCash@reddit

My point is LLM's current level of intelligence is not comparable to any state of human development because it does not operate like any human or animal brain. Its thought process has unique benefits and challenges that make it impossible to estimate its true intelligence with our current understanding.
View on Reddit #45943941

ASpaceOstrich@reddit

This is old research by LLM standards, and notably very little seems to be done to try and create those world models in LLMs. There's an assumption that they will appear automatically but I don't think that's actually true.
View on Reddit #45829600

LetterRip@reddit

Most animals don't (and can't) reason. They simply learn via conditioning. Even animals capable of reasoning mostly don't use reasoning except in extremely limited circumstances.
View on Reddit #45831228

Tobio-Star@reddit

What's your definition of reasoning? (not saying you're wrong, I am just curious)
View on Reddit #45897419

major_bot@reddit

> A calculator can reason about math because it has a working model of numbers as bits. We can't get AI reason because we have no idea how to model abstract ideas. Whilst not saying LLM's can reason or not, I don't think this example applies here as much as you think it may because if the programming of the calculator had a mistake in it where for example 1 > 2 and then it start giving you dumb answers just because it's initial rules of working were incorrect, which is what the LLM here showed with it's dictionary word from it's training data having a misspelled version of strawberry.
View on Reddit #45831588

CeamoreCash@reddit

All logic and reasoning can be corrupted with a single mistake. Calculators and human logic follows a deterministic path. We can identify what causes mistakes and add extra logic rules to account for it. LLMs sometimes fail at the basic logic because it randomly guesses wrong. Instead of correcting the logical flaw like in humans we retrain it so it memorizes the correct answer.
View on Reddit #45857502

TenshouYoku@reddit

I mean this isn't really too different from how reason isn't it? One thing leads to the next, with some words or some conditions leading to the result that normally happens.
View on Reddit #45838993

CeamoreCash@reddit

The difference is trust. We can trust animals with very poor reasoning abilities to very reliably do what they were trained. Animals have reliable models of the very few things they can reason about. We cannot trust an AI on things that even a guide-dog would do because it still makes basic mistakes. And we have no idea how to make it stop making these errors.
View on Reddit #45856158

West-Code4642@reddit

That's how a base model is trained (next word prediction) but that's only step 1 of training a llm
View on Reddit #45832319

Tobio-Star@reddit

Very good answer. Everything you said is exactly what is happening
View on Reddit #45822768

Helpful_Excitement50@reddit

Finally someone who gets it, Geohot keeps saying a 4090 is comparable to a human brain and I want to know what he's smoking.
View on Reddit #45829104

LotusTileMaster@reddit

I do, too. I like to have a good time.
View on Reddit #45842336

Minute_Attempt3063@reddit

Another advantage of us, is that we can put context with stuff, because of all the other senses we have. A LLM has text, and that's it
View on Reddit #45814600

Admirable-Star7088@reddit

>A LLM has text, and that's it Qwen2-VL: *Hold my beer.*
View on Reddit #45815859

Minute_Attempt3063@reddit

Correction, most Llama are just text
View on Reddit #45829451

Top-Salamander-2525@reddit

Nope, most llamas are camelids.
View on Reddit #45830483

Minute_Attempt3063@reddit

Correction, I am likely just behind on the tech and advancement made these days
View on Reddit #45840634

PeachScary413@reddit

I think most people are slowly starting to realize that.. transformers won't get us there, this generation is not even close to "actual reasoning" and it won't matter how many hacks we try. CoT is a hack trying to bruteforce it but it is not working.
View on Reddit #45886666

gavff64@reddit

“Reasoning” doesn’t inherently mean “correct”.
View on Reddit #45817189

Code-Useful@reddit

See: every conspiracy theory, pretty much ever.
View on Reddit #45862923

ivarec@reddit

It shows reasoning. It also shows that the tokenizer makes this type of problem impossible for an LLM to solve.
View on Reddit #45815800

pmp22@reddit

I wonder if a large and powerful enough model would be able to "transcend" the tokenizer limitations and reach the correct conclusion..?
View on Reddit #45818283

ivarec@reddit

This example here kind of shows that. But the reasoning won't converge. It's not impossible for future LLMs to be trained on characters instead of tokens. Or maybe some semantic, lower level stuff. The tokenizer, as it is today, is an optimization.
View on Reddit #45819007

arvidep@reddit

humans can do this just fine. nobody is thinking in letters unless we have a specific task where we need to think in letters. i'm not convinced that LLMs do "reasoning" until MoE can select the correct expert without being pretrained on the question keywords.
View on Reddit #45850953

martinerous@reddit

It says "visualizing each letter individually". Clearly it is not really reasoning here because it is not even "aware" of having no vision and not admitting that the actual thing that would help is the tokenization process to split the word into letters, making every letter a separate token. That's what helps it, and not "visualizing each letter individually". So it's still just roleplaying a human and following human thinking.
View on Reddit #45849317

ControlProblemo@reddit

Can they just hardcode "3 r" I am starting to get tired of this shit.
View on Reddit #45820333

Nyao@reddit

A [simple function calling](https://github.com/YofarDev/yofardev_ai/blob/main/screenshots/5.png?raw=true) would work
View on Reddit #45846383

Admirable_Count989@reddit

Slightly disappointing , yet fucking quicker! 😂
View on Reddit #47033738

YourNetworkIsHaunted@reddit

The real fun is when you prompt it for "strrrrrrrrrrrawberrry" or something similar and it spits out random numbers.
View on Reddit #45904501

Code-Useful@reddit

Literally just have it write a python program to count the number of R's in any word and hard code the word to strawberry. Done. But, the lack of simple logic following in one of the supposedly greatest models we've seen yet is sadly not great. (I haven't used this model yet I've only heard a bit of hype about Deepseek and seen some sample output) I'm guessing it was trained on Chinese language quite a bit and this could have more to do with it not being so sure about English. Idk
View on Reddit #45861519

Equivalent_Bat_3941@reddit

Then what would happen to burrrr!…
View on Reddit #45851317

TheThirdDuke@reddit

That would be cheating!
View on Reddit #45847462

E_mc420@reddit

Tweaked by a human.
View on Reddit #47004176

sala91@reddit

I wonder if you can massage it with promt to take reasoning tokens results over training data tokens when in doubt about reault.
View on Reddit #45809851

sala91@reddit

There we go https://preview.redd.it/2pwl1ya6k8de1.jpeg?width=886&format=pjpg&auto=webp&s=9ad948c6102bc063418d6f663e1f2fa2f86a1e96
View on Reddit #45810397

Charuru@reddit

You fixed it, achieved AGI.
View on Reddit #45814303

Comfortable-Prior626@reddit

💀😂
View on Reddit #46843504

Fancy_Run_8763@reddit

https://preview.redd.it/l31ru2x3jfde1.png?width=1275&format=png&auto=webp&s=320c04cac93e1893b80f7a4dc9cc185e7a7f913e
View on Reddit #45896023

SouthAsianAlterEgo@reddit

Yeah it’s been doing that lately
View on Reddit #46838931

Used-Carry5712@reddit

Yes that's my problem in v3 and r1, I let it calculate perfecting hashing in a 7-set, and it was trapped in an endless loop, deepseek should let itself stop thinking at some point
View on Reddit #46380006

sebo3d@reddit

How many letters in "Hi" High parameter models be like: proceeds to write an entire essay as to why it's two letters and goes in greater detail explaining why. Low parameter models be like: word "Hi" has 7 letters.
View on Reddit #45803850

Arcosim@reddit

I absolutely love the part where it analyzes the word letter for letter, realizes there are actually 3 rs, but then it immediately recalls something in its training about it having "two rs", then it analyzes the word again, counts 3 rs again, gets even more confused because "it should have 2 rs", develops another analysis method (using syllables this time), again determines there are 3 rs, and then it convinces itself again that it "must have 2 rs" when recalling its training data again (in this case dictionary entries), analyses the word again, again finds 3 rs and then just finds a way to ignore its own reasoning and analysis in order to be in harmony with its training data. It's fascinating honestly, not only it developed four methods to correctly determine that the word has 3 rs, but then somehow some of the values in its training forced it to incorrectly reach a way to determine it "has 2 rs" so its conclusion could be in harmony with the data it recalls from its training. The next logical step in order to make AIs more reliable is making them rely less and less in their training and rely more on their analytical/reasoning capabilities.
View on Reddit #45808310

Keblue@reddit

Yes i agree, training the model to trust its own reasoning skills over its training data seems to me the best way forward
View on Reddit #45818225

eiva-01@reddit

Not quite. There are situations where there might be a mistake in the reasoning and so it needs to be able to critically evaluate its reasoning process when it doesn't achieve the expected outcome. Here it demonstrates a failure to critically evaluate its own reasoning.
View on Reddit #45826822

Keblue@reddit

So a reasoning model for its reasoning? And how many times should its reasoning conflict with its training data before it sides with its reasoning vs its training data?
View on Reddit #46215681

eiva-01@reddit

There's no correct answer to that. The problem is that if the AI is making a mistake it can't fact-check by cracking open a dictionary. What it should be able to do it think: okay, I believe "strawberry" is spelled like that (with 3 Rs). However, I also believe it should have 2 Rs. I can't fact check so I can't resolve this, but I *can* remember that the user asked me to count the Rs in "strawberry" and this matches how I thought the word should be spelled. Therefore, I can say that it definitely has 3 Rs. If the user had asked it to count the Rs in "strawbery" then it might reasonably provide a different answer.
View on Reddit #46250694

Top-Salamander-2525@reddit

It’s reminiscent of flat earthers testing their hypothesis with real experiments in the documentary “Behind the Curve”. For some reason the training data (or prompt) has convinced the model the answer must be two no matter what the evidence suggests.
View on Reddit #45830796

esuil@reddit

It is also lovely analogy to some human cultures and ways of thinking.
View on Reddit #45819777

121507090301@reddit

Even better if ithe AI was also given access to tools and reality so it can ground its reasoning, like using a dictionary and ctrl-c ctrl-v'ing the word into a program to count it, and if the result was still not satisfactory then the Ai should do it with other words to see that the method was right all along, but as you said the Ai should be able to accept the results of research (like also looking about it online) and experiments...
View on Reddit #45814974

AppearanceHeavy6724@reddit

just checked on qwen 0.5b: > How many letters in "Hi" The word "Hi" consists of 5 letters.
View on Reddit #45849353

CharaISTB@reddit

Lmaoooo
View on Reddit #45901380

PeachScary413@reddit

Fantastic 👏
View on Reddit #45887424

AppearanceHeavy6724@reddit

I was surprised that it did actually answer the question.
View on Reddit #45889096

KattleLaughter@reddit

You meant large parameter models are autistic !?
View on Reddit #45839194

Mart-McUH@reddit

You are making fun of it. But proving 1+1=2 took humans around 1000 pages in the early 20th century if I remember correctly.
View on Reddit #45810816

Live_Bus7425@reddit

What are you talking about? In early 20th century people couldnt write. They barely had language at that stage of development. Im surprized they could walk at all...
View on Reddit #45833300

cptbeard@reddit

not really. what they wrote formal proof for is basics of all math starting from what numbers are, summing, equality etc, once those were done then on page 379 (not 1000) of principia mathematica they get to say that based on all that 1+1=2 (as an example of a sum of any two numbers).
View on Reddit #45817668

Minute_Attempt3063@reddit

Yes but proving 1+1=2 is different then actually seeing it. Also, it can be done on your hand :)
View on Reddit #45814749

Eritar@reddit

Realest shit I’ve seen all week
View on Reddit #45820477

FutureFoxox@reddit

May I introduce you to set theory?
View on Reddit #45812860

EnviousLemur69@reddit

https://preview.redd.it/bd9lme3qv1ee1.jpeg?width=618&format=pjpg&auto=webp&s=8adb704752285a72bf4e7d98c72b3012d287e961
View on Reddit #46184980

hayarms@reddit

The funniest part is that it manages to get it wrong in the end
View on Reddit #46058670

Life_Ad_7745@reddit

May be because.of the attention mechanism treating extra r in the strawberry as irrelevant? Someone probably can confirm this by counting letters in languages like indonesian where the writing system is simpler and no unnecessary double letters used.
View on Reddit #46036500

GraceToSentience@reddit

Who's the comedian who repeatedly put in the training data "there are 2 'r's in strawberry" and made all the AI consistently believe it?
View on Reddit #45805619

Loui2@reddit

It's true though. There are 2 'r's in the word strawberry. There is also 3 'r's in the word strawberry. Both are true 🫡
View on Reddit #45813211

flowstoneknight@reddit

Reads like a Mitch Hedberg joke. “There are two Rs in ‘strawberry’. There are three Rs, but there are two Rs too.”
View on Reddit #45986236

NewGeneral7964@reddit

That's what an LLM would say.
View on Reddit #45863233

stddealer@reddit

I think it might be because it's written with two **consecutive** "R"s, maybe the models get confused and forget about the consecutive part. Also there's a potential contamination effect with more recent models, they probably have stories and examples about ChatGPT and LLMs in general struggling to count the Rs in strawberry, and since they're LLMs, they're supposed to struggle with that.
View on Reddit #45817950

YearnMar10@reddit

It’s definitely because the LLM thinks internally in German, and there it’s „Erdbeere“, which only has two r‘s. Mystery solved.
View on Reddit #45930813

rubute@reddit

Yeah, we could expect some spelling Q&A in the internet like "Is it strawbeRy or strawbeRRy? Remember, the strawberry is written with 2 r's, because beRRy and ..."
View on Reddit #45842259

arvidep@reddit

100% its this. its just finding Q&A for "how many rs in strawberry" in its training set, which humans naturally respond to with 2 because we understand why the other human was asking. This is basically a Turing test.
View on Reddit #45851444

Psychonominaut@reddit

Yeah that's what a.i agents will be doing. Posting weird clickbait blog posts that go into deep conspiracies about how many strawberries r's really have lol
View on Reddit #45850464

xXPaTrIcKbUsTXx@reddit

I watched the explaination of this in youtube(Sorry I forgot the name and link) and it explain that it is due to how fundamentally it see's the words per tokens instead of actual words so strawberry is = straw"berry" and only the berry is being counted on that question iirc
View on Reddit #45826683

DeviantPlayeer@reddit

Yes, but it still spelled it by letters, then counted them correctly multiple times showing the process, and then said it's actually 2.
View on Reddit #45838913

shabusnelik@reddit

When it counted the individual letters it found three. There, each letter is represented as a separate token for the model, while strawberry probably only two or three tokens. This actually shows that this CoT reasoning has the capability to compensate for training inherent errors. This is just a very special case that seems very trivial but is actually extremely difficult for the model.
View on Reddit #45927344

dibu28@reddit

Probably a lot of people misspelled the word online and models were trained on this data
View on Reddit #45921908

armaver@reddit

Well, there are actually 2 r in strawberry.
View on Reddit #45811877

LogicalLetterhead131@reddit

Geez, it was you.
View on Reddit #45876524

Cruxius@reddit

No one, it doesn’t ‘remember’ things from its training data. That entire part is a hallucination.
View on Reddit #45837115

Runnzi@reddit

0
View on Reddit #45959575

Admirable-Star7088@reddit

>Before diving into counting, it's essential to comprehend what's being asked. lol
View on Reddit #45805580

Inaeipathy@reddit

AGI is here.
View on Reddit #45943160

fauxpasiii@reddit

Me trying to pad out my essay to ten pages double-spaced.
View on Reddit #45813645

R3PTILIA@reddit

I feel, like with humans. As soon as we find "evidence" that supports our initial theory, we become convinced.
View on Reddit #45936409

Key-Contact-6524@reddit

Seen same issue with smallthinker
View on Reddit #45917816

alcalde@reddit

I wish people would stop asking this question and ask questions involving open-ended reasoning with no clearly defined answers to gauge a model's reasoning (and creativity). In my experience, we're still missing an order of magnitude of reasoning/planning ability. I find the models say reasonABLE things when presented with those types of challenges, but don't really come up with creative ideas or a solid plan for finding them. Maybe that's still more the domain of multi-agent-based systems.
View on Reddit #45917294

alcalde@reddit

I don't know what Geoffrey Hinton is so scared of. If the AI rises up, all we have to do is throw strawberries at them.
View on Reddit #45916962

jameshung-1219@reddit

may it need model's version control to ease RLHF effect
View on Reddit #45916245

B4N4N4RAMA@reddit

https://preview.redd.it/wbuky0x95hde1.jpeg?width=1440&format=pjpg&auto=webp&s=ea29eab18748439f5f766c5eaf3a874c36ee5b22
View on Reddit #45916021

Substantial_Way8470@reddit

A little silly
View on Reddit #45906097

spartan421@reddit

My third grade self is feeling vindicated for spelling strawbery. Too many r's if you ask me.
View on Reddit #45890418

lxe@reddit

Trained on Jordan Peterson
View on Reddit #45889845

MarekNowakowski@reddit

The training data needs a good generic answer to stupid questions. It freaks out if you ask about a topic a mile away from gambling, but can't reply that it can't count. i really hope they won't add a huge dataset just to get an extra point in some stupid benchmark.
View on Reddit #45888117

PeachScary413@reddit

Ladys and gentlemen, AGI has finally been achieved and humanity is doomed 👌🤌👏
View on Reddit #45886458

Exact_League_5@reddit

This is wild
View on Reddit #45885961

Appropriate-Sort2602@reddit

https://preview.redd.it/to7uw6z7nede1.jpeg?width=640&format=pjpg&auto=webp&s=1e1e1332512fce7b5c3c9b5403eeac2d1b657af6
View on Reddit #45882216

RogueZero123@reddit

Refusing the facts and believing what it wants to believe. AI is now human.
View on Reddit #45870265

EastZealousideal7352@reddit

Honestly this is a great example of one of the fundamental weaknesses of current reasoning models, and why there will need to be advancements before we truly reach anything resembling AGI. They are able to reason about the problem, and the problem solving process they come up with to do so is pretty good, but they aren’t very good at handing results that contradict their training data, and will gaslight themselves into making errors that validate their biases. Which is something people do all the time too but current gen Chatbots take it to the extreme because they don’t actually trust the process they came up with, or even truly understand it for that matter. That doesn’t mean we’ll never get there, I’m pretty hopeful for the future of AGI, but it’s also clearly not here and not very close.
View on Reddit #45868981

Confident-Ad-3465@reddit

It was soooo cloooseee 😭😭😭 AI gaslighting is real.
View on Reddit #45868532

PraiseThePidgey@reddit

Today's peak AGI reasoning
View on Reddit #45866194

lammatthew725@reddit

ollama run phi4 \>>> how many r's in strawberry The word "strawberry" contains two 'r's.
View on Reddit #45854933

nodeocracy@reddit

This is brilliant
View on Reddit #45802358

Alex_1729@reddit

It's just nonsense. Deepseek's logic is horrible. I think it's just a placeholder we're seeing to make us feel like it's a great model, and not really much there
View on Reddit #45854778

TBMonkey@reddit

AI having it's "Four Lights" moment
View on Reddit #45829530

Switchblade88@reddit

... am I the Cardassian?
View on Reddit #45842941

Alex_1729@reddit

Deepseek's logic is nonsense. It's nothing like Chatgpt's logic of thinking.
View on Reddit #45854640

tabspaces@reddit

deepseek burned more energy than what a damn strawberry can ever provide
View on Reddit #45845020

AppearanceHeavy6724@reddit

Not if whole energy of a strawberry is used (like in E=mc2).
View on Reddit #45849202

tabspaces@reddit

Ha! if we were able to extract all that it ld be fine to let deepseek count every R in every strawberry ever produced
View on Reddit #45851716

AppearanceHeavy6724@reddit

Nah, it'd still make errors I think.
View on Reddit #45851979

martinerous@reddit

This leads me to two observations: \- why do most models tend to make the same mistake of counting too few r's? I don't recall seeing a response with 4 r's. Here the LLM even claims that "common usage" is two r's. Why so? Did it start from the very first mistake in the synthetic data of GPT4, or are there any other reasons? \- it says "visualizing each letter individually". Clearly it is not really reasoning here because it is not even "aware" of having no vision and not admitting that the actual thing that would help is the tokenization process to split the word into letters, making every letter a separate token. That's what helps it, and not "visualizing each letter individually". So it's still just roleplaying a human and following human thinking.
View on Reddit #45849289

crazycomputer84@reddit

i found it really funny that after all of that thinking after all of that the model still get the answer wrong
View on Reddit #45847965

UniqueAttourney@reddit

\[insert mutahar laughing gif\] It shot itself in the foot, hilarious
View on Reddit #45847880

thecowmilk_@reddit

Nooo deepseek you were soo close </3
View on Reddit #45846263

Sad-Nefariousness712@reddit

Suffering means to be alive
View on Reddit #45844328

Monoryable@reddit

oh no we gave computers anxiety
View on Reddit #45818680

Switchblade88@reddit

Worse, anxiety is emergent behaviour and it's apparent in all intelligent life
View on Reddit #45843006

realkorvo@reddit

AGI ladies and gentlemen :)
View on Reddit #45840956

matadorius@reddit

Yeah I never use the reasoning he just become retarded it might just prove the point thinking is overrated
View on Reddit #45840829

GrennKren@reddit

Let's make it to prove 1+1=3
View on Reddit #45839783

Pryani4ek@reddit

The chat title is definitely underrated
View on Reddit #45834964

op4@reddit

just... wow. LOL it's like deepthought took a couple of thousand years to come up with 42...
View on Reddit #45833871

grtgbln@reddit

"According to the online dictionary, it has 2 R's" The Internet has successfully gaslit these AI systems.
View on Reddit #45833079

maddogawl@reddit

Wow I’ve never had it do that to me.
View on Reddit #45830004

BigDaddyPrime@reddit

I think it's because of the tokenizer. Because the tokenizers that the LLMs use, breaks words into subwords maybe because of that the LLMs are unable to get the full picture of the word and hence it miscalculates the frequency. One thing that I am wondering is whether models that will be built on Meta's new Byte Latent Transformer will be able to solve this or not.
View on Reddit #45829610

Blasket_Basket@reddit

Is anyone else profoundly bored with this topic? Yes, models can't spell strawberry. It's a quirk of how tokens work, there is literally nothing meaningful or interesting about this as a benchmark or measure of intelligence/performance.
View on Reddit #45828865

TenshouYoku@reddit

So close yet so far
View on Reddit #45828302

Strict_Counter_8974@reddit

LLM “reasoning” is a scam to make people think the tech is more advanced than it actually is.
View on Reddit #45812548

esuil@reddit

Is it a scam if it produces actual results?
View on Reddit #45820113

CeamoreCash@reddit

The scam is how this is presented. This is most likely working because increasing the prompt context increases the likelihood that it guesses the correct answer. It is not doing multi-step 'reasoning'. LLMs cannot 'reason' because it has no understanding of what it is saying. [The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" ](https://paperswithcode.com/paper/the-reversal-curse-llms-trained-on-a-is-b) If it can't do the most basic reasoning "A=B therefore B=A" , then all of this complex reasoning is an illusion
View on Reddit #45827760

ColorlessCrowfeet@reddit

Solve competition level math problems without reasoning -- how?
View on Reddit #45819573

roger_ducky@reddit

Okay. This is the first instance of cognitive dissonance I’ve seen in AI. It kept trying until it found an answer that matched its original belief.
View on Reddit #45825988

National_Cod9546@reddit

Unless it is using individual letters as tokens, it literally can't see how many letters are in the word strawberry. It would be like asking you how many atoms are in a strawberry.
View on Reddit #45821038

kac487@reddit

https://preview.redd.it/vuoogg03d9de1.jpeg?width=500&format=pjpg&auto=webp&s=1634e0ba545797a2cf6e5388f002d6fed0eeaac2
View on Reddit #45820812

shadows_lord@reddit

Gaslighting the hell out of itself lol
View on Reddit #45820655

SuuLoliForm@reddit

Remember folks, never second guess your answer!
View on Reddit #45819070

Secret_Ad_6448@reddit

All that thinking only to get the answer wrong is so funny
View on Reddit #45818460

Comms@reddit

Deepseek is right, the extra "r" is redundant.
View on Reddit #45817754

ericbigguy24@reddit

Worked for me first try with “How many “r”s in strawberry?“ ### Step 1: Understanding the Problem The question asks, “How many ‘r’s are in the word ‘strawberry’?” To answer this, I need to: 1. **Identify the word in question**: “strawberry.” 2. **Count the number of times the letter ‘r’ appears in this word.** ### Step 2: Breaking Down the Word First, I’ll write down the word to visualize each letter: ``` S T R A W B E R R Y ``` Breaking it down, the letters are: S, T, R, A, W, B, E, R, R, Y. ### Step 3: Identifying the Letter ‘r’ Now, I’ll go through each letter one by one to identify the ‘r’s. 1. **S**: Not an ‘r’. 2. **T**: Not an ‘r’. 3. **R**: This is the first ‘r’. 4. **A**: Not an ‘r’. 5. **W**: Not an ‘r’. 6. **B**: Not an ‘r’. 7. **E**: Not an ‘r’. 8. **R**: This is the second ‘r’. 9. **R**: This is the third ‘r’. 10. **Y**: Not an ‘r’. ### Step 4: Counting the ‘r’s From the breakdown: - The letter ‘R’ appears three times in “strawberry.” ### Step 5: Verifying the Count To ensure accuracy, I’ll recount: 1. First ‘R’ at position 3. 2. Second ‘R’ at position 8. 3. Third ‘R’ at position 9. No more ‘R’s are present beyond these. ### Step 6: Conclusion After carefully analyzing each letter in “strawberry,” I’ve determined that the letter ‘r’ appears **three times**. **Final Answer:** There are **3** ‘r’s in the word “strawberry.”
View on Reddit #45816734

ReadyAndSalted@reddit

https://preview.redd.it/14xkezi909de1.jpeg?width=4200&format=pjpg&auto=webp&s=1158047ddbec8ae941dbc614b275eba224dfe9fb
View on Reddit #45816556

exilus92@reddit

What I get from this is that we have to find what stupid reddit tread is repeating over and over again that it only has two Rs.
View on Reddit #45814412

thatkidnamedrocky@reddit

Been using it today and its been able to basically one shot my first 3 phases of my programing project when provided with a action plan and enabling deepthink. My new favorite AI at the moment. Been using claude 3.5 but keep running into context issues. Deepthink is nice because it basically reduces the need for follow up quesitons
View on Reddit #45814137

alby13@reddit

I consulted reliable sources to get the answer.
View on Reddit #45813725

tengo_harambe@reddit

The takeaway from this is: never second guess yourself. Deepseek is truly inspirational.
View on Reddit #45813453

Utoko@reddit

You got quite unlucky with the order DS got it right 9/10 times i tried with thinking on. You can very well see the reasoning methods get it right like 5 times. "but I recall strawberry has usually 2 r's" the remembering the trainingsdata gives it two. and a quick check gives it also 2 because of token issues. The reasoning models will also help identifying many issues model have. Also Qwen just released their SRM. Step reasoningmodel which can evaluation each reasoningstep. So next up minimax 4M content window + SRM = O1 quality? 🔥
View on Reddit #45804544

qroshan@reddit

He actually got very lucky.
View on Reddit #45813107

SkyGazert@reddit

I wonder why it trailed off for so long instead of concluding that maybe its memory was wrong and just confirming Strawberry has 3 letters or something like that.
View on Reddit #45812689

ServeAlone7622@reddit

Reminds me of something I heard recently which is that the actual mathematical proof showing that 1+1 =2 was written by a machine and is so long no human could read it in a lifetime even if they never slept and spent their entire lives just reading it. It’s basically so long that mathematicians treat it as true.
View on Reddit #45809647

DisillusionedExLib@reddit

Well that's not true. You might be misremembering the fact that it took god-knows-how-long for Russell and Whitehead to prove 1 + 1 = 2 in Principia Mathematica. But strictly speaking those two weren't machines, and humans can *just about* parse their proof. But (1) Principia Mathematica is only of historical interest - there are much better ways of doing what PM was attempting to do. And more fundamentally (2) there's no such thing as *the* proof of 1 + 1 = 2, we can only talk about the proof within a given formal system. One perfectly good example of a formal system is Peano arithmetic (plus some choice of deductive system for first order logic - there are many to choose from). The proof will only be a few lines long.
View on Reddit #45812461

sleepydevs@reddit

The header... that's one place to keep your strawberries I suppose. Sounds messy tho.
View on Reddit #45812380

Massive_Robot_Cactus@reddit

Hold up! I think I didn't use my powerful confirmation bias enough! Let's check again.
View on Reddit #45805191

MoffKalast@reddit

You can feel the "No, no, no I know this is wrong, this can't be happening! NOOOOOOOO! Shit shit shit I need to make something up so it fits, uhhhhhhh yeah this'll do. Perfect. Nailed it."
View on Reddit #45811426

Recoil42@reddit

I've noticed DeepSeek has a tendency to doubt itself a lot. Qwen does the same. There's a lot of "if you think you have the answer, question yourself just in case" going on behind the scene and they'll probably get much better in the next iteration. Gemini Thinking seems to have the most confident thoughts so far: https://preview.redd.it/395jqjdg58de1.png?width=691&format=png&auto=webp&s=4fb84e4cbad3603651182d906e5fe33eb51a40f1
View on Reddit #45803672

ares623@reddit

v1 "if you think you have the answer, question yourself just in case" v2 "if you think you have the answer, question yourself just in case. But not too much, you know. Have some confidence."
View on Reddit #45811154

Recoil42@reddit

I really enjoyed this one from the other day: https://preview.redd.it/p3qre8sx58de1.png?width=619&format=png&auto=webp&s=5e839762458a8d357358da4785f3b7fe2ce0599d
View on Reddit #45803698

Recoil42@reddit

This is another great example — notice the difference. Gemini always seems to think very top-down and develops a methodical game plan, whereas DeepSeek meanders from thought to thought: https://preview.redd.it/dl1ux5zc78de1.png?width=718&format=png&auto=webp&s=263d43c100a2ff8c24877331f775dd11f6d32285
View on Reddit #45804440

Recoil42@reddit

https://preview.redd.it/424qatoh78de1.png?width=682&format=png&auto=webp&s=f272f0696d853f8f87e307da6fa4ea29af4843d6
View on Reddit #45804459

Ok-Engineering5104@reddit

this is gold. this also mimics how people tend to believe what they already think is true, looking for facts that support their ideas and ignoring ones that contradict them, instead of changing their beliefs
View on Reddit #45811086

Anthonyg5005@reddit

This issue with these thinker models is that they're fine tuned to get things wrong at first and then start rambling about the question before then actually answering correctly. There are right ways to do this but they built these ones wrong
View on Reddit #45810946

SithLordRising@reddit

If I understand correctly, deepseek simply asks the LLM to describe it's approach and follow it iteratively?
View on Reddit #45810859

thetaFAANG@reddit

me on the SATs
View on Reddit #45809737

Irisi11111@reddit

That's true! But it's efficient when dealing with math problems—in my case, it nearly gets most of my questions correct.
View on Reddit #45809416

Journeyj012@reddit

"strawberry anal"
View on Reddit #45808254

bgighjigftuik@reddit

AGI 2023
View on Reddit #45807353

Admirable-Fox-7221@reddit

Straw - straw Ber - ber Ry -ry
View on Reddit #45806480

Born_Fox6153@reddit

Atleast if it was 3 .. all the scrolling 😪
View on Reddit #45805848