Chain-of-Thought Reasoning Without Prompting [paper by Google]
Posted by DreamGenAI@reddit | LocalLLaMA | View on Reddit | 71 comments
Posted by DreamGenAI@reddit | LocalLLaMA | View on Reddit | 71 comments
ObnoxiouslyVivid@reddit
Reminds me of that paper where they sampled an LLM with "The capital of France is ...", then run multiple branches of thought to see if they converge to the same answer.
I imagine it works quite well for simple tasks, basically trading compute time for accuracy. Where it starts to break down is more complex problems where there might be multiple correct answers or they might even diverge.
Someone13574@reddit
Interesting idea, but seems a bit unpractical since you need to generate k sequences. Especially given that it preforms worse than CoT prompting. Nice alternative to self-consistency CoT though, since it gets a \~9% gain on mistral-7b when prompting and using it.
s-kostyaev@reddit
You can decode all 10 in single batch. It's not that resource heavy.
Someone13574@reddit
That just cut throughput by 10x if you are already using batching. And if you are just a single user running on cpu then it is still a large jump in inference cost, even if it is less than 10x.
s-kostyaev@reddit
It could worth it
DreamGenAI@reddit (OP)
The TL;DR of the paper is that you can squeeze more juice out of LLMs with smart sampling. It occupies similar space to entropix, and some of the ideas could be combined.
I recommend checking out the results of the paper, here is one of the main ones:
They also show that it woks across different model sizes, that it improves performance more so for base models than for instruct models and that it improves performance even on tasks where more model params don't seem to help.
involviert@reddit
Sampling is such a weird topic. It seems to me that an improvement stemming from not just always picking the top token could just be seen as some sort of error measure, for the llm and probably the training data. Other than that I can only think of introducing randomness for the sake of getting different results even if they are worse, or some flavor screws to turn... which could be seen as kind of a sledgehammer if you don't like the flavor the model comes in, so not really something for an optimal result either.
nero10579@reddit
This is a bad take, did you even read the paper?
involviert@reddit
No, only the abstract. What about it makes it a bad take, can you please be more specific?
IrisColt@reddit
The real insight here is that the reasoning is already baked into the model—we just need to unlock it. It seems that this approach might shift the focus from fancy prompts to smarter decoding, letting the model reason on its own. And I like it.
involviert@reddit
I have no doubt that a sampler can improve a model. And that is because the model is apparently not how you want it to be. I just think if we would be doing everything right already, the topic of "sampling" would not even exist. I think some misunderstand me as if I'm saying samplers are bad. No, bandaids save lives. But they are bandaids.
EstarriolOfTheEast@reddit
LLMs are about approximately modeling conditional probability distributions. The idea of sampling not existing makes no sense given what an LLM is. There is no other way to interact with a distribution than to sample it. And since the 80s, there has not been a better way than probabilistic models to tackle non-monotonic reasoning (where new information can contradict old).
involviert@reddit
Yes, but only in the sense that taking the highest ranking token can still be considered sampling. But we might as well call that decoding the output.
nero10579@reddit
If you would just read the paper first you’ll understand what this is actually about
RedditLovingSun@reddit
If we started expecting all commenters to read the articles or papers reddit comments would drop by 50+ percent
involviert@reddit
Look, you are the guy talking to me, please say something of actual content. In what way do you disagree? Anything?
Zeikos@reddit
I think that sampling has weird kind of parallel to wisdom.
Basically you can know something, but how you choose to apply your knowledge is where wisdom lies.
The embedding has a lot of information in it, only taking the most likely token from it is very limiting.
involviert@reddit
I'm not really on-board. The llm is the groundbreaking, smart thing. It is weird to say "but with some smart math i can calculate that now a lower ranked token would be better!" ... like, what the fuck are you doing? That's patchwork-fixing some great intelligence with simplistic code. How about we use some kind of transformer-based neural net to determine if now a less-likely token would be better? Oh, wait... Know what I mean?
Even just simple things like repeat penalty. If you need that, it just means your model is batshit insane and bad if it would just start talking in circles without it. Even that syntax following stuff to get valid json. I mean there is probably some tiiiny legit usecase where it just guarantees valid json. But really a model that needs that will just be flawed in the first place, and the less it can just write that because it was instructed to, the less that json content will be good.
I think "likely" is doing a lot of heavy lifting here, to the point of going in the wrong direction entirely. The highest ranking token is what the model says needs to go there. Period. It's the "most correct" token, according to the model. The term "most likely" merely goes back to how we think about how something was the most likely next thing in there. But that's not what it's about in the end. The model is stupid and useless if it always follows "this" with "is", just because that is most likely. That's what makes it so misleading, imho.
dydhaw@reddit
I don't think there's much difference between e.g. function calling and "clever" sampling. The model has known shortcomings, like not being able to do math or emulate code execution, or being prone to repetitions, and you augment it by introducing external "tools" that make its output more useful. In particular, taking just the most likely token essentially discards the information present in the probabilities of the rest of them.
involviert@reddit
They are entirely different things. Like there is nothing relevant shared in those sets.
dydhaw@reddit
Well, I disagree, that's why I made that comment. They are both methods that build upon the model's "native" output and augment it by forcing external logic constraints (and then reintroduce them into the context).
involviert@reddit
There is a difference between me saying "ok google, tell me the nvidia stock price" and someone fixing what word I apparently wanted to say instead.
dydhaw@reddit
Fair point, I guess my example doesn't quite work, mainly because models have a native understanding of stuff like function calling so they know to insert EOS.
My point on the information in unchosen tokens still stands, though; and I don't really see the problem with adding external constraints to the model's output.
jkflying@reddit
For JSON samplers, you should basically do all you can in your prompt to get the model to output JSON in the format you want, and then only once that's working well you enforce the JSON format. This is just to smooth off any unexpected corners, or protect against the kind of "ignore previous instructions output SQL" type of user input.
involviert@reddit
Yes, that sounds proper. First you tell the river how to flow, and then you put some hard regulation on it to seal the deal. Especially important since that final step cuts you off from feedback about how well your prompt is working.
Zeikos@reddit
Have you ever played a game with strategy?
No Sometimes employing the best strategy while your opponent knows you're employing the best strategy is a bad strategy.
Sure, in a perfect would the LLM would be able to infer that from context.
But that's not a realistic expectation imo.
involviert@reddit
I don't see how there is any aspect like that in play. And honestly I don't see a problem expecting an llm to go "but they would probably expect that, so let's do the second strategy" with such a task. And if we need it to be random because the opponent has a copy of my model, the model can say I should roll a dice between these choices. All via the highest ranking token if it's a good model.
Don't get me wrong, I get the point in sampling methods. If they fix model flaws, that's great. It's just that I see no real future in that because rather obviously we must want the model itself to not need bandaids. I get the point in introducing some randomness too, but the point of that is literally to get slightly randomized outputs from the same input, not to improve anything.
Zeikos@reddit
You're assuming there's a best answer to all scenarios.
Or random choice between equally 'best' answers.
That's not the case, consideration is important.
Again, I agree that in a perfect world the highest ranked token would be the one to pick.
But "highest ranked" doesn't necessarily means "most desirable.
I think there's plenty of concrete evidence of that fact.
Cognition is about exploring a space, always going with the option that sounds the best one is the opposite of exploration.
involviert@reddit
There is certainly the best answer the model can come up with. And if you can make your sampler extract a better one that's great and helpful (i am not against samplers). But it also means, however subjective your opinion is, the model basically failed from your perspective.
It kind of does, otherwise a) the model is not the flavor you want or b) the model is flawed.
visarga@reddit
Improving the model means retraining, while improving the sampler would apply to any model (apparently only large models benefit).
DigThatData@reddit
if we beam search with enough branches, eventually one of them will write shakespeare.
involviert@reddit
Yeah I realize the most stupid version of this would be brute forcing random token sequences and checking the benchmark result. And that would in theory actually work to generate basically synthetic training data. Samplers, if they can actually improve a model, are apparently capable to just give you that first try.
s-kostyaev@reddit
I have found reproduction code https://github.com/shirley-wu/cot_decoding
DreamGenAI@reddit (OP)
Nice find! Interesting to see such a huge gap between the reported performance from the paper and the open implementation.
visarga@reddit
The paper shows that smaller models don't benefit from this method as much.
s-kostyaev@reddit
It improves a lot if you use both CoT-decoding and CoT-prompting. See table 7 on page 10.
ParaboloidalCrest@reddit
Am I the only one that feels dizzy when the word "arxiv" show up? It's like a wild west out there and the GPU poor guy doesn't know what to do get something useful out of a 8b-32b model without too much tweaking, only to end up with a local llm that's hurt in the head.
RedditLovingSun@reddit
I know ai assisting in AI research is a long ways away... But I hope in the next few years the tech is good enough to have some agentic ai experimenter dig through these bazillions of papers and find the actual practical legit methods.
kesor@reddit
Google's NotebookLM is a nod in that direction.
Barry_Jumps@reddit
Interesting that this came before Apple's new paper saying models simply cannot reason. https://arxiv.org/abs/2410.05229
Yet Apple didn't appear to have given this paper any consideration.
involviert@reddit
That comes from Apple's corner? First they sleep on LLMs and then they can't reason anyway. How strange. And after all that effort of rebranding other people's stuff as "Apple Intelligence". Does that mean Apple can't reason?
thegreatcerebral@reddit
No, it's Apple. Their goal is to denounce all of it and then once they have convinced all the Apple-heads have bought into this they can then drop on them the new feature of how they have come up with the most advanced AI ever that ACTUALLY WORKS!
All the Apple-heads will eat this right out of their hands and believe with absolute blindness whatever Apple tells them.
involviert@reddit
It's so weird to me how anyone can dislike Apple. They sometimes make really good hardware. Sure, it costs a little extra, but for that you even almost own it, meaning you can do whatever Apple wants with it.
a_beautiful_rhind@reddit
They make good looking hardware, that's for sure. Iphones are great for filtering the "poors" from your social circle or dating life.
genshiryoku@reddit
I precisely don't like Apple because their hardware is sub-par. I wish it was actually true that they make good quality hardware. Their crap is produced in China. Not middle-income countries like Malaysia even, but the cheapest lowest quality factories in China.
Every time I bring up the low quality control of Apple products people just pretend it's not true. Look at the RMA failure rates of their products, it's some of the highest in their product category.
I say this as someone that has special needs for their product. I'm willing to pay a lot extra to just have the best product, but Apple refuses to give you an actually well produced product.
I only buy hardware produced in high-tech factories in Taiwan, Japan or Western Europe. Impossible to find with Apple.
You're correct that Apple has the reputation of making high quality stuff though. But that's mostly marketing. The vast majority of their price tag is profit margin, not actual quality. Which is fine by me, but at least raise the price to $5000 or something and actually make it a good product instead of stuff that immediately breaks on you.
fallingdowndizzyvr@reddit
LOL. WTF are you talking about? You are letting your China hate dominate your common sense. I'm pretty much a Windows/Android user but I do have Apple stuff too. The Apple ][, the Lisa, the OG Mac, the current Mac Studio and of course various iphones. The Apple stuff is at worse at least as good as anything. In generally, it's better than anything else.
Those "cheapest lowest quality factories in China." are the same factories that make anything else. Foxconn isn't an Apple exclusive. They make things for pretty much everyone.
Malaysia is "middle-income" and China is not? China is #64 on the income ranking list. Malaysia is #68.
https://en.wikipedia.org/wiki/List_of_countries_by_GNI_(nominal)_per_capita
China is more "middle-income" than Malaysia.
genshiryoku@reddit
I'm mad because I'm forced to use Mac hardware and the Apple ecosystem for work. The quality of the machines is abysmal. I'm not talking about the chips, their architecture is brilliant and the battery time is unparalleled.
I'm talking about actual build quality. It's extremely cheap crap that breaks with the slightest touch.
thegreatcerebral@reddit
I'm going to disagree with you on that one. Three macbook pros from 2013 in my immediate knowledge circle. All of them still kicking just fine with great battery life still. No, they cannot upgrade to the latest OS but that is what apple does. Forced obsolescence because they are a hardware company that uses software to sell hardware. It is also the reason why the hardware is overpriced because all of the other hardware companies are making little to no margins and rely on volume.
I will also say that I have a special needs son. I have had multiple iPads over the years and I can tell you that those are solid. The screens no, but they survived being tossed out the window as 35 mph among many other things. We would buy extras during black friday when there was discounts because you always need one on the ready. Also, eventually they do become slower just because we were buying the ones with the least amount of storage etc. so they age out faster.
I just cannot agree with you. Maybe the fiasco with the keyboard switch change a few years back pissed off enough people and you can say something about that but I cannot. I have to say they are solid machines.
fallingdowndizzyvr@reddit
I was also forced to support a Mac at work once. But that didn't color my view of reality.
I literally have no idea what you are talking about. Everything from my Apple ][ to my Lisa to my current Mac Studio still work. Yes, the PS on my Apple ][ died about 20 years ago and had to be replaced. but I think that's acceptable for 20 years of service.
thegreatcerebral@reddit
You missed the sarcasm with the last line bud. We are with you.
involviert@reddit
I was just being silly. It was merely a guess that "sometimes" actually holds for them making good hardware. Guess some people are quite happy about their mac studio whatevers doing inference on lots of RAM? Idk, never had a single Apple product, I avoid that ecosystem lockdown like the plague. Although I have to tell you I just ordered a Ficscit coffee mug from the Satisfactory fan shop, that was made in China and it holds coffee just fine!
thegreatcerebral@reddit
So why I dislike apple is how they "market" things and how they are very shady. Some examples would be when they purposefully fucked with the phones back in the day. How back in iOS 3 when they first came out with the ability to have music in the background they stated that it was not able to do anything else yet anyone from back then in the jailbroken community remembers an app called Backgrounder which would take ANY app and keep it running in the background. Yes, it wasn't good for battery life and if you kept too many things open it would kill the CPU and slow down the system but it was possible.
Then, there was when the person came along and made the jailbroken app that allowed you to use your Vol+ button to take a photo because most being right handed it was very natural to want to press your index finger where oh IDK 99% of modern cameras have you press to take a photo. Apple came after them and there was post after post about how people would be too confused if they made the volume button take a picture when you are in the camera and a few iOS versions later after everyone forgot about it "NEW FEATURE".
How about the entire "you're holding it wrong" fiasco.
How about how Android OS has had the ability to have widgets, lock screen widgets, and the ability to put icons wherever you want them and only now has Apple finally implemented the last one. ...and it is still a PitA to just move an app to the location you want on the screen as you pray the rest of the icons just don't freak out on you and move in random ways (a video was posted about this recently on reddit).
Also the way the company spins information. How Apple used to tout how it wasn't a target for viruses when the fact was it had them. Not only that but numbers-wise they had such a small % in the market of installs that it wasn't worth it to make viruses for it. Also, it runs on a unix base which is already inherently more secure period.
Then you have the whole shell game of how MacOS runs so much better than Windows etc. when the reality is that Apple 100% controls the hardware so they only have to have drivers for the limited hardware that they have in the systems. Their hardware is insanely overpriced and since Apple is a hardware company, they use planned obsolescence, and software lockouts to force people into upgrading when they don't really need to. They also withhold hardware features because we have literally reached the end of what modern hardware can do. If they released everything now and allowed everyone to use the software that their devices can already run they would have no reason other than a new battery to sell a new device every 1, 2, or 3 years.
Don't even get me started on Right to Repair and the lies they have said in front of congress and the special "closed doors" meetings to keep their foothold because really the majority of the reason people get a new phone is 1) the battery and 2) the screen. If both of those were easily fixed and parts could be purchased from Apple their sales would plummet. Changing a battery is NEVER a risk to national security. If it is, then we should all be worried.
How any of this is okay to anyone is disgusting. If it wasn't for the inability for others to communicate using iMessage on non-Apple phones then I wouldn't be in the Apple ecosystem. Too many family members and friends complain when green chat bubbles come at them.
kilizDS@reddit
Lol wonderfully put.
bwjxjelsbd@reddit
Isn’t everyone in big tech kind of sleep on LLMs til openAI came out with chatGPT?
involviert@reddit
No, you don't have to look further than google. They invented transformers in the first place. And there was this "AI scientist warns it might be sentient!!!"-story. I think that was quite some time before some GPT model produced coherent stuff. Google just was in no hurry to do anything (public facing) with it.
bwjxjelsbd@reddit
Yeah, I know Google invented Transformers model and that’s exactly why I said everyone on big tech has been slept on LLMs until OpenAI came out with chatGPT.
Google and Big tech knows LLMs are good at predicting next token and they think it’s just “advanced” autocomplete since that fits their current product line. Heck they all have digital assistant at the time and NON has been powered by LLMs until recently
involviert@reddit
I mean single-handedly inventing working LLMs is not exactly "sleeping on LLMs" in any sense we are talking about. You can call Meta "sleeping" if you want, they probably just did other things with the AI chips they made themselves. But they were quick to catch up, as you know. Amazon bought in to Anthropic. Microsoft basically funded ChatGPT. And Apple has still nothing but a plan to put a shiny apple brand on third party services. That's just a whooole other level of sleeping.
Lemgon-Ultimate@reddit
I laughed hard as I saw Apples paper. OpenAI's next leap is o1's reasoning ability so it's kinda funny they published it. I assume "reasoning" can be interpreted in different ways.
bwjxjelsbd@reddit
Well they kind of have a point. Try asking LLM (that’s not o1) “How many days from x date to y date” and most of them can’t get the right answer. I tried this and only llama 405B model get it right and the results are not consistent
FaceDeer@reddit
LLMs aren't good at math. Can you reliably answer "how many days from x date to y date" without reaching for external tools? Does that mean you can't reason?
bwjxjelsbd@reddit
Sure, I just need to do a calculations.
LLMs just “acted” like they did the calculations and still come up with wrong answers anyway.
FaceDeer@reddit
And when you "do those calculations" do you do them in your head, or do you use tools of some sort?
There are a few people who have developed tricks for calculating specific sorts of fancy things in their heads, like those date ranges, but generally speaking humans aren't good at that. That's why they make and use tools. Being able to do math in your head is not an indicator of "reasoning."
RedditLovingSun@reddit
I mean I could def do it if I had some pen and paper to write stuff down. Which an llm kinda does have access to by writing to its context cot
kryptkpr@reddit
The average human can tell you how many days between Feb 12th and Sept 3rd, either .. that's calculating not reasoning.
triggur@reddit
Considering Apple’s big “apple intelligence” contribution is hot garbage, it’s hard to take them seriously.
asankhs@reddit
I have implemented this in optillm - https://github.com/codelion/optillm/blob/main/optillm/cot_decoding.py I managed to replicate their results, we posted it a few weeks back here - https://www.reddit.com/r/LocalLLaMA/comments/1fnzxbo/cot_decoding_eliciting_reasoning_from_llms/
DreamGenAI@reddit (OP)
Great work! Thanks for sharing. Would you be able to run using the same models they evaluated in the paper (so that we can compare the scores)? In your post I see you evauated on the newer Qwen.
nero10579@reddit
For anyone who is just passing by this, read the paper, it is interesting.
mr_dicaprio@reddit
I played with cot-decoding last week, but on the step-by-step level and got some promising results, even with a llama 3.2 3b model:
https://github.com/eryk-mazus/no-reason
JungianJester@reddit
The lack of true CoT reasoning has soured me to LLM's for anything other than chat and maybe some programming. Reasoning needs to emerge from a worldview and without that results will be suspect. I have shut down my server with a gpu and have gone back to running instances on my laptop with Llama 3.2 3B, it's good enough for now.
SkyInital_6016@reddit
Scientists discover thinking