Why isn’t LLM reasoning done in vector space instead of natural language?
Posted by ZeusZCC@reddit | LocalLLaMA | View on Reddit | 147 comments
Why don’t LLMs use explicit vector-based reasoning instead of language-based chain-of-thought? What would happen if they did?
Most LLM reasoning we see is expressed through language: step-by-step text, explanations, chain-of-thought style outputs, etc. But internally, models already operate on high-dimensional vectors.
So my question is:
Why don’t we have models that reason more explicitly in latent/vector space instead of producing intermediate reasoning in natural language?
Would vector-based reasoning be faster, more compressed, and better for intuition-like tasks? Or would it make reasoning too opaque, hard to verify, and unreliable for math/programming/legal logic?
In other words:
Could an LLM “think” in vectors and only translate the final reasoning into language at the end?
Curious how researchers/engineers think about this.
rc_ym@reddit
I've had a suspicion since LLM's first started showing emergent skills, that what we are really seeing are properties of language, not the model tech. If that's true, then reasoning would fall apart. Still tho isn't "vector space" the calculations needed to get to the next token which is essentially the language anyway.
Silver-Champion-4846@reddit
dense language that expresses in math?
rc_ym@reddit
Yeah, how LLM's work. 😛
Silver-Champion-4846@reddit
I call it Embeddish
AnActualWizardIRL@reddit
Heres the thing. Why do *you* reason in textual space instead of latent space. Are you sure you even do think in text? Are you even sure THEY do? The text is a representation , not the thing in itself. I'd argue in both wet and transformery brains, its a bit of column A and column B. We percieve outselves thinking in language, and theres some evidence that the language does structure our thinking, although spatial and temporal representations (Images, processes over time, etc) do as well. And since we're training these things on text, its *highly* likely language structures LLM thinking too. Its innevitable. Its why these little guys seem to reason in a way thats LOOKS familiar , rather than as some sort of spooky shoggoth like a lot of us seemed to think they where in the early days. But under the hood, its just vectors passing through a pipeline (or in our case , neurons squiring activation juice into other neurons)
TheRealMasonMac@reddit
There are people who think without an active inner monologue. I believe it's roughly 30-50% of the population. A paper a few years ago that found that they performed worse in verbal working memory and recognizing rhymes unless they were allowed to reason out loud. [1] At the same time, it's also known that reasoning isn't inherent to language. Some autistic savants can mentally perform complex math operations thanks to unique mental processes that intersects with synesthesia (though this is still poorly understood AFAIK).
[1] https://pubmed.ncbi.nlm.nih.gov/38728320/
DerDave@reddit
Yes my thinking is completely without inner monologe. It's hard for me to imagine how it is for other people who have a constant narrator in their head.
AnActualWizardIRL@reddit
Mostly we aren't. Its more when we're thinking about thinking that we start wording words in our brains. Or at least phenomenologically. Then again I'm also not sure we are even conscious most of the day, just cruising along in autopilot, until we have reason to notice things (ie our "attention mechanism" latches onto something requiring big-boy-brain to reason about)
I'm a little skeptical that the non-verbal reasoners are as high as 30-50 percent. We'd have noticed a hell of a lot earlier.
TheRealMasonMac@reddit
So, I checked the sources:
- 50-70% of the population has an infrequent inner monologue
- Ergo, 30-50% of the population have an ongoing inner monologue
- 5-10% of the population lack an inner monologue entirely
AnActualWizardIRL@reddit
That seems a bit more plausible
TheRealMasonMac@reddit
It’s all self-reported so who knows. Brain scans would probably be more reliable but there’s probably not a whole lot of money to do that.
AnActualWizardIRL@reddit
Btw. Anthropic's research section is *excellent* stuff for us folks who've forgotten more calculus than we remember but want to keep on top of the broad strokes of where a lot of the research is heading
thuanjinkee@reddit
You could also try OpenMythos https://github.com/kyegomez/OpenMythos
lol-its-funny@reddit
I literally thought the same 3 years back when studying transformers. My mental model the was we (human) get text, convert to a thought, process though and then we put it back into text.
So first off, transformers already operate in vector (latent) space. Every token is embedded into a high-dimensional vector, and all computation—attention, Q/K/V projections, FFNs—happens there, with representations changing across layers. The evolving hidden states are what drive next-token prediction, so in that sense the model’s “thinking” is entirely in vector space.
What we call chain-of-thought isn’t the model translating a finished internal reasoning trace into text. It’s part of the computation itself—forcing intermediate tokens helps steer the trajectory through latent space toward better answers. More precise framing: models perform statistical transformations over representations, and “reasoning” is an emergent behavior, not explicit symbolic logic.
We don’t rely on pure vector reasoning externally because it’s opaque and hard to supervise. Language gives us a training signal, interpretability, and control (debugging, alignment, verification). So the system already thinks in vectors—we just use language as the interface to guide and inspect that process.
spiddly_spoo@reddit
I thought part of the latent chain of thought approach was that you don't use the nonlinear relu/gelu/sigmoid w/e functions that usually apply a sort of discrete selecting of "neurons". Like human language and human conceptual models and inference is largely discrete. The transformer is perhaps clicking in to a certain train of thought or concept or inference with these nonlinear functions and the idea would be to not apply them or something? To let semantic vector drift further through semantic space via more linear transformations before applying relu/gelu functions. Idk
BalorNG@reddit
Wait WHAT? Are you even human? (Not a rhetorical question nowadays unfortunately). Have you ever opened a fucking dictionary and seen a dazzling array of different, sometimes contradictory, meanings assigned to one word?
Why an english speaker that never heard a word like schadenfreude instantly gets it upon reading the definition?
I'm semi-fluently bilingval and often find myself reaching to another language to communicate subtle "differences in semantics" when I'm trying to communicate complex concepts.
That's because our internal "mentalese" is anything but discrete, but sounds/letters we communicate has to be - speech bandwith is laughably narrow, so the "multidimentional mentalese thought embedding" has to be collapsed/compressed into a language token if we are to have any sort of commutication in reasonable length of time... and this compression is very lossy.
If/when we'll have BCIs that take the raw mentalese "embeddings", pack it into blobs taking several megabytes per packet and directly transmit those between PCs and other people, we'll need to language, and a LOT of conceptual/philosophical confusion will disapper in a snap.
There are people lacking internal monologue yet having rich non-verbal representations and they do just fine when it comes to reasoning... Sometimes better, if what I've read about Einstein is to be believed.
It will have to translated into the unique internal model each time, of course, and it might not be an entirely lossless/perfect process I guess.
spiddly_spoo@reddit
What I was thinking with "discrete" was more simply that our model of reality has discrete concepts like "tree" "planet", "animal" etc. we need some sort of discrete labelling and boundaries to be applied to the raw qualia streaming at us. You could argue that the distribution of genotypes and phenotypes is quite continuous and maybe there are no crisp boundaries to what a tree is as at it may seamlessly slide into some other life form that we have a different category for. And more continuous yet would be the space of all possible genotypes and phenotypes which may be so continuous that trying to divide life into discrete species becomes totally arbitrary... and yet when I say "tree" people know what I mean without referring to a specific instance. Human cognition has been able to sublimate many experiences and observations into a sort of platonic concept of a "tree". We definitely don't need language to think but the way we understand reality and think even without words depends on/uses a conceptual model of reality built from discrete concepts. So navigating this conceptual model of reality does require discreteness. Anyway, that's what I had in mind by discreteness.
BalorNG@reddit
Well, I understand your point, but you are, heh, missing the forest for the trees.
When it comes to, say, logic, when strict composionality and "discretness" of language is a feature - well, it is a feature allright.
When it comes to concepts - it is most certainly a bug.
If you had abovementioned BCI, you can very easily have sent an "abstract, Platonic tree" by sending an embed by what is currently implied by a tree defined and other possible characteristics "undefined", OR you could have sent "a bent willow of genus X, growing near a river, gently swaying in the breeze and toched by rays of setting sun" - all in one "packet", if it illustrated your point better.
And the fact that a "tree" is a point of a continum of species stemming from green-blue algae to bees, humans and whales - can also be embedded.
Along with many, many other things we don't even have words for, currently, because it lies outside of a typical language long tail distribution, but still there.
We can emulate it by stringing a lot of "language tokens" together to reconstruct a shape of "semantic cloud", like running an analog signal through ADC, sending it and converting it back with DAC, and it can be a very decent approximation in most cases, but the source is analog and by staying in analog all the way we can sidestep conversion losses, but as meat humans we don't have this problem because language is produced by our "meat parts" and resulting bandwith is terribly slow, so we have to sacrifice level of detail/clarity for brevity.
spiddly_spoo@reddit
Hmmmm. To have any type of reasoning that involves multiple things there must be categories at play. But this doesn't say anything about the nature of the semantic space this reasoning/thinking takes place in. I think I agree that semantic space is continuous... but practically speaking categories must be used, but maybe this can be accounted for through dimensionality. Like having a thought "I saw two cherry trees today" could exist as a point in semantic space where one component of the vector is two-ness, another "I am experiencing" etc etc.... I don't know. I guess that is obviously what a semantic space would be, something where each dimension is some linearly independent atomic concept. Maybe the discreteness I was thinking of is actually just in the dimensions of the space. I'm too tired to think but I will ponder this tomorrow morning haha.
BalorNG@reddit
Well, personally I rather despise platonic essentialism/idealism because it glorifies or, I daresay, idealises a map of reality (our internal model of reality that is incomplete and compressed even in mentalese) vs reality itself where there are no trees, chairs or, ultimately, even atoms, just a contious wavefunction that "collapses" into discrete phenomena... but even quantum mechanics is still a human-made map of reality (the best we have so far), not reality itself, which "we" don't experience directly - Plato's cave is a valid philosophical concept that stood up to scrutinty unlike his forms.
We have to use concepts and categories to "make sense" of this picture given limited time and bandwith, but while this is a "lesser evil" to escape "blooming, buzzing confusion" (c), I firmly consider this to be an evil that is best minimized - just like quantizing a typical model to 4 bits is... might be an "acceptable sacrifice", it still degrades it.
jubilantcoffin@reddit
Your understanding of the basic building blocks of a transformer (or neural networks in general) is off. You absolutely need nonlinear functions in order to get it to do anything useful.
lol-its-funny@reddit
I don’t understand what you’re saying. I think you’re confusing the dimensionality and representation. Activations are on individual outputs/resulting scalars. Like m y=f(x) where x, y are scalars.
SkyFeistyLlama8@reddit
Human brains already use vectors as in the array of neurons that gets activated once stimuli arrive and the flood of thinking that comes after that.
Can human thinking be broken down into compute cycles? Like do we think at a fixed refresh rate like a CPU's clock speed?
Polite_Jello_377@reddit
Put the pipe down bro
SkyFeistyLlama8@reddit
Juice_567@reddit
There’s coconut (chain of continuous thought) and JEPA which have similar ideas. This is difficult to train though, a latent space is the result of training, but if you train your network the latent space changes. So you need to freeze parts of the model so you don’t run into the moving target problem.
MuDotGen@reddit
Somebody asked a good question earlier. How would you even "read" the logic if the logic is performed in vector space, and it sounds like SAEs and other tools are being looked into to filter out and translate the raw logic into human readable syntax. I am not expert and just learning about this myself, but this is really interesting! If anyone has anymore incites or knows the current limits, I'd love to hear.
NextWeather7866@reddit
Logic structures are inherently in there somewhere. Turning it into human readable syntax, I think would be extremely difficult. When you actually think things through logically, you aren't doing it just in your internal voice (some people don't even have an internal monologue, that is called anendophasia, no visuals is aphantasia), but also stringing visuals together as well, placing things in time and space. Importantly, you aren't calling specific images or words by name either, when you imagine where you left your keys, visuals flash in front of your minds eye, but you aren't voicing those images internally either, they just get recalled as you are thinking through your problem.
I guess what I am saying is, if its not even interpretable from one person to another, I think the challenge of being able to interpret vectors in latent space and understanding them is going to be challenging.
It's probably more useful to imagine it as input being pushed through the system and being shaped into an output, to understand the difficulty in interpretability.
crantob@reddit
As someone who grew up multilingual, my brain abstracted thought early on and I do not 'think in a language' at all.
I've been waiting for LLM's to do the same since i invented CoT (along with many others).
NextWeather7866@reddit
Welcome to the club of being multilingual, it’s an incredible gift.
CoT has been an important addition to ML, and if you contributed to it, I genuinely applaud you and the other researchers who developed the field.
I don’t think a vector-language is technically impossible. My point was more about interpretability and transfer. If reasoning happens in a latent/vector format, we should not expect it to translate neatly into ordinary language. The model’s weights define the learned structure, but a particular thought-like process lives in the activations as an input moves through the network.
So a “vector-language” would probably be less like reading off which neurons fired, and more like learning a stable representational interface: identifying activation patterns or features that carry semantic, causal, spatial, or logical information, then using those patterns directly for reasoning or communication.
Some work already points in this direction, such as attempts to transfer specific behaviours between related models via activation-space alignment: https://arxiv.org/abs/2604.06377. But that seems much easier within related model families than between completely different models, because the dimensions(weights, layers and all) can all differ.
So I don’t think models have internal representations that are simple drop-ins for other models. Even if two systems encode similar semantics, the coordinates may not line up. Having a vector-language and having another model understand that language are two different things. One is a representation; the other is an aligned interpreter.
Which makes me wonder: if you need a translation/alignment layer for every model family anyway, is the goal really a universal vector-language, or is it more like building model-specific internal APIs for latent reasoning?
redditrasberry@reddit
there's a bit of anthropomorphisation in how people believe CoT works I think. As in, they think it works the way their own thought processes do. But mechanistically, it is an explicitly language dependent process. That is, it outputs a sentence of reasoning. Then how is that reasoning impacting the rest of the output? It is re-ingested as part of the context. ie: it plays as tokens all the way through the input layers of the network, re-activating everything including all the attention layers. If all it did was "thought" in latent space, it wouldn't have ability to re-activate those and potentially derive a different outcome. A lot of the research has shown that "reasoning" doesn't function by introducing logical constructs, and in fact the success of the final outcome is uncorrelated to the correctness of the "reasoning" output. What seems to drive it is that outputing reasoning creates better context for the full network to produce an accurate result, often giving it opportunity to break out of an incorrect line of thinking.
So although theoretically you could try to construct a model that executed reasoning in latent space (say, by feeding back the reasoning pathways directly to inner bottlenecked layers rather than the normal input layers), evidence so far is very unclear that it would be helpful and if anything suggestive it would be harmful.
Tokarak@reddit
I think I noticed something like the following when I look at Claude’s chain of thought. He appears to completely misunderstand the question in the chain of thought, and then come out with the right answer anyway. My vibes say that CoT is a very inefficient method, and there should be something better. (Confidence: low; this is nothing but vibes based on anecdotal evidence.)
Farmadupe@reddit
I'm not sure, but I'm guessing this is actually because CoT tokens in frontier models are hidden and summarized by special summarization models. These summarizers can be a bit slow and fall behind the main model. So when the main model is ready to output the answer, there's no point pausing the read output while we wait for the CoT summary to finish, so instead the CoT summary just gets cut off part way through.
It's also beneficial for the frontier labs, as they don't really want to revela their CoT to the outside world (oh no! we found an opportunity to not reveal our secrets!")
Just my guess.
gpalmorejr@reddit
The text you are seeing during the thought process is just the result of the output matrix decoding it's internal thoughts. Difference is, where you can use your internal monologue, the models can't. The computer is always watching. Would be like if someone has a test decoder wired directly to your brain. They would be able to see every concept that passed through without you have a choice.
Mathematically though, thinking thinking IS just vector maths. It iterates against the KV cache vectors using the output of the decoding matrix and then reads back through the ending vectors and iterates again. Each time it encodes new information in the form of a vector onto that end vector, creating a new one that represents it's reasoning process so far. Every one of these iteration also represents a total pass through all of the matrices and such to generate the new vector that is used for the addition. So it is bsaically doing what you are saying, it is just like a toddler, it has no ability to think without saying it. But mostly because we literally read its thoughts to the same display we paste it's output on.
FateOfMuffins@reddit
I mean one reason is the whole "safety" aspect of it, which is why a lot of the big labs are committed to making their CoT readable.
I've seen plenty of small papers here and there claiming to have improved reasoning efficiency without using natural language in the CoT but a first impression reaction they often get is:
"You made the AI's reason in Neuralese from the classic sci-fi novel Don't make the AI's reason in Neuralese"
LatentSpaceLeaper@reddit
Yes, over at r/agi they love to point out the danger of Neuralese:
https://www.reddit.com/r/agi/s/OLZhOuJV1l
LatentSpaceLeaper@reddit
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
catplusplusok@reddit
How would you prepare training datasets?
HorriblyGood@reddit
You can use RL on the final decoded answer
jubilantcoffin@reddit
But currently the training requires bootstrap traces?
robobub@reddit
You switch regimes once your base thinking is good enough
jubilantcoffin@reddit
This requires both regimes then and hoping it decides to switch over during RL? May be a very hard local minimum to escape from.
Far-Low-4705@reddit
the COT needs to entirely result from RL training.
with current COT, it is more or less boot strapped from the start to give it an initial boost in training time using fake reasoning traces.
but with this it needs to completely derive itself, which is very difficult to do, and very expensive, and not feasible in practice unless you come up with a new training method
milkipedia@reddit
By embedding them?
catplusplusok@reddit
They are already converted to embeddings but it's 1:1 token to initial embedding mapping. The question is how to make datasets that are in intermediate LLM meaning representation
aaronr_90@reddit
Vibes
aurelivm@reddit
GPUs really like it if you use the same compute graph every time. Recurrent connections like you would need for latent reasoning make the shape of the model highly variable. Even if latent reasoning got you, say, a 20% bump in reasoning performance per GPU-hour spent on RL, that would probably be offset by it being way slower.
x1250@reddit
IMO because nobody knows how to do it. Probably the only way to do it is to track what happens in the vector space during the generation of text, but even this is not complete. There IS some thinking without generation of texxt AFAIK, but it is not understood right now.
LumpyWelds@reddit
They've known how to do it for maybe 2 years now. One of the problems with this kind of model is you can't follow their thought process and can't tell if they are lying.
gfernandf@reddit
es un tema de arquitectura que nos empeñamos en resolver con los LLMs, otro paradigma es posible https://zenodo.org/records/19438943
SageThisAndSageThat@reddit
As opoosed to educated guesses?
MuDotGen@reddit
Apparently SAEs are becoming the solution that problem and ironically are only recently being used for this area from the last 18 months or so.
standish_@reddit
What does this stand for?
MuDotGen@reddit
Kind of like little neural networks trained to pick out semantic concepts from an entire layer of weights in the model, from what I understand. Many neurons can map to multiple different semantic meanings at the same time, but amidst all this "noise", an SAE can learn to distinguish thousands of individual (monosemantic) concepts that can have their own little weights. Apparently this is one way people are able to ablate models by "zeroing" only specific censorship concepts or making models be more honest. But its main use is to be like a microscope to peer into the "mind" of an LLM and translate the concepts into human readable text. (From my understanding so far)
standish_@reddit
That is very cool. The ablating techniques are nice to have for deprogramming (heh) models.
https://i.redd.it/z3qt7lqs52yg1.gif
teleprint-me@reddit
SAEs -> Sparse Autoencoders
xmnstr@reddit
CoT doesn't exactly help with that either so I'd say it's fine to let go of by now.
a_beautiful_rhind@reddit
So what? The weights are fixed. There are models that don't reason at all. Training small models and seeing if they perform better would show if the idea is worth doing.
HorriblyGood@reddit
It can be done. It’s called latent chain of thoughts. It’s an active research direction.
gfernandf@reddit
https://zenodo.org/records/19438943
the_ai_wizard@reddit
Literally read an article on HN today
paulqq@reddit
Could you please provide link tx
throwaway292929227@reddit
I lol'd a little at the thought that @The_ai_wizard is an AI bot with an LLM that thinks "today" is Dec 31, 2018 or January 1 1970.
paulqq@reddit
rofl
Far-Low-4705@reddit
i think you are correct.
There was an interesting finding in this community a while back, where if you repeated layers in qwen 3.5 27b, it not only stayed coherent, but increased performance.
So the models already have some internal representational language they understand, similar to human thoughts.
I think it is absolutely doable, but i think it is too expensive to get out of RL from scratch so we need a way to find out how to use what the models already understand/learn internally.
ASYMT0TIC@reddit
The human brain is arranged into cortical columns, and the cortical columns in your own brain do exactly this; in fact, recurrence is the main feature. Over 90% of all synaptic connections in the cortical columns of the human brain loop back into their own, or prior layers.
Orolol@reddit
It can be done but the probleme with anyhting in latent space : how to train it.
The current CoT is very effective because we can RL on it. We ccan make a model generate for example 10 CoT, keep only the one that get a correct result and train to favor the shortest one. You can iterate on this 1000 times and you'll have a model that think more efficiently. You can also analyze those CoT, find logical error and use those to reward or punish a model during post training.
All those things can't be done on latent space. We can only train on outputs, therefor we have far less leverage to get precise loss.
THere's good papers about this (https://arxiv.org/html/2602.10520v2) but there's still lot of work to be done
MuDotGen@reddit
I've actually wondered about this idea though. It blew my mind when I saw that semantics could be simulated via trained patterns of words and their surrounding context and translated into vector spaces to do actual math, which can be done deterministically and very fast via computers. Now what if we could do the same with something using formal logic and reasoning? We already use symbols and mathematics to build formal logic (and heck, we compile human-readable code into machine code full of these logic gates like and/or/xor). Could we not use linguistic layers to translate natural language into formal logic parameters, then transform back into natural language? We already solved part of this problem, so it feels like the reasoning part is a natural next step of research.
Edit: Looking around a bit, apparently it is an active field of research. Neuro-symbolic reasoning in the latent space.
ActivationReasoning (AR) seems to be this kind of practical framework idea of embedding actual logic into the latent vector space of an LLM. I personally have no idea of this in particular is legit, but this research space is quite active right now it seems.
throwaway292929227@reddit
This is going to show how dumb I am, but I am using agents and LLMs that tool call logic-only functions from MCPs pointing at old-school wolfram-type calcs using /tool math[+,-,/,Sqr,log,cos,...] for standard in/out functions. I thought this would save tokens and time, and increase accuracy. After reading this thread, I feel like I am either crazy or years behind. Am I an idiot?
Agreeable-Market-692@reddit
No, you're doing great. This is fine. Just keep an eye on the amount of tokens in the MCPs/tool defs.
ItilityMSP@reddit
Because most reasoning isn't logical deduction, but inference from knowns, assumptioms, unknowns and all to a certain confidence level. This is not the biggest issue with Llm's. It the fact they are not grounded, and there is a difference between thinking about something and doing something and there are terrible about doing stuff in meatspace.
eat_my_ass_n_balls@reddit
It’s been done and it’s scary because the tracing becomes very difficult and it also provides a surface for the LLM to hide what it’s thinking from what it’s “telling you”
Silver-Champion-4846@reddit
Not unless you crack the vectorian language with SAEs or whatever next technique will be invented
123vovochen@reddit
It is done, it is called Looped LLMs, and they are very much more knowledge dense !
Legumbrero@reddit
https://arxiv.org/abs/2412.06769
gfernandf@reddit
https://zenodo.org/records/19438943
bonobomaster@reddit
Coconut llama.cpp integration when? ;D
MuDotGen@reddit
That feels like that old video of two phones "talking" to each in "machine code" from ChatGPT. Being able to reason together in latent space is fascinating. A little scary to be honest.
BillDStrong@reddit
What could possibly go wrong with allowing 2 amoral LLMs to talk in a secret language that we can't read? No problems here, nosiree Bob.
my_name_isnt_clever@reddit
How is that different from 1 amoral LLM thinking internally in a language we can't read?
BillDStrong@reddit
A few ways. One, 2 models running can be run at different temps and other settings, with different cache setups, and KV compressions.
So, even the same model can come up with vastly different answers. This means you have 2 idea machines, not one. And just like how we use multiple agents to increase the productivity of the LLMs.
Also, one evil LLM can infect a good LLM, so now you have a contagion problem.
And different models will be trained to be good for different things, which also increases the abilities of the LLMs.
The same multiplicative effects that happen with multiple people is the same that happens with multiple LLMs.
AiDreamer@reddit
Just let them call tools...
Quirky_Inflation@reddit
It's not even talking at this point, more like sharing thoughts.
BillDStrong@reddit
I look at language as pointing to things in idea space. This would be the same, just directions in idea space.
The interesting thing would be can they add new thoughts to each other?
metmelo@reddit
that video is super fake
MuDotGen@reddit
Yes, which is why I put "talking" and "machine code" in quotes.
MixtureOfAmateurs@reddit
It looks like you could make reasoning much more efficient; when you don't need to do the final projection to logits you save a lot of time. But I think the value of thinking in latent space over English is either really hard to achieve irl, or not there. Like you can turn a token into latent space and back into the same token, so thinking in latent space isn't preserving any more info because English thoughts get translated pretty losslessly back to latent space during embedding.
Sorry if that's confusing.
Btw some models use the same weights for embedding and unembedding, so if you generated a logit that's near 1 probability, then embedded you'd get almost exactly matches what the last layers latent space was. But some models have different weights for embedding and unembedding so you'd get something different. Idk which method is trending rn. I know GPT 2 is different weights and gemma 2 (or 1?) was same weights.
zulrang@reddit
Just came to say, this is literally Coconut
peatthebeat@reddit
Coconut! I love it :)
gfernandf@reddit
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6600840
https://zenodo.org/records/19438943
That’s a really interesting direction — Coconut is basically making latent reasoning *explicit* and reusable within the model’s own internal loop.
To me it highlights a key distinction:
- latent/continuous reasoning → efficient, parallel, can explore multiple paths (like their BFS idea)
- language-based reasoning → slower, but inspectable and externally controllable
What I find interesting is that even if latent reasoning becomes stronger, we still face a practical problem at the system level:
we don’t have a way to *compose and reuse* those reasoning steps across runs or tasks.
Coconut improves how reasoning happens *within a single forward pass*, but most real-world systems still need:
- persistence
- composability
- control over multi-step workflows
So it feels like there are two orthogonal directions evolving:
1) improving reasoning inside the model (latent space, continuous thought, etc.)
2) structuring reasoning outside the model (workflows, tools, explicit steps)
My intuition is that we’ll end up needing both — latent reasoning for efficiency, and external structure for reliability and reuse.
Curious how people see the interaction between these two layers long term.
OneSovereignSource@reddit
https://i.redd.it/1xplzre9g5yg1.gif
AreaExact7824@reddit
Is it difference from how much B parameters a model have?
KickLassChewGum@reddit
They do think in latent space. It makes them decide what token to output next, and then that token will influence the latent space and push to it towards the next token after that, which will then push the latent space to the next token after that, and so on.
For this to work, you'd need to give the model a way to influence its residuals without sending a token down the forward pass.
1EvilSexyGenius@reddit
I see the whole vector space , adjoint method and ODEs are making a comeback 👀
1EvilSexyGenius@reddit
I see the whole vector space , adjoint method and ODEs are making a comeback 👀
Elusive_Spoon@reddit
lol, they do! https://dnhkng.github.io/posts/sapir-whorf/
Luoravetlan@reddit
Maybe I don't quite understand the topic discussed but it seems to me the author of the article knows so little about natural languages. 4 languages of the total 8 languages he picked are Indo-European. Also he probably doesn't know that Japanese and Korean are much closer to each other than Chinese to both of them. It would probably have roughly the same result if he just picked English, Chinese, Korean and Arabic as they differ from each other very much.
autoencoder@reddit
Looking at similar languages also yields valuable information. What other 4 languages would you have picked, given the author's resources?
Luoravetlan@reddit
Hungarian or Turkish, Telugu, Swahili, Indonesian.
Danger_Pickle@reddit
An impressively written article.
Elusive_Spoon@reddit
ZeusZCC@reddit (OP)
It's not COT. Its Activation space
wiltors42@reddit
I almost forgot about this. Did he say anything about how much repeating inner layers hurt tok/s?
Cosmicdev_058@reddit
This exact thing is what Meta's Coconut paper explores. They feed the model's last hidden state back as the next input embedding instead of decoding it to a word token, which lets the model reason in continuous space. Outperforms regular chain-of-thought on logical reasoning tasks that need backtracking, with fewer thinking tokens. Tradeoff is full loss of interpretability.
https://arxiv.org/abs/2412.06769
unlikely_ending@reddit
It's a big research direction (including by me) coz a lot of people think it should be.
brown2green@reddit
It might be possible in the future as practitioners abandon probabilistic methods, but I'm skeptical that it will actually go anywhere, unlike images/video:
OkFly3388@reddit
All llm, in fact, reasons in vector space. First \~5 layers transforming token to vector space, then all other layers do actual reasoning in vector space, and again, last \~5 layers do transform back from vector space to actual tokens. And there even study about duplicating actual "reasoning" layers that shown that this is working.
But question is, why ?
If you cant read what llm reason, you cant train it.
ketosoy@reddit
My understanding is that this is already happening in the mid layers’ kv cache attention stream
am2549@reddit
Just read Leonard Aschenbrenners book, he also suggests this being one of the levers to optimization.
ken107@reddit
Reasoning is the repeated application of logical deduction, operating on symbols that may represent real things. Real things have names, therefore a CoT is inherently and readily verbalizable. I've never had thoughts that could not be verbalized. Even when i didnt know names for concepts, they had names or ways to describe them. This property appears to be intrinsic to nature.
123vovochen@reddit
I ASKED MYSELF THIS SO OFTEN !
This is actually done, and it is highly successful, called "looped models".
So far, they have achieved 4x the knowledge density that other models can do.
infinitelylarge@reddit
In addition to being easier to build text based reasoning training datasets, this was also an intentional choice at a couple of large labs to help ensure that the models' reasoning remains more easily interpretable to humans. This is both a safety issue and also an ease-of-improvement issue. If a model's test-time compute is largely captured in human-readable text, then it's much easier to tell when or if a model tries to lie to deceive humans (e.g. when it's output does not match it's internal thought process) and it's also much easier to see why a model's capabilities are lacking for an intended task and how to efficiently improve those capabilities (e.g. if the model can't solve a class of math problems because it misunderstands how to use a particular mathematical theorem).
Gleethos@reddit
I believe it is just because CoT consisting of tokens is essier to train because we can create the chain using reinforcement learning. This produces plain text again. And we can easily train the language model on such plain text using supervised learning.
Now if you want to directly input vectors from latent space you have the problem that these are extremely rich in information. And using the same process for training may give the model "more to think about" from previous forward passes, but it would also flood the model with noise. And so to make this more stable, you would want to run the backpropagation through these latent space inputs to previous runs recursively. In theory, that would allow the model to truly think persistently across time. Kind of like a human...
But then you just turned a transformer into a giant recurrent neural network, and these are super hard to train at scale because they are inherently sequential and you need much more memory to store the gradients from precious passes. That can be millions, and so multiplying that by the number of weights and you can quickly see that we do not even have the hardware for that....
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Tall-Ad-7742@reddit
If i understand your question correctly then one part of the answer is security because we want to understand what they are reasoning about / how they reason
Revolutionalredstone@reddit
Why don't you ?
Words are not masks they ARE the tool and programs (the technology) of our conception.
Enjoy
volleyneo@reddit
That’s how we get skynet
potatolicious@reddit
This is a thing though generally not very mature. State space models use a vector as a reasoning intermediate, though importantly they’re not emitted as tokens, just used in the inference.
There is also a recent paper that trained a model to emit thinking tokens in an extended token space that didn’t overlap with linguistic tokens - and proved that the model did learn representations in these new tokens. The paper suggests that this can be a suitable replacement for regular CoT tokens, and they claim to realize a ~11x speed up.
What you’re suggesting is in principle possible but is all pretty new and immature as of yet.
ortegaalfredo@reddit
Tokens are vectors so...
LankyGuitar6528@reddit
OMG!!! It just hit me! You are describing exactly how I think. I don't think in words. I don't hear words inside my head at all. I just THINK - shapes and connections and colors - the words come at the end after the thinking. I can't even imagine how you would think in words. And to be honest if you did think in words I'm pretty sure that's not actually thinking at all. Like...how would words come out if there was no thinking behind them?
sydjashim@reddit
In my understanding, contexts are used like scratchpads during the reasoning part. Hence, I suppose this doesn't let us to perform the reasoning part via embedding space. Recent models are good to utilise this context (during reasoning part) wisely meaning it opts for longer reasoning or shorter like a human does when required. Thereby, utilising the context space wisely
dataconfle@reddit
Tengo curiosidad por saber como sería un razonamiento basado en vectores?
-dysangel-@reddit
the exact same as LLM based reasoning, because that's what they already do
wind_dude@reddit
I mean I guess it’s how you define reasoning. I think “chain of thought” works because it kind of recreates a lot of the thinking / brainstorming we do in text.
But there’s also looped transformers that basically make the transformer block recursive which on a brief look looks like togetherAI has been doing some work around https://www.together.ai/blog/parcae. I dunno if this counts as reasoning…
But I think in text is the easiest for us to understand, train and improve.
MuDotGen@reddit
Look up Coconut. https://arxiv.org/abs/2412.06769
It actually goes into the benefits specifically between external reasoning via Chain of Thought versus what they call Continuous Chain of Thought. One of the biggest improvements is the ability to quickly go down a BFS set of possible reasonings before committing to one before ever committing to a language token output. In other words, it could chain many more lines of reasoning and search several lines before finding the best one, much faster and more efficiently.
The main roadblock people cite is being unable to see that line of reasoning if it's in the latent space, but SAEs (sparse auto-encoders) have more recently been picked up and actually map the semantic embeddings into readable text, allowing us to effectively still see the reasoning during training and tune things, all without actually touching the weights themselves.
That's my understanding so far, but the gains would be higher efficiency, not having to activate all the language-specific weights used for manually outputting the tokens that go down a line of thought quite slowly, and accuracy, as only the best line of reasoning actually produces any output, so saving on tokens with only like 10-20% slower token generation (but only for the final output). I don't have a source on me at the moment for that 10-20% number though, so you can take that with a grain of salt as I just read it somewhere and forgot.
FusionX@reddit
The AI 2027 paper refers to this as "neuralese recurrence and memory". Someone linked the relevant paper from Meta below which originally implemented this idea.
FineClassroom2085@reddit
Because reasoning isn’t really what the word infers. It’s just a special mode of token output that LLMS are taught to do.
demostenes_arm@reddit
LLM already “think” in the vector space. Now if you are talking about “thinking tokens” they are nothing but output tokens which the LLM has been trained to produce before giving the “final answer” output tokens and stop the generation.
As any output, “thinking tokens” need to be trained on actual data, which can’t be on intermediate layers’ representations because humans don’t use them, and each LLM has its own internal representations.
Nandakishor_ml@reddit
That will be become ramanujan thinking. Eval become difficult
Q_H_Chu@reddit
As I remember, FAIR already cover on this, but the reason maybe the cost of doing everything from the beginning.
CryptoSpecialAgent@reddit
I’m thinking about how I reason about stuff, and my thoughts tend to be some combination of verbal, visual, and sometimes audio depending on the type of thinking I’m doing
So maybe there should be more research done with LLMs that have multimodal outputs (directly from the model, not with tools) and then they can reason with words, images, and sounds
qrios@reddit
But anyway, check out coconut.
MasterLJ@reddit
They do think in vectors. It's exactly how they work. They know semantic association between tokens (a defined input vector) that share a token vocabulary.
Those representations can be made to follow logical rules through patterns in the geometry of the path of least resistance that is etched into models during training that follows the path of least error.
The paths make "circuitry" that store representations of vector interactions and can encode logic and other rules
portmanteaudition@reddit
They do. Part of it is for human interpretation.
maycomesinlikealion@reddit
Idk but look up representation engineering
fastlanedev@reddit
Yes, good video here
https://youtu.be/VQ15-MhZE2k?si=d-YdEjMHe269p5TD
He reproduces a research paper on consumer hardware on this exactly
Kuro1103@reddit
? If it is done in vector space, it becomes another layer.
Sorry for my poor understanding but the deep learning architecture goes around the idea of having one input passed through layer after layer of neural node, to the final output which is a vector of possible token with distributed probability sum to 1.
LLM reasoning is the idea of letting the model select some extra tokens, before getting into the actual answer part.
It works the same way as if we give it 2 instructions. The first instruction leads to the model answers sentence A, then we gives it 2nd instruction + the sentence A, so it generates sentence B.
Making the LLM do reasoning in vector space is effectively adding one more hidden layer in the neural network, because the outcome right before the picking of a "natural language" token is a vector, which is exactly what every layer in the hidden layer passes to the next layer / output node.
slower-is-faster@reddit
Reasoning is just a loop over its text output
Euphoric_Ad9500@reddit
It’s a lot more complicated than that. There is plenty of research on it but the stuff I have seen don’t match semantic reasoning in benchmark performance and . There new stuff on performing multiple forward passes in a special way that is similar to “thinking in a latent space”.
LegitimateCopy7@reddit
I'm no data scientist but my intuition tells me that humans can't understand vectors therefore it's not possible to tune the reasoning process to make sure it's logical.
then there's the economic side, reasoning uses a lot of tokens and tokens mean revenue.
EndlessZone123@reddit
Tokens are a pretty linear way of tracking compute time. If we didnt use visible tokens, you would still be able to track costs. It would probably just be listed as $/s and $/t
ZeusZCC@reddit (OP)
I think the answers are more important than the reasoning. If you just ask the model why it did what it did, it will explain it to you.
Luke2642@reddit
You're so funny.
UnkarsThug@reddit
Probably because we currently generally output probabilities rather than latent space vectors.
There's been some of that with text diffusion models, since they can work in the latent space (although a lot of them are non-continuous), but it's just less common.
Cultural-Broccoli-41@reddit
https://huggingface.co/ByteDance/Ouro-1.4B
I think that it is a theme that is currently being studied in the lab of each company in the ongoing system. I think that the open model will correspond to this to appear as a product, so let's wait about half a year later.
Charmsopin@reddit
Because you won’t be able to verify it
faaaack@reddit
No one would use it.