The LLMentalist Effect: How AI programmers and users and trick themselves
Posted by grauenwolf@reddit | programming | View on Reddit | 88 comments
Posted by grauenwolf@reddit | programming | View on Reddit | 88 comments
gc3@reddit
This article is untrue. I have used Cursor to add major features to programs, it made the changes for me and showed me a diff for me to approve. I don't think I coukd hire a psychic to do this for me.
grauenwolf@reddit (OP)
You say it isn't true, then attack a strawman. The article never claimed that that a psychic could write code for you. It says that people trick themselves into grossly overestimating the AI capabilities.
gc3@reddit
It says it's all trickery
"The intelligence illusion is in the mind of the user and not in the LLM itself."
Then goes on to explain why this is. If it were an illusion, though, I could not get decent results. I could not get it to convert a python program to C++, I could not get it to add drag and drop ability to a javascript web browser application, I could not get it to analyze a bunch of points and draw the best matching line (for which it wrote a python program for me to run). But it (Cursor) does these things, I have done these things using AI just in the past week at my job.
The other alternative he gives is:
"The tech industry has accidentally invented the initial stages a completely new kind of mind, based on completely unknown principles, using completely unknown processes that have no parallel in the biological world."
I don't buy this either. Therefore the article is completely incorrect. First, psychic style trickery cannot create actual results, and I don't believe that these processes are without parallel in the biological world, especially as the mechanics of the all neural networks, which LLMS run on top of, was conceived as being analogous to neurons. I don't think the principle behind how LLMS work (as opposed to machine learning to recognize and classify pictures, which is well understood) is really well known or understood and it is true the creators were surprised by the level of seeming understanding that seems to be embedded in text.
grauenwolf@reddit (OP)
LOL Do you really think you're the first person to convert Python to C++? That's something I would expect it to have in its training data.
This is what they're talking about. You think it's intelligence, but it's just weighted guesses. It doesn't know what Python or C++ are.
gc3@reddit
Intelligence is mostly weighted guesses. It doesn't have to be 'real intelligence' to be useful. And if you ask it what C++ is it will give you a definition!
When using Cursor for a hard problem it will first break down the problem into smaller steps, and try to solve each one, coming up with a plan. If you watch it you can see that maybe it is getting off base and you can stop it, improving the prompt. So it has to be a little interactive. Still takes much less mental effort and time than doing it from scratch.
When translating from Python to C++ it replaced the numpy arrays with Eigen::MatrixX (i think, it's the weekend, maybe it was VectorX) which was a syntax I'd not used before. I had started to convert the python to C++ myself and after making a couple of mistakes (I found parts were hairy to convert since I made a bad decision earlier) decided to use Cursor and was so surprised by the clarity of the conversion after learning the new syntax... while it was not as optimal, it resembled the original python very closely which I thought was good as the original python was known to work.
grauenwolf@reddit (OP)
Humans think in terms of concepts, not tokens. Tokens are mapped to concepts, for example the word "table" maps to the idea of tables, but the concept and token are also distinct.
LLMs only understand tokens. It knows the tokens table and mesa are related, but it doesn't know what either of them means conceptually.
gc3@reddit
A token for table is mapped to the concept of tables inside the corpus of text it was trained on, the same way a machine learning maps cats to a concept of a cat.
Really the concept of a table is spread out among millions of files of text input, and includes ideas such as a 'operating table', a 'kitchen table', a 'spreadsheet table' which somehow the LLM can differentiate as it learned the textual concept of table from that training.
Where LLMS lack is a greater understanding of table that is not represented in text, how it feels to sit at a dinner table. It can pretend to know the feelings because authors have written about it but it has no firsthand experience.
grauenwolf@reddit (OP)
No it's not. There is literally no way to represent concepts in an LLM. All you can do is relate a token to other tokens. It knows that token 245753 is somehow associated with token 654347, but not what those numbers 'mean' beyond that.
gc3@reddit
You are just plain wrong. The semantic difference between words was developed before llvm and is crucial to understanding why it works.
Llvms basically exploit the labels provided by language to organize a neural model
And as for your anecdote about not recognizing their own contradictions, if that were required for intelligence, over half the people in the country woukd fail.
grauenwolf@reddit (OP)
There isn't a "sematics" variable in neural nets. You're just making stuff up because you don't have a real argument for LLMs being sentient, let alone sapient.
s-mores@reddit
Reminds me of the guy who was vibe coding away and shouting from the rooftops how AI was all you need.
Then it deleted his production database and hallucinated 4 small ones in its place.
In the writeup there's something that struck out to me "I had told [the LLM agent] yesterday to freeze things" or similar, thinking that was an instruction.
Of course it's not, you can't instruct a LLM, you give it prompts and get statistically relevant responses back.
redactedbits@reddit
Kind of. This subreddit is used to takes that lack specificity and nuance, so if that's you feel free to upvote the parent and just move on.
You can definitely enhance agents with memory. The language models themselves don't have much of a memory though. Sometimes the provider, like Anthropic, will provide a system that manages memory within a bound of conversations for you. Memory, however, can mostly be thought of as an MCP tool and the agent has to be organized in such a way that it uses it effectively. That's all to say you could instruct an agent for longer term tasks but it needs to be setup for that.
Environmental-Bee509@reddit
LLMs decrease in efficiency as the context window grows larger
ShiitakeTheMushroom@reddit
So do humans, once you get to a certain point.
Environmental-Bee509@reddit
Yeah... But LLMs are way worse
BroBroMate@reddit
Yep, that's why various LLM tools like Cursor ship "clear the context, holy fuck what" functionality. It's amazing what crap gets spat out when the context gets too big.
redactedbits@reddit
An LLM, yes, but agents can and should be actively managing context size.
Chrykal@reddit
I think you're missing the point, the prompt you provide isn't an actual instruction at all, it's a series of data points used to generate a multi-dimensional vector that leads to an output. The LLM doesn't "understand" the prompt and therefore cannot be instructed by it, only directed.
SkoomaDentist@reddit
And it's all just probabilities. If you manage to (intentionally or accidentally) skew those probabilities in a suitably bad direction, it doesn't matter what "instructions" you have given because the rest of your prompt overrides those in the vector space. A friend of mine has used this to jailbreak the censorship of every LLM he's tried by biasing the probabilities such that the system prompt is ineffectual.
redactedbits@reddit
LLMs "understand" intention and meaning in the same way that semantic search does. Is it fuzzy? Absolutely. I also used the word agent specifically and agents have clearly demonstrated that they can follow (and fuck up) basic instructions. The question posed by OP was about an agent and its ability to maintain long term tasks. I simply answered that it is possible with todays technology to do that - it'd just be another tool and you'd probably need to make some structural changes to each iteration of the prompt since most agents call a model multiple times. Call it a sleight of hand if you want but it's really no different sleights of hand than a lot of video game programming.
DynamicHunter@reddit
You can hold a human programmer accountable, morally, ethically, legally, etc. You can’t hold an AI “agent” accountable whatsoever.
astrange@reddit
That's what insurance policies are for. If that was an issue corporations would have insurance for it.
tragickhope@reddit
Oh, did Air Canada have LLM-specific insurance when they got sued over their LLM chat bot lying about offering partial refunds to bereavement flights?
cinyar@reddit
Does anyone offer insurance on LLMs fucking up? What data would they use to setup the rates?
astrange@reddit
Its nonexistence is evidence that it isn't needed.
Ok_Individual_5050@reddit
That it can't economically be provided.
exodusTay@reddit
people really think llm's are intelligent while they are just really good at producing convincing bullshit, with a good chunk of it somewhat useful at times. but so many times i have "instructed" the llm to not do something at it does it anyways, or just gets stuck in a loop trying to solve a bug where it will tell you doing A will solve it and when it doesn't it will tell you to do B, then A again later.
Syrupwizard@reddit
Sooooo much of the useful stuff is also just ripped nearly verbatim from stackexchange or Reddit too. So of course it provides good info sometimes… just like a search engine always has. Now it just has no accountability and a bunch of extra steps.
Isinlor@reddit
You can study in detail what ML systems are doing - extract exact algorithms that gradient descent is developing in weights. We can do it for simple systems and we do know that they produce provably correct solutions.
E.g. Neel Nanda analysis of learned modular addition (x+y(mod n)):
Map inputs x,y→cos(wx),cos(wy),sin(wx),sin(wy) with a Discrete Fourier Transform, for some frequency w
By choosing a frequency w=2πnk we get period dividing n, so this is a function of x+y(modn)
Map to the output logits z with cos(w(x+y))cos(wz)+sin(w(x+y))sin(wz)=cos(w(x+y−z)) - this has the highest logit at z≡x+y(mod n), so softmax gives the right answer.
To emphasise, this algorithm was purely learned by gradient descent! I did not predict or understand this algorithm in advance and did nothing to encourage the model to learn this way of doing modular addition. I only discovered it by reverse engineering the weights.
flumsi@reddit
How does that even work when forecasters are worse than random chance?
Isinlor@reddit
I'm not worse than random chance. If I say that something will happen with 90% chance it tends to be true approximately 90% of time:
https://www.metaculus.com/accounts/profile/103304/?mode=track_record
In general, I'm quite a lot better than guessing 50/50 on yes/no questions.
vqrs@reddit
But an LLM isn't a simple neural network learning an algorithm that ends up being deterministic.
Isinlor@reddit
LLMs they are 100% deterministic. The only intentional source of randomness is decoding, whatever method of sampling from softmax distribution over tokens produced by LLMs.
Therefore you can study algorithms implemented inside LLMs weights e.g.:
https://transformer-circuits.pub/2025/attribution-graphs/biology.html#dives-addition
Beneficial-Ad-104@reddit
What? Under what universe could you ever say models getting ahold in the IMO are not intelligent?
JimroidZeus@reddit
Because they’re not. They are networks of billions of weighted parameters. There is no “thinking”, there is no “intelligence”. They are only statistically likely outputs based on the input.
Beneficial-Ad-104@reddit
So? Why couldn’t that be intelligence. Human brains also have a random stochastic element, that’s besides the point. You have to argue that either solving the IMO doesn’t require intelligence, or they somehow cheated.
What objective benchmark would you use for intelligence then? Or is it just some wish washy philosophical definition that categorically excludes all intelligence that is not human based?
JimroidZeus@reddit
A human can consider possible outcomes from several different scenarios and decide on the best one. An LLM cannot do that. LLMs do not reason, they only provide you with the most statistically optimal response based on input.
They also cannot create anything net new. They can only regurgitate things they’ve seen before. They cannot combine things they’ve seen before to create something new.
I see another commenter has called you out on not providing a definition of intelligence. Care to provide one? I think I’ve touched on what I think a definition of intelligence means.
Hdmoney@reddit
I don't think that's entirely true. At least, it "feels like" they can create something new, given some initial ideas for guidance. But this could also be how humans work, afaict. We stand on the shoulders of giants.
JimroidZeus@reddit
It is entirely true.
They are only able to generate things they’ve been trained on.
They are not creating any new ideas.
Hdmoney@reddit
When I create a new noun-noun company name, is that creating something new or reconstituting training data? What about general relativity, whose ideas are based upon the work of all those before? Ideas of time, fields, gravity, and quantum theory, and all of the mathematics its defined with.
Is general relativity something "new", or is it a "reconstitution" of existing ideas? What is the difference between "new" and "reconstitution"? I mean, I've certainly gotten seemingly "new" sentences from an LLM - it's not wholesale copying training material. I'd wager that "new" and "reconstitution" are one in the same.
JimroidZeus@reddit
Go argue straw men somewhere else.
The technology does not work as you’re describing. I’ve explained it to you as best I can.
If you don’t agree, fine, but you’re wrong.
Hdmoney@reddit
You've backed your argument with nothing but vibes - but, I'll be specific
Abstraction and generalization are what you’re trying to describe, and LLMs demonstrably do both. Backprop and gradient descent over vast hypothesis spaces learn functions that map patterns to meanings beyond the examples seen in training.
That’s why you can invent a brand-new word, define it once, and the model will use it correctly in fresh contexts. That’s generalization - a testable, reproducible behavior, not a vibe.
And if you want evidence that this extends past word games: when a system, under proctoring, cycles between an LLM and a formal verifier to reason its way through unseen Olympiad math problems - as DeepMind's AlphaProof + AlphaGeometry 2 and Gemini "Deep Think" have - that’s not memorization, that’s adaptive reasoning.
If that process doesn’t qualify as at least a form of intelligence, you need a better definition.
JimroidZeus@reddit
You’re missing the point entirely.
While the outputs of the LLM(s) working in conjunction appears to be “reasoning”, it fundamentally is not.
It is literally just the statistically most probable output(s) based on the original input after it has been bounced back and forth a few times.
The “vibes” you’re referring to are actually real world experience in the field trying to use LLMs to solve real world problems. Also consuming dozens of white papers on the subject.
Current AI are powerful tools and they can do some incredibly surprising things, but they are not intelligent.
Hdmoney@reddit
No, I understand your point and I'm acknowledging that not all intelligence === reasoning - that's the bit you're missing. LLMs have useful intelligence which is not in the form of "reasoning", because it hasn't been trained to reason. They aren't generally intelligent, we all know that. It's not human intelligence, but it is something.
These models aren't just building "word prediction". If you look into the model, you can find actual semantics, and abstraction built on top of those semantics. Reasoning requires operating on the model's semantics, which we don't do in 2025. Though, the text input we feed them kind of proxies that, it's not ideal.
I'm writing a whitepaper on exactly that, for a bit over a month now. I'd love to share in a few months once it's more fleshed out.
JimroidZeus@reddit
I disagree that it’s intelligence. To me it is fundamentally a network of statistical weights and that’s it. Statistically generated output. Just for clarity, you’re quoting another commenter in your block quote, not me.
I’m not even necessarily specifically mean to imply “word prediction” models, but I guess it depends on what’s happening underneath. I was referring to the off the shelf options that the general population would be using.
I agree with what you said about reasoning requiring semantic operations, but depending on the specific mechanism, I don’t know that I’d agree that counts as “intelligence” either. 🤷
I think we just disagree and that’s okay!
JimroidZeus@reddit
You’re missing the point entirely.
While the outputs of the LLM(s) working in conjunction appears to be “reasoning”, it fundamentally is not.
It is literally just the statistically most probable output(s) based on the original input after it has been bounced back and forth a few times.
The “vibes” you’re referring to are actually real world experience in the field trying to use LLMs to solve real world problems. Also consuming dozens of white papers on the subject.
Current AI are powerful tools and they can do some incredibly surprising things, but they are not intelligent.
JimroidZeus@reddit
The Oxford Language Dictionary defines intelligence as
‘the ability to acquire and apply knowledge and skills’
Since an LLM has neither “knowledge” that can be applied, nor “skills” that it can actually learn, I would say no, LLMs do not have intelligence.
cmsj@reddit
You can’t play the definitional high ground card unless you have a definition of intelligence which is also not wishy washy.
Beneficial-Ad-104@reddit
Well a better definition is, can we think of the most difficult benchmark for the AI, such that it gets the lowest score, then compare how an average human does compared to that? Once we can’t even construct a benchmark the AI doesn’t ace that’s a sign that we have a superhuman intelligence, and in the meantime the comparison between human and AI score gets us a measure of intelligence to compare to humans.
cmsj@reddit
I would say your scenario is just a sign that we have models that can predict the answer to all test questions better than an average person. I don’t see any particular need to apply the label “intelligence” to that ability. I have a hammer that can hit any nail better than my fists can, but I don’t apply any labels to that, it’s a hammer. LLMs are prediction engines.
Drogzar@reddit
As per company request, we are evaluating the use of Gemini Pro for coding... so far, it has NOT ONCE being able to provide a solution to a problem that I didn't know the answer beforehand (I test it asking things I know and things I don't know, to evaluate the answers).
The likelihood of it just hallucinating functions that don't exist approaches 100% the moment your question is not a cookie-cutter thing.
It is also completely unable to explain why something would work and would 100% make up shit. On Friday is tried to convince me that just calling "WaitOnKeyRelease" (Unreal's GAS method) would automatically work if I drop it in my code because Unreal's GAS magically knows which key was pressed at the time of an Ability activation... without any need to map the key to the ability or anything...
roland303@reddit
the word model suggests inherently that it is impossible to truly replicate the system they model, otherwise they would call them brains, or people, not models.
grauenwolf@reddit (OP)
But that's what they want. The AI investors are literally hoping to create slaves for themselves so they do longer have to pay us humans.
avanasear@reddit
this universe. have you ever actually attempted to use an LLM to assist in creating a programmatic solution? I have, and all it spat out was irrelevant boilerplate that looked good to people who didn't know what they were doing
pdabaker@reddit
I have, and it was quite useful, but takes some practice to get used to the scope/detail needed to use them well
currentscurrents@reddit
Uh, yes?
I feel like I'm taking crazy pills on reddit, everyone software developer I know in real life is using ChatGPT every day. Myself included.
Buttleston@reddit
For me it's the opposite, almost everyone I know, including myself, tried it and abandoned it except for very trivial one-offs. Like, I needed to dump the environment for all the pods in a k8s cluster so I asked claude to do it, it was close, and I went from there. But for real work? Not a chance.
I suspect that like in most things, people self select their peer group, or are selected by it. When I was 20, I thought there were not many 40yo programmers because I barely ever saw any. When I was 40, all my peers were also 40 and there were not many 20 year olds. People tend to hire people who are like them, whether on purpose or not
currentscurrents@reddit
Could be.
In general, younger people are faster to adopt new technologies, so I would not be shocked if adoption is lower among 40-something programmers.
Buttleston@reddit
It goes both ways though. All my co-workers are extremely experienced and talented. They have finely honed bullshit detectors from decades of "this new thing solves major programming problems!". And they work on things of significant complexity, and can tell the difference between something that is well architected and something that is not.
And also, can generally get stuff done faster - they have either done it before, or something like it, or they have tools they've already written that make it trivial, or they know a way to find something that will do it very quickly. That makes having an LLM do it less attractive, because it might not actually be faster even if it works.
And also, I have personally seen how badly AI shits the bed, and how much people don't catch it. I am not even saying those people CAN'T catch it, but they don't. I guess my operating theory is, if you're lazy or inexperienced enough that you'd rather have AI do it, you're also too lazy or inexperienced to do a good job evaluating the output. I very regularly deny PRs that are clearly AI generated because they're nonsense, and I have found egregious stuff in PRs that other people "reviewed"
avanasear@reddit
if you're recreating existing solutions that have been done time and time again then sure, I'm glad it works for you. it cannot do much more than that
Beneficial-Ad-104@reddit
Well these models have been acing programming competitions, which are generally significantly harder than many typical real life programming problems. Can they do everything? No, we don’t have AGI yet, but they can do very useful things already
cmsj@reddit
Programming competition wins are cool, but they’re ultimately a demonstration of how well trained a tuned a model is to predict the answer to a programming competition question.
If what you do for a living involves answering programming competition questions, then it’s happy days for you (or sad days if you are now replaceable by an AI).
I happen to work in the enterprise software industry where we have to build and maintain tremendously complicated systems that run critical real-world infrastructure. AI can help with that, but would I replace a human with it? Not even close.
Beneficial-Ad-104@reddit
I wouldn’t replace those jobs fully either for such tasks right now as it has problems with hallucinations and long term planning, even though it’s incredibly useful. But just cause there are things it can’t do doesn’t mean I wouldn’t call it intelligent.
cmsj@reddit
I think a lot of the naming here is marketing fluff. Just call them code prediction machines and an appropriate level of scrutiny is obviously necessary. Using the word “intelligence” just doesn’t seem necessary.
avanasear@reddit
which competitions?
Beneficial-Ad-104@reddit
https://venturebeat.com/ai/google-and-openais-coding-wins-at-university-competition-show-enterprise-ai
cmsj@reddit
It very much depends on the type of solution you need. If it’s some pretty vanilla bash/python/javascript/typescript that’s implementing something that’s been implemented thousands of times before, it’s fine.
In the last year I’ve used it for those things and it was ok, but I’ve also tried to use it while working on some bleeding-edge Swift/SwiftUI code that is stretching brand new language/framework features to their limits, and all of the LLMs were entirely useless - to the point that they were generating code that wouldn’t even compile.
Xemorr@reddit
The claim being made here is one of an unsolved philosophical question. We haven't decided whether https://en.wikipedia.org/wiki/Chinese_room is truly intelligent or not.
You can't claim whether a LLM is intelligent or not, because we have not solved that scenario.
juanluisback@reddit
You can push back against people declaring it's intelligent.
Xemorr@reddit
His reasoning when he pushes back is wrong though. His claim assumes consciousness/intelligence CANNOT be emergent from simpler operations. You can only make the argument that you cannot claim either way
juanluisback@reddit
You can push back against people declaring it's intelligent.
Zwets@reddit
We could try to solve it now by applying 18th century unnecessarily cruel methods for animal sapience testing to the Chinese Room.
You insert a phrase in Chinese and receive the room's default response. You then insert the same phrase accompanied by jabbing whatever is inside the room with a stick, to induce the room's distress response, if it has one. This step may need to be repeated a dozen times or so before moving on.
Then you wait several days for the room to return to normal routines, before inserting the same phrase again, (twice to be certain); and the result should confirm if the room contains a machine responding to stimuli in a purely mechanical (non-sapient) fashion with no adaptation/self-preservation, or if the room contains a sapient person with a book and thus is capable of anticipating pain.
I do note that evidence exists that when poked with significant amounts of bad press, the aggregate of "models+their operating company/foundation" have together demonstrated sapience and changed their responses to specific stimuli in future replies. I do not think the same applies when considering the LLM as only the model.
...though one could argue that allowing previous prompts to be cleared after waiting several days means it is now a different Chinese Room than you started with?
Syrupwizard@reddit
Oh buddy…
mr_birkenblatt@reddit
Title gore; I have no clue what you're trying to say
fripletister@reddit
And that's why Reddit users shouldn't be editorializing titles
flumsi@reddit
Sure the title on reddit is a bit messed up but if you read the article past the first paragraph (I know that's tough for reddit) it's pretty clear what the article is about and I find it actually extremely insightful and interesting.
briddums@reddit
It’s a shame they changed the original title.
The Reddit title made me go WTF?
The original title would’ve had me clicking through before I even read the comments.
sakri@reddit
I found it a good read. I was expecting the old " when a tech is sufficiently advanced it's indistinguishable from magic", but it catalogues and digs into the techniques and sleights of hand used by conmen and grifters since forever, and finds (at least for me) very interesting parallels.
audioen@reddit
I thought this article would have had something to do with programming. Seems to complain about the language chatbots use for most part, which usually indeed just spits your own prompt back at you, lightly massaged.
However, let's forget about chat. You can watch LLM develop a feature all on its own and maybe it helps you to figure out that this technology can actually do some stuff. It's not a trick when it produces plainly correct code or notices something you didn't in your codebase and fixes it for you.
Full-Spectral@reddit
Arguably the bulk of the great ideas of the last century were mostly driven by desire, which LLMs don't have, and by endless speculation and failure (and the ability to recognize when the goal wasn't achieved by something else of value was discovered incidentally.)
I don't see how LLMs would be particularly good at any of those things. Like any software it can do massive amounts of attempts at something, but it wouldn't 'know' if it accidentally found something else of value since it has no idea what value is unless someone tells it. And if someone can tell it, then the idea isn't particularly novel.
chipstastegood@reddit
LLMs may not be intelligent but it seems as if they’ve long passed the Turing test.
FlagrantlyChill@reddit
You can make it fail a tiring test by asking it a basic math question
met0xff@reddit
Then half of mankind fails the Turing test
grauenwolf@reddit (OP)
You missed the point. Ask it a question that is easy for a computer but hard for a person to do in their head.
SkoomaDentist@reddit
And ask it a question that will result in the kind of nonsense / wrong answer that no human would ever give, even if they also can't answer correctly.
Thormidable@reddit
LLMs merely proved what a low bar the Turing test was.
Plazmaz1@reddit
I mean only sometimes. I think it's more about us failing than it is llms passing.
flumsi@reddit
Which is pretty meaningless. The first chatbots already passed the turing test. The turing test is based on a subject's belief about what a robot or computer can or can't do. If a computer does something the subject doesn't believe it should be able to do they think it's a real human.