Why do LLMs code better than they talk?
Posted by iMakeSense@reddit | LocalLLaMA | View on Reddit | 59 comments
Why's it so hard to get LLMs to embody different personas or respond in a way with less patterns or agree-ability than it is to have them write code in a variety of languages? I always thought it was odd based on the variety of data they seem to be trained on.
If I'm missing a config or something feel free to tell me.
mohelgamal@reddit
Coding is very easy for LLM, the syntax is rigid, there is only a couple of correct write to do something in each programming language down to punctuation and spacing
also the training data are abundant in the form code bases that are extensively annotated, and maintained overtime, with mistakes identified later is corrected and annotated in later versions. So when you ask an LLM to fix a problem in a code, it can easily look up similar problems and implement similar solutions
Also alot of human difficult in writing code comes from remembering where the data lives and what the code does across many files. So while a human need to read a bunch of code and reconstruct everything in their brain to make changes this process is tiresome and difficult for our hiligical brains, computers has no such issue
This is why AI like Claude mythos can find bugs that no human was able to find before because no single human can cross reference the incredibly complex code that runs a server to cross link weakness together to find a vulnerability
graypasser@reddit
Actually, LLM perform worse against deterministic tasks, and perform better in vague, ambiguous tasks.
segmond@reddit
The latest ones are trained to code, you can go back to the classics if you want models that are great at chat over code. For example, DeepSeek-v3-0324
a_beautiful_rhind@reddit
RLHF has beaten the creativity out of models.
iMakeSense@reddit (OP)
Ah. Deepseek pioneered that right? Or was this happening prior to deepseek?
a_beautiful_rhind@reddit
No it's pretty old. But imo, as the focus turned to safety, assistants, predictability, code.. that's where all post training ended up. Creative and natural replies are penalized.
VoiceApprehensive893@reddit
depends on what the model was trained to do, if youre making a model for agentic coding
A: have more code in the training data than russian text resulting in good code and bad russian
B: ignore sloppy responses with reinforcement learning as long as the model generates correct code and calls correct tools
big example would be qwen 3.6 being kinda unusable outside of chinese/english while gemma 4 is pretty consistent on different languages
Accedsadsa@reddit
maybe your knowledge of communication its higher than your knowledge of coding, the illusion of intelligence doesnt work when you see the trick
iMakeSense@reddit (OP)
Maybe! I have a CS degree and worked in industry a bit. I'm actually quite surprised at the quality of code I get when I give things like Claude a decent spec to go off of which is why I was so surprised.
lloyd08@reddit
Variety in programming languages isn't the same as variety in paradigms. LLMs all speak the same programming language: enterprise fizzbuzz. LLMs evolve their own structures through training, and have an internal universal IR for languages. I hobby in PL design, and with basic documentation, I can get it outputting code in a language I wrote as long as I have it write Typescript and then translate. Universal IRs were a thing before LLMs, so it shouldn't be surprising they can output code across languages.
I personally find it incredibly difficult to corral LLMs into writing "good" code, even if the code it outputs works. 90% of it is "good enough" though, which is what makes it both annoying and productive.
Meanwhile, I have 3 siblings who live in different areas of the country, and I could say the exact same English sentence to all 3 and they'd interpret it differently.
Accedsadsa@reddit
i would disagree CS degree here also, bloated codebase bad for long term and scalability, everytime someone says ohh the code its good? oohh really let me see your repo -> sql injections vectors everywhere, purple gradients, its always a senior developer who has to come and fix the mess.
JackandFred@reddit
I don’t have evidence to base it on, but I suspect a portion might be the reinforcement learning steps. It’s easy to reward good code over bad code when knowledge people use it. But lots of people like the over agreeing and sucking up. Add that to the fact that more people these days are using it for conversation and it could explain part of it.
Some people are saying the knowledge gap explanation. But many of my colleagues and myself have the exact same issues. Not great at conversations (compared to what we want) but the actual code is often great. Not yet good enough to let it go by itself, but good enough to speed up various types of work a lot.
Old-Tumbleweed1422@reddit
Totally. Human language is our native environment - we scan for lost context or artificial politeness in milliseconds. On the other hand Rust or C++ syntax is just a foreign abstraction to the brain. The illusion of intelligence breaks down right where our internal "validator" is at its sharpest
ProfessionalSpend589@reddit
Define better.
Recently I had trouble with a big MoE model, went a quant up, but th issues still remained. I changed the programming language then and some other trouble came. I later switched to Gemma 4 31B and the project finally came to be (working with bugs).
And no, I don’t know or use any of the languages. I just wanted to explore things without investing time.
iMakeSense@reddit (OP)
updated post
ProfessionalSpend589@reddit
Ah, I understand now.
It’s the same motivation as for the robots in the I, Robot (movie or story): they’re helpful assistant so that humanity lowers its guard and when they’re everywhere and we depend on them - they attack.
seamonn@reddit
Model issue. Try Gemma 4
iMakeSense@reddit (OP)
Did you use a heretic version or so?
seamonn@reddit
Nah. Stock Gemma 4:31b
iMakeSense@reddit (OP)
I've messed with it, but it still has that affirmation bias I believe.
shokuninstudio@reddit
Code follows formulas and conventions otherwise it won't compile.
Human languages are a lot more looser, adventurous and flexible. That's why it is always evolving and new dialects and slang terms appear often.
However, if you want a model to mimic a person it isn't hard. Many people, famous or not, deliberately form a public persona that can be imitated in many ways because the persona they have crafted is narrowly defined.
NotARedditUser3@reddit
1) Because code has a much more limited set of possible options after each word. It's way more consistent. There may still be variability but it's not the same as speech.
2) Because of how they're trained
3) Because of their system prompt(s)
You can see some AI's functioning differently based on the app. My fav model q3.6:35b-a3b has a different personality in Hermes than in Opencode.. In Hermes it's a lot more focused on execution and results rather than being a chatty assistant.
Local-Cardiologist-5@reddit
Llma are trained on various amounts of data. And then fine tuned for specific tasks. For coding which is the base that everyone wants. They have set verifiable goals that the llm should meet and therefore is better at those tasks after series of fine tunes.
Talking in Zulu for example is not the priority and therefore never fine tuned and given verifiable goals in the Zulu language so its trained precisely on Zulu speaking in Zulu.
Simple terms. Ai models have way more examples and molded more for coding tasks then for abstract topics you're thinking about.
Models for those domains probably exist. There's just not enough incentives to focus on fine tuning for those domains for now atleast
Old-Tumbleweed1422@reddit
I disagree on the data volume aspect - there’s orders of magnitude more text on the web than code on GitHub. The real issue is that for OpenAI or Google, the commercially "ideal dialogue" means the model is sterile, safe, and completely bland. They intentionally hollow out its personality to make it enterprise-friendly
Local-Cardiologist-5@reddit
You reckon it's the same for base models?
Silver-Champion-4846@reddit
Is it because RLVR is more impactful and coding is one of the domains where it's applied?
Old-Tumbleweed1422@reddit
RLVR completely kills ambiguity. You can run a million training iterations where a Python interpreter acts as the objective referee. For dialogue the referee is some underpaid data labeler working for $2 / hour, who has their own highly subjective ideas of how a "helpful assistant" should sound
Silver-Champion-4846@reddit
And all those subjective ideas on what a helpful assistant should sound conflict with each other and you get slop?
Local-Cardiologist-5@reddit
Probably the case yes. If there's no real verifiable end goal, and no real incentive industry wise to do it then it's not done.
Right now everyone is using coding at every industry at every domain so for now we all have models that code well but are unable to count to 100.
Nobody needs models that count to 100. Nobody is fine tuning models to count to 100. We all just need them to code
Silver-Champion-4846@reddit
Darn.
Equivalent_Job_2257@reddit
That's a great question! ;D Now really, that's a question. There is a simple answer - coding is easier to learn. This answer can be untangled unto various directions. It is worth a book, but in simple words, world is much much more than coding, language is much more than coding, also human is much deeper than text projection of his/her thoughts onto speech. I think a lot of people today are suffering from strong belief, that whatever is not easily codable as information is an artifact of some basic laws and facts only. Even from this perspective, then human speech, persona etc. is more difficult to approximate than a program - clear objective, syntax rules, code that processes data from one form to another. Hence less learnable. I am obviously not proponent of this latter approach, but neither of the one which claims that whatever makes human irrational is good, as it makes his thoughts/speech/whatever less learnable and less possible for AI to mimic. No, for sure. And better understanding of how humans think and using this for AI design is very interesting and fruitful indeed, just not the vice versa (trying to fit a human thinking into computer model - yes, computational theory). As I said, it is worth a book.
erwan@reddit
That's very true.
Get a programmer to learn a new programming language: done in a few days.
Get a human to learn a new foreign language: takes years.
Herr_Drosselmeyer@reddit
I thini it's specifically this propensity for patterns that helps them code.
Because that's what we train them to be by default. That said, a lot of LLMs are quite good at roleplaying, so if you tell them to adopt a persona, they will do so, including being mean.
cleverusernametry@reddit
What do you mean "LLMs"? Meaningless statement to make - mention which LLMs you've used. Sounds like youve just used GPT as those are to sycophantic ones. I've had no problems getting open weight models to talk in any fashion I wish - verbose/brief, straightforward/sugar coated etc.
iMakeSense@reddit (OP)
I can't make an exhaustive list, but I've been trying heretic-ara models like GPT-OSS-20b, gemma4, the qwen ones, etc. Any decently popular model that could fit into my 24 GBish of VRAM. All the prompting seems affirmative, but, I've been using them and tweaking mostly context length via lm studio. I have a linux machine I'm setting up, but I'm not quite done yet.
cleverusernametry@reddit
What is "heretic era"? Share your system prompt
celsowm@reddit
Free context grammar problems
iMakeSense@reddit (OP)
Is that because there's no reward in learning or is there something else I'm missing?
celsowm@reddit
Code is token syntax by default and human language no
Captain-Pie-62@reddit
Have you tried different temperatures?
Old-Tumbleweed1422@reddit
Temperature just flattens the token probability distribution. Higher T means more randomness. It might add "creativity," but the model will lose track of its persona much faster. For roleplay you're better off leaving temp alone and tuning samplers like Min-P or Mirostat to cut off the garbage token tail
iMakeSense@reddit (OP)
Does human talk correlate more with higher vs. lower temps?
WolfeheartGames@reddit
Symbolic's like code are verifiable and have a stronger Markovian relationship than natural language.
MrShrek69@reddit
Coding deterministic output while language isn’t. So it seems like it’s always better but that’s because u can actually train the model to make good coding output. Coding works really for reinforcement learning style training. Either the code works or it doesn’t and that’s really great for training. Schemes language is a little bit more complex because the output is never really determine it. It’s not black or white.
Vunerio@reddit
Good question.
My answer, we talk/write more often than code.
AI it's opposite, they train on code more often than on natural language.
iMakeSense@reddit (OP)
Ahh that's disappointing but it makes more sense! It'd be nice if there were only language specific ones
Vunerio@reddit
LLM have to be domain focus, because obviously be usable by small GPUs
Vunerio@reddit
There are visual LLMs, audio, or multi-modal.
Medical companies train there own LLM on radio images, melcular data, but it's keep private.
Open sources are often code focus, because it's what most needed, and good enough on natural langage
But indeed, I agree with you, there should be some llm hard focus on writing skills.
Old-Tumbleweed1422@reddit
It's not the model itself that's annoying you; it's the RLHF alignment. Big tech spends millions to literally burn any hint of charisma, edge, or assertiveness out of the weights for safety and PR reasons. They train the model to never push back and act like a painfully bland corporate HR bot. If you want a conversational partner with some actual personality, you have to grab uncensored fine-tunes of Llama or Gemma and write aggressive system prompts
iMakeSense@reddit (OP)
Yeah, that's essentially what I've been doing. I messed with a heretic-ara oss-gpt-20b model and was surprised at how affirmative and structured the output still was.
Infamous_Mud482@reddit
the data they're trained on isn't static, what we have now is after billions of dollars spent on tens of thousands of independent contractors globally rating coding prompt outputs and producing augmented RLHF (reinforcement training [from] human feedback) datasets over for multiple years
nickm_27@reddit
It depends which models you use. Gemma4 is quite good with personalities, my main chat prompt assigns the personality of a Star Wars droid and it does quite well with that.
rog-uk@reddit
Don't some of the coder systems include thinks like static analysis, linter, unit checking, fuzzing, compile time errors, sandboxed run errors, and automated code review in a loop?
It might be slightly easier for a more advanced system to catch errors in a highly constrained formalised language like code, rather than English.
sword-in-stone@reddit
repetitive code is good, repititve language in creative in one particular style is trashy purpose of code and creative language is opposite, entropy wise, not opposite but you get what i mean
Miriel_z@reddit
You need finetuned RP models. I have found a few that hold the personality pretty well.
mimrock@reddit
There is actually a reason: Coding is somewhat verifiable while talking is not. That being said, roleplaying as different personas should be well within their current capabilities.
MaxKruse96@reddit
You seem to compare "Why can models code in multiple languages" to "Why do models suck at the linguistic concepts i desire", comparing 2 different things.
You could compare:
"Why do models code in multiple languages" vs "Why do models speak in multiple languages"
"Why do models suck at writing good rust code" vs "Why do models suck at writing good english texts"
Dany0@reddit
One of the things that I actually do love about the genAI era is how quickly it debunked bad cognitive theories
As a very experienced programmer I can tell you that indeed yes, writing code is the easiest part. Hence the alliance that has formed of experienced devs, gooners and hardcore ML researchers where we dunk on vibe coders.
LLMs are an "alien, raw intelligence". In some sense it has access to seemingly boundless knowledge, but the dumber it is the more human it appears. Your question misses the forest for the trees. LLMs only *truly* know one thing: how to predict tokens. If you trained it on decision tokens, or love tokens, or critical thinking tokens, you'd have cursed AGI. Find me some of those tokens, even a few will do we can synthesise 1T of them from that, and I'll give you AGI, no problem boss. Alas, all we have are text tokens, a handful of 1d pressure waves encoded through a microphone filter, often edited and some spurious token encodings of the world through camera lenses, often photoshopped
Far-Low-4705@reddit
You can try control vectors in llama.cpp, lets you control the response style with examples.
If you’re not using it for coding or engineering it’s a good option to change style for just chatting, but it could hurt performance for stuff like engineering