Why do LLMs code better than they talk?

[-]

mohelgamal@reddit

Coding is very easy for LLM, the syntax is rigid, there is only a couple of correct write to do something in each programming language down to punctuation and spacing

also the training data are abundant in the form code bases that are extensively annotated, and maintained overtime, with mistakes identified later is corrected and annotated in later versions. So when you ask an LLM to fix a problem in a code, it can easily look up similar problems and implement similar solutions

Also alot of human difficult in writing code comes from remembering where the data lives and what the code does across many files. So while a human need to read a bunch of code and reconstruct everything in their brain to make changes this process is tiresome and difficult for our hiligical brains, computers has no such issue

This is why AI like Claude mythos can find bugs that no human was able to find before because no single human can cross reference the incredibly complex code that runs a server to cross link weakness together to find a vulnerability

[-]

graypasser@reddit

Actually, LLM perform worse against deterministic tasks, and perform better in vague, ambiguous tasks.

[-]

segmond@reddit

The latest ones are trained to code, you can go back to the classics if you want models that are great at chat over code. For example, DeepSeek-v3-0324

[-]

a_beautiful_rhind@reddit

RLHF has beaten the creativity out of models.

[-]

iMakeSense@reddit (OP)

Ah. Deepseek pioneered that right? Or was this happening prior to deepseek?

[-]

a_beautiful_rhind@reddit

No it's pretty old. But imo, as the focus turned to safety, assistants, predictability, code.. that's where all post training ended up. Creative and natural replies are penalized.

[-]

VoiceApprehensive893@reddit

depends on what the model was trained to do, if youre making a model for agentic coding

A: have more code in the training data than russian text resulting in good code and bad russian

B: ignore sloppy responses with reinforcement learning as long as the model generates correct code and calls correct tools

big example would be qwen 3.6 being kinda unusable outside of chinese/english while gemma 4 is pretty consistent on different languages

[-]

Accedsadsa@reddit

maybe your knowledge of communication its higher than your knowledge of coding, the illusion of intelligence doesnt work when you see the trick

[-]

iMakeSense@reddit (OP)

Maybe! I have a CS degree and worked in industry a bit. I'm actually quite surprised at the quality of code I get when I give things like Claude a decent spec to go off of which is why I was so surprised.

[-]

lloyd08@reddit

Variety in programming languages isn't the same as variety in paradigms. LLMs all speak the same programming language: enterprise fizzbuzz. LLMs evolve their own structures through training, and have an internal universal IR for languages. I hobby in PL design, and with basic documentation, I can get it outputting code in a language I wrote as long as I have it write Typescript and then translate. Universal IRs were a thing before LLMs, so it shouldn't be surprising they can output code across languages.

I personally find it incredibly difficult to corral LLMs into writing "good" code, even if the code it outputs works. 90% of it is "good enough" though, which is what makes it both annoying and productive.

Meanwhile, I have 3 siblings who live in different areas of the country, and I could say the exact same English sentence to all 3 and they'd interpret it differently.

[-]

Accedsadsa@reddit

i would disagree CS degree here also, bloated codebase bad for long term and scalability, everytime someone says ohh the code its good? oohh really let me see your repo -> sql injections vectors everywhere, purple gradients, its always a senior developer who has to come and fix the mess.

[-]

JackandFred@reddit

I don’t have evidence to base it on, but I suspect a portion might be the reinforcement learning steps. It’s easy to reward good code over bad code when knowledge people use it. But lots of people like the over agreeing and sucking up. Add that to the fact that more people these days are using it for conversation and it could explain part of it.

Some people are saying the knowledge gap explanation. But many of my colleagues and myself have the exact same issues. Not great at conversations (compared to what we want) but the actual code is often great. Not yet good enough to let it go by itself, but good enough to speed up various types of work a lot.

[-]

Old-Tumbleweed1422@reddit

Totally. Human language is our native environment - we scan for lost context or artificial politeness in milliseconds. On the other hand Rust or C++ syntax is just a foreign abstraction to the brain. The illusion of intelligence breaks down right where our internal "validator" is at its sharpest

[-]

ProfessionalSpend589@reddit

Define better.

Recently I had trouble with a big MoE model, went a quant up, but th issues still remained. I changed the programming language then and some other trouble came. I later switched to Gemma 4 31B and the project finally came to be (working with bugs).

And no, I don’t know or use any of the languages. I just wanted to explore things without investing time.

[-]

iMakeSense@reddit (OP)

updated post

[-]

ProfessionalSpend589@reddit

Ah, I understand now.

It’s the same motivation as for the robots in the I, Robot (movie or story): they’re helpful assistant so that humanity lowers its guard and when they’re everywhere and we depend on them - they attack.

[-]

seamonn@reddit

Model issue. Try Gemma 4

[-]

iMakeSense@reddit (OP)

Did you use a heretic version or so?

[-]

seamonn@reddit

Nah. Stock Gemma 4:31b

[-]

iMakeSense@reddit (OP)

I've messed with it, but it still has that affirmation bias I believe.

[-]

shokuninstudio@reddit

Code follows formulas and conventions otherwise it won't compile.

Human languages are a lot more looser, adventurous and flexible. That's why it is always evolving and new dialects and slang terms appear often.

However, if you want a model to mimic a person it isn't hard. Many people, famous or not, deliberately form a public persona that can be imitated in many ways because the persona they have crafted is narrowly defined.

[-]

NotARedditUser3@reddit

1) Because code has a much more limited set of possible options after each word. It's way more consistent. There may still be variability but it's not the same as speech.
2) Because of how they're trained
3) Because of their system prompt(s)
You can see some AI's functioning differently based on the app. My fav model q3.6:35b-a3b has a different personality in Hermes than in Opencode.. In Hermes it's a lot more focused on execution and results rather than being a chatty assistant.

[-]

Local-Cardiologist-5@reddit

Llma are trained on various amounts of data. And then fine tuned for specific tasks. For coding which is the base that everyone wants. They have set verifiable goals that the llm should meet and therefore is better at those tasks after series of fine tunes.

Talking in Zulu for example is not the priority and therefore never fine tuned and given verifiable goals in the Zulu language so its trained precisely on Zulu speaking in Zulu.

Simple terms. Ai models have way more examples and molded more for coding tasks then for abstract topics you're thinking about.

Models for those domains probably exist. There's just not enough incentives to focus on fine tuning for those domains for now atleast

[-]

Old-Tumbleweed1422@reddit

I disagree on the data volume aspect - there’s orders of magnitude more text on the web than code on GitHub. The real issue is that for OpenAI or Google, the commercially "ideal dialogue" means the model is sterile, safe, and completely bland. They intentionally hollow out its personality to make it enterprise-friendly

[-]

Local-Cardiologist-5@reddit

You reckon it's the same for base models?

[-]

Silver-Champion-4846@reddit

Is it because RLVR is more impactful and coding is one of the domains where it's applied?

[-]

Old-Tumbleweed1422@reddit

RLVR completely kills ambiguity. You can run a million training iterations where a Python interpreter acts as the objective referee. For dialogue the referee is some underpaid data labeler working for $2 / hour, who has their own highly subjective ideas of how a "helpful assistant" should sound

[-]

Silver-Champion-4846@reddit

And all those subjective ideas on what a helpful assistant should sound conflict with each other and you get slop?

[-]

Local-Cardiologist-5@reddit

Probably the case yes. If there's no real verifiable end goal, and no real incentive industry wise to do it then it's not done.

Right now everyone is using coding at every industry at every domain so for now we all have models that code well but are unable to count to 100.

Nobody needs models that count to 100. Nobody is fine tuning models to count to 100. We all just need them to code

[-]

Silver-Champion-4846@reddit

Darn.

[-]

Equivalent_Job_2257@reddit

That's a great question! ;D Now really, that's a question. There is a simple answer - coding is easier to learn. This answer can be untangled unto various directions. It is worth a book, but in simple words, world is much much more than coding, language is much more than coding, also human is much deeper than text projection of his/her thoughts onto speech. I think a lot of people today are suffering from strong belief, that whatever is not easily codable as information is an artifact of some basic laws and facts only. Even from this perspective, then human speech, persona etc. is more difficult to approximate than a program - clear objective, syntax rules, code that processes data from one form to another. Hence less learnable. I am obviously not proponent of this latter approach, but neither of the one which claims that whatever makes human irrational is good, as it makes his thoughts/speech/whatever less learnable and less possible for AI to mimic. No, for sure. And better understanding of how humans think and using this for AI design is very interesting and fruitful indeed, just not the vice versa (trying to fit a human thinking into computer model - yes, computational theory). As I said, it is worth a book.

[-]

erwan@reddit

That's very true.

Get a programmer to learn a new programming language: done in a few days.

Get a human to learn a new foreign language: takes years.

[-]

Herr_Drosselmeyer@reddit

Why's it so hard to get LLMs to embody different personas or respond in a way with less patterns or agree-ability than it is to have them write code in a variety of languages?

I thini it's specifically this propensity for patterns that helps them code.

Why do they always sound like willing assistants

Because that's what we train them to be by default. That said, a lot of LLMs are quite good at roleplaying, so if you tell them to adopt a persona, they will do so, including being mean.

[-]

cleverusernametry@reddit

What do you mean "LLMs"? Meaningless statement to make - mention which LLMs you've used. Sounds like youve just used GPT as those are to sycophantic ones. I've had no problems getting open weight models to talk in any fashion I wish - verbose/brief, straightforward/sugar coated etc.

[-]

iMakeSense@reddit (OP)

I can't make an exhaustive list, but I've been trying heretic-ara models like GPT-OSS-20b, gemma4, the qwen ones, etc. Any decently popular model that could fit into my 24 GBish of VRAM. All the prompting seems affirmative, but, I've been using them and tweaking mostly context length via lm studio. I have a linux machine I'm setting up, but I'm not quite done yet.

[-]

cleverusernametry@reddit

What is "heretic era"? Share your system prompt

[-]

celsowm@reddit

Free context grammar problems

[-]

iMakeSense@reddit (OP)

Is that because there's no reward in learning or is there something else I'm missing?

[-]

celsowm@reddit

Code is token syntax by default and human language no

[-]

Captain-Pie-62@reddit

Have you tried different temperatures?

[-]

Old-Tumbleweed1422@reddit

Temperature just flattens the token probability distribution. Higher T means more randomness. It might add "creativity," but the model will lose track of its persona much faster. For roleplay you're better off leaving temp alone and tuning samplers like Min-P or Mirostat to cut off the garbage token tail

[-]

iMakeSense@reddit (OP)

Does human talk correlate more with higher vs. lower temps?

[-]

WolfeheartGames@reddit

Symbolic's like code are verifiable and have a stronger Markovian relationship than natural language.

[-]

MrShrek69@reddit

Coding deterministic output while language isn’t. So it seems like it’s always better but that’s because u can actually train the model to make good coding output. Coding works really for reinforcement learning style training. Either the code works or it doesn’t and that’s really great for training. Schemes language is a little bit more complex because the output is never really determine it. It’s not black or white.

[-]

Vunerio@reddit

Good question.

My answer, we talk/write more often than code.

AI it's opposite, they train on code more often than on natural language.

[-]

iMakeSense@reddit (OP)

Ahh that's disappointing but it makes more sense! It'd be nice if there were only language specific ones

[-]

Vunerio@reddit

LLM have to be domain focus, because obviously be usable by small GPUs

[-]

Vunerio@reddit

There are visual LLMs, audio, or multi-modal.

Medical companies train there own LLM on radio images, melcular data, but it's keep private.

Open sources are often code focus, because it's what most needed, and good enough on natural langage

But indeed, I agree with you, there should be some llm hard focus on writing skills.

[-]

Old-Tumbleweed1422@reddit

It's not the model itself that's annoying you; it's the RLHF alignment. Big tech spends millions to literally burn any hint of charisma, edge, or assertiveness out of the weights for safety and PR reasons. They train the model to never push back and act like a painfully bland corporate HR bot. If you want a conversational partner with some actual personality, you have to grab uncensored fine-tunes of Llama or Gemma and write aggressive system prompts

[-]

iMakeSense@reddit (OP)

Yeah, that's essentially what I've been doing. I messed with a heretic-ara oss-gpt-20b model and was surprised at how affirmative and structured the output still was.

[-]

Infamous_Mud482@reddit

the data they're trained on isn't static, what we have now is after billions of dollars spent on tens of thousands of independent contractors globally rating coding prompt outputs and producing augmented RLHF (reinforcement training [from] human feedback) datasets over for multiple years

[-]

nickm_27@reddit

It depends which models you use. Gemma4 is quite good with personalities, my main chat prompt assigns the personality of a Star Wars droid and it does quite well with that.

[-]

rog-uk@reddit

Don't some of the coder systems include thinks like static analysis, linter, unit checking, fuzzing, compile time errors, sandboxed run errors, and automated code review in a loop?

It might be slightly easier for a more advanced system to catch errors in a highly constrained formalised language like code, rather than English.

[-]

sword-in-stone@reddit

repetitive code is good, repititve language in creative in one particular style is trashy purpose of code and creative language is opposite, entropy wise, not opposite but you get what i mean

[-]

Miriel_z@reddit

You need finetuned RP models. I have found a few that hold the personality pretty well.

[-]

mimrock@reddit

There is actually a reason: Coding is somewhat verifiable while talking is not. That being said, roleplaying as different personas should be well within their current capabilities.

[-]

MaxKruse96@reddit

You seem to compare "Why can models code in multiple languages" to "Why do models suck at the linguistic concepts i desire", comparing 2 different things.

You could compare:
"Why do models code in multiple languages" vs "Why do models speak in multiple languages"
"Why do models suck at writing good rust code" vs "Why do models suck at writing good english texts"

[-]

Dany0@reddit

One of the things that I actually do love about the genAI era is how quickly it debunked bad cognitive theories

As a very experienced programmer I can tell you that indeed yes, writing code is the easiest part. Hence the alliance that has formed of experienced devs, gooners and hardcore ML researchers where we dunk on vibe coders.

LLMs are an "alien, raw intelligence". In some sense it has access to seemingly boundless knowledge, but the dumber it is the more human it appears. Your question misses the forest for the trees. LLMs only *truly* know one thing: how to predict tokens. If you trained it on decision tokens, or love tokens, or critical thinking tokens, you'd have cursed AGI. Find me some of those tokens, even a few will do we can synthesise 1T of them from that, and I'll give you AGI, no problem boss. Alas, all we have are text tokens, a handful of 1d pressure waves encoded through a microphone filter, often edited and some spurious token encodings of the world through camera lenses, often photoshopped

[-]

Far-Low-4705@reddit

You can try control vectors in llama.cpp, lets you control the response style with examples.

If you’re not using it for coding or engineering it’s a good option to change style for just chatting, but it could hurt performance for stuff like engineering