Something weird is happening with LLMs and chess
Posted by paranoidray@reddit | LocalLLaMA | View on Reddit | 80 comments
Posted by paranoidray@reddit | LocalLLaMA | View on Reddit | 80 comments
paranoidray@reddit (OP)
A year ago, there was a lot of talk about large language models (LLMs) playing chess. Word was that if you trained a big enough model on enough text, then you could send it a partially played game, ask it to predict the next move, and it would play at the level of an advanced amateur.
This seemed important. These are “language” models, after all, designed to predict language.
Now, modern LLMs are trained on a sizeable fraction of all the text ever created. This surely includes many chess games. But they weren’t designed to be good at chess. And the games that are available are just lists of moves. Yet people found that LLMs could play all the way through to the end game, with never-before-seen boards.
Did the language models build up some kind of internal representation of board state?
abhuva79@reddit
I would say yes. As far as i understand this, the base idea is that these LLMs develop a world-model in a way during training. Its not the same idea like with AlphaGo Zero (where they used MonteCarlo search coupled with a policy network - these systems defintly play at superhuman strength now), but it kinda makes sense that they get better at predicting because there is some kind of internal representation of the world-model of chess, so to say.
The really hard part now is actually proving and understanding this (interpretability)
Late-Passion2011@reddit
LLMs do not develop world models. https://arxiv.org/pdf/2406.03689
Whotea@reddit
LLMs have an internal world model that can predict game board states: https://arxiv.org/abs/2210.13382
>We investigate this question in a synthetic setting by applying a variant of the GPT model to the task of predicting legal moves in a simple board game, Othello. Although the network has no a priori knowledge of the game or its rules, we uncover evidence of an emergent nonlinear internal representation of the board state. Interventional experiments indicate this representation can be used to control the output of the network. By leveraging these intervention techniques, we produce “latent saliency maps” that help explain predictions
More proof: https://arxiv.org/pdf/2403.15498.pdf
Even more proof by Max Tegmark (renowned MIT professor): https://arxiv.org/abs/2310.02207
Given enough data all models will converge to a perfect world model: https://arxiv.org/abs/2405.07987
. Video generation models as world simulators: https://openai.com/index/video-generation-models-as-world-simulators/
Researchers find LLMs create relationships between concepts without explicit training, forming lobes that automatically categorize and group similar ideas together: https://arxiv.org/pdf/2410.19750 NotebookLM explanation: https://notebooklm.google.com/notebook/58d3c781-fce3-4e5d-8a06-6acadfa87e7e/audio LLMs develop their own understanding of reality as their language abilities improve: https://news.mit.edu/2024/llms-develop-own-understanding-of-reality-as-language-abilities-improve-0814 In controlled experiments, MIT CSAIL researchers discover simulations of reality developing deep within LLMs, indicating an understanding of language beyond simple mimicry. After training on over 1 million random puzzles, they found that the model spontaneously developed its own conception of the underlying simulation, despite never being exposed to this reality during training. Such findings call into question our intuitions about what types of information are necessary for learning linguistic meaning — and whether LLMs may someday understand language at a deeper level than they do today. “At the start of these experiments, the language model generated random instructions that didn’t work. By the time we completed training, our language model generated correct instructions at a rate of 92.4 percent,” says MIT electrical engineering and computer science (EECS) PhD student and CSAIL affiliate Charles Jin
Late-Passion2011@reddit
That’s the point of the paper I posted. That they give that illusion, but as soon as it encounters anything not explicitly in the training data, they fail because there is no world model there, just the illusion of one.
When you give them basic tasks that they’ve been trained on in ways that they have less data on (I.e arithmetic that is not base 10) their performance plummets. They don’t understand arithmetic (which is what a world model implies) but have memorized enough data that that’s the illusion https://arxiv.org/abs/2307.02477
Whotea@reddit
read through section 2
Late-Passion2011@reddit
I'm not reading a google doc. I'll read the studies on it. But I suppose we will know within a year either way whether these models do somehow develop a way of reasoning or not.
Whotea@reddit
The google doc contains links to studies
They already did. It’s called o1
Late-Passion2011@reddit
There is no reasoning in o1. Please, refer back to the studies.
Or just simply use chatgpt and give it a scenario that is simple and unlikely to be in its training data. It fails miserably, same as every other model, because there is no reasoning going on; it's a nice illusion. And I don't say they don't have a place in the economy - they likely do, but the number of tasks that they can really achieve in a real business setting to justify their lack of accuracy is pretty low and going to get much smaller if interest rates drop.
Whotea@reddit
so what’s all this in section 2
Informal_Warning_703@reddit
This is like being shocked that an LLM can write a song, even though they weren’t designed to be good at song writing. Chess can be communicated via language. LLMs are designed to be good at predicting language. Any task that can be modeled in language, like chess, can be modeled by an LLM.
Really? You actually know that all language descriptions of chess are just lists of moves? Doubt it.
Except you don’t know what data exactly was or wasn’t in the training. First because the amount of data is so vast and we don’t have good tools for browsing every piece of data. Second because the companies are not (or are no longer) sharing the relevant information.
There’s a limited number of chess board states. In fact it’s so limited that we have been able to manually implement good chess engines for a long time (these ingredients would almost certainly also be in the training data). If we can’t verify exactly what was in the training data, we can’t verify that we’ve presented a novel state.
Whotea@reddit
A CS professor taught GPT 3.5 (which is way worse than GPT 4 and its variants) to play chess with a 1750 Elo: https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/
“gpt-3.5-turbo-instruct can play chess at ~1800 ELO. I wrote some code and had it play 150 games against stockfish and 30 against gpt-4. It's very good! 99.7% of its 8000 moves were legal with the longest game going 147 moves.” https://github.com/adamkarvonen/chess_gpt_eval Can beat Stockfish 2 in vast majority of games and even win against Stockfish 9
Google trained grandmaster level chess (2895 Elo) without search in a 270 million parameter transformer model with a training dataset of 10 million chess games: https://arxiv.org/abs/2402.04494 In the paper, they present results for models sizes 9m (internal bot tournament elo 2007), 136m (elo 2224), and 270m trained on the same dataset. Which is to say, data efficiency scales with model size
Impossible to do this through training without generalizing as there are AT LEAST 10^120 possible game states in chess: https://en.wikipedia.org/wiki/Shannon_number
There are only 10^80 atoms in the universe: https://www.thoughtco.com/number-of-atoms-in-the-universe-603795
Othello can play games with boards and game states that it had never seen before: https://www.egaroucid.nyanyan.dev/en/
Informal_Warning_703@reddit
This is a really great pseudo response… because it doesn’t address anything I said and gives irrelevant stats. For example, you cite the game-tree complexity, which isn’t what I referred to (board states, which by the way is about 10^50 for legal states and the distribution for play is going to be far smaller).
You say “Impossible to do this through training without generalizing.” Of course I never argued that LLMs don’t generalize. So try again….
Whotea@reddit
So how is it able to play when the training data does not contain anywhere close to 10^50 games? FYI even if it contained 10^49 games, that’s only 10% of every possible state.
Informal_Warning_703@reddit
There's slightly more possible character states for the English language (if we including common punctuation). LLMs are doing each the same way. You need to explain why you find one so unbelievable and not other, especially given that you don't know what the distribution of data in the training.
Whotea@reddit
Because there are patterns in language. You can’t pattern match to win a chess game
Informal_Warning_703@reddit
Assertion with no evidence. There’s certainly patterns to humans playing chess, this is why some are recognized as playing unconventional moves.
Whotea@reddit
So how does it know which move to play next that will get it closer to winning when the game board is in a state it hasn’t seen before
Informal_Warning_703@reddit
I already told you: same way they predict language when presented with sentences they’ve presumably never seen before. You haven’t shown why we need a different explanation.
You’ve tailored a ridiculous premise (that chess can’t be pattern matched) to arrive at the conclusion you’re trying to reach (that LLMs aren’t doing pattern matching for some number of tasks).
Whotea@reddit
Which pattern did it match this from https://ai.stackexchange.com/questions/39310/what-is-the-significance-of-move-37-to-a-non-go-player
Informal_Warning_703@reddit
Are actually so dense that you don’t realize that your last 3 responses have all been substantively the same and, therefore, you aren’t somehow escaping the points I already made? Or is this just desperation at having the appearance of something to say in response?
Alpha Go isn’t the same architecture as an LLM, nor would it work for that sort of task since language is an open-ended domain where there’s no definable policy or network (in the sense used by architectures like Alpha Go, that are designed for a very narrow, rules-goal definable task) that an AI can use to to self-evaluate on.
When we are talking about the the distribution of data for a deep neural network-reinforcement architecture of something like AlphaGo it isn’t simply set by its supervised learning stage, but includes its monte carlo tree search strategy. That’s not something a transformer architecture of an LLM can do. Nor is it transferable to any other domain that doesn’t have the same clearly defined policy network and value network. (Meaning, they didn’t just take AlphaGo and tell it “Hey, now focus on protein structures!” and renamed it to AlphaFold.) So finding a working move for a game that is classified as novel given its supervised learning stage is not at all what you are trying to make it out to be first and foremost LLM and chess. There doesn’t need to be an ontologically significant understanding of Go to apply MCTS and find a novel winning move.
And you can’t dodge the fact that if you don’t know the training data first an LLM, then you have no basis to claim some board state does or does not fit within its distribution. Sad that you’re like a one-trick pony who’s put all his eggs in the “But what about chess!?” argument.
Whotea@reddit
You don’t have to know the training data to know that they didn’t train on anywhere close to 10^50 board states lol.
Informal_Warning_703@reddit
They wouldn’t have had to, you ignoramus. That’s why being in distribution vs out of distribution matters.
Whotea@reddit
it can do that. see section 2
acutelychronicpanic@reddit
Why are people still debating if they have internal representations?
EstarriolOfTheEast@reddit
I'm not sure this summary accurately emphasizes the blog post's central puzzle. The blog post seems to be asking why only gpt-3.5-turbo-instruct is decent at chess while other models, including future ones from OpenAI, are not.
Here's a table from the article:
willdepue from openai has a response to this:
https://x.com/willdepue/status/1857510504723525995
He goes over four hypotheses:
GPT-3.5-instruct was trained on more chess games: This appears correct but for openai models in general. According to their superalignment paper, OpenAI models are likely trained on more chess content than other models.
Essentially, extracting good LLM chess performance seems to be most impacted by input formatting and chess data quantity model was trained on, not model architecture or instruction tuning effects. Their chess performance is very sensitive to formatting (even small changes like spaces in PGN notation are harmful). He also states fine-tuning on just 100 examples is enough to recover chess ability in the other OpenAI models.
That still leaves unclear the question of why only gpt-3.5-turbo-instruct of OpenAI's models responds appropriately to PGN inputs by default.
Pojiku@reddit
I'd speculate that this more accurately correlates with the shift to heavily filtered or synthetic data.
We still use the meme that LLMs are trained on "all text on the Internet" but that's not exactly true when accounting for the more rigorous data processing pipelines that may filter out content like move-by-move logs of chess games.
EstarriolOfTheEast@reddit
The reason it can't be that is the openai researcher states all their models are strong (for LLMs) at chess. He specifically mentions their paper on gpt4 where they call out having trained on more chess data than other LLM creators. This means the inclusion of chess data is a deliberate choice, not an incidental one that would be filtered out by chance changes in pre-processing pipelines.
Kep0a@reddit
To me, isn't the obvious answer that GPT3.5 happened to be trained on lots of chess? There's no reason an LLM can't be good at chess if it's been trained on a lot of sequential chess moves. Training on a massive corpus of entire chess games seems like a waste so maybe it was culled.
EstarriolOfTheEast@reddit
Indeed, but the issue is the openai researcher states all their models are trained on lots of chess. What's happening with the chat models is that the PGN string is followed by chat turn completion tokens. This is enough to completely throw them off. Since we have more control with open models, we can test this with one major possible confounder: open models are not trained for chess like openai ones are.
EstarriolOfTheEast@reddit
I suppose it's still worth a try. I think the way I'd approach is to keep the game 1 response long. Tell the model it is to predict what chess move is next. We prefill or open the chat response. We take the model's predictions and the opponent's turns and append the game string in place. The model responds and closes the response, which we undo each turn. Smaller ones might need prepended examples of what they should do.
It's possible finetuning might be needed if the model is so sensitive, all preceding non-PGN related tokens count as format interference.
Alternatively, just do not use a chat template. Or try a base model.
Internet--Traveller@reddit
There's a lot of books on chess - if it's trained with them, it will know all the openings and will be proficient in at least the first few moves. There are chess notations on strategy and end games, but it will require a lot of examples for it to "pretend" that it can play chess.
Whotea@reddit
it’s impossible for anyone or anything to play chess without generalizing and understanding the rules
freecodeio@reddit
Good article but all those words, and not a simple ELO comparison?
ResidentPositive4122@reddit
I've seen estimates anywhere from 1100 to 1800 "true" ELO. So 1300-2100 chessdotcom and 1500-2300 lichess? Maybe.
Whotea@reddit
A CS professor taught GPT 3.5 to play chess with a 1750 Elo: https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/
“gpt-3.5-turbo-instruct can play chess at ~1800 ELO. I wrote some code and had it play 150 games against stockfish and 30 against gpt-4. It's very good! 99.7% of its 8000 moves were legal with the longest game going 147 moves.” https://github.com/adamkarvonen/chess_gpt_eval
Can beat Stockfish 2 in vast majority of games and even win against Stockfish 9
Google trained grandmaster level chess (2895 Elo) without search in a 270 million parameter transformer model with a training dataset of 10 million chess games: https://arxiv.org/abs/2402.04494 In the paper, they present results for models sizes 9m (internal bot tournament elo 2007), 136m (elo 2224), and 270m trained on the same dataset. Which is to say, data efficiency scales with model size
Impossible to do this through training without generalizing as there are AT LEAST 10^120 possible game states in chess: https://en.wikipedia.org/wiki/Shannon_number
There are only 10^80 atoms in the universe: https://www.thoughtco.com/number-of-atoms-in-the-universe-603795
Othello can play games with boards and game states that it had never seen before: https://www.egaroucid.nyanyan.dev/en/
ozzie123@reddit
I would share an even more interesting tidbit. The LLM will outperform the ELO training data, simply because sometime even the average players make great moves and the LLM learned to make more of that moves.
Orolol@reddit
Yes, you should read this article about someone training a Llm to play Othello and while analysing the weight during inference realising that the LLM somewhat keep a board state representation
https://www.neelnanda.io/mechanistic-interpretability/othello
AloHiWhat@reddit
Yes they do build internal states but we do not know what, because we dont build it
YetiTrix@reddit
No, words are linked by associations. The link between two words carries a lot of meaning and is multi-dimensional. It's hard to put into words but the intelligence is in the connection between two words not just the words themselves.
Nuckyduck@reddit
Yes.
When we train loras, we use dimensions, I use 64x16 or whatever. Basically, if the matrix of movies doesn't overlap the matrix of learning, LLM's of any N dimension will learn the game.
StarFox122@reddit
Just wanted to say I really enjoyed your writing style.
Good article as well. The sample size being small for the GPT models makes me hesitate to draw conclusions about GPT-3.5-turbo-instruct, but certainly suggests there’s something interesting there worth a closer look.
Thomas-Lore@reddit
It needs Claude 3.5 tested too. And maybe the original gpt-4. And Opus. But those models are very expensive.
ElectroSpore@reddit
Yes, I was listening to an interview from the anthropic Claude team and they where discussing that even with just single modal text models the model develops representations of abstract concepts in its neural net just like what happens in nature.
Depending on the models complexity it could be sentence level, paragraph level or concept level. IE the model has a concept of the language, if you give it text in different languages, it can have concepts of places, and that is how it returns valid info if you as it general questions like where is a nice place to go on vacation? or what place is similar to X.
This often gets discuss to as a vector space representation. Some models can build 3d layouts from an image, IE they understand both the representation of objects and where they are in space JUST from a 2D image.
Tobias783@reddit
This is the beauty of emergence
dizzydizzyd@reddit
Very interesting write up! I do wonder if your results point towards tokenization/embedding exploration changes between models. It's unlikely the training material changed _that_ much between different models (since it's quite time consuming to curate) - so I wonder if the issue with the LLM playing well is finding the right way to tokenize board state such that you can tickle the appropriate areas of the overall network.
ekcrisp@reddit
Chess ability would be a fun LLM benchmark
s101c@reddit
When a capable chess LLM emerges, /r/LocalLlama users will be like:
https://media0.giphy.com/media/L6EoLS78pcBag/giphy.gif
qrios@reddit
I'd be hesitant to conclude anything without a more carefully designed prompt. Couple of problems jump out:
A better prompt would be something like
And then have the language model predict however many tokens of the record are required before you see " 6."
This way, the language model is clear about the direction the game is supposed to go in, and the prompt more closely matches contexts in which it would have seen game records that go in given directions. As opposed to your current prompt, where the only hint is the [Result] part, and the text preceding it is extremely unnatural within the context of anything but instruction tuning datasets specifically for chess LLMs.
chimera271@reddit
If the 3.5 instruct model is great and the base 3.5 is terrible, then it sounds like instruct is calling out to something that can solve for chess moves.
KallistiTMP@reddit
What I'd be very curious to see - does the behavior stay the same when you add a bunch of few shot examples, with the bot as the winning player?
Also, what about MoE models? Does Mixtral do any better?
Another thought, with the Gemma models I wonder if you could find any interesting features there using Gemma Scope.
ekcrisp@reddit
How did you account for illegal moves? I refuse to believe this never happened
Whotea@reddit
A CS professor taught GPT 3.5 (which is way worse than GPT 4 and its variants) to play chess with a 1750 Elo: https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/
“gpt-3.5-turbo-instruct can play chess at ~1800 ELO. I wrote some code and had it play 150 games against stockfish and 30 against gpt-4. It's very good! 99.7% of its 8000 moves were legal with the longest game going 147 moves.” https://github.com/adamkarvonen/chess_gpt_eval Can beat Stockfish 2 in vast majority of games and even win against Stockfish 9
Google trained grandmaster level chess (2895 Elo) without search in a 270 million parameter transformer model with a training dataset of 10 million chess games: https://arxiv.org/abs/2402.04494 In the paper, they present results for models sizes 9m (internal bot tournament elo 2007), 136m (elo 2224), and 270m trained on the same dataset. Which is to say, data efficiency scales with model size
M
Cool_Abbreviations_9@reddit
I dont think its weird at all, memorization can look like planning given humongous amount of data.
Whotea@reddit
Impossible to do this through training without generalizing as there are AT LEAST 10^120 possible game states in chess: https://en.wikipedia.org/wiki/Shannon_number
There are only 10^80 atoms in the universe: https://www.thoughtco.com/number-of-atoms-in-the-universe-603795
Othello can play games with boards and game states that it had never seen before: https://www.egaroucid.nyanyan.dev/en/
ReadyAndSalted@reddit
read the post, the weirdness is not that they can play chess, it's that large, modern, "higher bench marking" models are all performing much much worse than older, smaller, lower bench marking models.
ReadyAndSalted@reddit
I would be very interested to hear your explanation as to why modern models are performing much worse. Was it some switch in tokenisation strat or what?
paranoidray@reddit (OP)
We know that transformers trained specifically on chess games can be extremely good at chess. Maybe gpt-3.5-turbo-instruct happens to have been trained on a higher fraction of chess games, so it decided to dedicate a larger fraction of its parameters to chess.
That is, maybe LLMs sort of have little “chess subnetworks” hidden inside of them, but the size of the subnetworks depends on the fraction of data that was chess.
https://github.com/sgrvinod/chess-transformers
Whotea@reddit
Makes sense. Humans don’t know how to play chess until they train on it so ai won’t be much different
MoffKalast@reddit
Or more likely, 3.5-turbo received chess instruct tuning.
MoonGrog@reddit
I built a python app to track piece placement and tried Claude, GPT4, and LLAMA3.2 13b and none of them can handle the board state. They would try to make illegal moves, it was pretty terrible. I thought I could use this for measuring logic. I walked away.
Inevitable-Solid-936@reddit
Tried this myself with various local LLMs including Qwen and had the same results - ended up having to supply a list of legal moves. Playing a model against itself from a very small test seemed to always end in a stalemate.
pigeon57434@reddit
try o1
paranoidray@reddit (OP)
You specifically need to use gpt-3.5-turbo-instruct.
Nothing else works.
Sabin_Stargem@reddit
Hopefully the difference between GTP 3.5 Turbo Instruct and the other models will be figured out. Mistral Large 2 can't grasp how dice can modify a character, at least not when there are multiple levels to calculate.
MaasqueDelta@reddit
I don't think that's so weird at all. Notice one of the things he said:
The answer may be quite simple: quantization is making the answers worse than they should be, because it tends to give less accurate answers. Maybe GPT-3.5-instruct is a full-quality model?
Either way, the fact that ONLY GPT 3.5-instruct gets good answers shows something odd is going on.
MrSomethingred@reddit
It reminds me of that paper where they trained GPT-2 to predict Autmater Cellular Behavior (I think), and then built a LORA trained on geometry puzzles, and found that the "intelligence" that emerged was highly transferrable.
I'll find the actual paper later if someone else doesn't find the first.
adalgis231@reddit
Chat gpt 3.5 being the best is the true puzzling discovery. This needs an explanation
ekcrisp@reddit
If you could give concrete reasons for why a particular model outputs something you would be solving AI's biggest problem
paranoidray@reddit (OP)
Only gpt-3.5-turbo-instruct!
gpt-3.5-turbo is as bad as the others.
jacek2023@reddit
interesting, so the one and only one model is hidden AGI?
opi098514@reddit
So what’s interesting is there are a couple models that are trained using chess notation. And those do models are actually very good.
Healthy-Nebula-3603@reddit
...If you want AI for chess you can ask qwen 32b to build such AI ..that takes a few seconds literally.
ResidentPositive4122@reddit
This is not interesting as a chess playing engine. This is interesting because it's mathematically impossible for the LLM to have "memorised" all the chess moves. So something else must be happening there. It's interesting because it can play at all, as in it will mostly output correct moves. So it "gets" chess. ELO and stuff don't matter that much, and obviously our top engines will smoke it. That's not the point.
UserXtheUnknown@reddit
There isn't the need to "memorize" all the chess moves. Memorizing all the games played by masters might be quite enough to give a decent "player" simulation against someone who plays following the same logic of a pro-chess player (so a program like Stockfish). Funny enough, it would probably fail against some noob who plays very suboptimal moves creating situations that never appear in the records.
Comprehensive-Pin667@reddit
That's not the point though
claythearc@reddit
I’ve been building a transformer based chess bot for a while. My best guess on why newer stuff is performing worse is either tokenization changes - where it splits algebraic notation awkwardly, or chess being pruned from bots since it’s not useful language data (whereas earlier bots probably didn’t have as rigorous pruning of data).
On the surface though a transformer seems reasonably well suited for chess, since moves can be pretty cleanly expressed as a token and the context of a game is really quite small. So there has to be something on the training / data / encoder side hurting it
paranoidray@reddit (OP)
Check out: https://github.com/sgrvinod/chess-transformers
Down_The_Rabbithole@reddit
What kind of stupid clickbait title is this?
jacobpederson@reddit
So it's the tokenizer isn't understanding the moves on the other models? Come on the suspense is killing us!