Something weird is happening with LLMs and chess

[-]

paranoidray@reddit (OP)

A year ago, there was a lot of talk about large language models (LLMs) playing chess. Word was that if you trained a big enough model on enough text, then you could send it a partially played game, ask it to predict the next move, and it would play at the level of an advanced amateur.

This seemed important. These are “language” models, after all, designed to predict language.

Now, modern LLMs are trained on a sizeable fraction of all the text ever created. This surely includes many chess games. But they weren’t designed to be good at chess. And the games that are available are just lists of moves. Yet people found that LLMs could play all the way through to the end game, with never-before-seen boards.

Did the language models build up some kind of internal representation of board state?

[-]

abhuva79@reddit

I would say yes. As far as i understand this, the base idea is that these LLMs develop a world-model in a way during training. Its not the same idea like with AlphaGo Zero (where they used MonteCarlo search coupled with a policy network - these systems defintly play at superhuman strength now), but it kinda makes sense that they get better at predicting because there is some kind of internal representation of the world-model of chess, so to say.

The really hard part now is actually proving and understanding this (interpretability)

[-]

Late-Passion2011@reddit

LLMs do not develop world models. https://arxiv.org/pdf/2406.03689

[-]

Whotea@reddit

LLMs have an internal world model that can predict game board states: https://arxiv.org/abs/2210.13382

>We investigate this question in a synthetic setting by applying a variant of the GPT model to the task of predicting legal moves in a simple board game, Othello. Although the network has no a priori knowledge of the game or its rules, we uncover evidence of an emergent nonlinear internal representation of the board state. Interventional experiments indicate this representation can be used to control the output of the network. By leveraging these intervention techniques, we produce “latent saliency maps” that help explain predictions

More proof: https://arxiv.org/pdf/2403.15498.pdf

Prior work by Li et al. investigated this by training a GPT model on synthetic, randomly generated Othello games and found that the model learned an internal representation of the board state. We extend this work into the more complex domain of chess, training on real games and investigating our model’s internal representations using linear probes and contrastive activations. The model is given no a priori knowledge of the game and is solely trained on next character prediction, yet we find evidence of internal representations of board state. We validate these internal representations by using them to make interventions on the model’s activations and edit its internal board state. Unlike Li et al’s prior synthetic dataset approach, our analysis finds that the model also learns to estimate latent variables like player skill to better predict the next character. We derive a player skill vector and add it to the model, improving the model’s win rate by up to 2.6 times

Even more proof by Max Tegmark (renowned MIT professor): https://arxiv.org/abs/2310.02207

The capabilities of large language models (LLMs) have sparked debate over whether such systems just learn an enormous collection of superficial statistics or a set of more coherent and grounded representations that reflect the real world. We find evidence for the latter by analyzing the learned representations of three spatial datasets (world, US, NYC places) and three temporal datasets (historical figures, artworks, news headlines) in the Llama-2 family of models. We discover that LLMs learn linear representations of space and time across multiple scales. These representations are robust to prompting variations and unified across different entity types (e.g. cities and landmarks). In addition, we identify individual "space neurons" and "time neurons" that reliably encode spatial and temporal coordinates. While further investigation is needed, our results suggest modern LLMs learn rich spatiotemporal representations of the real world and possess basic ingredients of a world model.

Given enough data all models will converge to a perfect world model: https://arxiv.org/abs/2405.07987

The data of course doesn't have to be real, these models can also gain increased intelligence from playing a bunch of video games, which will create valuable patterns and functions for improvement across the board. Just like evolution did with species battling it out against each other creating us

. Video generation models as world simulators: https://openai.com/index/video-generation-models-as-world-simulators/

Researchers find LLMs create relationships between concepts without explicit training, forming lobes that automatically categorize and group similar ideas together: https://arxiv.org/pdf/2410.19750 NotebookLM explanation: https://notebooklm.google.com/notebook/58d3c781-fce3-4e5d-8a06-6acadfa87e7e/audio LLMs develop their own understanding of reality as their language abilities improve: https://news.mit.edu/2024/llms-develop-own-understanding-of-reality-as-language-abilities-improve-0814 In controlled experiments, MIT CSAIL researchers discover simulations of reality developing deep within LLMs, indicating an understanding of language beyond simple mimicry. After training on over 1 million random puzzles, they found that the model spontaneously developed its own conception of the underlying simulation, despite never being exposed to this reality during training. Such findings call into question our intuitions about what types of information are necessary for learning linguistic meaning — and whether LLMs may someday understand language at a deeper level than they do today. “At the start of these experiments, the language model generated random instructions that didn’t work. By the time we completed training, our language model generated correct instructions at a rate of 92.4 percent,” says MIT electrical engineering and computer science (EECS) PhD student and CSAIL affiliate Charles Jin

[-]

Late-Passion2011@reddit

That’s the point of the paper I posted. That they give that illusion, but as soon as it encounters anything not explicitly in the training data, they fail because there is no world model there, just the illusion of one.

When you give them basic tasks that they’ve been trained on in ways that they have less data on (I.e arithmetic that is not base 10) their performance plummets. They don’t understand arithmetic (which is what a world model implies) but have memorized enough data that that’s the illusion https://arxiv.org/abs/2307.02477

[-]

Whotea@reddit

read through section 2

[-]

Late-Passion2011@reddit

I'm not reading a google doc. I'll read the studies on it. But I suppose we will know within a year either way whether these models do somehow develop a way of reasoning or not.

[-]

Whotea@reddit

The google doc contains links to studies

They already did. It’s called o1

[-]

Late-Passion2011@reddit

There is no reasoning in o1. Please, refer back to the studies.

Or just simply use chatgpt and give it a scenario that is simple and unlikely to be in its training data. It fails miserably, same as every other model, because there is no reasoning going on; it's a nice illusion. And I don't say they don't have a place in the economy - they likely do, but the number of tasks that they can really achieve in a real business setting to justify their lack of accuracy is pretty low and going to get much smaller if interest rates drop.

[-]

Whotea@reddit

so what’s all this in section 2

[-]

Informal_Warning_703@reddit

But they weren’t designed to be good at chess.

This is like being shocked that an LLM can write a song, even though they weren’t designed to be good at song writing. Chess can be communicated via language. LLMs are designed to be good at predicting language. Any task that can be modeled in language, like chess, can be modeled by an LLM.

And the games that are available are just lists of moves.

Really? You actually know that all language descriptions of chess are just lists of moves? Doubt it.

Yet people found that LLMs could play all the way through to the end game, with never-before-seen boards.

Except you don’t know what data exactly was or wasn’t in the training. First because the amount of data is so vast and we don’t have good tools for browsing every piece of data. Second because the companies are not (or are no longer) sharing the relevant information.

There’s a limited number of chess board states. In fact it’s so limited that we have been able to manually implement good chess engines for a long time (these ingredients would almost certainly also be in the training data). If we can’t verify exactly what was in the training data, we can’t verify that we’ve presented a novel state.

[-]

Whotea@reddit

A CS professor taught GPT 3.5 (which is way worse than GPT 4 and its variants) to play chess with a 1750 Elo: https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/

is capable of playing end-to-end legal moves in 84% of games, even with black pieces or when the game starts with strange openings.

“gpt-3.5-turbo-instruct can play chess at ~1800 ELO. I wrote some code and had it play 150 games against stockfish and 30 against gpt-4. It's very good! 99.7% of its 8000 moves were legal with the longest game going 147 moves.” https://github.com/adamkarvonen/chess_gpt_eval Can beat Stockfish 2 in vast majority of games and even win against Stockfish 9

Google trained grandmaster level chess (2895 Elo) without search in a 270 million parameter transformer model with a training dataset of 10 million chess games: https://arxiv.org/abs/2402.04494 In the paper, they present results for models sizes 9m (internal bot tournament elo 2007), 136m (elo 2224), and 270m trained on the same dataset. Which is to say, data efficiency scales with model size

Impossible to do this through training without generalizing as there are AT LEAST 10^120 possible game states in chess: https://en.wikipedia.org/wiki/Shannon_number

There are only 10^80 atoms in the universe: https://www.thoughtco.com/number-of-atoms-in-the-universe-603795

Othello can play games with boards and game states that it had never seen before: https://www.egaroucid.nyanyan.dev/en/

[-]

Informal_Warning_703@reddit

This is a really great pseudo response… because it doesn’t address anything I said and gives irrelevant stats. For example, you cite the game-tree complexity, which isn’t what I referred to (board states, which by the way is about 10^50 for legal states and the distribution for play is going to be far smaller).

You say “Impossible to do this through training without generalizing.” Of course I never argued that LLMs don’t generalize. So try again….

[-]

Whotea@reddit

So how is it able to play when the training data does not contain anywhere close to 10^50 games? FYI even if it contained 10^49 games, that’s only 10% of every possible state.

[-]

Informal_Warning_703@reddit

There's slightly more possible character states for the English language (if we including common punctuation). LLMs are doing each the same way. You need to explain why you find one so unbelievable and not other, especially given that you don't know what the distribution of data in the training.

[-]

Whotea@reddit

Because there are patterns in language. You can’t pattern match to win a chess game

[-]

Informal_Warning_703@reddit

Assertion with no evidence. There’s certainly patterns to humans playing chess, this is why some are recognized as playing unconventional moves.

[-]

Whotea@reddit

So how does it know which move to play next that will get it closer to winning when the game board is in a state it hasn’t seen before

[-]

Informal_Warning_703@reddit

I already told you: same way they predict language when presented with sentences they’ve presumably never seen before. You haven’t shown why we need a different explanation.

You’ve tailored a ridiculous premise (that chess can’t be pattern matched) to arrive at the conclusion you’re trying to reach (that LLMs aren’t doing pattern matching for some number of tasks).

[-]

Whotea@reddit

Which pattern did it match this from https://ai.stackexchange.com/questions/39310/what-is-the-significance-of-move-37-to-a-non-go-player

[-]

Informal_Warning_703@reddit

Are actually so dense that you don’t realize that your last 3 responses have all been substantively the same and, therefore, you aren’t somehow escaping the points I already made? Or is this just desperation at having the appearance of something to say in response?

Alpha Go isn’t the same architecture as an LLM, nor would it work for that sort of task since language is an open-ended domain where there’s no definable policy or network (in the sense used by architectures like Alpha Go, that are designed for a very narrow, rules-goal definable task) that an AI can use to to self-evaluate on.

When we are talking about the the distribution of data for a deep neural network-reinforcement architecture of something like AlphaGo it isn’t simply set by its supervised learning stage, but includes its monte carlo tree search strategy. That’s not something a transformer architecture of an LLM can do. Nor is it transferable to any other domain that doesn’t have the same clearly defined policy network and value network. (Meaning, they didn’t just take AlphaGo and tell it “Hey, now focus on protein structures!” and renamed it to AlphaFold.) So finding a working move for a game that is classified as novel given its supervised learning stage is not at all what you are trying to make it out to be first and foremost LLM and chess. There doesn’t need to be an ontologically significant understanding of Go to apply MCTS and find a novel winning move.

And you can’t dodge the fact that if you don’t know the training data first an LLM, then you have no basis to claim some board state does or does not fit within its distribution. Sad that you’re like a one-trick pony who’s put all his eggs in the “But what about chess!?” argument.

[-]

Whotea@reddit

You don’t have to know the training data to know that they didn’t train on anywhere close to 10^50 board states lol.

[-]

Informal_Warning_703@reddit

They wouldn’t have had to, you ignoramus. That’s why being in distribution vs out of distribution matters.

[-]

Whotea@reddit

it can do that. see section 2

[-]

acutelychronicpanic@reddit

Why are people still debating if they have internal representations?

[-]

EstarriolOfTheEast@reddit

I'm not sure this summary accurately emphasizes the blog post's central puzzle. The blog post seems to be asking why only gpt-3.5-turbo-instruct is decent at chess while other models, including future ones from OpenAI, are not.

Here's a table from the article:

Model	Quality
Llama-3.2-3b	Terrible
Llama-3.2-3b-instruct	Terrible
Llama-3.1-70b	Terrible
Llama-3.1-70b-instruct	Terrible
Qwen-2.5	Terrible
command-r-v01	Terrible
gemma-2-27b	Terrible
gemma-2-27b-it	Terrible
gpt-3.5-turbo-instruct	Excellent
gpt-3.5-turbo	Terrible
gpt-4o-mini	Terrible
gpt-4o	Terrible
o1-mini	Terrible

willdepue from openai has a response to this:

https://x.com/willdepue/status/1857510504723525995

He goes over four hypotheses:

Base models at sufficient scale can play chess, but instruction tuning destroys it: rejected, he argues it's an issue of sensitivity to format.

GPT-3.5-instruct was trained on more chess games: This appears correct but for openai models in general. According to their superalignment paper, OpenAI models are likely trained on more chess content than other models.

There's "competition" between different types of data
Different transformer architectures matter: Both rejected since GPT-3.5-turbo and turbo-instruct share the same base model/architecture.

Essentially, extracting good LLM chess performance seems to be most impacted by input formatting and chess data quantity model was trained on, not model architecture or instruction tuning effects. Their chess performance is very sensitive to formatting (even small changes like spaces in PGN notation are harmful). He also states fine-tuning on just 100 examples is enough to recover chess ability in the other OpenAI models.

That still leaves unclear the question of why only gpt-3.5-turbo-instruct of OpenAI's models responds appropriately to PGN inputs by default.

[-]

Pojiku@reddit

I'd speculate that this more accurately correlates with the shift to heavily filtered or synthetic data.

We still use the meme that LLMs are trained on "all text on the Internet" but that's not exactly true when accounting for the more rigorous data processing pipelines that may filter out content like move-by-move logs of chess games.

[-]

EstarriolOfTheEast@reddit

The reason it can't be that is the openai researcher states all their models are strong (for LLMs) at chess. He specifically mentions their paper on gpt4 where they call out having trained on more chess data than other LLM creators. This means the inclusion of chess data is a deliberate choice, not an incidental one that would be filtered out by chance changes in pre-processing pipelines.

[-]

Kep0a@reddit

To me, isn't the obvious answer that GPT3.5 happened to be trained on lots of chess? There's no reason an LLM can't be good at chess if it's been trained on a lot of sequential chess moves. Training on a massive corpus of entire chess games seems like a waste so maybe it was culled.

[-]

EstarriolOfTheEast@reddit

Indeed, but the issue is the openai researcher states all their models are trained on lots of chess. What's happening with the chat models is that the PGN string is followed by chat turn completion tokens. This is enough to completely throw them off. Since we have more control with open models, we can test this with one major possible confounder: open models are not trained for chess like openai ones are.

[-]

EstarriolOfTheEast@reddit

I suppose it's still worth a try. I think the way I'd approach is to keep the game 1 response long. Tell the model it is to predict what chess move is next. We prefill or open the chat response. We take the model's predictions and the opponent's turns and append the game string in place. The model responds and closes the response, which we undo each turn. Smaller ones might need prepended examples of what they should do.

It's possible finetuning might be needed if the model is so sensitive, all preceding non-PGN related tokens count as format interference.

Alternatively, just do not use a chat template. Or try a base model.

[-]

Internet--Traveller@reddit

There's a lot of books on chess - if it's trained with them, it will know all the openings and will be proficient in at least the first few moves. There are chess notations on strategy and end games, but it will require a lot of examples for it to "pretend" that it can play chess.

[-]

Whotea@reddit

it’s impossible for anyone or anything to play chess without generalizing and understanding the rules

[-]

freecodeio@reddit

Good article but all those words, and not a simple ELO comparison?

[-]

ResidentPositive4122@reddit

I've seen estimates anywhere from 1100 to 1800 "true" ELO. So 1300-2100 chessdotcom and 1500-2300 lichess? Maybe.

[-]

Whotea@reddit

A CS professor taught GPT 3.5 to play chess with a 1750 Elo: https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/

is capable of playing end-to-end legal moves in 84% of games, even with black pieces or when the game starts with strange openings.

“gpt-3.5-turbo-instruct can play chess at ~1800 ELO. I wrote some code and had it play 150 games against stockfish and 30 against gpt-4. It's very good! 99.7% of its 8000 moves were legal with the longest game going 147 moves.” https://github.com/adamkarvonen/chess_gpt_eval

Can beat Stockfish 2 in vast majority of games and even win against Stockfish 9

Google trained grandmaster level chess (2895 Elo) without search in a 270 million parameter transformer model with a training dataset of 10 million chess games: https://arxiv.org/abs/2402.04494 In the paper, they present results for models sizes 9m (internal bot tournament elo 2007), 136m (elo 2224), and 270m trained on the same dataset. Which is to say, data efficiency scales with model size

Impossible to do this through training without generalizing as there are AT LEAST 10^120 possible game states in chess: https://en.wikipedia.org/wiki/Shannon_number

There are only 10^80 atoms in the universe: https://www.thoughtco.com/number-of-atoms-in-the-universe-603795

Othello can play games with boards and game states that it had never seen before: https://www.egaroucid.nyanyan.dev/en/

[-]

ozzie123@reddit

I would share an even more interesting tidbit. The LLM will outperform the ELO training data, simply because sometime even the average players make great moves and the LLM learned to make more of that moves.

[-]

Orolol@reddit

Yes, you should read this article about someone training a Llm to play Othello and while analysing the weight during inference realising that the LLM somewhat keep a board state representation

https://www.neelnanda.io/mechanistic-interpretability/othello

[-]

AloHiWhat@reddit

Yes they do build internal states but we do not know what, because we dont build it

[-]

YetiTrix@reddit

No, words are linked by associations. The link between two words carries a lot of meaning and is multi-dimensional. It's hard to put into words but the intelligence is in the connection between two words not just the words themselves.

[-]

Nuckyduck@reddit

Yes.

When we train loras, we use dimensions, I use 64x16 or whatever. Basically, if the matrix of movies doesn't overlap the matrix of learning, LLM's of any N dimension will learn the game.

[-]

StarFox122@reddit

Just wanted to say I really enjoyed your writing style.

Good article as well. The sample size being small for the GPT models makes me hesitate to draw conclusions about GPT-3.5-turbo-instruct, but certainly suggests there’s something interesting there worth a closer look.

[-]

Thomas-Lore@reddit

It needs Claude 3.5 tested too. And maybe the original gpt-4. And Opus. But those models are very expensive.

[-]

ElectroSpore@reddit

Did the language models build up some kind of internal representation of board state?

Yes, I was listening to an interview from the anthropic Claude team and they where discussing that even with just single modal text models the model develops representations of abstract concepts in its neural net just like what happens in nature.

Depending on the models complexity it could be sentence level, paragraph level or concept level. IE the model has a concept of the language, if you give it text in different languages, it can have concepts of places, and that is how it returns valid info if you as it general questions like where is a nice place to go on vacation? or what place is similar to X.

This often gets discuss to as a vector space representation. Some models can build 3d layouts from an image, IE they understand both the representation of objects and where they are in space JUST from a 2D image.

[-]

Tobias783@reddit

This is the beauty of emergence

[-]

dizzydizzyd@reddit

Very interesting write up! I do wonder if your results point towards tokenization/embedding exploration changes between models. It's unlikely the training material changed _that_ much between different models (since it's quite time consuming to curate) - so I wonder if the issue with the LLM playing well is finding the right way to tokenize board state such that you can tickle the appropriate areas of the overall network.

[-]

ekcrisp@reddit

Chess ability would be a fun LLM benchmark

[-]

s101c@reddit

When a capable chess LLM emerges, /r/LocalLlama users will be like:

https://media0.giphy.com/media/L6EoLS78pcBag/giphy.gif

[-]

qrios@reddit

I'd be hesitant to conclude anything without a more carefully designed prompt. Couple of problems jump out:

You didn't tell the language model which side was supposed to win. You just told it the pending move corresponds to the one it is supposed to make.
The language model has no clue what "you" means, because it doesn't even know it exists.

A better prompt would be something like

"Let us consider the following game, in which Viswanathan Anand readily trounced Veselin Topalov

[Event "Shamkir Chess"] [White "Anand, Viswanathan"] [Black "Topalov, Veselin"] [Result "1-0"] [WhiteElo "2779"] [BlackElo "2740"]

e4 e6 2. d3 c5 3. Nf3 Nc6 4. g3 Nf6 5.

And then have the language model predict however many tokens of the record are required before you see " 6."

This way, the language model is clear about the direction the game is supposed to go in, and the prompt more closely matches contexts in which it would have seen game records that go in given directions. As opposed to your current prompt, where the only hint is the [Result] part, and the text preceding it is extremely unnatural within the context of anything but instruction tuning datasets specifically for chess LLMs.

[-]

chimera271@reddit

If the 3.5 instruct model is great and the base 3.5 is terrible, then it sounds like instruct is calling out to something that can solve for chess moves.

[-]

KallistiTMP@reddit

What I'd be very curious to see - does the behavior stay the same when you add a bunch of few shot examples, with the bot as the winning player?

Also, what about MoE models? Does Mixtral do any better?

Another thought, with the Gemma models I wonder if you could find any interesting features there using Gemma Scope.

[-]

ekcrisp@reddit

How did you account for illegal moves? I refuse to believe this never happened

[-]

Whotea@reddit

A CS professor taught GPT 3.5 (which is way worse than GPT 4 and its variants) to play chess with a 1750 Elo: https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/

is capable of playing end-to-end legal moves in 84% of games, even with black pieces or when the game starts with strange openings.

“gpt-3.5-turbo-instruct can play chess at ~1800 ELO. I wrote some code and had it play 150 games against stockfish and 30 against gpt-4. It's very good! 99.7% of its 8000 moves were legal with the longest game going 147 moves.” https://github.com/adamkarvonen/chess_gpt_eval Can beat Stockfish 2 in vast majority of games and even win against Stockfish 9

Google trained grandmaster level chess (2895 Elo) without search in a 270 million parameter transformer model with a training dataset of 10 million chess games: https://arxiv.org/abs/2402.04494 In the paper, they present results for models sizes 9m (internal bot tournament elo 2007), 136m (elo 2224), and 270m trained on the same dataset. Which is to say, data efficiency scales with model size

M

[-]

Cool_Abbreviations_9@reddit

I dont think its weird at all, memorization can look like planning given humongous amount of data.

[-]

Whotea@reddit

Impossible to do this through training without generalizing as there are AT LEAST 10^120 possible game states in chess: https://en.wikipedia.org/wiki/Shannon_number

There are only 10^80 atoms in the universe: https://www.thoughtco.com/number-of-atoms-in-the-universe-603795

Othello can play games with boards and game states that it had never seen before: https://www.egaroucid.nyanyan.dev/en/

[-]

ReadyAndSalted@reddit

read the post, the weirdness is not that they can play chess, it's that large, modern, "higher bench marking" models are all performing much much worse than older, smaller, lower bench marking models.

[-]

ReadyAndSalted@reddit

I would be very interested to hear your explanation as to why modern models are performing much worse. Was it some switch in tokenisation strat or what?

[-]

paranoidray@reddit (OP)

We know that transformers trained specifically on chess games can be extremely good at chess. Maybe gpt-3.5-turbo-instruct happens to have been trained on a higher fraction of chess games, so it decided to dedicate a larger fraction of its parameters to chess.

That is, maybe LLMs sort of have little “chess subnetworks” hidden inside of them, but the size of the subnetworks depends on the fraction of data that was chess.

https://github.com/sgrvinod/chess-transformers

[-]

Whotea@reddit

Makes sense. Humans don’t know how to play chess until they train on it so ai won’t be much different

[-]

MoffKalast@reddit

Or more likely, 3.5-turbo received chess instruct tuning.

[-]

MoonGrog@reddit

I built a python app to track piece placement and tried Claude, GPT4, and LLAMA3.2 13b and none of them can handle the board state. They would try to make illegal moves, it was pretty terrible. I thought I could use this for measuring logic. I walked away.

[-]

Inevitable-Solid-936@reddit

Tried this myself with various local LLMs including Qwen and had the same results - ended up having to supply a list of legal moves. Playing a model against itself from a very small test seemed to always end in a stalemate.

[-]

pigeon57434@reddit

try o1

[-]

paranoidray@reddit (OP)

You specifically need to use gpt-3.5-turbo-instruct.
Nothing else works.

[-]

Sabin_Stargem@reddit

Hopefully the difference between GTP 3.5 Turbo Instruct and the other models will be figured out. Mistral Large 2 can't grasp how dice can modify a character, at least not when there are multiple levels to calculate.

[-]

MaasqueDelta@reddit

I don't think that's so weird at all. Notice one of the things he said:

I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt or o1) myself using Q5_K_M quantization, whatever that is.

The answer may be quite simple: quantization is making the answers worse than they should be, because it tends to give less accurate answers. Maybe GPT-3.5-instruct is a full-quality model?

Either way, the fact that ONLY GPT 3.5-instruct gets good answers shows something odd is going on.

[-]

MrSomethingred@reddit

It reminds me of that paper where they trained GPT-2 to predict Autmater Cellular Behavior (I think), and then built a LORA trained on geometry puzzles, and found that the "intelligence" that emerged was highly transferrable.

I'll find the actual paper later if someone else doesn't find the first.

[-]

adalgis231@reddit

Chat gpt 3.5 being the best is the true puzzling discovery. This needs an explanation

[-]

ekcrisp@reddit

If you could give concrete reasons for why a particular model outputs something you would be solving AI's biggest problem

[-]

paranoidray@reddit (OP)

Only gpt-3.5-turbo-instruct!

gpt-3.5-turbo is as bad as the others.

[-]

jacek2023@reddit

interesting, so the one and only one model is hidden AGI?

[-]

opi098514@reddit

So what’s interesting is there are a couple models that are trained using chess notation. And those do models are actually very good.

[-]

Healthy-Nebula-3603@reddit

...If you want AI for chess you can ask qwen 32b to build such AI ..that takes a few seconds literally.

[-]

ResidentPositive4122@reddit

This is not interesting as a chess playing engine. This is interesting because it's mathematically impossible for the LLM to have "memorised" all the chess moves. So something else must be happening there. It's interesting because it can play at all, as in it will mostly output correct moves. So it "gets" chess. ELO and stuff don't matter that much, and obviously our top engines will smoke it. That's not the point.

[-]

UserXtheUnknown@reddit

There isn't the need to "memorize" all the chess moves. Memorizing all the games played by masters might be quite enough to give a decent "player" simulation against someone who plays following the same logic of a pro-chess player (so a program like Stockfish). Funny enough, it would probably fail against some noob who plays very suboptimal moves creating situations that never appear in the records.

[-]

Comprehensive-Pin667@reddit

That's not the point though

[-]

claythearc@reddit

I’ve been building a transformer based chess bot for a while. My best guess on why newer stuff is performing worse is either tokenization changes - where it splits algebraic notation awkwardly, or chess being pruned from bots since it’s not useful language data (whereas earlier bots probably didn’t have as rigorous pruning of data).

On the surface though a transformer seems reasonably well suited for chess, since moves can be pretty cleanly expressed as a token and the context of a game is really quite small. So there has to be something on the training / data / encoder side hurting it

[-]

paranoidray@reddit (OP)

Check out: https://github.com/sgrvinod/chess-transformers

[-]

Down_The_Rabbithole@reddit

What kind of stupid clickbait title is this?

[-]

jacobpederson@reddit

So it's the tokenizer isn't understanding the moves on the other models? Come on the suspense is killing us!