My 1.2B model won 2 out of 5 poker tournaments against models up to 1T params.

Posted by Junior_Bake5120@reddit | LocalLLaMA | View on Reddit | 11 comments

I made 6 LLMs play Texas Hold’em against each other. Ran 5 tournaments on my 16GB MacBook. The 1.2B local model won more than anything else.

Run	Winner	Size
1	Qwen	1.7B local
2	MiniMax	230B cloud
3	Liquid	1.2B local
4	Kimi	\~1T cloud
5	Liquid	1.2B local

Lineup was Liquid lfm2.5 (1.2B, LM Studio, \~5s/decision), Qwen3 (1.7B, LM Studio, \~2.5 min), Claude Haiku 4.5, GPT-OSS (120B, Fireworks), MiniMax M2 (230B, Fireworks), and Kimi K2 (\~1T, Fireworks).

Run 3 was wild. Liquid played 6 hands: 19 raises, 0 folds. Just raised everything no matter what cards it had. Won $5.98M from a $1M starting stack. GPT-OSS (120B) in the same run did 0 raises and 5 folds in 6 hands. The 120B model was too smart to bluff and it got blinded out.

Before you come for me: yes, 25 hands with 5K/10K blinds + 1K ante is basically a shove-or-fold format. This is not deep poker. The format punishes patience and rewards aggression. The big models “understand” poker well enough to fold bad hands.

Liquid doesn’t know what a bad hand looks like. So it raises everything. Against opponents who fold too much, that prints money. Not claiming small models are smarter at poker. I’m saying in this specific format, not knowing when to fold is an advantage. Which is kind of hilarious.

I want to run longer tournaments (100+ hands, lower blinds) where hand-reading actually matters. If you have a local model you want to see at the table, drop it below. Especially curious about Mistral, Llama, Gemma 3. The framework also supports custom personas (personality traits, risk tolerance, fears) per player, so if you want to design a degenerate gambler or a paranoid folder, I’ll build it and run it.

Side note: my fan was screaming during the Phi-4 runs. Seven minutes per decision. It played 2 hands before getting eliminated. Only had enough RAM for one local model(I was working while running the project on the side) at a time so the locals played sequentially.

Code and full result JSONs: https://github.com/chiruu12/Hive (tournament runner is in hive-arena/, results in tournaments/results/)

[-]

LocalLLaMA-ModTeam@reddit

Rule 3

nsdjoe@reddit

you probably already know this but in addition to shove-or-fold format, only 5 tournaments is extremely high variance. you'd need dozens or better yet hundreds to be more confident in your results

Junior_Bake5120@reddit (OP)

Ikr! Just want to get some feedback from fellow redditors because ik i will be missing a bunch of things that fellow redditors can point out the 5 tournament run was just like demo will doing like 100 to get a better picture. I'm making the framework better as well.

portmanteaudition@reddit

We don't mean 100. We mean more like 1 million.

Sure if you sponsor the credits/hardware lol

It sounds like you lack the hardware or money for benchmarking properly so you are settling for meaningless noise. Good luck kiddo.