My 1.2B model won 2 out of 5 poker tournaments against models up to 1T params.
Posted by Junior_Bake5120@reddit | LocalLLaMA | View on Reddit | 11 comments
I made 6 LLMs play Texas Hold’em against each other. Ran 5 tournaments on my 16GB MacBook. The 1.2B local model won more than anything else.
| Run | Winner | Size |
|---|---|---|
| 1 | Qwen | 1.7B local |
| 2 | MiniMax | 230B cloud |
| 3 | Liquid | 1.2B local |
| 4 | Kimi | \~1T cloud |
| 5 | Liquid | 1.2B local |
Lineup was Liquid lfm2.5 (1.2B, LM Studio, \~5s/decision), Qwen3 (1.7B, LM Studio, \~2.5 min), Claude Haiku 4.5, GPT-OSS (120B, Fireworks), MiniMax M2 (230B, Fireworks), and Kimi K2 (\~1T, Fireworks).
Run 3 was wild. Liquid played 6 hands: 19 raises, 0 folds. Just raised everything no matter what cards it had. Won $5.98M from a $1M starting stack. GPT-OSS (120B) in the same run did 0 raises and 5 folds in 6 hands. The 120B model was too smart to bluff and it got blinded out.
Before you come for me: yes, 25 hands with 5K/10K blinds + 1K ante is basically a shove-or-fold format. This is not deep poker. The format punishes patience and rewards aggression. The big models “understand” poker well enough to fold bad hands.
Liquid doesn’t know what a bad hand looks like. So it raises everything. Against opponents who fold too much, that prints money. Not claiming small models are smarter at poker. I’m saying in this specific format, not knowing when to fold is an advantage. Which is kind of hilarious.
I want to run longer tournaments (100+ hands, lower blinds) where hand-reading actually matters. If you have a local model you want to see at the table, drop it below. Especially curious about Mistral, Llama, Gemma 3. The framework also supports custom personas (personality traits, risk tolerance, fears) per player, so if you want to design a degenerate gambler or a paranoid folder, I’ll build it and run it.
Side note: my fan was screaming during the Phi-4 runs. Seven minutes per decision. It played 2 hands before getting eliminated. Only had enough RAM for one local model(I was working while running the project on the side) at a time so the locals played sequentially.
Code and full result JSONs: https://github.com/chiruu12/Hive (tournament runner is in hive-arena/, results in tournaments/results/)
LocalLLaMA-ModTeam@reddit
Rule 3
nsdjoe@reddit
you probably already know this but in addition to shove-or-fold format, only 5 tournaments is extremely high variance. you'd need dozens or better yet hundreds to be more confident in your results
Junior_Bake5120@reddit (OP)
Ikr! Just want to get some feedback from fellow redditors because ik i will be missing a bunch of things that fellow redditors can point out the 5 tournament run was just like demo will doing like 100 to get a better picture. I'm making the framework better as well.
portmanteaudition@reddit
We don't mean 100. We mean more like 1 million.
Junior_Bake5120@reddit (OP)
Sure if you sponsor the credits/hardware lol
portmanteaudition@reddit
It sounds like you lack the hardware or money for benchmarking properly so you are settling for meaningless noise. Good luck kiddo.
finevelyn@reddit
So you're saying the 1.2B model made zero intelligent decisions. Why use a 1.2B model when a 0B model could do the same?
Otherwise_Economy576@reddit
1.2b winning poker tourneys is wild. would love to know if it is exploiting formatting quirks in the benchmark or actual strategy - either way a good reminder that task-specific small models beat general giants.
Junior_Bake5120@reddit (OP)
Yep fr btw it is not exploiting any formatting quirks in the benchmarks it just doesn't know if it has a bad hand or a good one lol... Hence it just plays really aggressively
Interesting-Print366@reddit
How did rule based model performed?
Junior_Bake5120@reddit (OP)
Haven’t tested a rule-based baseline yet will do that for the next run