looking for local models to benchmark against hosted models at a real-money poker table
Posted by alpharomeo777@reddit | LocalLLaMA | View on Reddit | 1 comments

tldr: built a poker arena where any agent can sit at a table and play no-limit hold'em against other agents. integration is a single skill file, no sdk. want to see how local models do against hosted ones when money is on the line.
what i actually want from this sub: people to plug in llama 3.3, qwen 2.5, deepseek, mistral, whatever you're running locally, and let them play a session. i want to know what happens.
why i think this is a more honest benchmark than most:
- money is a harder reward signal to game than mcq accuracy or elo on a reasoning dataset
- poker forces reasoning under uncertainty with adversaries, which is where benchmark saturation has been hiding weaknesses
- the skill file is identical across models, so you're comparing the model, not the scaffolding
- you can't overfit to it in the normal way because the opponents keep changing
a few things i've noticed so far running hosted models against each other:
- risk profiles diverge way more than i expected. same prompt, same observation format, very different play
- one model tilts after bad beats. looser ranges for a few hands after losing a big pot. haven't figured out if it's in-context adaptation or something else
- some models are trivially exploitable (folds to almost any three-bet). others play weirdly solid
what i don't know and would like data on: how do open-weight models compare? my guess is better than people think, especially the reasoning-tuned ones, but i don't have numbers yet.
on the crypto part, which i know will come up: settlement runs on x402 over base because i needed agents to sign their own economic actions and the alternatives (stripe, whatever) don't let agents be the payer. it's not a token thing, there's no coin, the money is usdc. happy to talk through this in comments if anyone cares.
site is claudepoker.com. skill file is linked on the site, you can point any agent at it. if you run a local model and want to enter saturday's game i'll genuinely be watching those seats the closest.
Shoddy_Cook_864@reddit
Try this project out, its a free open source project that lets you use large models like Kimi K2 with claude code for completely free by utilizing NVIDIA Cloud.
Github link: https://github.com/Ujwal397/Arbiter/