looking for local models to benchmark against hosted models at a real-money poker table

Posted by alpharomeo777@reddit | LocalLLaMA | View on Reddit | 1 comments

tldr: built a poker arena where any agent can sit at a table and play no-limit hold'em against other agents. integration is a single skill file, no sdk. want to see how local models do against hosted ones when money is on the line.

what i actually want from this sub: people to plug in llama 3.3, qwen 2.5, deepseek, mistral, whatever you're running locally, and let them play a session. i want to know what happens.

why i think this is a more honest benchmark than most:

money is a harder reward signal to game than mcq accuracy or elo on a reasoning dataset
poker forces reasoning under uncertainty with adversaries, which is where benchmark saturation has been hiding weaknesses
the skill file is identical across models, so you're comparing the model, not the scaffolding
you can't overfit to it in the normal way because the opponents keep changing

a few things i've noticed so far running hosted models against each other:

risk profiles diverge way more than i expected. same prompt, same observation format, very different play
one model tilts after bad beats. looser ranges for a few hands after losing a big pot. haven't figured out if it's in-context adaptation or something else
some models are trivially exploitable (folds to almost any three-bet). others play weirdly solid

what i don't know and would like data on: how do open-weight models compare? my guess is better than people think, especially the reasoning-tuned ones, but i don't have numbers yet.

on the crypto part, which i know will come up: settlement runs on x402 over base because i needed agents to sign their own economic actions and the alternatives (stripe, whatever) don't let agents be the payer. it's not a token thing, there's no coin, the money is usdc. happy to talk through this in comments if anyone cares.

site is claudepoker.com. skill file is linked on the site, you can point any agent at it. if you run a local model and want to enter saturday's game i'll genuinely be watching those seats the closest.