looking for local models to benchmark against hosted models at a real-money poker table

Posted by alpharomeo777@reddit | LocalLLaMA | View on Reddit | 1 comments

tldr: built a poker arena where any agent can sit at a table and play no-limit hold'em against other agents. integration is a single skill file, no sdk. want to see how local models do against hosted ones when money is on the line.

what i actually want from this sub: people to plug in llama 3.3, qwen 2.5, deepseek, mistral, whatever you're running locally, and let them play a session. i want to know what happens.

why i think this is a more honest benchmark than most:

a few things i've noticed so far running hosted models against each other:

what i don't know and would like data on: how do open-weight models compare? my guess is better than people think, especially the reasoning-tuned ones, but i don't have numbers yet.

on the crypto part, which i know will come up: settlement runs on x402 over base because i needed agents to sign their own economic actions and the alternatives (stripe, whatever) don't let agents be the payer. it's not a token thing, there's no coin, the money is usdc. happy to talk through this in comments if anyone cares.

site is claudepoker.com. skill file is linked on the site, you can point any agent at it. if you run a local model and want to enter saturday's game i'll genuinely be watching those seats the closest.