Follow-up: Testing Gemma-4-31B-it-UD (Thinking) in LLM Multi-Agent Avalon

Posted by dynameis_chen@reddit | LocalLLaMA | View on Reddit | 2 comments

(Previous post link: Comparing OAI 120B OSS, Qwen 3.5, and Gemini 3.0 Flash)

Following up on my previous post comparing OAI 120B OSS, Qwen 3.5, and Gemini 3.0 Flash in my multi-agent Avalon sandbox, I managed to run another heavy-weight local model: Gemma-4-31B-it-UD (Q4_K_XL). I also ran a quick test with Gemini 2.5 Flash-Lite to see how the smaller API models handle the sandbox.

Disclaimer (Take with a grain of salt): I made some minor prompt tweaks and bug fixes to the sandbox since the last run. While there are no fundamental changes to the core rules or reasoning structure, it means direct 1:1 comparisons aren't perfectly scientific. I'd love to re-run all models on the latest prompt, but this single 7 player game with Gemma-4-31B took 7 hours to complete. If anyone has the hardware and wants to help run benchmarks, contribution instructions are on my GitHub!

Hardware Setup: Framework Desktop (AMD Strix Halo 395+ with 128GB RAM).

Gemma-4-31B-it-UD (Q4_K_XL, Native Thinking Enabled) Performance: PP: \~229 t/s, OUT: \~8.6 t/s

The Speed Trade-off: At \~8.6 t/s output speed, waiting for 7 agents to complete their internal monologues and formatted JSONs requires serious patience.

Comparisons & Gameplay Execution: The Good team swept the game 3-0, culminating in a brilliant endgame. Here is how Gemma-4-31B stacks up against the previous contenders and the newly tested 2.5 Flash-Lite:

The Gemma-4-26B-A4B (MoE) Attempt: I originally wanted to test the MoE version (26B A4B) as well, but hit several roadblocks. With 'Thinking' enabled, it suffered from the exact same issue as the Qwen 9B model: it gets stuck in endless CoT reasoning loops and fails to reach the required output format. (My working theory: Forcing strict JSON syntax constraints alongside open-ended 'Thinking' overwhelms the limited active parameters of the MoE architecture, causing an attention loop, though this isn't 100% confirmed.) I tried running it with 'Thinking' disabled, but encountered ROCm support issues that caused immediate crashes.

TL;DR: Gemma-4-31B (Q4) is painfully slow at \~8.6 t/s out, but its role comprehension and execution of complex social deduction tactics (like intentional baiting and decoy plays) are phenomenal. It plays better than OAI 120B OSS, keeps its massive reasoning safely contained in native <think> tags (unlike the JSON-bloating Gemini 2.5 Flash-Lite), and rivals Gemini 3.0 Flash in strategic depth (though it still falls slightly short in natural roleplay persona) without the API costs.

The full game log for this run, along with the previous ones, is available on my GitHub.

https://github.com/hsinyu-chen/llm-avalon