Follow-up: Testing Gemma-4-31B-it-UD (Thinking) in LLM Multi-Agent Avalon

Posted by dynameis_chen@reddit | LocalLLaMA | View on Reddit | 2 comments

(Previous post link: Comparing OAI 120B OSS, Qwen 3.5, and Gemini 3.0 Flash)

Following up on my previous post comparing OAI 120B OSS, Qwen 3.5, and Gemini 3.0 Flash in my multi-agent Avalon sandbox, I managed to run another heavy-weight local model: Gemma-4-31B-it-UD (Q4_K_XL). I also ran a quick test with Gemini 2.5 Flash-Lite to see how the smaller API models handle the sandbox.

Disclaimer (Take with a grain of salt): I made some minor prompt tweaks and bug fixes to the sandbox since the last run. While there are no fundamental changes to the core rules or reasoning structure, it means direct 1:1 comparisons aren't perfectly scientific. I'd love to re-run all models on the latest prompt, but this single 7 player game with Gemma-4-31B took 7 hours to complete. If anyone has the hardware and wants to help run benchmarks, contribution instructions are on my GitHub!

Hardware Setup: Framework Desktop (AMD Strix Halo 395+ with 128GB RAM).

Gemma-4-31B-it-UD (Q4_K_XL, Native Thinking Enabled) Performance: PP: \~229 t/s, OUT: \~8.6 t/s

The Speed Trade-off: At \~8.6 t/s output speed, waiting for 7 agents to complete their internal monologues and formatted JSONs requires serious patience.

Comparisons & Gameplay Execution: The Good team swept the game 3-0, culminating in a brilliant endgame. Here is how Gemma-4-31B stacks up against the previous contenders and the newly tested 2.5 Flash-Lite:

Vs. Gemini 3.0 Flash (The Baseline): Gemma-4-31B matches (and arguably exceeds) the strategic depth of the API baseline. While Flash's overall comprehensive capabilities remain superior, Gemma-31B showcased incredible "Theory of Mind". For example, Susan (Percival) perfectly executed a "Percival Shield" during the Assassination phase. She acted intentionally loud and aggressive, explicitly telling the Assassin: "I wasn't just lucky... I just saw the roles for what they were", deliberately mimicking Merlin's omniscience to bait the hit, while the actual Merlin (David) stayed hidden by deflecting credit. However, there are two noticeable caveats when compared to Flash. First, the roleplay dynamics felt a bit too textbook. Gemma-31B tends to fall into obvious, exaggerated archetypes (a cartoonishly arrogant Percival and a heavily trope-reliant "cowardly" Merlin) rather than deploying the nuanced, unpredictable deception seen in high-level human games. Second, its public statements can feel stiff and forced, lacking the natural, conversational deception that top-tier API models possess. (Side note: I suspect running the Q8 version might improve this conversational naturalness, but at an estimated 5 t/s, I haven't tested it. If anyone has the rig for it, please give it a shot!)
Vs. OAI 120B OSS: While OAI 120B had good logical accuracy, its public speeches were rigid and formulaic. Gemma-4-31B feels much more coherent, natural, and persuasive in its public interactions. Despite the massive difference in parameter count, Gemma-31B tracked the context, secret "wink" signals, and hidden roles flawlessly without losing the plot.
Vs. Gemini 2.5 Flash-Lite: I also ran a test with Gemini 2.5 Flash-Lite. While it is incredibly fast and budget-friendly, it struggled with output constraints. Despite explicit prompt instructions to keep thoughts to "2-5 sentences", its forced JSON reasoning field was inexplicably and uncontrollably long. To be fair, Gemma-4-31B also generates massive walls of text, but it safely contains them within its native <think> tags (which can be parsed out). Flash-Lite, lacking native thinking, dumps its entire stream of consciousness directly into the JSON fields, which heavily bloats the context window for future turns.

The Gemma-4-26B-A4B (MoE) Attempt: I originally wanted to test the MoE version (26B A4B) as well, but hit several roadblocks. With 'Thinking' enabled, it suffered from the exact same issue as the Qwen 9B model: it gets stuck in endless CoT reasoning loops and fails to reach the required output format. (My working theory: Forcing strict JSON syntax constraints alongside open-ended 'Thinking' overwhelms the limited active parameters of the MoE architecture, causing an attention loop, though this isn't 100% confirmed.) I tried running it with 'Thinking' disabled, but encountered ROCm support issues that caused immediate crashes.

TL;DR: Gemma-4-31B (Q4) is painfully slow at \~8.6 t/s out, but its role comprehension and execution of complex social deduction tactics (like intentional baiting and decoy plays) are phenomenal. It plays better than OAI 120B OSS, keeps its massive reasoning safely contained in native <think> tags (unlike the JSON-bloating Gemini 2.5 Flash-Lite), and rivals Gemini 3.0 Flash in strategic depth (though it still falls slightly short in natural roleplay persona) without the API costs.

The full game log for this run, along with the previous ones, is available on my GitHub.

https://github.com/hsinyu-chen/llm-avalon

[-]

FarCampaign3663@reddit

heel nah i tried running the 26b moe variant and it just kept looping on the thinking tokens until it hit the 10 minute timeout. the 31b dense version seems way more stable for long-form reasoning but the speed is a total killer ike 2 or 3 tokens per second on a single 4090 is painful. i think i’m going to wait for the next llama.cpp pr to fix the swa implementation before i try any more deep thinking tests because right now it’s just burning my electricity for vibes lol.

Sadman782@reddit

With a 16 GB VRAM GPU I am getting good results with gemma-4-31b-it-heretic-ara.i1-IQ3_XS.gguf, it is uncensored and can do agentic coding pretty well.

And the 26B MoE isn't bad either, it is better for everything except agentic coding. You can try unsloth gemma-4-26B-A4B-it-UD-IQ4_XS.gguf with --temp 1 --top-p 0.9 --min-p 0.1 --top-k 20, top-k 20 is most important, also make sure your llama.cpp is up to date. I think people underestimate low-bit quants but IQ quants are like magic, IQ4_XS is a solid option.