Gemma 4 constantly repeating the same token

Posted by leorgain@reddit | LocalLLaMA | View on Reddit | 10 comments

I've been updating the nightlies of llama cpp as they've come out, but for the life of me I can't get gemma 4 31b to stop repeating the same tokens after a couple messages. It starts out fine but after the third or fourth reply it just repeats the the last two or three tokens it outputs. I've deactivated all samplers and then entered google's recommended settings (even tried turning on min-p but that didn't work either), re-downloading quants (bartowski's Q6_K_L), activating xtc, dry or them both at the same time.

Does anyone have any ideas as to what's going on?

Side note: I've noticed models like step 3.5 and gemma 4 having weird issues with of, either merging them with the last word or hyphenating them. That one is less annoying but if anyone has ideas on that too I'd appreciate it

[-]

leorgain@reddit (OP)

It seems like I fixed it on my end. I had a generic "anti-slop" ban list. When using Silly Tavern I took a look at my outgoing requests and noticed I had a logit bias set. When converting the tokens I saw one of them was gemma's token (2), along with a couple others. Once I cleared the list the logit bias went away and everything worked as expected. So far after this change 'of' works properly again and, fingers crossed, I haven't had a repeating output yet. So if you're running into an issue, check to make sure you don't have tokens you don't mean to in logit bias.

[-]

leorgain@reddit (OP)

UPDATE: If i turn thinking off it magically works again. It seems like thinking is bugged

[-]

Electronic-Metal2391@reddit

I'm having this issue with Koboldcpp. In LM Studio, it works just fine out of the box.

[-]

charnet3d@reddit

LM Studio are using an older more stable release of llamacpp, this is a recent bug https://github.com/ggml-org/llama.cpp/issues/21726

I just tested it and they have it patched now in this release b8757, so just update to a more recent version now.

[-]

Flashy-Split-8602@reddit

Hehe, same here! thought it's a mode issue?

[-]

charnet3d@reddit

Hey I reported a bug about this today!

https://github.com/ggml-org/llama.cpp/issues/21726

It seems they introduced a regression in b8702, a new release should come out the next few hours as the fix has just been merged.

In the meantime if you need to disable kv cache offloading (-nkvo argument) just use version b8701. Otherwise when using -kvo newer versions work fine (like running a low context version fully on the GPU)

This actually teaches us not to follow the bleeding edge build blindly and to keep an eye on issues like this and retest older versions.

[-]

LeonidasTMT@reddit

I faced that too and just assumed it was because it ran out of context length.

[-]

FoxiPanda@reddit

Did you grab the new chat templates that got updated overnight last night?

See: https://old.reddit.com/r/LocalLLaMA/comments/1shs6sx/more_gemma4_fixes_in_the_past_24_hours/

[-]

leorgain@reddit (OP)

Sadly that didn't work either. I was hopeful for a bit, then it started again after a few messages like normal

[-]

FoxiPanda@reddit

Can you tell us about your whole setup then? llama.cpp version string + how you built it (or installed it), full model launch parameters (pull out any personal information if there is any...usually not), what harness you're using, how you're interacting with the LLM, etc.

I actually don't have these problems with my usage of Gemma-4 but maybe you're doing something different from me somehow.