Model stuck in some thinking zone where it keeps saying a similar thing again and again

Posted by BitGreen1270@reddit | LocalLLaMA | View on Reddit | 13 comments

I experienced this with Q4 and Q3 versions of Qwen3.6-35B-A3B and Gemma-4-26B-A4B. It starts saying things which sound similar in thinking mode:

I must do ....

I have to do ...

I need to do ...

Is this a known issue with lower quantization ? I usually run it with --fit on -c 16384 --fit-target 2000

[-]

into_devoid@reddit

Are you using Google’s recommended sampling settings? Is your context filling?

[-]

BitGreen1270@reddit (OP)

Seems like the default params (from unsloth) are:

temperature = 1.0
top_p = 0.95
top_k = 64

I'll try these out from now on.

[-]

qwen3.5 and 3.6 are also sensitive to parameters - so much so I'll bench all 4 "modes" they provide against my project endpoint to determine the best one for the job. If I ever get looping, bad tool calls, over thinking etc - it's always been bad settings for me.

Mind you I am using minimum int8 and unquantized cache.

[-]

Caffdy@reddit

I only got looping while using Gemma4 on SillyTavern, do you think is a problem with the settings?

[-]

redmctrashface@reddit

This is not the first time I read "bad tool calls". What are these exactly? (Noob here)

[-]

Makers7886@reddit

LLM's can run "commands" but any typos, mistakes, hallucinations etc can cause that command to fail. The LLM is still just generating text but just like typing a command into terminal - it needs to be correct.

[-]

redmctrashface@reddit

Oh ok, I see exactly the output it can produce. Thx for the clarification!

[-]

BitGreen1270@reddit (OP)

Thanks for sharing, I don't have the machine for q8, but am somehow avoiding quantizing cache. Once I get on longer workflows i might not have choice but tweak that one as well.

[-]

BitGreen1270@reddit (OP)

Sorry what do you mean by sampling settings? Do you mean temp? No I haven't, I'll look up their recommendations

[-]

lit1337@reddit

Yeah, from what I've seen this happens with quantized MoE models. Both Gemma 4 and Qwen 3.6 do this at Q3/Q4, I've hit it on my own quants too. I don't think its a sampling thing. I think what's going on is the KV cache builds up tiny rounding errors with every token during thinking mode. After enough internal reasoning tokens those errors stack up and the model gets stuck in a loop it can't get out of. Longer thinking = worse. It's not about temperature or top_p. It's the quantization degrading the attention cache over time.

Stuff I've noticed:
- shorter context helps since there's less room for errors to pile up
- not all quants are equal here, some layers are way more sensitive than others
- full precision KV cache (--ctk f16 --ctv f16) reduces it but costs more VRAM
- the actual fix has to come from the quantization side, protecting the right tensors

It's not something you're doing wrong. It's a real limitation of uniform quantization on these MoE architectures. The models weren't built with this in mind and nobody's really solved it yet at the quant level.

[-]

BitGreen1270@reddit (OP)

Seems like the default params (from unsloth) are:

temperature = 1.0
top_p = 0.95
top_k = 64

I'll try these out from now on.

[-]

BitGreen1270@reddit (OP)

Thanks thats very helpful to know. How do I break out of this loop? I'm building an app on top of llama-server, is there a way to detect the model stuck in a loop and cancel the request and retry? Maybe restrict number of thinking tokens? But thats hard to quantify unless my prompts are super predictable.