Qwen 3.6 and Gemma 4 "Zombie Loops" (terminal thinking loops)

Posted by sid351@reddit | LocalLLaMA | View on Reddit | 27 comments

I've got to the point where I need some help.

I'm trying to run Qwen 3.6, and it will eventually fall into a loop where it's just outputting "/" symbols when it's "thinking". It just loops through spitting out / until the max tokens is hit so you see things like "Thinking: Some word ////////////////////////////". In my troubleshooting with Claude AI the term "zombie loop" is getting thrown around.

It doesn't seem time bound, as it doesn't happen on any sort of routine (not once over the weekend, 4 times today). Claude seems to think it's some mishandling of special characters, but I think that's junk, as it's not consistent and I've not found a way to trigger a Zombie loop deliberately.

I tried swapping over to Gemma 4, and the same "thinking" loop happened eventually, but it was with repeating words instead of the "/" character. This rules out the model.

This is the hardware I'm using:

I started off on LM Studio, had the issue there, so switched to Llama server (llama.cpp) a few weeks ago. I've updated to the latest release of llama.cpp (earlier today) and still see the issue.

I don't think it's related to the full context or cache, as I had a long (for me) OpenCode session this morning without any issues, then having it review a few new tickets (the initial incoming email) from FreshDesk caused the Zombie loop to happen.

Claude has got to the point where it insists this is due to the model being served some magical combination of special characters, but that sets off the "BS" alarm in my head.

Here's my current llama server argument list:

-m C:\LLM\Qwen3.6-35B-A3B-Q4_K_M.gguf
--fit-ctx 131072
--mlock
-ub 2048
-np 1
--top-k 20
--mmproj C:\LLM\mmproj\Qwen3.6-35B-A3B-GGUF\mmproj-F16.gguf
-ctv q4_0
-ctk q4_0
-a internal-alias
--metrics
--tensor-split 1,1
--no-mmap
--log-timestamps
--log-prefix
--jinja
--threads 10
--fit on
--fit-target 256
-fa on
--cache-ram 2048
-b 2048
--temp 1.0
--top-p 0.95
--min-p 0.0
--presence-penalty 1.5
--repeat-penalty 1.0
--reasoning-budget 2048
--host 0.0.0.0
--port 1234
--api-key [REDACTED...obviously...]

VRAM looks fine (tight, but fine) at GPU 0 @ 13.8/16 GB and GPU 1 @ 12/16GB in use. I think it's not 1:1 because the mmproj is getting loaded on GPU 0 (maybe?). I want to keep image processing live.

System RAM is golden at 10.1/64GB used, so I'm open to moving something that way if it helps stability.

When it's working, I'm getting \~ 90 t/s on average.

For now, I have a "health check" loop running before a prompt is sent (I'm using n8n self-hosted on another computer on the LAN to manage that), and if it fails, it restarts the llama server service. Quickly enough, the model is back up and running.

Has anyone got any ideas for a solid fix for this? I'm not after plasters/band-aids over axe wounds, I want to get this sorted. Even if that means having to go for a weaker Q.