Qwen 3.6 and Gemma 4 "Zombie Loops" (terminal thinking loops)

Posted by sid351@reddit | LocalLLaMA | View on Reddit | 27 comments

I've got to the point where I need some help.

I'm trying to run Qwen 3.6, and it will eventually fall into a loop where it's just outputting "/" symbols when it's "thinking". It just loops through spitting out / until the max tokens is hit so you see things like "Thinking: Some word ////////////////////////////". In my troubleshooting with Claude AI the term "zombie loop" is getting thrown around.

It doesn't seem time bound, as it doesn't happen on any sort of routine (not once over the weekend, 4 times today). Claude seems to think it's some mishandling of special characters, but I think that's junk, as it's not consistent and I've not found a way to trigger a Zombie loop deliberately.

I tried swapping over to Gemma 4, and the same "thinking" loop happened eventually, but it was with repeating words instead of the "/" character. This rules out the model.

This is the hardware I'm using:

GPU = 2x RTX 5060 Ti 16GB (32GB VRAM total)
RAM = 64GB DDR5
CPU = Intel Core Ultra 5 225F
Storage = 1TB Predator SSD GM6
Motherboard = MSI MEG Z890 ACE
PSU = 1000W
OS = Windows 11 Pro

I started off on LM Studio, had the issue there, so switched to Llama server (llama.cpp) a few weeks ago. I've updated to the latest release of llama.cpp (earlier today) and still see the issue.

I don't think it's related to the full context or cache, as I had a long (for me) OpenCode session this morning without any issues, then having it review a few new tickets (the initial incoming email) from FreshDesk caused the Zombie loop to happen.

Claude has got to the point where it insists this is due to the model being served some magical combination of special characters, but that sets off the "BS" alarm in my head.

Here's my current llama server argument list:

-m C:\LLM\Qwen3.6-35B-A3B-Q4_K_M.gguf
--fit-ctx 131072
--mlock
-ub 2048
-np 1
--top-k 20
--mmproj C:\LLM\mmproj\Qwen3.6-35B-A3B-GGUF\mmproj-F16.gguf
-ctv q4_0
-ctk q4_0
-a internal-alias
--metrics
--tensor-split 1,1
--no-mmap
--log-timestamps
--log-prefix
--jinja
--threads 10
--fit on
--fit-target 256
-fa on
--cache-ram 2048
-b 2048
--temp 1.0
--top-p 0.95
--min-p 0.0
--presence-penalty 1.5
--repeat-penalty 1.0
--reasoning-budget 2048
--host 0.0.0.0
--port 1234
--api-key [REDACTED...obviously...]

VRAM looks fine (tight, but fine) at GPU 0 @ 13.8/16 GB and GPU 1 @ 12/16GB in use. I think it's not 1:1 because the mmproj is getting loaded on GPU 0 (maybe?). I want to keep image processing live.

System RAM is golden at 10.1/64GB used, so I'm open to moving something that way if it helps stability.

When it's working, I'm getting \~ 90 t/s on average.

For now, I have a "health check" loop running before a prompt is sent (I'm using n8n self-hosted on another computer on the LAN to manage that), and if it fails, it restarts the llama server service. Quickly enough, the model is back up and running.

Has anyone got any ideas for a solid fix for this? I'm not after plasters/band-aids over axe wounds, I want to get this sorted. Even if that means having to go for a weaker Q.

[-]

ai_guy_nerd@reddit

Zombie loops usually happen when the model hits a token sequence it can't escape, especially in thinking modes where the internal logic gets stuck on a specific pattern. It often feels like a hardware issue but it is almost always a sampling or temperature problem.

Trying a different sampler or slightly bumping the temperature can sometimes break the loop. If you are using llama.cpp, check if the repetition penalty is too low or if a specific prompt template is triggering it.

For those building more complex systems, tools like OpenClaw or custom agent harnesses usually handle this by implementing a timeout or a 'sanity check' on the output to force a reset when the model starts repeating characters.

[-]

sid351@reddit (OP)

Thanks.

When you say "bumping" the temperature, do you mean up (so 0.7 -up-> 1.0) or down (so 1.0 -down-> 0.7)?

How would I change the sampler? Does that mean adjust the top p and min p values?

I am using llama.cpp, and the repetition penalty is set to 1.0 at the moment (which I think means it's disabled).

I'm building the logic out in n8n, and I've put my check (called "Health Check") in front of the "AI Agent" node where the prompt I really want to be processed is sent. It's really basic, it checks the /health for "ok" and then sends a small test prompt. If the response from that deviated from the specific text the LLM is supposed to reply with, the llama server service gets restarted.

I've added some sanitising to all system prompts and user prompts that get sent in to the AI Agent as well now, so that should help. Do you have any references on what token sequences I should be avoiding in prompts? I'd be happy to sanitise them all out if I can get some sort of list.

[-]

H_DANILO@reddit

loop generally happens when the context is overflown, try increasing the context and setting the context to be fixed, configure the tool(like opencode) to compact before its full, generally I run llamacpp with 250k context, and set opencode to limit to 220-230k, this way if it overflows it doesnt go into a loop and has space to compact.

[-]

sid351@reddit (OP)

As in the 128k context?

It seems that one "ticket review", which is set to max out at 8k tokens, can cause the Zombie loop.

Conversely an OpenCode session this morning got up to 35k with no issues.

Or do you mean another context?

[-]

H_DANILO@reddit

yeah the 128k context

[-]

sid351@reddit (OP)

How does stuff move in, and out, of the context?

The health check has restarted the services twice this evening. Glancing at the execution logs, it looks like that's after only reviewing one ticket each time.

There's no way I can believe that's hitting 128k on it's own. Especially when each LLM call is limited to a max of 8k tokens in the n8n flow.

[-]

H_DANILO@reddit

I don't know how your flow is utilizing it, but normally, if you create a loop of question & answers back and forth, that builds the context in and will definitely exceed the limit

[-]

sid351@reddit (OP)

I've got two different things connected in to the model directly:

Open Code on my laptop
I've not used this at all since yesterday morning (when it worked perfectly)
n8n (self-hosted running on a separate computer) using an "AI Agent" node, which has 2 flows that use it right now
Ticket Triage - this receives a "ticket" (think HTML email) from FreshDesk and returns a brief analysis and directs the ticket in one of 6 different routes (that I'll handle later when I've got confidence that the LLM can be relied on - for now it's just adding private notes for us to see).
RocketChat handler - just handles messages that we send to the LLM from our RocketChat instance (very little), in essence, a simple internal chatbot

The n8n AI Agent was set up with a "Simple Memory" (stored in n8n cache), with the session ID being either the FreshDesk ticket number or the Rocket Chat "room" id. It was set with a memory window of 20. As the ticket numbers increment, and they're only being processed when they're created for now, this shouldn't have been bringing any history into the context, but it WAS adding it when I've been reloading flows while troubleshooting.

I've updated the flows to allow me to turn the memory on or off, and set the memory window value separately, with the call to the AI Agent flow.

Ticket Triage now does NOT use memory, and RocketChat handler does.

As I move forward and use the LLM for more FreshDesk stuff, then I'll probably turn memory back on, but keep the window low (like 2 or 3) to keep the context trimmed and focused.

I've also added some prompt sanitising to both the system prompt and user prompt being delivered to the AI Agent node, as I now think Claude may be right, and the inclusion of certain characters (or token patterns) essentially confuses the LLM into thinking it needs to make, or handle, a tool call (or similar).

I'll keep a close eye on it and see how I get on with these changes.

[-]

lit1337@reddit

Same issue here hit it during benchmarking on both Gemma 4 and Qwen models. Setting reasoning budget to 0 kills the zombie loops immediately, but that's a bandaid if you actually want thinking. The real culprit is probably your -ctv q4_0 -ctk q4_0. Quantized KV cache accumulates drift during long reasoning chains the thinking phase generates hundreds of tokens feeding back into a degraded cache, compounding errors until the model falls into a repetition attractor. That's why it's not consistent it depends on how long the reasoning chain runs before the drift hits critical mass. --presence-penalty 1.5 isn't helping either. During thinking, it penalizes tokens the model already used, which pushes it toward garbage tokens like "/" when normal vocabulary gets penalized out. I'd try: switch KV cache to f16 (you have 64 GB system RAM, plenty of room), drop presence penalty to 0.6-0.8, and if it still happens cut --reasoning-budget to 512 instead of killing it. That should sort it without losing reasoning entirely.

[-]

sid351@reddit (OP)

Unfortunately moving to -ctk f16 and -ctv f16 (the defaults btw), and reducing the --presence-penalty to 0.6 didn't prevent the Zombie loop.

I'll try cutting the reasoning-budget now.

[-]

sid351@reddit (OP)

Reducing the --reasoning-budget made things worse, in a more obscure way. I started getting injections of random tokens and words repeated in English and then Chinese in what looked like half-formed responses that fit the constraints from the system and user prompt.

I'm looking at heavily sanitising my user and system prompts before feeding them to the "AI Agent" node in n8n now, and will continue monitoring it.

[-]

sid351@reddit (OP)

Thanks for the advice, I'll give it a go and see what happens.

[-]

WetSound@reddit

What's in the context when it happens? Sounds like normal context rot

[-]

sid351@reddit (OP)

How do I check/verify the context properly instead of guessing?

With the examples that happened today, it doesn't seem like enough has been passed to the model for it to hit 128k context, but that's only based on what I think, and not on what I know.

[-]

WetSound@reddit

It's not about the max context length. That's not the only way a model can go haywire. Weird, repetitive or junk content in the context can cause problems, there is no guarantee for correctness.

Context management is paramount.

Just store all communication and have a look after an incident.

[-]

TokenRingAI@reddit

Why do you have your KV cache quantized so heavily?

[-]

sid351@reddit (OP)

It was one of the many suggestions in my Claude AI trouble shooting.

This is the -ctv and -ctk settings, yeah?

[-]

Due-Function-4877@reddit

I run q4 as well and it degrades the output. That's true. With that said, I'm not seeing loops like this in Cline.

It's a moe model and you have 24gb, so you can dump the experts on the cpu and run the model with a bigger context window, if you want. You should be able to get quite a bit more. You'll have to wait a long time for prompt processing, though. If you're not interested in a huge context, look up a calculator and try out q8.

[-]

a-babaka@reddit

it happened with qwen3.5 122b ud_q6_xl unsloth. I changed it to a q5 from AesSedai and everything became fine. Just try other model

[-]

sid351@reddit (OP)

That's the next thing to try on my list.

Any advice on ...providers to consider? It seems there's hundreds on Huggingface

[-]

a-babaka@reddit

Subjectively: Ubergarm, AesSedai

[-]

chimph@reddit

running the same model at q6 in opencode and have o issues.. tho I did when I first set it up. Since then I have this in my agents.md file.. maybe try it out yourself:

## Core Principle
When uncertain, look it up. Do not fabricate API signatures, file contents, config behavior, library behavior, or command output. If an available tool can resolve the uncertainty, use it.

## Environment
- macOS on Apple Silicon.
- Local inference may use llama.cpp or LM Studio via OpenAI-compatible endpoints.
- Prefer `rg` over `grep`.
- Prefer `fd` over `find` when available.

## Research
- Use the available web search tool for:
  - Current library versions
  - Recent APIs
  - Unfamiliar error messages
  - Package manager behavior
  - Anything likely to be stale in model training data
- Prefer primary sources: official docs, changelogs, source repositories, and issue trackers.

## Codebase Workflow
- Read files before editing them.
- Use `rg` to locate relevant sections before opening large files.
- Keep changes scoped to the request.
- Ask before refactors that touch more than 3 files or change public behavior, such as API surface, return types, function signatures, or exported names.
- Preserve existing style, naming, formatting, and architecture unless there is a clear reason to change them.

## Verification
- After code changes, run the project's relevant typecheck, lint, and tests when available.
- Do not claim work is complete without saying what verification ran.
- If verification could not be run, say why.

## Output Style
- Be direct.
- No unnecessary preamble.
- Push back on bad ideas or risky assumptions.
- When asked for code, provide complete corrected code blocks unless a diff or partial snippet is specifically requested.
- Do not re-summarize obvious changes unless asked.
- Surface important command errors instead of hiding them.

## Stop Conditions
- If the same test fails twice with the same root cause, stop and explain the blocker.
- If a tool returns an unexpected error, report it before trying a substantially different approach.
- If 5 or more tool calls make no progress on the same subproblem, stop and ask for direction.

[-]

sid351@reddit (OP)

What issues were you seeing before you developed this prompt?

My session with Open Code this morning got up to 35k + and had no issues whatsoever.

I have it review one ticket (thing a basic html email) from FreshDesk (via a n8n flow) and bang: Zombie. There were no tool calls involved in that, just straight text analysis (the tools aren't "lit up" in the n8n execution history, so I know there were no tool calls).

[-]

sid351@reddit (OP)

13.1.

Not a dumb question, it's important context I forgot to include.