How can you stop your model from looping

Posted by chocofoxy@reddit | LocalLLaMA | View on Reddit | 31 comments

So i thought this is a small model issue but when i added a new gpu and i am able to run low mid model like Qwen 3.6 35b q4 or q5 this issue still exists now its not as much as small model but it does break when linking the model to copilot chat or Hermes the model mid task will start loop thinking or looping generating more than 40k token or generating a wrong tool call

[-]

tonyboi76@reddit

two angles. the sampler stuff others said is right, for qwen3 use their recommended params (temp ~0.6, top_p 0.95, top_k 20), bump repeat/presence penalty a bit, and a DRY sampler if your runtime has it kills repetition loops well.

but the part id actually check first: it loops when linked to copilot chat or hermes, but the model runs fine otherwise? that smells like the chat template, not the model. if the integration sends the wrong template or stop tokens, qwen3 especially will loop or never stop. run the same model standalone in llama.cpp or ollama with the proper qwen3 template and see, if it behaves there then its the integration mangling the prompt, not your sampler or the quant.

[-]

Queasy-Contract9753@reddit

What DRY settings do you use? I've been trying to get Qwen 3.5 0.8b and 2b to work on llama cpp. With thinking of and a jinja template it kind of kills it half the time. But still only half.

[-]

tonyboi76@reddit

i run dry-multiplier 0.8, dry-base 1.75, dry-allowed-length 2, thats the usual sane starting point in llama.cpp. nudge the multiplier toward 1.0 if it still loops.

but honestly at 0.6b and 1.7b the model is most of your problem, not the sampler, those are small enough that DRY is mostly just masking that they fall into loops because theyre weak. the half-the-time thing also smells like the template, qwen3 is picky about it, if the jinja template or the think on/off handling is slightly off it loops or never emits a stop. id double check youre using the exact template for no-think mode. realistically though if you can fit a 4b youll spend way less time fighting this.

[-]

Specter_Origin@reddit

I never was able to stop looping with Qwen, never had that issue with Gemma though...

[-]

cu-pa@reddit

i think qwen3.6 on q4 quant is buggy, they overthink even i just say hi

[-]

Specter_Origin@reddit

Well I also tried q6/q8 or even not self hosted (as in via openrouter) in all cases I always got looping and overthinking issues with QWEN 3.5/3.6 .

[-]

DeProgrammer99@reddit

I'm messing around with a llama.cpp branch that allows custom samplers as extensions (outside DLLs), and my example extension is specifically a loop-breaker. I don't run into loops that often with Qwen3.6-27B, though, so there might be something wrong with the quant you're using or the llama.cpp build you're using or whatever that should be addressed before resorting to this kind of approach.

[-]

Queasy-Contract9753@reddit

Which samplers help you with loops?

[-]

DeProgrammer99@reddit

"If looping then inject 'I should stop looping''" ones. I check if the last 20 or 30 tokens contain 3 or fewer distinct ones and if the last 20 lines had 5 with no more than 1 token of difference and things like that. The point is to have no effect on normal logits and a big effect on breaking loops; it just doesn't reduce their probability of starting in the first place, unlike presence penalty and DRY.

[-]

Sisaroth@reddit

I didn't have any looping with Q5. Then i only lowered temp and presence_penalty and it got in a loop doing the same task (I reset all changes in git, same prompt). Could be random bad luck maybe.

People say to use low presence_penalty for coding but then it gets stuck so idk.

[-]

stormy1one@reddit

I previously had issues but latest vLLM and froggeric’s chat template fix has been working well running 27B FP8 quant. https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

[-]

Jammystocker@reddit

This seems to work for me too, at least so far.

v11 were still looping after a lot of tool calls, but v19 manage to make tool calls up to my hermes limit (default 90 calls, first time I ever reached that) and it never looped.

[-]

Blizado@reddit

Did you tried their GGUFs? https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

If there are some problems with GGUFs Unsloth normally update their quants to fix issues.

They also have a guide about this model which parameters should be used etc. https://unsloth.ai/docs/models/qwen3.6

[-]

mukz_mckz@reddit

Try playing around with repeat and presence penalties. That solved the issue for most quants. Sometimes, I just had to bump it up by 1 quant level, nothing else could solve it.

[-]

Blizado@reddit

Trying quants from a different source can help too sometimes. It can happen that some GGUFs from one source are broken and if I remember right there was a issue with Qwen GGUFs and why Unsloth updated their GGUFs.

[-]

Long_comment_san@reddit

Why people run q4 over q8 for 35b? RAM issues? You don't need to fit entire MOE model into VRAM!

[-]

Blizado@reddit

Did you forget RAM prices? RAM is a general issue. You are right that no everything need to be in VRAM, but the more is in RAM the slower the LLM gets. If you want to have quick answers you want to have as much as possible in VRAM.

[-]

no_witty_username@reddit

Probably a settings issue. If not set up properly this is common behavior for many models. Id suggest you have your coding agent take a look at the hyperparameters, it will usually find the culprit

[-]

kevin_1994@reddit

Use q6 or q8. q4 for 3b active params is too much compression imo

[-]

hidden2u@reddit

This has been my experience

[-]

IntrepidDig1581@reddit

so the looping issue with tool calls is usually a context window problem, the model loses track of where it is in the task and starts re-evaluating from scratch, especially past like 8k tokens in, and qwen 35b still does this without proper stop conditions

[-]

Sudden-Echo-8976@reddit

I use little-coder that enforces a thinking cap and kills thinking if it thinks for too long. When removing the thinking cap I found that the model is able to identify when it's looping and jumps to action on its own. It just takes a lot of tokens before it does so though. I use Qwopus3.6

[-]

Own_Mix_3755@reddit

It depends what you use to serve Qwen. Each tool (vLLM, Ollama, llama.cpp) has different fixes applied to it as same as different fixes awaiting in pull requests. One part is definetelly froggeric fixed chat template (which helps alot) and for me personally using vLLM I had to apply this: https://github.com/vllm-project/vllm/pull/40861 to finally get it working, I suspect there are much more edge cases where it might still fail and there is also alot more other patches (some are slowly getting merged).

For the Hermes I have also lowered the number of thinking tokens it can use per turn and you can play with presence penalty parameter for the model itself. I’ve seen people using 1.5 (which is quite aggresive towards not repeating almost any text at all) and me personally I have been running 1.2.

[-]

Sofakingwetoddead@reddit

I just punch it.

[-]

ikkiho@reddit

if it loops only when bridged through copilot/hermes but runs clean standalone, that's the chat template (as others said). one more thing worth checking on qwen3: is the wrapper leaving it in extended thinking mode? if /think is on with no hard cap, the trace can spiral, and 40k tokens of slop is exactly what that looks like. /no_think in the system prompt usually kills the runaway, separate from any sampler tuning.

[-]