Qwen 3.6 - Loops and repetitions

Posted by Safe-Buffalo-4408@reddit | LocalLLaMA | View on Reddit | 10 comments

I normally seldom experience loops, either reasoning or responses, using Qwen 3.6 27B Q8 with 256k context window in Agent Zero.

But the 35B A3B Q8 with 256k context window gets constant loops and is basically unusable within Agent Zero.

What are your experience with these loops and repetitions?

Is there a good way to prevent these kind of loops and repetitions?

[-]

maxpayne07@reddit

Very bizarre, but i only get repeticions at Q8 xxl

[-]

I use it for days and never had a single loop with 120k context. Make sure your temp is not too low. Lowest should be 0.65 but if you have looping issue increase it to 0.75. If you can avoid presence and repetition penalty, however the latter worked better with the MoE model. Something like 1.1 rep penality and only on the last 368 tokens (so output quality won't really be affected, mostly thinking)

But with 27B this was never needed for me.

[-]

DigRealistic2977@reddit

Ofcourse it loops that model literally sucks all the hype and Good evaluation here of Qwen are all one shot and synthetic and scripted.

literally all Qwen models early to latest 3.6 loops alot low temps or high temps so yeah there are no workarounds on it. even Quants or UD will not save you.

I guess all model does have tendency to loop but damn Qwen models are the worse and only good at around 16-32k ctx and beyond that are hopes and prayers.

[-]

_manteca@reddit

Same experience. Qwen 3.6 loops a lot.

Mid-conversation, it hallucinates that the user sent the original request again, and starts all over again.

I think it has to do with self-reflection, it wants to "recap" the task by repeating the original message, but ends looping on itself.

[-]

cosmicnag@reddit

Try some of the finetunes on HF , like the ones with opus reasoning dataset distillation,etc.

[-]

milpster@reddit

as unsloth recommends i turn up presence_penalty slightly:

https://unsloth.ai/docs/models/qwen3.6

presence_penalty = 0.0 to 2.0 default this is off, but to reduce repetitions, you can use this, however using a higher value may result in slight decrease in performance

0.9 is the value that works for me so far.

[-]

GregoryfromtheHood@reddit

The more I add presence penalty the more loops I get. At 1-1.5 it loops like crazy. 0.0 is the best so far in my experience in longer context tasks. With lower context usage it doesn't seem to loop at all for me even with a high presence penalty.

[-]

milpster@reddit

Interesting. I also run 256k context. Here is my full cmd in case that might help you:

LD_LIBRARY_PATH=/opt/rocm-6.1.0/lib:$LD_LIBRARY_PATH HSA_OVERRIDE_GFX_VERSION=9.0.6 HSA_OVERRIDE_

WAVEFRONT_SIZE=64 HSA_ENABLE_SDMA=0 HSA_XNACK=1 ROCBLAS_INTERNAL_FP16_ALT_IMPL=1 ROCBLAS_LAYER=0 ROCBLAS_INTERNAL_FP16_ALT_IMPL=1 ROCBLAS_TENSILE_LIBPATH=/opt/rocm/lib/rocblas/library HSA_OVERRIDE_GFX_VERSION=9.0.6 USE_MLOCK=true \~/dev/llama.cpp/build/bin/llama-server -m \~/ai/ai/Qwen3.6-27B.i1-Q4_K_M.gguf --ctx-size 262144 --threads-batch 11 --threads 6 --no-mmap -fa on -ngl 333 -b 2048 -ub 896 -cram -1 --ctx-checkpoints 200 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence_penalty 0.9 --repeat-penalty 1.0 --device Vulkan1,ROCm0 --chat-template-file /home/srcds/dev/cuda_llama.cpp/chat_template.jinja --chat-template-kwargs '{"preserve_thinking": true}' --port 8009 -np 1 -ctk q4_0 -ctv q4_0 --spec-type ngram-mod --spec-ngram-mod-n-match 16 --spec-draft-n-min 4 --spec-draft-n-max 24 -ts 30,70

[-]

uti24@reddit

I also has found that default temperature of 0.1 in LMStudio makes it loop and lifting temperature to 0.5 helps it a lot. I thing Qwen3.6 repo says to lift temperature even more.