qwen 3.6 27B looping problem

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 12 comments

Whenever I write here that I use gemma 31B I get answers that qwen 27B is better. I switched in the pi from gemma 31B Q5 to qwen 27B Q8 and generally I manage to code, document and run tests but somewhere after exceeding 100k context qwen keeps getting into loops. Do you have any solution for this?

I tried to break it and tell him to start over, try again, etc... but it keeps looping

my current command is:

CUDA_VISIBLE_DEVICES=0,1,2 llama-server -c 200000 -m /mnt/models2/Qwen/3.6/Qwen3.6-27B-UD-Q8_K_XL.gguf --host 0.0.0.0 --jinja -fa on --keep 4096 -b 8192 --spec-type ngram-mod --parallel 1 --ctx-checkpoints 24 --checkpoint-every-n-tokens 8192 --cache-ram 65536

[-]

computehungry@reddit

I went back to 3.5. I also accept that 65k, give or take, is the effective max context, and manage my use around that limitation.

[-]

jacek2023@reddit (OP)

on the model page I see: Context Length: 262,144 natively and extensible up to 1,010,000 tokens.

[-]

computehungry@reddit

Yeah, I'm saying the model gets too dumb at 65k so I just treat it as the max and make the workload smaller for each run. I run at Q4 though, it might be better for Q8.

[-]

jacek2023@reddit (OP)

I am observing various issues after 100k

[-]

Pablo_the_brave@reddit

You have no any sampler settings and are using default jinja template from the model. It's two red flag. Focus on it.

[-]

jacek2023@reddit (OP)

This is with the settings from upvoted comment

[-]

MrShrek69@reddit

I always find 56 - 65 is basically the max before most of them start failing tool calls or get lost in the context. That’s why management of context is important for programming. That’s still a shit ton of space to work with. Just use morr sessions and have the agent write to md files so it can pick up where it left off

[-]

WetSound@reddit

How are you reaching long contexts? I /new every new task and have no problems when that still gets me over 100k.

[-]

jacek2023@reddit (OP)

what are your tasks? I have lots of docs and code

[-]

fahrenhe1t@reddit

Try: --repeat-penalty 1.1 or --presence-penalty 0.5 Test with either/or, not both at the same time. I added —repeat-penalty 1.1 to my config and it helped significantly.

[-]

LetsGoBrandon4256@reddit

Double check your sampler settings

https://huggingface.co/Qwen/Qwen3.6-27B

We recommend using the following set of sampling parameters for generation

Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

Instruct (or non-thinking) mode: temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

[-]

mister2d@reddit

Have you tried with preserve thinking on?

chat-template-kwargs = {"preserve_thinking": true}