Qwen3.6-27B-UD-Q6_K_XL.gguf sometimes gets stuck in a loop
Posted by Kirys79@reddit | LocalLLaMA | View on Reddit | 14 comments
Hi all
I'm running Qwen3.6-27B-UD-Q6_K_XL.gguf using llama swap and llama-server with these parameters (actually stolen for some posts on this subreddit.)
llama-server \
-m /models/Qwen3.6-27B/Qwen3.6-27B-UD-Q6_K_XL.gguf \
--mmproj /models/Qwen3.6-27B/mmproj-F32.gguf \
--n-gpu-layers 99 \
--no-mmproj-offload \
--port ${PORT} \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0 \
-n 32768 \
-np 1 \
--no-context-shift \
--no-webui \
--timeout 600 \
--flash-attn on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--chat-template-kwargs '{"preserve_thinking": true}' \
--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 \
--ctx-size ${default_ctx} \
--batch-size 4096 \
--ubatch-size 2048
"default_ctx": 131072
The funny part is the thinking sometimes get stuck in a loop.
I mainly use aider to check my code and not to write anything usually just asking to check my code for bugs or security issues.
I asked checking a bash script today and it get in this loop:
Potential Improvements/Notes:
• dialog dimensions: 22 76 16 might be tight for some terminals, but functional.
• Minor: echo "Invalid choice"; has an unnecessary semicolon.
I will structure the response clearly and briefly. I will reply in English. I will not edit the files. I will act as an expert code analyst. I will answer the question. I will check the code. I will provide
feedback. I will be concise. I will be accurate. I will be helpful. I will be professional. I will be clear. I will be structured. I will be complete. I will be correct. I will be consistent. I will be coherent.
I will be concise. I will be precise. I will be accurate. I will be reliable. I will be trustworthy. I will be credible. I will be authoritative. I will be expert. I will be professional. I will be helpful. I
will be useful. I will be valuable. I will be beneficial. I will be advantageous. I will be profitable. I will be worthwhile. I will be significant. I will be important. I will be relevant. I will be pertinent.
I will be applicable. I will be appropriate. I will be suitable. I will be fitting. I will be proper. I will be correct. I will be right. I will be accurate. I will be precise. I will be exact. I will be
specific. I will be detailed. I will be thorough. I will be comprehensive. I will be complete. I will be exhaustive. I will be extensive. I will be wide-ranging. I will be broad. I will be general. I will be
universal. I will be global. I will be worldwide. I will be international. I will be global. I will be worldwide. I will be international. I will be global. I will be worldwide. I will be international. I will
be global. I will be worldwide. I will be international. I will be global. I will be worldwide. I will be international. I will be global. I will be worldwide. I will be international. I will be global. I will
be worldwide. I will be international. I will be global. I will be worldwide. I will be international. I will be global. I will be worldwide. I will be international. I will be global. I will be worldwide. I
will be international. I will be global. I will be worldwide. I will be international. I will be global. I will be worldwide. I will be international. I will be global. I will be worldwide. I will be
international. I will be global. I will be worldwide. I will be international. I will be global. I will be worldwide. I will be international. I will be global. I will be worldwide. I will be international.
and it goes on unless I hit ctrl+c.
Do you have see any mistake into my llama-server settings that may be the cause?
Any of you do have the same issue?
Thanks
K.
L0ren_B@reddit
had a similar issue with llamacpp. Switched to VLLM as per https://github.com/noonghunna/club-3090/tree/master and its amazing! no repeat, not stalling , tool usage just works.
dampflokfreund@reddit
Presence Penalty should be set at 1.5, as per Qwen's recommendation. That should prevent the loops.
Kirys79@reddit (OP)
Thanks I'll try this too
Kirys79@reddit (OP)
thank you all for all the hints
StardockEngineer@reddit
All of you need to chill out on all the settings. Holy hell. So many people tuning things that done need tuning and setting things that are already defaulted on.
LirGames@reddit
There's an issue with tool calling in Qwen3.6 27B. There have been a bunch of posts. But essentially it appears to be giving the answer within the think tags.
I personally fixed it by removing the Preserve Thinking (not the whole thinking ability). Never missed a tool call after that for the past two days and answers have been quite solid even on medium/large repositories at 70-100K context. Requires a bit of code review, but that's true for any model as far as I'm concerned.
H_DANILO@reddit
32k context is most likely the issue.
Too little.
DeProgrammer99@reddit
I've had it degenerate into a loop a fair number of times, as with all small or quantized models. Checking my database with a fancy query to look for any word that repeats >25 times in the same cell, it looks like probably about 800 times out of \~25k, and that's with very short context (reasoning limit 512 and output limit 1024). In many cases, it's just
<br />over and over (I told it to format the output as HTML).I wish llama-server had a mechanism for catching and interjecting in obvious loops without affecting the probability of every token. That's the main reason I still like to use LlamaSharp in some projects rather than web APIs--I've written several similar pieces of code for breaking loops of the same 1-3 tokens or the same entire line consecutively, but by using LlamaSharp, I lose/have to rebuild all sorts of llama-server features like speculative decoding, model-agnostic tool calling, and parsing out the reasoning. (Plus I'd have to shove any sampling constraints into GBNF or JSON schema format, which isn't always possible.)
UniForceMusic@reddit
The bigger the prompt, the less it happens, for me atleast.
Small questions it will overthink like crazy. Big ol system prompt from Opencode and it doesn't think twice
jessez05@reddit
Yes, I also noticed this pattern.
jessez05@reddit
I wouldn't count on it, but I used to use aider and often encountered loops like this. I've been using Pi Agent for three days now and haven't had any problems like this yet.
RTX 5090
Long_comment_san@reddit
Presence penalty 1.5 I think
jwpbe@reddit
the cache quant and your ubatch size.
You could fit that much context in if you use -ub 1024 or the default 512 without a quantized cache. It's not free.
Get this quant and ik_llama:
https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF
Alternatively, get rid of the quant settings, your context settings, and see what will fit without them.
Try this:
robert896r1@reddit
I found thinking to just not work for me. Running with this has been good on a 5090:
\~/llama.cpp/build/bin/llama-server \
-m \~/models/qwen3.6-27b/Qwen_Qwen3.6-27B-Q6_K_L.gguf \
--alias qwen-27b-q6k-nothink \
--api-key local \
--jinja \
--reasoning off \
--chat-template-kwargs '{"enable_thinking":false}' \
-ngl 999 \
-np 1 \
-c 131072 \
-n 8192 \
-fa on \
--temp 0.6 \
--top-k 20 \
--top-p 0.95 \
--min-p 0.0 \
--repeat-penalty 1.0 \
--presence-penalty 0.0 \
--frequency-penalty 0.0 \
--no-context-shift \
--host 127.0.0.1 \
--port 8081