Qwen3.6 35b a3b getting stuck in looped reasoning?
Posted by EggDroppedSoup@reddit | LocalLLaMA | View on Reddit | 4 comments
Some might think this is obvious but for me, I was using IQ4 (XS) for the longest time and i recently switched to the Q4 K XL model for qwen because I saw someone post that it was faster for offloading scenarios. Running with offloading of 32gb ram, 5060 8gb vram gpu and was getting around 40 t/s with iq4xs and now around 27 with Q4 K XL. Much larger size, much lower KLD according to unsloth, but I'm getting looped reasoning that wastes compute time.
Any config tweaks to fix this? I don't think I got this when running the other version, or even IQ4 NL XL.
Below is my config I obtained from multiple benchmark runs justing testing different things:
param(
[string]$ModelPath = '',
[string]$ModelFileName = 'Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf',
[string]$ServerExePath = '',
[string]$PreferredServerExePath = '.\llama.cpp-b8838-win-cuda-13.1-x64\llama-server.exe',
[string]$ListenHost = '127.0.0.1',
[int]$Port = 11434,
[int]$CtxSize = 128000,
[int]$GpuLayers = 99,
[int]$CpuMoeLayers = 38,
[int]$Threads = 16,
[int]$Parallel = 1,
[int]$BatchSize = 2048,
[int]$UBatchSize = 2048,
[int]$ThreadsBatch = 8,
[bool]$ContBatching = $true,
[bool]$KVUnified = $true,
[int]$CacheRAMMiB = 4096,
[int]$FitTargetMiB = 128,
[string]$ModelAlias = 'qwen3.6-35b-a3b-ud-q4-k-xl',
[double]$Temperature = 0.6,
[double]$TopP = 0.95,
[int]$TopK = 20,
[double]$MinP = 0.,
[double]$PresencePenalty = 0,
[ValidateSet('on', 'off', 'auto')]
[string]$Reasoning = 'on',
[string]$ReasoningFormat = 'deepseek-legacy',
[int]$ReasoningBudget = -1,
[ValidateSet('kv', 'native', 'off')]
[string]$TurboQuantMode = 'kv',
[string]$CacheTypeK = 'q8_0',
[string]$CacheTypeV = 'q8_0',
[ValidateSet('none', 'ngram-cache', 'ngram-simple', 'ngram-map-k', 'ngram-map-k4v', 'ngram-mod')]
[string]$SpeculativeType = 'none',
[int]$SpeculativeNgramSizeN = 8,
[int]$SpeculativeNgramSizeM = 48,
[int]$SpeculativeNgramMinHits = 1,
[string]$TurboQuantNativeArgs = '',
[string]$ApiKey = '',
[switch]$DisableFlashAttention,
[switch]$DisableFit = $true,
[switch]$ForceRestart
)param(
[string]$ModelPath = '',
[string]$ModelFileName = 'Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf',
[string]$ServerExePath = '',
[string]$PreferredServerExePath = '.\llama.cpp-b8838-win-cuda-13.1-x64\llama-server.exe',
[string]$ListenHost = '127.0.0.1',
[int]$Port = 11434,
[int]$CtxSize = 128000,
[int]$GpuLayers = 99,
[int]$CpuMoeLayers = 38,
[int]$Threads = 16,
[int]$Parallel = 1,
[int]$BatchSize = 2048,
[int]$UBatchSize = 2048,
[int]$ThreadsBatch = 8,
[bool]$ContBatching = $true,
[bool]$KVUnified = $true,
[int]$CacheRAMMiB = 4096,
[int]$FitTargetMiB = 128,
[string]$ModelAlias = 'qwen3.6-35b-a3b-ud-q4-k-xl',
[double]$Temperature = 0.6,
[double]$TopP = 0.95,
[int]$TopK = 20,
[double]$MinP = 0.,
[double]$PresencePenalty = 0,
[ValidateSet('on', 'off', 'auto')]
[string]$Reasoning = 'on',
[string]$ReasoningFormat = 'deepseek-legacy',
[int]$ReasoningBudget = -1,
[ValidateSet('kv', 'native', 'off')]
[string]$TurboQuantMode = 'kv',
[string]$CacheTypeK = 'q8_0',
[string]$CacheTypeV = 'q8_0',
[ValidateSet('none', 'ngram-cache', 'ngram-simple', 'ngram-map-k', 'ngram-map-k4v', 'ngram-mod')]
[string]$SpeculativeType = 'none',
[int]$SpeculativeNgramSizeN = 8,
[int]$SpeculativeNgramSizeM = 48,
[int]$SpeculativeNgramMinHits = 1,
[string]$TurboQuantNativeArgs = '',
[string]$ApiKey = '',
[switch]$DisableFlashAttention,
[switch]$DisableFit = $true,
[switch]$ForceRestart
)
moimereddit@reddit
IME the quantizer vendor matters too... try different ones available... bartowski's have been the most stable to me.
FinBenton@reddit
Just cap the reasoning, Im using 4k max.
Undici77@reddit
I'm experiencing same issue and more, I find out in long context working new models are not so good as benchmark show! I create a post about my experience
https://www.reddit.com/r/LocalLLaMA/comments/1stbohn/qwen_models_for_coding_using_qwencode_my/
SM8085@reddit
For an automated task, I set my reasoning budget to 10k tokens.
When it hits that budget limit it seems to yeet it into the non-reasoning.