Qwen3.6 35b a3b getting stuck in looped reasoning?

Posted by EggDroppedSoup@reddit | LocalLLaMA | View on Reddit | 4 comments

Some might think this is obvious but for me, I was using IQ4 (XS) for the longest time and i recently switched to the Q4 K XL model for qwen because I saw someone post that it was faster for offloading scenarios. Running with offloading of 32gb ram, 5060 8gb vram gpu and was getting around 40 t/s with iq4xs and now around 27 with Q4 K XL. Much larger size, much lower KLD according to unsloth, but I'm getting looped reasoning that wastes compute time.

Any config tweaks to fix this? I don't think I got this when running the other version, or even IQ4 NL XL.

Below is my config I obtained from multiple benchmark runs justing testing different things:

param(
    [string]$ModelPath = '',
    [string]$ModelFileName = 'Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf',
    [string]$ServerExePath = '',
    [string]$PreferredServerExePath = '.\llama.cpp-b8838-win-cuda-13.1-x64\llama-server.exe',
    [string]$ListenHost = '127.0.0.1',
    [int]$Port = 11434,
    [int]$CtxSize = 128000,
    [int]$GpuLayers = 99,
    [int]$CpuMoeLayers = 38,
    [int]$Threads = 16,
    [int]$Parallel = 1,
    [int]$BatchSize = 2048,
    [int]$UBatchSize = 2048,
    [int]$ThreadsBatch = 8,
    [bool]$ContBatching = $true,
    [bool]$KVUnified = $true,
    [int]$CacheRAMMiB = 4096,
    [int]$FitTargetMiB = 128,
    [string]$ModelAlias = 'qwen3.6-35b-a3b-ud-q4-k-xl',
    [double]$Temperature = 0.6,
    [double]$TopP = 0.95,
    [int]$TopK = 20,
    [double]$MinP = 0.,
    [double]$PresencePenalty = 0,
    [ValidateSet('on', 'off', 'auto')]
    [string]$Reasoning = 'on',
    [string]$ReasoningFormat = 'deepseek-legacy',
    [int]$ReasoningBudget = -1,
    [ValidateSet('kv', 'native', 'off')]
    [string]$TurboQuantMode = 'kv',
    [string]$CacheTypeK = 'q8_0',
    [string]$CacheTypeV = 'q8_0',
    [ValidateSet('none', 'ngram-cache', 'ngram-simple', 'ngram-map-k', 'ngram-map-k4v', 'ngram-mod')]
    [string]$SpeculativeType = 'none',
    [int]$SpeculativeNgramSizeN = 8,
    [int]$SpeculativeNgramSizeM = 48,
    [int]$SpeculativeNgramMinHits = 1,
    [string]$TurboQuantNativeArgs = '',
    [string]$ApiKey = '',
    [switch]$DisableFlashAttention,
    [switch]$DisableFit = $true,
    [switch]$ForceRestart
)param(
    [string]$ModelPath = '',
    [string]$ModelFileName = 'Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf',
    [string]$ServerExePath = '',
    [string]$PreferredServerExePath = '.\llama.cpp-b8838-win-cuda-13.1-x64\llama-server.exe',
    [string]$ListenHost = '127.0.0.1',
    [int]$Port = 11434,
    [int]$CtxSize = 128000,
    [int]$GpuLayers = 99,
    [int]$CpuMoeLayers = 38,
    [int]$Threads = 16,
    [int]$Parallel = 1,
    [int]$BatchSize = 2048,
    [int]$UBatchSize = 2048,
    [int]$ThreadsBatch = 8,
    [bool]$ContBatching = $true,
    [bool]$KVUnified = $true,
    [int]$CacheRAMMiB = 4096,
    [int]$FitTargetMiB = 128,
    [string]$ModelAlias = 'qwen3.6-35b-a3b-ud-q4-k-xl',
    [double]$Temperature = 0.6,
    [double]$TopP = 0.95,
    [int]$TopK = 20,
    [double]$MinP = 0.,
    [double]$PresencePenalty = 0,
    [ValidateSet('on', 'off', 'auto')]
    [string]$Reasoning = 'on',
    [string]$ReasoningFormat = 'deepseek-legacy',
    [int]$ReasoningBudget = -1,
    [ValidateSet('kv', 'native', 'off')]
    [string]$TurboQuantMode = 'kv',
    [string]$CacheTypeK = 'q8_0',
    [string]$CacheTypeV = 'q8_0',
    [ValidateSet('none', 'ngram-cache', 'ngram-simple', 'ngram-map-k', 'ngram-map-k4v', 'ngram-mod')]
    [string]$SpeculativeType = 'none',
    [int]$SpeculativeNgramSizeN = 8,
    [int]$SpeculativeNgramSizeM = 48,
    [int]$SpeculativeNgramMinHits = 1,
    [string]$TurboQuantNativeArgs = '',
    [string]$ApiKey = '',
    [switch]$DisableFlashAttention,
    [switch]$DisableFit = $true,
    [switch]$ForceRestart
)