qwen3.6 just stops

Posted by robertpro01@reddit | LocalLLaMA | View on Reddit | 48 comments

Sometimes qwen 3.6 just stops at the middle of a task, is there a way to avoid it?

This is qwen-code CLI, but also happens on opencode.

Running with vLLM with docker compose:

services:
  vllm-qwen36-27b-dual-dflash-noviz:
    image: vllm/vllm-openai:nightly-1acd67a795ebccdf9b9db7697ae9082058301657
    container_name: vllm-qwen36-27b-dual-dflash-noviz
    restart: on-failure
    ports:
      - "${BIND_HOST:-0.0.0.0}:${PORT:-8080}:8000"
    volumes:
      - ${MODEL_DIR:-/home/ai/models/vllm}:/root/.cache/huggingface
      - /home/ai/club-3090/models/qwen3.6-27b/vllm/cache/torch_compile:/root/.cache/vllm/torch_compile_cache
      - /home/ai/club-3090/models/qwen3.6-27b/vllm/cache/triton:/root/.triton/cache
      - /home/ai/club-3090/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad/marlin.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/marlin.py:ro
      - /home/ai/club-3090/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad/MPLinearKernel.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/MPLinearKernel.py:ro
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN:-}
      - CUDA_DEVICE_ORDER=PCI_BUS_ID
      - VLLM_WORKER_MULTIPROC_METHOD=spawn
      - NCCL_CUMEM_ENABLE=0
      - NCCL_P2P_DISABLE=1
      - VLLM_NO_USAGE_STATS=1
      - VLLM_USE_FLASHINFER_SAMPLER=1
      - OMP_NUM_THREADS=1
      - PYTORCH_CUDA_ALLOC_CONF=${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True,max_split_size_mb:512}
    shm_size: "16gb"
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0", "2"]
              capabilities: [gpu]
    entrypoint:
      - /bin/bash
      - -c
      - |
        exec vllm serve ${VLLM_ENFORCE_EAGER:+--enforce-eager} "$@"
      - --
    command:
      - --model
      - /root/.cache/huggingface/qwen3.6-27b-autoround-int4
      - --served-model-name
      - qwen
      - --quantization
      - auto_round
      - --dtype
      - bfloat16
      - --tensor-parallel-size
      - "2"
      - --disable-custom-all-reduce
      - --max-model-len
      - "${MAX_MODEL_LEN:-185000}"
      - --gpu-memory-utilization
      - "${GPU_MEMORY_UTILIZATION:-0.95}"
      - --max-num-seqs
      - "${MAX_NUM_SEQS:-2}"
      - --max-num-batched-tokens
      - "8192"
      - --language-model-only
      - --trust-remote-code
      - --reasoning-parser
      - qwen3
      - --default-chat-template-kwargs
      - '{"enable_thinking": true}'
      - --enable-auto-tool-choice
      - --tool-call-parser
      - qwen3_coder
      - --enable-prefix-caching
      - --enable-chunked-prefill
      - --speculative-config
      - '{"method":"dflash","model":"/root/.cache/huggingface/qwen3.6-27b-dflash","num_speculative_tokens":5}'
      - --host
      - 0.0.0.0
      - --port
      - "8000"

Based on https://github.com/noonghunna/club-3090

Any ideas how to improve?

[-]

c0lumpio@reddit

As a workaround you can use harness with Ralph loop, for example, oh-my-opencode It will force the model to continue until it says "I promise I've finished"

[-]

anzzax@reddit

It's a known bug with qwen tool call parser. I use vllm build with applied pr - it's much better but not all issues fixed yet

https://github.com/vllm-project/vllm/pull/40861

[-]

javasux@reddit

Yup this is the answer. I had the same happen. The solution for me is to use llama.cpp for now.

[-]

robertpro01@reddit (OP)

I wish moving to llama.cpp was the answer, I have another GPU running qwen 3.6 35b running on llama.cpp, and it has exactly the same issue

[-]

Healthy-Nebula-3603@reddit

I'm using llamacpp-server and never has such behavior.

I'm on opencode with llamacpp-server.

My .md file is strictly informing model to work fully automatically without any question / informing user about work progress.

Also I'm not using worse moe model but only qwen 27b dense

[-]

exact_constraint@reddit

Same. I’m just getting used to saying “continue”.

[-]

That is quite interesting. Do you compile llama.cpp or use a precompiled binary? I compile it myself and I never saw any issue with llama.cpp while I saw the same issue over and over using vllm. Sglang just refused to run qwen3.6 quant btw.

[-]

robertpro01@reddit (OP)

I'm a newbie, I compiled stock llama.cpp.

I moved to vllm after discovering club-3090

[-]

brakx@reddit

Yes I’ve seen the same thing happen in llama.cpp using 3.6 27b

[-]

looselyhuman@reddit

Holy chaos in that repo. Kind of questioning vllm rn.

[-]

anzzax@reddit

Yeah, thats 100% true, but we have to appreciate level of complexity there: support wide variety of llm architectures, attention backends, weights and kv quantization, multiple families of supported hardware, tool call parsers, spec decoding, guided decoding and so on. It's crazy when you try to comprehend full scope and all of that glued together with flaky python. I tried few 'alterantives' but they very shallow and at the end give you worse performance than vllm, but hyped being written in rust with all possible optimizations. I think only sglang worth checking.

[-]

Anbeeld@reddit

It outputs EOS instead of by mistake. Had to fix that for https://github.com/Anbeeld/beellama.cpp

[-]

llitz@reddit

You also seem to have disabled speculative decoding during reasoning, am I reading the code right?

Can you explain what's the reason behind it?

[-]

Anbeeld@reddit

Nope, it's enabled.

[-]

llitz@reddit

All right! I will share your repo with some of the folks trying stuff on vllm as well.

[-]

robertpro01@reddit (OP)

Thanks! Well try it later today, qq, how does it work with 2 3090? I would be looking to use Q8 with it.

Do you have any guide for gemma4 31b.

[-]

Anbeeld@reddit

Unfortunately I don't possess multi-GPU so for now it's at the stage of "send me log and I'll fix issues". But at least baseline without DFlash should work.

[-]

cleversmoke@reddit

I had the same issue with MTP PR, I am testing with keeping --ctx-checkpoints at 16 (default is 32) with --ctx-cache unset. Default at 32 would oom the service, while ctx-cache 4096 would stop the agent mid way like yours.

[-]

llitz@reddit

This seems to be more related to MTP and dflash in VLLM - some of us have seen some broken responses putting tokens in wrong place after enabling these.

[-]

Clank75@reddit

Hmm? I've had this problem with 27b in Llama.cpp for a while. Without MTP.

[-]

llitz@reddit

That is... "Great" to hear, and bad too xD

I haven't run llama in a while and I don't think I run it with 3.6 27b, I appreciate you replying to me.

[-]

Makers7886@reddit

Have you tried using the qwen3.6 preserve thinking on parameter? I see some who said it causes problem but I've had it on since day 0 and q3.6 27b int8 has been super solid. I do not use dflash though as it caused me issues in agentic harness situations (prefix cache misses, slow responses negating the speedup)

[-]

my_name_isnt_clever@reddit

Yeah, I run it with preserve thinking, no dflash, llama.cpp at Q6 and I haven't seen this issue once. Unless Hermes has something built in to force it to keep going, since that's what I've been using primarily.

[-]

llitz@reddit

Hermes sends a nudge, usually says "received empty tool call - nudging to continue" and there's another case too, but it nudges in both cases.

[-]

Emergency-Map9861@reddit

I've had this issue occur consistently with Qwen3.6-27b on llama.cpp and qwen code cli. I think it's an issue with the model itself. I tried different quants from Q4 up to Q8 from unsloth and bartowski and they all behave the same way. Qwen3.6-35B-A3B doesn't have this issue. It's really a shame for such a great model.

[-]

llitz@reddit

On llama? Did you have MTP enabled by any chances?

[-]

Kindly-Cantaloupe978@reddit

3.5-27b doesn’t have this problem so I am now back to the older version which actually works quite well

[-]

Hot_Turnip_3309@reddit

same I use 3.5

[-]

nakedspirax@reddit

27 b does this to me too. I have 128gb vram. I think it's the context overfilling. I don't have troubles with qwen 3.5 35 a3b

[-]

leonbollerup@reddit

that!!!! .. i have noticed excactly the same.. 3.5 does not seem to have the same problem as 3.6 have

[-]

nakedspirax@reddit

Oh damn. I meant 3.6.... haha

[-]

ionizing@reddit

I forget the different root causes at this point, because there were a few circumstances that lead to models stopping randomly and the majority of them were how I was handling the tool loops and conversation pattern sent to llama server, etc. But I worked all those out slowly, until qwen 3.6 came along and it also just stops but I have not solved this one, I don't think it is my application this time.

Regardless, I added auto-stop detection which then re-prompts the model to continue and that has been working at least.

[-]

robertpro01@reddit (OP)

How did you add auto stop?

[-]

Ell2509@reddit

Before guessing, check finish_reason on the stalled response. If it's length your client max_tokens is just exhausted (with thinking on, a single turn can burn 4-8k tokens easily). If it's stop something fired a stop sequence. If it's tool_calls the model thinks it called a tool and the CLI isn't handling it. That tells you which thing to chase.

After that, in order of likelihood: enable_thinking: true combined with --tool-call-parser qwen3_coder. This combo is fragile. The reasoning parser can strip a tool call that straddles a think block, and the client ends up with an empty assistant turn, which looks exactly like "stopped mid-task". Qwen's own coder guidance is to disable thinking for tool-calling workflows. Try '{"enable_thinking": false}' first, highest-yield change by far.

The dflash speculative config. Most experimental piece in the stack. Draft/target divergence around tool-call delimiters is a known failure mode, and you've got two different quant regimes (dflash draft + AutoRound INT4 target) deciding on structural tokens. Comment out the whole --speculative-config block and retest. You'll lose throughput but you'll localise it.

The marlin.py / MPLinearKernel.py patch mounts from the club-3090 repo. If those touch padding or dequant scales, occasional bad-token emission ending in EOS is plausible. Trivial to rule out, just comment the volume mounts. Side notes: max-num-batched-tokens 8192 is small for a 185k context deployment with chunked prefill, won't cause stalls but hurts TTFT. Worth bumping to 16-32k once correctness is sorted. Leave NCCL_P2P_DISABLE and --disable-custom-all-reduce alone while debugging.

So do it in this order, log finish_reason, disable thinking, disable spec decoding, drop the patches. One change at a time.

[-]

Prudent-Ad4509@reddit

I'll join the chorus or people who moved from 27b to 35b to avoid a similar issue. Looks like I will have to use older 122b instead of 27b then.

[-]

Roughy@reddit

I'm not sure if this is the same problem, but if it manifests as it just cutting off mid-thinking and not doing anything afterwards, then it's probably the same thing.
Experienced on a very vanilla 27b setup on llama.cpp main.

tl;dr the model will sometimes return a random EOS while thinking, even though it hasn't actually finished.

 slot process_toke: id  0 | task 2219 | stopped by EOS/EOG token: 248046 '', n_decoded = 182, n_remaining = 32586, generated_chars = 437, tail = "Let me trace through the chess game move by move to determine the final board state.\n\nStarting position (standard):\n```\n8: r n b q k b n r\n7: p p p p p p p p\n6: . . . . . . . .\n5: . . . . . . . .\n4: . . . . . . . .\n3: . . . . . . . .\n2: P P P P P P P P\n1: R N B Q K B N R\n  a b c d e f g h\n```\n\nWhite = uppercase, Black = lowercase\n\nLet me trace each move:\n\n**1. b3** - White b2 pawn moves to b3\n```\n8: r n b q k b n r\n7: p p p p p p p p"

You can run llama with --ignore-eos but that just results in everything running forever, so instead I patched common_reasoning_budget_apply to only ignore EOS while a tag is open:

static void common_reasoning_budget_apply(struct llama_sampler * smpl, llama_token_data_array * cur_p) {
    auto * ctx = (common_reasoning_budget_ctx *) smpl->ctx;

    if (ctx->state == REASONING_BUDGET_COUNTING || ctx->state == REASONING_BUDGET_WAITING_UTF8) {
        for (size_t i = 0; i < cur_p->size; i++) {
            if (llama_vocab_is_eog(ctx->vocab, cur_p->data[i].id)) {
                cur_p->data[i].logit = -INFINITY;
            }
        }
    }
...

My test case usually reproduces it within 3 runs, so it /appears/ to be fixed, but never say never.

I have zero knowledge of the llama codebase and I imagine this is probably a terrible solution that may have unintended side effects.

I sorta figure the issue is so prevalent that someone who knows what they're doing will fix it properly. Will have another look at existing issues once I'm absolutely sure and open one if there is nothing relevant.

[-]

iamapizza@reddit

It's qwendlejack

[-]

SimilarWasabi4696@reddit

Ran into the same wall with Qwen 3.6 stopping mid-response. After a lot of trial and error, I'd put money on the dflash speculative decoding + enable_thinking: true interaction. The model generates N speculative tokens, the thinking template interferes with the draft model's output, vLLM rejects the speculation and the generation stalls. Disabling dflash and dropping enable_thinking from the chat template kwargs fixed it on my setup, same vLLM version, same 27B model. Lost maybe 5-12%, throughput but it stopped hanging. Also, gpu-memory-utilization: 0.95 + max-model-len 185000 on two GPUs is tight for 27B, the KV cache at 185k tokens with tensor parallelism will eat VRAM fast. If you're already at 0.95 there's no headroom for the speculative draft model. Worth checking if the OOM killer is silently killing the draft worker.

[-]

leonbollerup@reddit

i think its a problem with 3.6 as i dont have the same problem with 3.5 (both 27b and 35b confirmed on my setup)

Would love if somebody else can confirm

[-]

DeltaSqueezer@reddit

I created a plug-in called 'cattleprod' ;)

[-]

robertpro01@reddit (OP)

You mind to share it?

[-]

DeltaSqueezer@reddit

Unfortunately, it's integrated into my own custom system so nothing that you can just drop into opencode, but you could probably get opencode to whip up suitable code easily.

[-]

brownman19@reddit

Lmao why you getting downvoted. Most logical answer. Wild 🤪

[-]

Warm-Attempt7773@reddit

Great name

[-]

switchbanned@reddit

So it's not just me

[-]

robertpro01@reddit (OP)

No haha

[-]

FutureIsMine@reddit

its something with the model and what I've noticed is this is an issue that happens more and more with higher token counts. As many others have said within the chat you really do need to keep prodding the model in a "keep going" kind of sense

[-]

YourNightmar31@reddit

This happens to me a lot.