qwen3.6 just stops

Posted by robertpro01@reddit | LocalLLaMA | View on Reddit | 48 comments

Sometimes qwen 3.6 just stops at the middle of a task, is there a way to avoid it?

This is qwen-code CLI, but also happens on opencode.

Running with vLLM with docker compose:

services:
  vllm-qwen36-27b-dual-dflash-noviz:
    image: vllm/vllm-openai:nightly-1acd67a795ebccdf9b9db7697ae9082058301657
    container_name: vllm-qwen36-27b-dual-dflash-noviz
    restart: on-failure
    ports:
      - "${BIND_HOST:-0.0.0.0}:${PORT:-8080}:8000"
    volumes:
      - ${MODEL_DIR:-/home/ai/models/vllm}:/root/.cache/huggingface
      - /home/ai/club-3090/models/qwen3.6-27b/vllm/cache/torch_compile:/root/.cache/vllm/torch_compile_cache
      - /home/ai/club-3090/models/qwen3.6-27b/vllm/cache/triton:/root/.triton/cache
      - /home/ai/club-3090/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad/marlin.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/marlin.py:ro
      - /home/ai/club-3090/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad/MPLinearKernel.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/MPLinearKernel.py:ro
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN:-}
      - CUDA_DEVICE_ORDER=PCI_BUS_ID
      - VLLM_WORKER_MULTIPROC_METHOD=spawn
      - NCCL_CUMEM_ENABLE=0
      - NCCL_P2P_DISABLE=1
      - VLLM_NO_USAGE_STATS=1
      - VLLM_USE_FLASHINFER_SAMPLER=1
      - OMP_NUM_THREADS=1
      - PYTORCH_CUDA_ALLOC_CONF=${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True,max_split_size_mb:512}
    shm_size: "16gb"
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0", "2"]
              capabilities: [gpu]
    entrypoint:
      - /bin/bash
      - -c
      - |
        exec vllm serve ${VLLM_ENFORCE_EAGER:+--enforce-eager} "$@"
      - --
    command:
      - --model
      - /root/.cache/huggingface/qwen3.6-27b-autoround-int4
      - --served-model-name
      - qwen
      - --quantization
      - auto_round
      - --dtype
      - bfloat16
      - --tensor-parallel-size
      - "2"
      - --disable-custom-all-reduce
      - --max-model-len
      - "${MAX_MODEL_LEN:-185000}"
      - --gpu-memory-utilization
      - "${GPU_MEMORY_UTILIZATION:-0.95}"
      - --max-num-seqs
      - "${MAX_NUM_SEQS:-2}"
      - --max-num-batched-tokens
      - "8192"
      - --language-model-only
      - --trust-remote-code
      - --reasoning-parser
      - qwen3
      - --default-chat-template-kwargs
      - '{"enable_thinking": true}'
      - --enable-auto-tool-choice
      - --tool-call-parser
      - qwen3_coder
      - --enable-prefix-caching
      - --enable-chunked-prefill
      - --speculative-config
      - '{"method":"dflash","model":"/root/.cache/huggingface/qwen3.6-27b-dflash","num_speculative_tokens":5}'
      - --host
      - 0.0.0.0
      - --port
      - "8000"

Based on https://github.com/noonghunna/club-3090

Any ideas how to improve?