Does anyone have a usable vLLM setup with Qwen3.6 27B + pipeline parallelism + MTP?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 22 comments

I'm a daily llama-cpp user, and was hoping to try MTP on vLLM. Unfortunately, pipeline parallelism + MTP does not seem to work with this model in vLLM. Enabling MTP gives me this error "(APIServer pid=1) NotImplementedError: Pipeline parallelism is not supported for this model. Supported models implement the `SupportsPP` interface."

Does this work for anyone?

MTP with this model would be really nice as it's powerful, but could be faster in terms of generation.

Removing the speculative (MTP) config from the below works but obviously is not what I want.

sudo docker run --runtime nvidia -d --gpus '"device=1,2"' --ipc=host \
  --name qwen3.6 --restart always -p 8000:8000 \
  -v vllm-hf-cache:/root/.cache/huggingface \
--env "PYTORCH_ALLOC_CONF=expandable_segments:True" \
  vllm/vllm-openai:nightly \
   cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4 \
    --served-model-name Qwen3.6-27B \
  --max-model-len 200000 \
  --kv-cache-dtype auto \
  --enable-chunked-prefill \
  --gpu-memory-utilization 0.95 \
  --language-model-only \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --enable-prefix-caching \
   --tensor-parallel-size 1 \
  --pipeline-parallel-size 2 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --default-chat-template-kwargs '{"enable_thinking": true}' \
  --tool-call-parser qwen3_coder