Does anyone have a usable vLLM setup with Qwen3.6 27B + pipeline parallelism + MTP?

Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 22 comments

I'm a daily llama-cpp user, and was hoping to try MTP on vLLM. Unfortunately, pipeline parallelism + MTP does not seem to work with this model in vLLM. Enabling MTP gives me this error "(APIServer pid=1) NotImplementedError: Pipeline parallelism is not supported for this model. Supported models implement the `SupportsPP` interface."

Does this work for anyone?

MTP with this model would be really nice as it's powerful, but could be faster in terms of generation.

Removing the speculative (MTP) config from the below works but obviously is not what I want.

sudo docker run --runtime nvidia -d --gpus '"device=1,2"' --ipc=host \
  --name qwen3.6 --restart always -p 8000:8000 \
  -v vllm-hf-cache:/root/.cache/huggingface \
--env "PYTORCH_ALLOC_CONF=expandable_segments:True" \
  vllm/vllm-openai:nightly \
   cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4 \
    --served-model-name Qwen3.6-27B \
  --max-model-len 200000 \
  --kv-cache-dtype auto \
  --enable-chunked-prefill \
  --gpu-memory-utilization 0.95 \
  --language-model-only \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --enable-prefix-caching \
   --tensor-parallel-size 1 \
  --pipeline-parallel-size 2 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --default-chat-template-kwargs '{"enable_thinking": true}' \
  --tool-call-parser qwen3_coder

[-]

Weekly_Comfort240@reddit

I just use tensor-parallel-size 2 for my VLLM. I believe I tried pipeline-parallel-size 2 and also failed in this, but tensor parallelism works fine with MTP on my 2 RTX A6000's and I'm getting a solid 19 tokens/second - very usable for agentic stuff. Here's my vllm docker config (docker-compole.yml , just enter 'docker compose up') - the NCCL stuff is necessary because the latest nvidia drivers bork stuff. For my agent stuff, 2 speculative tokens was a little wasted but 1 seems to be the sweet spot.

services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: vllm
    environment:
      - CUDA_DEVICE_ORDER=PCI_BUS_ID
      - NCCL_P2P_DISABLE=1
      - NCCL_SHM_DISABLE=0
      - NCCL_IB_DISABLE=1
      - NCCL_CUMEM_ENABLE=0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0','1']
              capabilities: [gpu]
    volumes:
      - /opt/.cache/huggingface:/root/.cache/huggingface
    ports:
      - "8000:8000"
    ipc: host # Prevents shared memory bottlenecks during tensor parallelism
    command: >
      --model QuantTrio/Qwen3.6-27B-AWQ
      --tensor-parallel-size 2
      --max-model-len 262144
      --gpu-memory-utilization 0.95
      --enable-prefix-caching
      --reasoning-parser qwen3
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --max-num-seqs 4
      --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}'
    restart: unless-stopped

[-]

fragment_me@reddit (OP)

Hmm, 19 t/s is a little disappointing. I am pretty sure that is llama-cpp speed for Q8. I am just using 2x 3090. Are your GPUs hooked up at PCIE 4 x16? My server mobo only supports PCIE 3 x16 so I am worried that the tensor parallelism BW will be too much for this system.

[-]

McSendo@reddit

Does your mobile support bifurcation x8/x8? You should get around 80 to 100 t/s single requests, and a lot more concurrent with that quant.

[-]

fragment_me@reddit (OP)

Are you asking me or the guy who runs it in vLLM? My mobo does support it. But I don't see how it would help here. Right now I have each GPU in some PCI risers, unfortunately one of them is on 3.0 x8 while the other is on 3.0 x16. I tested tensor parallelism (in llama cpp), and it was pretty slow. I confirmed via nvidia-smi that the PCIE BW was being saturated. I don't have hopes that even PCIE 3.0 x16 would work well considering how slow it was. Even doubling it wouldn't have made it faster than pipeline parallelism. I suspect PCIE 4 is the minimum for tensor parallelism.

[-]

McSendo@reddit

I was asking you. There's no harm in trying since you have vllm setup already. One of machines have 2x3090 on an ancient x370 gaming carbon pro 3.0 x8/x8 and I get those speeds. There are other data points from another thread that confirm the same.

[-]

McSendo@reddit

or try the other cyanwiki awq quant. I use the official FP8.

[-]

fragment_me@reddit (OP)

Wow so I just tried tensor parallelism in vLLM even with PCIE 3 it works well. Only 500MB/s transfer between the GPUs, and tok/s is 60-70 with MTP. I guess the vLLM tensor parallelism is much more efficient than llama CPP. Thanks!

[-]

fragment_me@reddit (OP)

I was using the AWQ-BF16-INT4 image, I'll try the FP8 today since it's only 30GB. 80 to 100 t/s single request on 27B would be really nice.

[-]

Weekly_Comfort240@reddit

For an X670E ProArt mobo, I believe it's a chipset mux chip that automatically splits the PCI 5.0 x16 into two PCI 5.0 x8 - but since RTX A6000's are PCI 4.0, they're running at PCI 4.0 x8. The only thing I can think of right now to enhance performance would be an NVlink bridge.

[-]

etaoin314@reddit

I was getting 70 tps on one 3090, when I went to q8 spread over 2 cards it went up to 85tps this is on vllm with the optimizations posted here a couple days ago

[-]

fragment_me@reddit (OP)

Interesting, I only get about 25-30 with pipeline parallelism on llama CPP (2x 3090). 1x 3090 is about the same, slightly better. What image were you using? I use q8_0 and above. Also, are we talking about Q3.5/6 27B or the MoE ?

[-]

TheOnlyBen2@reddit

Looks great, what context size ?

Any chance you could share your compose file or configuration ?

[-]

I get around 42t/s on lcpp and a similar bench on the autoround quant showed 25t/s officially, even though vllm was showing 65 occasionally.

Claude believes it was consistently doing 57t/s. This took an entire day of recompiling too.

I cbf for now. 42 is fine.

[-]

fragment_me@reddit (OP)

Thanks for the info