Does anyone have a usable vLLM setup with Qwen3.6 27B + pipeline parallelism + MTP?
Posted by fragment_me@reddit | LocalLLaMA | View on Reddit | 22 comments
I'm a daily llama-cpp user, and was hoping to try MTP on vLLM. Unfortunately, pipeline parallelism + MTP does not seem to work with this model in vLLM. Enabling MTP gives me this error "(APIServer pid=1) NotImplementedError: Pipeline parallelism is not supported for this model. Supported models implement the `SupportsPP` interface."
Does this work for anyone?
MTP with this model would be really nice as it's powerful, but could be faster in terms of generation.
Removing the speculative (MTP) config from the below works but obviously is not what I want.
sudo docker run --runtime nvidia -d --gpus '"device=1,2"' --ipc=host \
--name qwen3.6 --restart always -p 8000:8000 \
-v vllm-hf-cache:/root/.cache/huggingface \
--env "PYTORCH_ALLOC_CONF=expandable_segments:True" \
vllm/vllm-openai:nightly \
cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4 \
--served-model-name Qwen3.6-27B \
--max-model-len 200000 \
--kv-cache-dtype auto \
--enable-chunked-prefill \
--gpu-memory-utilization 0.95 \
--language-model-only \
--speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
--enable-prefix-caching \
--tensor-parallel-size 1 \
--pipeline-parallel-size 2 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--default-chat-template-kwargs '{"enable_thinking": true}' \
--tool-call-parser qwen3_coder
Weekly_Comfort240@reddit
I just use tensor-parallel-size 2 for my VLLM. I believe I tried pipeline-parallel-size 2 and also failed in this, but tensor parallelism works fine with MTP on my 2 RTX A6000's and I'm getting a solid 19 tokens/second - very usable for agentic stuff. Here's my vllm docker config (docker-compole.yml , just enter 'docker compose up') - the NCCL stuff is necessary because the latest nvidia drivers bork stuff. For my agent stuff, 2 speculative tokens was a little wasted but 1 seems to be the sweet spot.
fragment_me@reddit (OP)
Hmm, 19 t/s is a little disappointing. I am pretty sure that is llama-cpp speed for Q8. I am just using 2x 3090. Are your GPUs hooked up at PCIE 4 x16? My server mobo only supports PCIE 3 x16 so I am worried that the tensor parallelism BW will be too much for this system.
McSendo@reddit
Does your mobile support bifurcation x8/x8? You should get around 80 to 100 t/s single requests, and a lot more concurrent with that quant.
fragment_me@reddit (OP)
Are you asking me or the guy who runs it in vLLM? My mobo does support it. But I don't see how it would help here. Right now I have each GPU in some PCI risers, unfortunately one of them is on 3.0 x8 while the other is on 3.0 x16. I tested tensor parallelism (in llama cpp), and it was pretty slow. I confirmed via nvidia-smi that the PCIE BW was being saturated. I don't have hopes that even PCIE 3.0 x16 would work well considering how slow it was. Even doubling it wouldn't have made it faster than pipeline parallelism. I suspect PCIE 4 is the minimum for tensor parallelism.
McSendo@reddit
I was asking you. There's no harm in trying since you have vllm setup already. One of machines have 2x3090 on an ancient x370 gaming carbon pro 3.0 x8/x8 and I get those speeds. There are other data points from another thread that confirm the same.
McSendo@reddit
or try the other cyanwiki awq quant. I use the official FP8.
fragment_me@reddit (OP)
Wow so I just tried tensor parallelism in vLLM even with PCIE 3 it works well. Only 500MB/s transfer between the GPUs, and tok/s is 60-70 with MTP. I guess the vLLM tensor parallelism is much more efficient than llama CPP. Thanks!
fragment_me@reddit (OP)
I was using the AWQ-BF16-INT4 image, I'll try the FP8 today since it's only 30GB. 80 to 100 t/s single request on 27B would be really nice.
Weekly_Comfort240@reddit
For an X670E ProArt mobo, I believe it's a chipset mux chip that automatically splits the PCI 5.0 x16 into two PCI 5.0 x8 - but since RTX A6000's are PCI 4.0, they're running at PCI 4.0 x8. The only thing I can think of right now to enhance performance would be an NVlink bridge.
etaoin314@reddit
I was getting 70 tps on one 3090, when I went to q8 spread over 2 cards it went up to 85tps this is on vllm with the optimizations posted here a couple days ago
fragment_me@reddit (OP)
Interesting, I only get about 25-30 with pipeline parallelism on llama CPP (2x 3090). 1x 3090 is about the same, slightly better. What image were you using? I use q8_0 and above. Also, are we talking about Q3.5/6 27B or the MoE ?
TheOnlyBen2@reddit
Looks great, what context size ?
Any chance you could share your compose file or configuration ?
datbackup@reddit
which motherboard are you using?
Medium_Chemist_4032@reddit
Would appreciate a link
One-Replacement-37@reddit
Vllm nightly + 3x A40/3090 + DFlash = 210 tok/ sec
datbackup@reddit
what motherboard?
One-Replacement-37@reddit
H12DSG-O-CPU
Medium_Chemist_4032@reddit
210 decode? Can you give more info. I'm at 60 ish p on tp4 vllm nightly
Bootes-sphere@reddit
Have you tried disabling MTP and just running pipeline parallelism solo? Qwen 27B should distribute reasonably well across GPUs without it. Or flip it: enable MTP but use tensor parallelism instead. Less elegant, but usually stable.
Miserable-Dare5090@reddit
I’m getting 20ish fitting the nodel into 1 24gb card, no MTP. 40 for gemma4 using e2b as a drafter. Qwen 35 is nicely offloaded moe to cpu and runs off a 4060ti at 40tps as well
Ok-Measurement-1575@reddit
It's a long way from being ready yet.
Even with the 50 - 60% acceptance rate I was seeing, I'm not convinced it's appreciably faster than llama.cpp.
I get around 42t/s on lcpp and a similar bench on the autoround quant showed 25t/s officially, even though vllm was showing 65 occasionally.
Claude believes it was consistently doing 57t/s. This took an entire day of recompiling too.
I cbf for now. 42 is fine.
fragment_me@reddit (OP)
Thanks for the info