Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests)

Posted by JohnTheNerd3@reddit | LocalLLaMA | View on Reddit | 140 comments

Hi everyone!

I've been trying to run the new Qwen models as efficiently as possible with my setup - and seem to have performance higher than I've seen around, so wanted to share my scripts and metrics!

The above video is simulating ideal conditions - due to the nature of MTP, it does get slower once your response requires more intelligence and creativity. However, even at the worst-case scenario I rarely ever see my decode speeds drop below 60t/s. And for multi-user throughput, I have seen as high as 585t/s across 8 requests.

To achieve this, I had to:

The tool call parser for Qwen3 Coder (also used in Qwen3.5 in vLLM) seems to have a bug where tool calling is inaccurate when MTP is enabled, so I cherry-picked this pull request into the current main branch (and another pull request to fix an issue where reasoning content is lost when using LiteLLM). My fork with the cherry-picked fixes are available on my GitHub if you'd like to use it, but please keep in mind that I am unlikely to maintain this fork.

Prefill speeds appear to be really good too, at ~1500t/s.

My current build script is:

#!/bin/bash

. /mnt/no-backup/vllm-venv/bin/activate

export CUDACXX=/usr/local/cuda-12.4/bin/nvcc
export MAX_JOBS=1
export PATH=/usr/local/cuda-12.4/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH

cd vllm

pip3 install -e .

And my current launch script is:

#!/bin/bash

. /mnt/no-backup/vllm-venv/bin/activate

export CUDA_VISIBLE_DEVICES=0,1
export RAY_memory_monitor_refresh_ms=0
export NCCL_CUMEM_ENABLE=0
export VLLM_SLEEP_WHEN_IDLE=1
export VLLM_ENABLE_CUDAGRAPH_GC=1
export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve /mnt/no-backup/models/Qwen3.5-27B-AWQ-BF16-INT4 --served-model-name=qwen3.5-27b \
--quantization compressed-tensors \
--max-model-len=170000 \
--max-num-seqs=8 \
--block-size 32 \
--max-num-batched-tokens=2048 \
--swap-space=0 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--attention-backend FLASHINFER \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":5}' \
--tensor-parallel-size=2 \
-O3 \
--gpu-memory-utilization=0.9 \
--no-use-tqdm-on-load \
--host=0.0.0.0 --port=5000

deactivate

Hope this helps someone!