Qwen3.5 27B running at ~65tps with DFlash speculation on 2x 3090
Posted by Kryesh@reddit | LocalLLaMA | View on Reddit | 16 comments
Posted by Kryesh@reddit | LocalLLaMA | View on Reddit | 16 comments
Opteron67@reddit
Failed: Cuda error /home/_/vllm/csrc/custom_all_reduce.cuh:455 'an illegal memory access was encountered'
AdamDhahabi@reddit
That looks very cool for builds running multi-GPU on consumer mainboards meaning no tensor parallel due to poor PCIE bandwidth.
marutichintan@reddit
currently i am running 122b on 4x3090, i am waiting for Dflash
wullyfooly@reddit
Please update us on the result! Very curious on the performance
Opteron67@reddit
160 tps 1 concurrency, 620 tps batched on Qwen3,5 27B fp8 dual 5090
Kryesh@reddit (OP)
Testing out https://huggingface.co/z-lab/Qwen3.5-27B-DFlash to see how it work - pleasantly surprised by the performance after getting ~25tps in llama.cpp.
Command: uv run vllm serve cyankiwi/Qwen3.5-27B-AWQ-4bit --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 8, "draft_tensor_parallel_size": 2}' --attention-backend flash_attn --max_num_seqs 4 --max-num-batched-tokens 12288 -tp 2 --gpu-memory-utilization 0.80 --max-model-len -1 --reasoning-parser qwen3 --enable-prefix-caching --enable-auto-tool-choice --tool-call-parser qwen3_coder
koljanos@reddit
That’s weird, with nvlink I can run 6 bit quant at 170k context window, at the same tps, you want my settings?
Kryesh@reddit (OP)
There's several reasons I won't get max performance on my current setup. It's a desktop so I need vram for running my ui etc and vllm doesn't do asymmetric offloading so the second card isnt using all available memory. The dflash model is 3.5gb which takes up memory that could be used for context, and I don't have an nvlink bridge for faster tensor parallelism.
tomz17@reddit
--reasoning-parser qwen3-tool-call-parser qwen3_coderare these correct?kms_dev@reddit
How about concurrent requests, What is the max throughput in that case for maximum gpu utilization?
szansky@reddit
and how it's going okay ? smoothly?
ReentryVehicle@reddit
How does it compare to running the official fp8/some 4bit with the built-in MTP normally? Looking at your acceptance rates it looks like anything beyond 3 tokens is a bit pointless, no?
putrasherni@reddit
What in the abracadabra is this vodoo Love it
roosterfareye@reddit
Vodoo?! Well if ain't Voodoo, it's Vodoo! Give me that sweet Vodoo!
-dysangel-@reddit
Jesus Chris, Patron Saint of Typos :0
https://arxiv.org/abs/2602.06036
Addyad@reddit
Niceeee