Qwen3.6 27B on dual RTX 5060 Ti 16GB with vLLM: ~60 tok/s, 204k context working
Posted by do_u_think_im_spooky@reddit | LocalLLaMA | View on Reddit | 19 comments
I’ve been testing Qwen3.6 27B on a pretty non-standard local setup and figured the numbers might be useful for anyone looking at the newer 16GB Blackwell cards.
Hardware:
- 2x RTX 5060 Ti 16GB
- 32GB total VRAM
- Proxmox LXC
- 16 vCPU
- \~60GB RAM
- CUDA 13 / Torch 2.11 nightly
- vLLM nightly:
0.19.2rc1.dev - Model:
sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
vLLM launch shape:
vllm serve sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP \
--served-model-name qwen36-nvfp4-mtp \
--tensor-parallel-size 2 \
--max-model-len 204800 \
--max-num-batched-tokens 8192 \
--max-num-seqs 1 \
--gpu-memory-utilization 0.95 \
--kv-cache-dtype fp8 \
--quantization modelopt \
--speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
--reasoning-parser qwen3 \
--language-model-only \
--generation-config vllm \
--disable-custom-all-reduce \
--attention-backend TRITON_ATTN
Performance so far:
- 8K context, MTP n=1: \~50–52 tok/s
- 8K context, MTP n=3: \~62–66 tok/s
- 32K context: \~59–66 tok/s
- 204800 context starts and works, but is tight
- Idle VRAM at 204k: \~14.45GiB per GPU
- After a 168k-token prefill: \~15.65GiB per GPU
- 168k-token needle/retrieval smoke test passed in \~256s
- Near-limit test correctly rejected prompt+output over the 204800 window
Thinking mode works too, but you need to give it enough output budget. With low max_tokens, Qwen can spend the whole cap on reasoning and return no final content. Around 1024+ is fine for small prompts, and 4096–8192 is safer for actual reasoning tasks.
Caveats:
- 204k context is right on the edge with 2x16GB.
gpu_memory_utilization=0.94failed KV allocation;0.95worked.- Startup takes several minutes due to compile/autotune.
- Logs show FlashInfer autotuner OOM fallbacks during startup, but the server still becomes healthy.
- I had better luck with
TRITON_ATTNfor the text path. - This is not a high-concurrency config:
max_num_seqs=1.
Overall: dual 5060 Ti 16GB seems surprisingly usable for Qwen3.6 27B if you use the right checkpoint/runtime combo. It’s not roomy, but it works.
pepedombo@reddit
These results 50-60tps are without thinking/reasoning? I tried to setup vllm via docker but on 5070+5060 I ended up worse than llama.cpp. I'm using q5/q6 f16 128k on 2-3gpus and I can live with 20tps, but everytime I see vllm and its results I wonder where do I fail 😄
bonobomaster@reddit
5070ti + 3060ti with 24 gb vram all together and a q5 qant are already 22 tk/s with llama on an older pcie 4 board in my case.
MXFP4 quant 31 tk/s.
So there is that on two wildly different cards, llama, no --split-mode tensor and slower quants in regards to NVFP4.
And then there is the speed kicker on vllm: MTP, which is sadly not supported in llama.
In short: 20 tk/s on two 5000 gen cards has much potential for optimization even with llama.cpp.
gingerbeer987654321@reddit
Any AI, even the free one with google search should be able to tweak your command line until you get vllm working
pepedombo@reddit
I got tired after 2 days of tinkering with gpt plus 😄 There might be an issue with my setup, because i'm running one pci-e x16 and three pci-e x1 risers so the bandwidth is bottlenecked. Loading safetensors in vllm take ages so I frequently go back to llama. Bandwidth problems occur mostly with dense models, I can spot it while running qwen 27b-Q8 F16 at \~14tps and I can run it parallel and get 2x11tps.
Anyway - if that 50tps is for non-thinking mode then it doesn't convince me, because I'm lurking for quality.
Turbulent_War4067@reddit
I'm the same.
Mount_Gamer@reddit
I am curious which gen of pcie you're running on? I am tempted by two 5060ti's but might need to upgrade my setup, as my 5650g pro only runs with pcie3
do_u_think_im_spooky@reddit (OP)
The GPUs are running inside a Dell Optiplex with dual Xeon E5 2680 v4 CPUs and 128GB 2400MHz DDR4
gingerbeer987654321@reddit
Ditto, I run 2x5060 in a dell r730 server
Dense models are best as the communication speed between cards is bad (pcie 3 -> cpu 1 —> cpu2 -> pcie 3 other bus). M
MoE models really suffer, so 27b is faster than 36ba3b sort of thing.
andy_potato@reddit
I’m using dual 5060ti and loving it so far
jjh47@reddit
If you get all of the model layers to offload to the GPU then PCIe speed doesn’t really matter. It’s only an issue if you run some of the layers on the CPU, which is also much slower, even with PCIe 5.
fasti-au@reddit
Stop. Goto internet type qwen 3.6 llama cpp turboquant and do that then do the ask Claude or decent to look at the spreads timings and under volt the card and enjoy 500k cintext approx at 200 tps
TapAggressive9530@reddit
If you are not running it with full precision ( BF16) you are not running it at all - no matter how many tps you are getting . Yes you can tweak and quant but qwen 3.6 is complete garbage in quant state.
kaliku@reddit
Fp8 isn't garbage although it sometimes mangles tool calls and thinking. Maybe 2% of the calls. Wonder if it's because of fp8. It's the qwen fp8..
anzzax@reddit
Thanks for this config and model, I didn't expect for 27b dense model I can get 20 tok/s on DGX Spark, on 5090 it's going to be >100 tok/s
SocialDinamo@reddit
My second 5060ti is 16gb is coming in today. I was looking for exactly this and you provided, thank you brozzer!
patricious@reddit
Great choice to use the NVFP4 model variant, as your 2x 5060's have native support for it. llama.cpp also added official support for it a hour ago lol. Currently building a new server around that.
rpkarma@reddit
NVFP4 requires QAT to really make it recover the intelligence the quantisation loses. Not all NVFP4 quants will be the same
DeltaSqueezer@reddit
what is prefill speed? also did you try flash attention 2?
Lyceum_Tech@reddit
thanks for the detailed numbers man. really helpful
quick question - how’s the stability at 20k context when you’re actually chatting or running longer sessions? any random crashes?
appreciate you posting the full setup too