Qwen3.6 27B on dual RTX 5060 Ti 16GB with vLLM: ~60 tok/s, 204k context working

Posted by do_u_think_im_spooky@reddit | LocalLLaMA | View on Reddit | 19 comments

I’ve been testing Qwen3.6 27B on a pretty non-standard local setup and figured the numbers might be useful for anyone looking at the newer 16GB Blackwell cards.

Hardware:

2x RTX 5060 Ti 16GB
32GB total VRAM
Proxmox LXC
16 vCPU
\~60GB RAM
CUDA 13 / Torch 2.11 nightly
vLLM nightly: 0.19.2rc1.dev
Model: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP

vLLM launch shape:

vllm serve sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP \
  --served-model-name qwen36-nvfp4-mtp \
  --tensor-parallel-size 2 \
  --max-model-len 204800 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --quantization modelopt \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
  --reasoning-parser qwen3 \
  --language-model-only \
  --generation-config vllm \
  --disable-custom-all-reduce \
  --attention-backend TRITON_ATTN

Performance so far:

8K context, MTP n=1: \~50–52 tok/s
8K context, MTP n=3: \~62–66 tok/s
32K context: \~59–66 tok/s
204800 context starts and works, but is tight
Idle VRAM at 204k: \~14.45GiB per GPU
After a 168k-token prefill: \~15.65GiB per GPU
168k-token needle/retrieval smoke test passed in \~256s
Near-limit test correctly rejected prompt+output over the 204800 window

Thinking mode works too, but you need to give it enough output budget. With low max_tokens, Qwen can spend the whole cap on reasoning and return no final content. Around 1024+ is fine for small prompts, and 4096–8192 is safer for actual reasoning tasks.

Caveats:

204k context is right on the edge with 2x16GB.
gpu_memory_utilization=0.94 failed KV allocation; 0.95 worked.
Startup takes several minutes due to compile/autotune.
Logs show FlashInfer autotuner OOM fallbacks during startup, but the server still becomes healthy.
I had better luck with TRITON_ATTN for the text path.
This is not a high-concurrency config: max_num_seqs=1.

Overall: dual 5060 Ti 16GB seems surprisingly usable for Qwen3.6 27B if you use the right checkpoint/runtime combo. It’s not roomy, but it works.

[-]

pepedombo@reddit

These results 50-60tps are without thinking/reasoning? I tried to setup vllm via docker but on 5070+5060 I ended up worse than llama.cpp. I'm using q5/q6 f16 128k on 2-3gpus and I can live with 20tps, but everytime I see vllm and its results I wonder where do I fail 😄

[-]

bonobomaster@reddit

5070ti + 3060ti with 24 gb vram all together and a q5 qant are already 22 tk/s with llama on an older pcie 4 board in my case.

MXFP4 quant 31 tk/s.

So there is that on two wildly different cards, llama, no --split-mode tensor and slower quants in regards to NVFP4.

And then there is the speed kicker on vllm: MTP, which is sadly not supported in llama.

In short: 20 tk/s on two 5000 gen cards has much potential for optimization even with llama.cpp.

[-]

gingerbeer987654321@reddit

Any AI, even the free one with google search should be able to tweak your command line until you get vllm working

[-]

pepedombo@reddit

I got tired after 2 days of tinkering with gpt plus 😄 There might be an issue with my setup, because i'm running one pci-e x16 and three pci-e x1 risers so the bandwidth is bottlenecked. Loading safetensors in vllm take ages so I frequently go back to llama. Bandwidth problems occur mostly with dense models, I can spot it while running qwen 27b-Q8 F16 at \~14tps and I can run it parallel and get 2x11tps.

Anyway - if that 50tps is for non-thinking mode then it doesn't convince me, because I'm lurking for quality.

[-]

Turbulent_War4067@reddit

I'm the same.

Mount_Gamer@reddit

I am curious which gen of pcie you're running on? I am tempted by two 5060ti's but might need to upgrade my setup, as my 5650g pro only runs with pcie3

do_u_think_im_spooky@reddit (OP)

The GPUs are running inside a Dell Optiplex with dual Xeon E5 2680 v4 CPUs and 128GB 2400MHz DDR4

Ditto, I run 2x5060 in a dell r730 server

Dense models are best as the communication speed between cards is bad (pcie 3 -> cpu 1 —> cpu2 -> pcie 3 other bus). M

MoE models really suffer, so 27b is faster than 36ba3b sort of thing.

appreciate you posting the full setup too