Qwen3.6 27B on dual RTX 5060 Ti 16GB with vLLM: ~60 tok/s, 204k context working

Posted by do_u_think_im_spooky@reddit | LocalLLaMA | View on Reddit | 19 comments

I’ve been testing Qwen3.6 27B on a pretty non-standard local setup and figured the numbers might be useful for anyone looking at the newer 16GB Blackwell cards.

Hardware:

vLLM launch shape:

vllm serve sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP \
  --served-model-name qwen36-nvfp4-mtp \
  --tensor-parallel-size 2 \
  --max-model-len 204800 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --quantization modelopt \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
  --reasoning-parser qwen3 \
  --language-model-only \
  --generation-config vllm \
  --disable-custom-all-reduce \
  --attention-backend TRITON_ATTN

Performance so far:

Thinking mode works too, but you need to give it enough output budget. With low max_tokens, Qwen can spend the whole cap on reasoning and return no final content. Around 1024+ is fine for small prompts, and 4096–8192 is safer for actual reasoning tasks.

Caveats:

Overall: dual 5060 Ti 16GB seems surprisingly usable for Qwen3.6 27B if you use the right checkpoint/runtime combo. It’s not roomy, but it works.