Qwen3.6-27B-INT4 clocking 100 tps with 256k context length on 1x RTX 5090 via vllm 0.19

Posted by Kindly-Cantaloupe978@reddit | LocalLLaMA | View on Reddit | 27 comments

Thanks to the community the Qwen3.6-27B speed keeps getting better. The following improves upon my recipe from yesterday and delivered a whopping 100+ tps (TG).

Model: https://huggingface.co/Lorbus/Qwen3.6-27B-int4-AutoRound

- MTP supported

- KLD is decent especially being the smallest model https://www.reddit.com/r/LocalLLaMA/comments/1ssyukx/qwen3627b_klds_ints_and_nvfps/

- The smaller model size allows for full native 256k context window

Tokens per second (TG): 105-108 tps

Special credits to this post that helps me discover the Lorbus quant: https://www.reddit.com/r/Olares/comments/1svg2ad/qwen3627b_at_85100_ts_on_a_24gb_rtx_5090_laptop/

Note that I didn't mess with TQ in my setup as I can already hit the max context length native to the model without it.

Vllm launch config:

args=(

vllm serve "/root/autodl-tmp/llm-models"

--max-model-len "262144"

--gpu-memory-utilization "0.93"

--attention-backend flashinfer

--performance-mode interactivity

--language-model-only

--kv-cache-dtype "fp8_e4m3"

--max-num-seqs "2"

--skip-mm-profiling

--quantization auto_round

--reasoning-parser qwen3

--enable-auto-tool-choice

--enable-prefix-caching

--enable-chunked-prefill

--tool-call-parser qwen3_coder

--speculative-config '{"method":"mtp","num_speculative_tokens":3}'

--host "0.0.0.0"

--port "6006"

)