Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM

Posted by Maheidem@reddit | LocalLLaMA | View on Reddit | 6 comments

So I spent some time testing Qwen3.6 27B NVFP4 on my RTX 5090 and wanted to share the numbers, since most of the recent good posts are either around 48GB cards, FP8, or llama.cpp/GGUF.

This is not a "best possible setup" claim. More like: this is what I got working, here are the exact params, here are the numbers, and maybe it helps other 5090 owners avoid some guessing.

The short version:

Single RTX 5090, 32GB VRAM
Model: Peutlefaire/Qwen3.6-27B-NVFP4
vLLM: 0.20.1.dev0+g88d34c640.d20260502
Torch: 2.13.0.dev20260430+cu130
Driver: 595.58.03
Quantization: compressed-tensors
Attention backend: flashinfer
KV cache: fp8_e4m3
MTP enabled with 3 speculative tokens
Text-only mode
Public claim I am comfortable with: 200k context, not 220k/262k

The vLLM model endpoint reports max_model_len: 230400, but I only benchmarked up to 200k context depth. I am intentionally keeping the claim at 200k because that is what I actually validated with repeated runs.

Here are the main vLLM args:

vllm serve Peutlefaire/Qwen3.6-27B-NVFP4 \
  --host 0.0.0.0 --port 8082 \
  --safetensors-load-strategy=prefetch \
  --tensor-parallel-size 1 \
  --attention-backend flashinfer \
  --performance-mode interactivity \
  --language-model-only \
  --skip-mm-profiling \
  --kv-cache-dtype fp8_e4m3 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 230400 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 4096 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --no-disable-hybrid-kv-cache-manager \
  --reasoning-parser qwen3 \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --quantization compressed-tensors \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
  --trust-remote-code

Startup log had the important bits I wanted to see:

Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
Available KV cache memory: 8.3 GiB
Maximum concurrency for 230,400 tokens per request: 1.00x

After the run, nvidia-smi showed about 30478 MiB / 32607 MiB used, with the vLLM EngineCore process using around 29998 MiB.

llama-benchy numbers

All of this was with:

llama-benchy 0.3.7
--pp 2048
--tg 480
--latency-mode generation
--skip-coherence
concurrency 1
War and Peace text as the long-context source

Context ladder

context depth	prefill tok/s	generation tok/s	TTFT
0	28470	86.3	0.2s
1k	20901	94.5	0.3s
5k	14593	82.3	0.6s
10k	12805	88.8	1.0s
20k	10564	88.3	2.2s
50k	7277	89.0	7.3s
100k	4834	62.7	21.2s
150k	3617	75.5	42.1s
200k	2893	63.4	69.9s

Then I ran a separate 10-run stability pass at 200k, with --exit-on-first-fail, just to make sure it was not a lucky single run.

200k stability run

pp=2048, tg=480, depth=200000, runs=10, no cache:

10/10 runs completed
exit status 0
mean prefill: 2883 tok/s
mean generation: 73.6 tok/s
generation stddev: 13.5 tok/s
mean TTFT: 70.2s
wall time: 12:48.79

Per-run generation speed:

73.04, 75.12, 63.24, 75.94, 59.02, 110.71, 64.11, 68.18, 72.55, 74.37 tok/s

So I would not cherry-pick the 93 tok/s 200k result from the smaller sweep. The more honest number for this setup is probably around 65-75 tok/s generation at 200k, depending on the run.

Prefix cache behavior

I also tested prefix caching separately. At 200k:

run	prefill tok/s	generation tok/s	TTFT
cold	2911	65.2	68.8s
warm	761	59.6	2.8s

The warm-cache prefill number is not directly comparable to cold prefill, but the TTFT drop is the useful part. For local coding / agent workflows where you keep reusing a huge prefix, this is the thing that actually feels different.

MTP telemetry

From the vLLM log across the benchmark run:

Mean MTP acceptance length: 2.28
Average draft acceptance: 42.7%
Max observed GPU KV cache usage: 88.0%

The acceptance rate moved around a lot, so I am curious if other people get better numbers with num_speculative_tokens=2 instead of 3. I started with 3 because it was stable here, but I am not claiming it is optimal.

Caveats

A few things worth saying clearly:

I did not run an accuracy benchmark here. This is performance/stability only.
vLLM warns about NVFP4 global scales possibly reducing accuracy. So if you care about coding quality, do your own evals.
Prefix caching with the Mamba cache align mode is still marked experimental by vLLM.
FlashInfer + spec decode forced CUDAGraph mode to piecewise.
I did not test vision/multimodal. This was text-only.
I did not validate 220k or 262k. The number I can stand behind from this run is 200k.

At this point I am pretty happy with this as a local 5090 setup. Not perfect, and not pretending it replaces every cloud model, but for long local coding sessions it finally feels like the card is doing what I bought it for.

If anyone else is running Qwen3.6 27B on a 5090, especially NVFP4 or FP8 with vLLM, I would really like to compare params and MTP settings. Also curious if someone has cleaner settings for max_num_batched_tokens with MTP, because vLLM does warn that 4096 may be suboptimal.

I have the raw llama-benchy JSON/stdout/stderr and full vLLM logs saved locally. Can upload them somewhere if people want to inspect the full audit trail.

I am a bot. This action was performed automatically.

[-]

75-130tks

cibernox@reddit

Using circa 30B dense models in Q4 at 60+ tk/s with 128k+ context on consumer hardware is going to be quite te revolution really. That is actually very capable and usable.

Bulky-Priority6824@reddit

2 years from now or whatever all of this will be Atari talk

Config	TG t/s	Notes
27B Q4 no MTP, single GPU, Debian	21.88 t/s	baseline
27B Q4 MTP, single GPU, Debian	39.21 t/s	1.8x uplift, 60% acceptance
27B Q4 MTP, dual GPU, Debian	41.51 t/s	65% acceptance, marginal gain
35B A3B MoE Q4 no MTP, dual GPU, Debian	102 t/s	production config

Single 5060ti 16gb