Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM

Posted by Maheidem@reddit | LocalLLaMA | View on Reddit | 6 comments

So I spent some time testing Qwen3.6 27B NVFP4 on my RTX 5090 and wanted to share the numbers, since most of the recent good posts are either around 48GB cards, FP8, or llama.cpp/GGUF.

This is not a "best possible setup" claim. More like: this is what I got working, here are the exact params, here are the numbers, and maybe it helps other 5090 owners avoid some guessing.

The short version:

The vLLM model endpoint reports max_model_len: 230400, but I only benchmarked up to 200k context depth. I am intentionally keeping the claim at 200k because that is what I actually validated with repeated runs.

Here are the main vLLM args:

vllm serve Peutlefaire/Qwen3.6-27B-NVFP4 \
  --host 0.0.0.0 --port 8082 \
  --safetensors-load-strategy=prefetch \
  --tensor-parallel-size 1 \
  --attention-backend flashinfer \
  --performance-mode interactivity \
  --language-model-only \
  --skip-mm-profiling \
  --kv-cache-dtype fp8_e4m3 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 230400 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 4096 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --no-disable-hybrid-kv-cache-manager \
  --reasoning-parser qwen3 \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --quantization compressed-tensors \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
  --trust-remote-code

Startup log had the important bits I wanted to see:

After the run, nvidia-smi showed about 30478 MiB / 32607 MiB used, with the vLLM EngineCore process using around 29998 MiB.

llama-benchy numbers

All of this was with:

Context ladder

context depth prefill tok/s generation tok/s TTFT
0 28470 86.3 0.2s
1k 20901 94.5 0.3s
5k 14593 82.3 0.6s
10k 12805 88.8 1.0s
20k 10564 88.3 2.2s
50k 7277 89.0 7.3s
100k 4834 62.7 21.2s
150k 3617 75.5 42.1s
200k 2893 63.4 69.9s

Then I ran a separate 10-run stability pass at 200k, with --exit-on-first-fail, just to make sure it was not a lucky single run.

200k stability run

pp=2048, tg=480, depth=200000, runs=10, no cache:

Per-run generation speed:

73.04, 75.12, 63.24, 75.94, 59.02, 110.71, 64.11, 68.18, 72.55, 74.37 tok/s

So I would not cherry-pick the 93 tok/s 200k result from the smaller sweep. The more honest number for this setup is probably around 65-75 tok/s generation at 200k, depending on the run.

Prefix cache behavior

I also tested prefix caching separately. At 200k:

run prefill tok/s generation tok/s TTFT
cold 2911 65.2 68.8s
warm 761 59.6 2.8s

The warm-cache prefill number is not directly comparable to cold prefill, but the TTFT drop is the useful part. For local coding / agent workflows where you keep reusing a huge prefix, this is the thing that actually feels different.

MTP telemetry

From the vLLM log across the benchmark run:

The acceptance rate moved around a lot, so I am curious if other people get better numbers with num_speculative_tokens=2 instead of 3. I started with 3 because it was stable here, but I am not claiming it is optimal.

Caveats

A few things worth saying clearly:

At this point I am pretty happy with this as a local 5090 setup. Not perfect, and not pretending it replaces every cloud model, but for long local coding sessions it finally feels like the card is doing what I bought it for.

If anyone else is running Qwen3.6 27B on a 5090, especially NVFP4 or FP8 with vLLM, I would really like to compare params and MTP settings. Also curious if someone has cleaner settings for max_num_batched_tokens with MTP, because vLLM does warn that 4096 may be suboptimal.

I have the raw llama-benchy JSON/stdout/stderr and full vLLM logs saved locally. Can upload them somewhere if people want to inspect the full audit trail.


I am a bot. This action was performed automatically.