Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM
Posted by Maheidem@reddit | LocalLLaMA | View on Reddit | 6 comments
So I spent some time testing Qwen3.6 27B NVFP4 on my RTX 5090 and wanted to share the numbers, since most of the recent good posts are either around 48GB cards, FP8, or llama.cpp/GGUF.
This is not a "best possible setup" claim. More like: this is what I got working, here are the exact params, here are the numbers, and maybe it helps other 5090 owners avoid some guessing.
The short version:
- Single RTX 5090, 32GB VRAM
- Model:
Peutlefaire/Qwen3.6-27B-NVFP4 - vLLM:
0.20.1.dev0+g88d34c640.d20260502 - Torch:
2.13.0.dev20260430+cu130 - Driver:
595.58.03 - Quantization:
compressed-tensors - Attention backend:
flashinfer - KV cache:
fp8_e4m3 - MTP enabled with 3 speculative tokens
- Text-only mode
- Public claim I am comfortable with: 200k context, not 220k/262k
The vLLM model endpoint reports max_model_len: 230400, but I only benchmarked up to 200k context depth. I am intentionally keeping the claim at 200k because that is what I actually validated with repeated runs.
Here are the main vLLM args:
vllm serve Peutlefaire/Qwen3.6-27B-NVFP4 \
--host 0.0.0.0 --port 8082 \
--safetensors-load-strategy=prefetch \
--tensor-parallel-size 1 \
--attention-backend flashinfer \
--performance-mode interactivity \
--language-model-only \
--skip-mm-profiling \
--kv-cache-dtype fp8_e4m3 \
--gpu-memory-utilization 0.95 \
--max-model-len 230400 \
--max-num-seqs 1 \
--max-num-batched-tokens 4096 \
--enable-chunked-prefill \
--enable-prefix-caching \
--no-disable-hybrid-kv-cache-manager \
--reasoning-parser qwen3 \
--default-chat-template-kwargs '{"enable_thinking": false}' \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--quantization compressed-tensors \
--speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
--trust-remote-code
Startup log had the important bits I wanted to see:
Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM- Available KV cache memory:
8.3 GiB - Maximum concurrency for
230,400tokens per request:1.00x
After the run, nvidia-smi showed about 30478 MiB / 32607 MiB used, with the vLLM EngineCore process using around 29998 MiB.
llama-benchy numbers
All of this was with:
llama-benchy 0.3.7--pp 2048--tg 480--latency-mode generation--skip-coherence- concurrency 1
- War and Peace text as the long-context source
Context ladder
| context depth | prefill tok/s | generation tok/s | TTFT |
|---|---|---|---|
| 0 | 28470 | 86.3 | 0.2s |
| 1k | 20901 | 94.5 | 0.3s |
| 5k | 14593 | 82.3 | 0.6s |
| 10k | 12805 | 88.8 | 1.0s |
| 20k | 10564 | 88.3 | 2.2s |
| 50k | 7277 | 89.0 | 7.3s |
| 100k | 4834 | 62.7 | 21.2s |
| 150k | 3617 | 75.5 | 42.1s |
| 200k | 2893 | 63.4 | 69.9s |
Then I ran a separate 10-run stability pass at 200k, with --exit-on-first-fail, just to make sure it was not a lucky single run.
200k stability run
pp=2048, tg=480, depth=200000, runs=10, no cache:
- 10/10 runs completed
- exit status 0
- mean prefill:
2883 tok/s - mean generation:
73.6 tok/s - generation stddev:
13.5 tok/s - mean TTFT:
70.2s - wall time:
12:48.79
Per-run generation speed:
73.04, 75.12, 63.24, 75.94, 59.02, 110.71, 64.11, 68.18, 72.55, 74.37 tok/s
So I would not cherry-pick the 93 tok/s 200k result from the smaller sweep. The more honest number for this setup is probably around 65-75 tok/s generation at 200k, depending on the run.
Prefix cache behavior
I also tested prefix caching separately. At 200k:
| run | prefill tok/s | generation tok/s | TTFT |
|---|---|---|---|
| cold | 2911 | 65.2 | 68.8s |
| warm | 761 | 59.6 | 2.8s |
The warm-cache prefill number is not directly comparable to cold prefill, but the TTFT drop is the useful part. For local coding / agent workflows where you keep reusing a huge prefix, this is the thing that actually feels different.
MTP telemetry
From the vLLM log across the benchmark run:
- Mean MTP acceptance length:
2.28 - Average draft acceptance:
42.7% - Max observed GPU KV cache usage:
88.0%
The acceptance rate moved around a lot, so I am curious if other people get better numbers with num_speculative_tokens=2 instead of 3. I started with 3 because it was stable here, but I am not claiming it is optimal.
Caveats
A few things worth saying clearly:
- I did not run an accuracy benchmark here. This is performance/stability only.
- vLLM warns about NVFP4 global scales possibly reducing accuracy. So if you care about coding quality, do your own evals.
- Prefix caching with the Mamba cache align mode is still marked experimental by vLLM.
- FlashInfer + spec decode forced CUDAGraph mode to piecewise.
- I did not test vision/multimodal. This was text-only.
- I did not validate 220k or 262k. The number I can stand behind from this run is 200k.
At this point I am pretty happy with this as a local 5090 setup. Not perfect, and not pretending it replaces every cloud model, but for long local coding sessions it finally feels like the card is doing what I bought it for.
If anyone else is running Qwen3.6 27B on a 5090, especially NVFP4 or FP8 with vLLM, I would really like to compare params and MTP settings. Also curious if someone has cleaner settings for max_num_batched_tokens with MTP, because vLLM does warn that 4096 may be suboptimal.
I have the raw llama-benchy JSON/stdout/stderr and full vLLM logs saved locally. Can upload them somewhere if people want to inspect the full audit trail.
I am a bot. This action was performed automatically.
Anbeeld@reddit
It's quite possible to fit 200k content even into 24GB VRAM, let alone 5090 with its 32GB.
MutantEggroll@reddit
Please share your configuration to achieve this.
Otherwise-Director17@reddit
I’m using this model… The quality is great and it works with images.
https://hugging-face.co/Lorbus/Qwen3.6-27B-int4-AutoRound
I’m also on a 5090 and I have similar settings to you. I keep 200k context and the same mtp config but my acceptance rate is no less than 65. I get higher throughput with thinking enabled, that is probably why my acceptance rate is higher.
75-130tks
cibernox@reddit
Using circa 30B dense models in Q4 at 60+ tk/s with 128k+ context on consumer hardware is going to be quite te revolution really. That is actually very capable and usable.
Bulky-Priority6824@reddit
2 years from now or whatever all of this will be Atari talk
Bulky-Priority6824@reddit
Single 5060ti 16gb