Qwen3.6 27B FP8 runs with 200k tokens of BF16 KV cache at 80 TPS on a single RTX 5000 PRO 48GB

Posted by __JockY__@reddit | LocalLLaMA | View on Reddit | 160 comments

----START HUMAN TEXT----

Hi all,

I've seen a bunch of posts about squeezing 27B onto a 24GB card and all the quantization tricks involved in doing so. It's all amazing work, but at the end of the day a quantized model with quantized KV will inevitably compound errors faster than non-quantized ones, which noticeably impacts agentic coding.

I figured a 48GB GPU offered just enough VRAM to avoid most of the quantization nastiness with genuinely good options, like Blackwell-accelerated FP8. Luckily, Qwen released their own FP8 variant of the 27B model.

I'm serious when I say: I think we might have an answer to all those "what do I buy for $10k?" posts. A pro5k, 64GB RAM, a decent CPU/mobo, and it will run the FP8 quant of 27B with Blackwell hardware acceleration and non-quantized KV like a champ. It's quiet, cool enough, small, fast... really great.

The end recipe:

These settings:

export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_TEST_FORCE_FP8_MARLIN=1
export VLLM_SLEEP_WHEN_IDLE=1
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
export VLLM_LOG_STATS_INTERVAL=2
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export TORCH_FLOAT32_MATMUL_PRECISION=high
export PYTORCH_ALLOC_CONF=expandable_segments:True

vllm serve Qwen/Qwen3.6-27B-FP8 \
  --host 0.0.0.0 --port 8080 \
  --performance-mode interactivity \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --mm-encoder-tp-mode data \
  --mm-processor-cache-type shm \
  --gpu-memory-utilization 0.975 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE", "max_cudagraph_capture_size": 16, "mode": "VLLM_COMPILE"}' \
  --async-scheduling \
  --attention-backend flashinfer \
  --max-model-len 196608 \
  --kv-cache-dtype bfloat16 \
  --enable-prefix-caching

Performance

I'm running real benchmarks right now and will update this post later, but in general: writing code with MTP=2 yields 60-90 TPS, which is a number I find perfectly acceptable for daily use. Furthermore, because we're running the FP8 and KV is non-quantized we get the benefits of long Claude sessions without early compaction, endless loops, etc. It's truly minimally quantized.

----END HUMAN TEXT----

If there were AI-generated text it would follow here.

----START AI TEXT----

----END AI TEXT----