Qwen3.6-27B-INT4 clocking 100 tps with 256k context length on 1x RTX 5090 via vllm 0.19
Posted by Kindly-Cantaloupe978@reddit | LocalLLaMA | View on Reddit | 27 comments
Thanks to the community the Qwen3.6-27B speed keeps getting better. The following improves upon my recipe from yesterday and delivered a whopping 100+ tps (TG).
Model: https://huggingface.co/Lorbus/Qwen3.6-27B-int4-AutoRound
- MTP supported
- KLD is decent especially being the smallest model https://www.reddit.com/r/LocalLLaMA/comments/1ssyukx/qwen3627b_klds_ints_and_nvfps/
- The smaller model size allows for full native 256k context window
Tokens per second (TG): 105-108 tps
Special credits to this post that helps me discover the Lorbus quant: https://www.reddit.com/r/Olares/comments/1svg2ad/qwen3627b_at_85100_ts_on_a_24gb_rtx_5090_laptop/
Note that I didn't mess with TQ in my setup as I can already hit the max context length native to the model without it.
Vllm launch config:
args=(
vllm serve "/root/autodl-tmp/llm-models"
--max-model-len "262144"
--gpu-memory-utilization "0.93"
--attention-backend flashinfer
--performance-mode interactivity
--language-model-only
--kv-cache-dtype "fp8_e4m3"
--max-num-seqs "2"
--skip-mm-profiling
--quantization auto_round
--reasoning-parser qwen3
--enable-auto-tool-choice
--enable-prefix-caching
--enable-chunked-prefill
--tool-call-parser qwen3_coder
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
--host "0.0.0.0"
--port "6006"
)
Own_Mix_3755@reddit
The question for me is - if you have enough RAM/VRAM headroom, is it better to use 27B INT4 or 35B A3B?
Running both in FP8 renders 27B alot slower. I would love to get to better speed on Nvidia DGX Spark but it is bandwidth limited. The question is whether its better to go with INT4 27B (which might be dumbed down a little) or go FP8 35 a3b directly.
oxygen_addiction@reddit
Why not both with llama-swap? If you need speed (code scaffolding), go to the 35B. If you need intelligence (planning and implementation) go to the 27B.
Abishek_Muthian@reddit
Is llama-swap still useful when using llama-server in router mode?
Pentium95@reddit
27B Is dense -> smarter
35B Is MoE -> faster
You can't draft punk, choose id you want either speed or intelligence
mintybadgerme@reddit
Is there an optimal setup/quant for 27B on a 5060ti with 16GB VRAM and 64GB RAM? I've been trying the unsloth IQ-4_XS via LMStudio and VSCode and it's really slow. Really really slow. :)
Important_Quote_1180@reddit
27B Local Inference on Single RTX 3090
qwen3.6-27B-AutoRound (INT4), vLLM 0.19.2rc1.dev21, 24GB VRAM. 71–83 tok/s after warmup.
• Turboquant 3-bit NC KV Cache: Compresses KV state to 3-bit non-uniform quantization. Enables 125K context window within 24GB VRAM without OOM. • MTP n=3 Speculative Decoding: Three auxiliary heads draft tokens per forward pass, verified atomically against main head. ~3× throughput multiplier vs. non-speculative baselines. • Cudagraph PIECEWISE Mode: Captures only attention-op boundaries instead of full-graph replay. Eliminates degenerate repetition loops caused by stale MTP state in FULL_AND_PIECEWISE mode on multi-GPU hosts. • Chunked Prefill + Prefix Caching: max-num-batched-tokens=4121 with max-num-seqs=1. First post-restart request incurs ~29s cudagraph compilation; subsequent requests stabilize at 12–14s for 1024-token generation.
PreparationTrue9138@reddit
Hi, where can I find the best turboquant fork/pr? Or do you use official latest release candidate version mentioned in your post?
oxygen_addiction@reddit
vLLM 0.19.2rc1.dev21
Important_Quote_1180@reddit
vLLM Stack — qwen3.6-27b-autoround on RTX 3090
Model: qwen3.6-27b-autoround-int4 (AutoRound INT4 quantization) served via vLLM nightly (dev21) on port 8020. Context window: 125K tokens. KV cache uses TurboQuant 3-bit NC. Speculative decoding via MTP with 3 draft tokens. Cudagraph mode set to PIECEWISE — this is the critical setting that makes MTP work without garbling output (the default FULL mode breaks speculative decoding on this rig).
Hardware: RTX 3090 24GB, NVIDIA driver 580.126, GPU memory at 97% utilization (23.1GB of 24.5GB). Running at 348W out of a 350W power limit, 66°C, 98% utilization during benchmark.
Key launch flags: --gpu-memory-utilization 0.97, --max-num-seqs 1, --max-num-batched-tokens 4128, --enable-chunked-prefill, --enable-prefix-caching, --reasoning-parser qwen3, --tool-call-parser qwen3_coder, --kv-cache-dtype turboquant_3bit_nc, --compilation-config.cudagraph_mode PIECEWISE, --speculative-config for MTP with 3 speculative tokens. Also applies Genesis unified patch and tolist cudagraph patch at container startup.
Live benchmark results from 2026-04-26: 100-token output generated at 82.4 tok/s in 1.21s total. 400-token output at 82.1 tok/s in 4.87s. 800-token output at 71.3 tok/s in 11.22s. Time-to-first-token estimated at 0.3-0.6 seconds depending on prompt length. Sustained baseline is roughly 67-89 tok/s depending on workload shape.
The PIECEWISE cudagraph setting costs about 15-20% throughput versus theoretical FULL mode speeds (which could hit 100+ tok/s) but FULL mode produces garbled, repeating output when combined with MTP speculative decoding on this hardware. The tradeoff is worth it — clean output at 82 tok/s beats garbled output at 108 tok/s.
Bottom line: 27B parameter model, INT4 quantized, running single-GPU on a consumer 3090, delivering 82 tokens per second with sub-second first-token latency and full reasoning/tool-calling support.
allknowncloud@reddit
Nice, what is your vllm config/parameters? And do you use it with multimodal enabled?
Born-Caterpillar-814@reddit
Interestinglt I was not able to with full context length with 5090 using your vLLM launch config without going oom. I am using vLLM 0.19.1 though. I was able to start with 131k context. The gpu does not run anything else (eg. monitor output). Any idea why this happens?
Performance wise its fast, have to do testing how good the coding output is.
Kindly-Cantaloupe978@reddit (OP)
you need to patch the kv calcs issue (see links to my previous posts in OP)
audiophile_vin@reddit
Try the vllm nightly image
gliptic@reddit
Is the linked KLD measurements using fp8 KV-cache though?
Dany0@reddit
f16 kv
Kindly-Cantaloupe978@reddit (OP)
IDK, but this would still suggest choosing this quant over NVFP4 given better KLD and smaller model size
WetSound@reddit
I think I have to dual-boot, I'm only getting 70-80 tps in WSL
ComfyUser48@reddit
What is the difference in quality vs unsloth official quants? Is it like Q8? Q6? Help me understand
Kindly-Cantaloupe978@reddit (OP)
It's INT4 (so 4-bit)
ComfyUser48@reddit
So this is comparable to Unsloth Q4, just faster. Should I expect similar performance in coding agents?
HareMayor@reddit
No, int4 is the oldest format for using in 4 bit. (Rtx 20 series)
After that q4 ggufs came out that are significantly better, and then unsloth's UD q4s which have apparently best size-to-quality ratio (that's the whole reason unsloth is famous).
And latest one being nvfp4 which has quality close to q8, but size is close to q5 - q6.
Nvfp4's speed benefits are only for 50 series but it will still run like any model of relative size on older cards.
ImportantSignal2098@reddit
Ah yes when I say 4 I usually mean 5-6.
Ell2509@reddit
They are using proper terns. Nfpx4 is the name. The size is comparable to an older q5 or 6, with older q8 performance, and on 5000 series cards that also comes with up to double the speed.
Kindly-Cantaloupe978@reddit (OP)
Give it a try - so far works very well for me
YourNightmar31@reddit
Is there any 27B INT4 gguf somewhere? Or am i asking for something stupid? :)
Kindly-Cantaloupe978@reddit (OP)
there should be, but don't know if it will get you the same speed with llama cpp or other server
Tormeister@reddit
Relevant thread for 27B KLDs