With 48gb vram, on vllm, Qwen3.6-27b-awq-int4 has only 120k ctx (fp8), is that normal?

Posted by Historical-Crazy1831@reddit | LocalLLaMA | View on Reddit | 5 comments

I am using cyankiwi/Qwen3.6-27B-AWQ-INT4 with vllm, to get the acceleration from speculative decoding. The model takes 20.5GB, so it should leave my 2x3090 system plenty of free vram, but I find it very tight. Vllm output:

(EngineCore pid=1638) INFO 04-22 19:45:40 [kv_cache_utils.py:1316] GPU KV cache size: 121,504 tokens (EngineCore pid=1638) INFO 04-22 19:45:40 [kv_cache_utils.py:1321] Maximum concurrency for 160,000 tokens per request: 2.66x

I am running on WSL2. My vllm configuration is like:

        nohup vllm serve "$MODEL" \
            --served-model-name qwen3.6-27b \
            --api-key "$VLLM_API_KEY" \
            --max-model-len 160000 \
            --max-num-seqs 2 \
            --block-size 32 \
            --kv-cache-dtype fp8_e4m3 \
            --max-num-batched-tokens 8192 \
            --enable-prefix-caching \
            --enable-auto-tool-choice \
            --no-enforce-eager \
            --reasoning-parser qwen3 \
            --tool-call-parser qwen3_coder \
            --attention-backend FLASHINFER \
            --speculative-config '{"method":"mtp","num_speculative_tokens":5}' \
            --tensor-parallel-size 2 \
            -O3 \
            --gpu-memory-utilization 0.81 \
            --chat-template /home/rum/tzhao/vllm/chat_template_dynamic_thinking.jinja \
            --default-chat-template-kwargs '{"enable_thinking": false}' \
            --no-use-tqdm-on-load \
            --host "$HOST" \
            --port "$PORT" \
            > "$LOG_FILE" 2>&1 &

My questions are:

I am already using fp8 KV cache and still only get \~120k ctx. Is it normal?
The vram usage keeps increasing when the context gets longer. I have to set the "gpu-memory-utilization" to be around <0.83 otherwise eventually it will OOM. Is that normal? Shouldn't like vllm pre-arranged the vram and wont take more than allowed?

Thanks

[-]

iVoider@reddit

max-num-seqs to 1 or use Linux side by side. WSL is very buggy for work with GPU.

Mart-McUH@reddit

I don't use vllm and only used Qwen 3.5 27B so far (not yet 3.6) but for what it is worth, with 40GB VRAM (4090+4060 Ti) and Q6 GGUF quant I could run 128k context in full 16bit precision all in VRAM, and there was still room to spare. So with 48GB VRAM I am pretty sure I could run Q8 with 16 bit 128k context in 48GB VRAM.

For reference Q6_K GGUF takes \~21.4 GB (23 GiB)

If 20.5GB quant with 8bit KV does not fit 48GB VRAM then something is terribly badly optimized IMO.

Historical-Crazy1831@reddit (OP)

Thanks! I am going to try llama.cpp. The only reason I am using vllm is that it supports speculative decoding. I am not sure if llama.cpp supports now or not. I do feel weird so I am asking here to see if it is due to my wrong settings.

llama.cpp supports speculative decoding by other model being drafting model. But it does not support new features like multi token prediction (which maybe you were doing? Eg kind of speculative decoding with the same model).

GroundbreakingMall54@reddit

yeah 120k feels tight but thats just how fp8 vllm works. kv cache chews through vram fast. either drop batch size or bite the bullet and use less context