With 48gb vram, on vllm, Qwen3.6-27b-awq-int4 has only 120k ctx (fp8), is that normal?

Posted by Historical-Crazy1831@reddit | LocalLLaMA | View on Reddit | 5 comments

I am using cyankiwi/Qwen3.6-27B-AWQ-INT4 with vllm, to get the acceleration from speculative decoding. The model takes 20.5GB, so it should leave my 2x3090 system plenty of free vram, but I find it very tight. Vllm output:

(EngineCore pid=1638) INFO 04-22 19:45:40 [kv_cache_utils.py:1316] GPU KV cache size: 121,504 tokens (EngineCore pid=1638) INFO 04-22 19:45:40 [kv_cache_utils.py:1321] Maximum concurrency for 160,000 tokens per request: 2.66x

I am running on WSL2. My vllm configuration is like:

        nohup vllm serve "$MODEL" \
            --served-model-name qwen3.6-27b \
            --api-key "$VLLM_API_KEY" \
            --max-model-len 160000 \
            --max-num-seqs 2 \
            --block-size 32 \
            --kv-cache-dtype fp8_e4m3 \
            --max-num-batched-tokens 8192 \
            --enable-prefix-caching \
            --enable-auto-tool-choice \
            --no-enforce-eager \
            --reasoning-parser qwen3 \
            --tool-call-parser qwen3_coder \
            --attention-backend FLASHINFER \
            --speculative-config '{"method":"mtp","num_speculative_tokens":5}' \
            --tensor-parallel-size 2 \
            -O3 \
            --gpu-memory-utilization 0.81 \
            --chat-template /home/rum/tzhao/vllm/chat_template_dynamic_thinking.jinja \
            --default-chat-template-kwargs '{"enable_thinking": false}' \
            --no-use-tqdm-on-load \
            --host "$HOST" \
            --port "$PORT" \
            > "$LOG_FILE" 2>&1 &

My questions are:

  1. I am already using fp8 KV cache and still only get \~120k ctx. Is it normal?

  2. The vram usage keeps increasing when the context gets longer. I have to set the "gpu-memory-utilization" to be around <0.83 otherwise eventually it will OOM. Is that normal? Shouldn't like vllm pre-arranged the vram and wont take more than allowed?

Thanks