5060ti quad-chads - vllm (the reluctant arc) - pp and tg talk

Posted by see_spot_ruminate@reddit | LocalLLaMA | View on Reddit | 16 comments

Okay, so I have this quad 5060ti setup and for forever I have had people nagging me to try vllm. I thought it was too complicated, like varsity golf or putting on both legs of pants at the same time. Turns out, it was just laziness.

tl;dr

pp on a prompt (car racing game in browser that had way too much detail to the point it was slowing down my browser) of >10k tokens = Avg prompt throughput: 1444.9 tokens/s

tg follow up (to make a car racing game in my browser not have 1 frame per second) = Avg generation throughput: 47.4 tokens/s

Avg draft acceptance = Avg Draft acceptance rate: 70.4% to Avg Draft acceptance rate: 97.6%

Now this is from the logs (journalctl -f -u vllm.service), and I have found it hard to just grab the end pp and tg like I am used to with llamacpp. If you know a different way, then I am all ears.

Okay, so it was actually fairly easy in the end to get vllm to work. Here are the steps I took on my linux server.

  1. mkdir vllm

  2. uv venv

  3. uv pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

  4. cd vllm && source .venv/bin/activate

  5. vllm serve Qwen/Qwen3.6-27B-FP8 \

    --tensor-parallel-size 4 \

    --max-model-len 262144 \

    --reasoning-parser qwen3 \

    --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \

    --host 0.0.0.0 --port 9999 \

    --quantization="fp8" \

    --max-num-seqs 2 \

    --enable-prefix-caching \

    --enable-auto-tool-choice \

    --tool-call-parser qwen3_coder \

    --language-model-only

  6. profit.

I also then just set it up as a systemd service that I can control easier and then monitor the log output at will. I guess I am just making this so others can learn from my laziness and/or scold me for my sloth.