5060ti quad-chads - vllm (the reluctant arc) - pp and tg talk

Posted by see_spot_ruminate@reddit | LocalLLaMA | View on Reddit | 16 comments

Okay, so I have this quad 5060ti setup and for forever I have had people nagging me to try vllm. I thought it was too complicated, like varsity golf or putting on both legs of pants at the same time. Turns out, it was just laziness.

tl;dr

pp on a prompt (car racing game in browser that had way too much detail to the point it was slowing down my browser) of >10k tokens = Avg prompt throughput: 1444.9 tokens/s

tg follow up (to make a car racing game in my browser not have 1 frame per second) = Avg generation throughput: 47.4 tokens/s

Avg draft acceptance = Avg Draft acceptance rate: 70.4% to Avg Draft acceptance rate: 97.6%

Now this is from the logs (journalctl -f -u vllm.service), and I have found it hard to just grab the end pp and tg like I am used to with llamacpp. If you know a different way, then I am all ears.

Okay, so it was actually fairly easy in the end to get vllm to work. Here are the steps I took on my linux server.

mkdir vllm
uv venv
uv pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
cd vllm && source .venv/bin/activate
vllm serve Qwen/Qwen3.6-27B-FP8 \

--tensor-parallel-size 4 \

--max-model-len 262144 \

--reasoning-parser qwen3 \

--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \

--host 0.0.0.0 --port 9999 \

--quantization="fp8" \

--max-num-seqs 2 \

--enable-prefix-caching \

--enable-auto-tool-choice \

--tool-call-parser qwen3_coder \

--language-model-only
profit.

I also then just set it up as a systemd service that I can control easier and then monitor the log output at will. I guess I am just making this so others can learn from my laziness and/or scold me for my sloth.

[-]

Puzzleheaded_Base302@reddit

the vllm numbers looks very good for $2000 worth of GPUs. this is the most cost-effective way to run local llm.

[-]

see_spot_ruminate@reddit (OP)

Too bad my $2k of gpu power is likely now $3k or more.

[-]

Makers7886@reddit

Im on 4x3090s running q3.6 27b int8 with a solid 53 t/s without MTP/Dflash (I prefer lower latency responses). I would imagine you should be able to match/beat my speeds. What are your other system specs?

[-]

see_spot_ruminate@reddit (OP)

I don't think so. I think it runs well but at some point the 5060ti just does not have the same hp. I'm not mad though, I spent less money at least (crying in less bandwidth noises from afar)

[-]

NickCanCode@reddit

Have you adjust your cards config or using the default? Undervolt + overclock will use less power but also gain a little better performance when done right. The tps gain may not be significant but using less power generating less heat while gaining a few tok/s for free. There is really no reason not to do it.

FYI:

https://www.reddit.com/r/StableDiffusion/comments/1s9i1yo/a_reminder_guys_undervolt_your_gpus_immediately/

There is Youtube video mentioned in the comment on how to do it **correctly**. As for the settings for 5060ti, you can google search it as I am not using this card.

[-]

see_spot_ruminate@reddit (OP)

I have thought about it, but the entire system (according to my ups) only uses like 80w idle and 300w on inference. Each card seems to use about 50w to 60w when active and 5w to 8w idle. Again, its not taking up that much power. If my electric rates go up, I might consider, but for now its pretty cheap to run.

[-]

Makers7886@reddit

I had to look it up, you are right 3090s have a little more compute at the cost of wattage. Still that's a good result you got with those 5060ti's.

[-]

tmvr@reddit

You are talking about tg with 53 tok/s, compute where your limiting factor is memory bandwidth. You are faster because you have about double the memory bandwidth with 936 GB/s per card compared to OPs 448 GB/s.

[-]

grumd@reddit

Worth also mentioning that Q8_K_XL will be higher quality than FP8, closer to F16

[-]

see_spot_ruminate@reddit (OP)

There is always a trade off: good, cheap, or fast. Can’t have all 3.

[-]

grumd@reddit

Have you tried "--split-mode row" with llama? The default is "layer" and it's not as performant as row. Try it with row.

[-]

see_spot_ruminate@reddit (OP)

I’ve tried it before. It was marginally better. Ik_llama split mode graph can get close to vllm though.

[-]

iMakeSense@reddit

What's your setup? Do you have a two slot motherboard with two external?

[-]

see_spot_ruminate@reddit (OP)

2 internal and 2 external.

The 2 internal are on x8 lanes (card max) and x1 lanes (due to shitty bifurcation options).

The 2 external are on x4 lanes each through nvme-to-oculink on ag01 egpus

[-]

overand@reddit

What were your performance numbers like with llama.cpp on the same setup?

[-]

see_spot_ruminate@reddit (OP)

Re ran the edit in llama.cpp, will put the full command in the main message with an edit, but here for you with the same prompt as before in vllm:

prompt eval time = 13116.58 ms / 14448 tokens ( 0.91 ms per token, 1101.51 tokens per second)

eval time = 839108.37 ms / 9638 tokens ( 87.06 ms per token, 11.49 tokens per second)

total time = 852224.96 ms / 24086 tokens