2x Instinct MI50 32G running vLLM results

Posted by NaLanZeYu@reddit | LocalLLaMA | View on Reddit | 81 comments

I picked up these two AMD Instinct MI50 32G cards from a second-hand trading platform in China. Each card cost me 780 CNY, plus an additional 30 CNY for shipping. I also grabbed two cooling fans to go with them, each costing 40 CNY. In total, I spent 1730 CNY, which is approximately 230 USD.

Even though it’s a second-hand trading platform, the seller claimed they were brand new. Three days after I paid, the cards arrived at my doorstep. Sure enough, they looked untouched, just like the seller promised.

The MI50 cards can’t output video (even though they have a miniDP port). To use them, I had to disable CSM completely in the motherboard BIOS and enable the Above 4G decoding option.

System Setup

Hardware Setup

One MI50 is plugged into a PCIe 3.0 x16 slot, and the other is in a PCIe 3.0 x8 slot. There’s no Infinity Fabric Link between the two cards.

Software Setup

The vLLM I used is a modified version. The official vLLM support on AMD platforms has some issues. GGUF, GPTQ, and AWQ all have problems.

vllm serv Parameters

docker run -it --rm --shm-size=2g --device=/dev/kfd --device=/dev/dri \
    --group-add video -p 8000:8000 -v /mnt:/mnt nalanzeyu/vllm-gfx906:v0.9.0-rocm6.3 \
    vllm serve --max-model-len 8192 --disable-log-requests --dtype float16 \
    /mnt/<MODEL_PATH> -tp 2

vllm bench Parameters

# for decode
vllm bench serve \
    --model /mnt/<MODEL_PATH> \
    --num-prompts 8 \
    --random-input-len 1 \
    --random-output-len 256 \
    --ignore-eos \
    --max-concurrency <CONCURRENCY>

# for prefill
vllm bench serve \
    --model /mnt/<MODEL_PATH> \
    --num-prompts 8 \
    --random-input-len 4096 \
    --random-output-len 1 \
    --ignore-eos \
    --max-concurrency 1

Results

~70B 4-bit

Model B 1x Concurrency 2x Concurrency 4x Concurrency 8x Concurrency Prefill
Qwen2.5 72B GPTQ 17.77 t/s 33.53 t/s 57.47 t/s 53.38 t/s 159.66 t/s
Llama 3.3 70B GPTQ 18.62 t/s 35.13 t/s 59.66 t/s 54.33 t/s 156.38 t/s

~30B 4-bit

Model B 1x Concurrency 2x Concurrency 4x Concurrency 8x Concurrency Prefill
Qwen3 32B AWQ 27.58 t/s 49.27 t/s 87.07 t/s 96.61 t/s 293.37 t/s
Qwen2.5-Coder 32B AWQ 27.95 t/s 51.33 t/s 88.72 t/s 98.28 t/s 329.92 t/s
GLM 4 0414 32B GPTQ 29.34 t/s 52.21 t/s 91.29 t/s 95.02 t/s 313.51 t/s
Mistral Small 2501 24B AWQ 39.54 t/s 71.09 t/s 118.72 t/s 133.64 t/s 433.95 t/s

~30B 8-bit

Model B 1x Concurrency 2x Concurrency 4x Concurrency 8x Concurrency Prefill
Qwen3 32B GPTQ 22.88 t/s 38.20 t/s 58.03 t/s 44.55 t/s 291.56 t/s
Qwen2.5-Coder 32B GPTQ 23.66 t/s 40.13 t/s 60.19 t/s 46.18 t/s 327.23 t/s