Bench 8xMI50 MiniMax M2.7 AWQ @ 64 tok/s peak (vllm-gfx906-mobydick)

Posted by ai-infos@reddit | LocalLLaMA | View on Reddit | 9 comments

Inference engine used (vllm fork): https://github.com/ai-infos/vllm-gfx906-mobydick/tree/main

Huggingface Quants used: cyankiwi/MiniMax-M2.7-AWQ-4bit

Relevant commands to run:

docker run -it --name vllm-gfx906-mobydick-mixa3607 -v ~/llm/models:/models --network host --device=/dev/kfd --device=/dev/dri --group-add video \
  --group-add $(getent group render | cut -d: -f3) --ipc=host mixa3607/vllm-gfx906:0.19.1-rocm-7.2.1-aiinfos-20260405173349

FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" OMP_NUM_THREADS=4 VLLM_LOGGING_LEVEL=DEBUG NCCL_DEBUG=INFO vllm serve \
    /llm/models/MiniMax-M2.7-AWQ-4bit \
    --served-model-name MiniMax-M2.7-AWQ-4bit \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --trust-remote-code \
    --max-model-len 196608 \
    --gpu-memory-utilization 0.94 \
    --enable-log-requests \
    --enable-log-outputs \
    --log-error-stack \
    --dtype float16 \
    --tensor-parallel-size 8 --port 8000 2>&1 | tee log.txt

FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" OMP_NUM_THREADS=4 VLLM_LOGGING_LEVEL=DEBUG vllm bench serve \
  --dataset-name random \
  --random-input-len 10000 \
  --random-output-len 1000 \
  --num-prompts 4 \
  --request-rate 10000 \
  --ignore-eos 2>&1 | tee logb.txt

RESULTS

============ Serving Benchmark Result ============
Successful requests:                     4
Failed requests:                         0
Request rate configured (RPS):           10000.00
Benchmark duration (s):                  125.90
Total input tokens:                      40000
Total generated tokens:                  4000
Request throughput (req/s):              0.03
Output token throughput (tok/s):         31.77
Peak output token throughput (tok/s):    64.00
Peak concurrent requests:                4.00
Total token throughput (tok/s):          349.48
---------------Time to First Token----------------
Mean TTFT (ms):                          37281.45
Median TTFT (ms):                        37480.25
P99 TTFT (ms):                           58355.92
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          88.39
Median TPOT (ms):                        88.22
P99 TPOT (ms):                           109.47
---------------Inter-token Latency----------------
Mean ITL (ms):                           88.39
Median ITL (ms):                         66.85
P99 ITL (ms):                            73.62
==================================================

FINAL NOTES :

To me, perf is « acceptable » for agentic coding use cases and the quality output is pretty good for its size. This setup might be a reliable alternative to 3090s setup as it’s much cheaper or CPU/GPU setup as it’s faster (prefill/decode). Don't hesitate to ask any questions.

[-]

twnznz@reddit

How many PCIe lanes to each card and what PCIe speed? How is PP@4096?

[-]

ai-infos@reddit (OP)

pcie 3.0 (8GT/s) and it was supposed to be 6 gpu at x16 and 2 at x8

but recently i noticed that the slimsas risers i use are not steady and that some of my pcie might have been downgraded to lower lanes or even lower pcie speed...

so i changed the 6 cards with linkup risers, updated the gpu vbios to enable also p2p and have everything in pcie 4.0 (16GT/s) x16 for 6 gpu and x8 for 2 gpu

i launched the same benchmark and got:

============ Serving Benchmark Result ============
Successful requests:                     4
Failed requests:                         0
Request rate configured (RPS):           10000.00
Benchmark duration (s):                  120.94
Total input tokens:                      40000
Total generated tokens:                  4000
Request throughput (req/s):              0.03
Output token throughput (tok/s):         33.07
Peak output token throughput (tok/s):    64.00
Peak concurrent requests:                4.00
Total token throughput (tok/s):          363.82
---------------Time to First Token----------------
Mean TTFT (ms):                          34450.81
Median TTFT (ms):                        34589.38
P99 TTFT (ms):                           53714.45
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          86.27
Median TPOT (ms):                        86.17
P99 TPOT (ms):                           105.44
---------------Inter-token Latency----------------
Mean ITL (ms):                           86.27
Median ITL (ms):                         66.66
P99 ITL (ms):                            77.74
==================================================

so it's not so different from before (despite the big pcie speed jump and the "true p2p"...)

and didn't try PP4096 but for 14710 tok prompt, it took 23s during the prefill phase (single request), so 639 tok/s PP
(which is pretty decent for this old gpu with low compute: 26.5 TFLOPs fp6)

[-]

TechSwag@reddit

As a fellow Mi50 owner, very interesting to see.

Just curious - what's the rough performance delta between the vLLM fork and llama.cpp? I have 3x Mi50s and I've got my llama.cpp/llama-swap stack down pretty good, but always looking for better performance.

[-]

ai-infos@reddit (OP)

i didn't try llama.cpp since a lot of time but last time i tried (qwen3 230b), perf (prefill and decode) was always lower than vllm (especially when context grows a lot)

vllm might be harder to setup but once it works, it works usually better than llama.cpp (thanks to tensor parallelism, MTP features, chunk prefill, etc..)

[-]

sleepingsysadmin@reddit

That's an interesting looking setup. Are those gpus just laying there?

[-]

ai-infos@reddit (OP)

thanks and yes, the gpus are attached with garden wires on a shelf

[-]

xandep@reddit

*chef's kiss*

[-]

Makers7886@reddit

That's really good peak speeds. I need to re-bench because I swear I got 60 t/s via vllm same quant but 8x3090s but I recall it being sustained solid 60. I didn't like the model for my purposes so didn't test much other than to run it through comparison benches (it scored between 397b 4bit and 122b fp8).

[-]

ai-infos@reddit (OP)

thanks, good to know, the mean here is around 30 tok/s actually, and when i send big prompt for code review (16k+ tok), it drops to around 26 tok/s
on 8x3090s setup, normally, you will also get a much better prefill speed (thousands of tok/s vs 350 tok/s here)