Bench 8xMI50 MiniMax M2.7 AWQ @ 64 tok/s peak (vllm-gfx906-mobydick)

Posted by ai-infos@reddit | LocalLLaMA | View on Reddit | 9 comments

Inference engine used (vllm fork): https://github.com/ai-infos/vllm-gfx906-mobydick/tree/main

Huggingface Quants used: cyankiwi/MiniMax-M2.7-AWQ-4bit

Relevant commands to run:

docker run -it --name vllm-gfx906-mobydick-mixa3607 -v ~/llm/models:/models --network host --device=/dev/kfd --device=/dev/dri --group-add video \
  --group-add $(getent group render | cut -d: -f3) --ipc=host mixa3607/vllm-gfx906:0.19.1-rocm-7.2.1-aiinfos-20260405173349

FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" OMP_NUM_THREADS=4 VLLM_LOGGING_LEVEL=DEBUG NCCL_DEBUG=INFO vllm serve \
    /llm/models/MiniMax-M2.7-AWQ-4bit \
    --served-model-name MiniMax-M2.7-AWQ-4bit \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --trust-remote-code \
    --max-model-len 196608 \
    --gpu-memory-utilization 0.94 \
    --enable-log-requests \
    --enable-log-outputs \
    --log-error-stack \
    --dtype float16 \
    --tensor-parallel-size 8 --port 8000 2>&1 | tee log.txt

FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" OMP_NUM_THREADS=4 VLLM_LOGGING_LEVEL=DEBUG vllm bench serve \
  --dataset-name random \
  --random-input-len 10000 \
  --random-output-len 1000 \
  --num-prompts 4 \
  --request-rate 10000 \
  --ignore-eos 2>&1 | tee logb.txt

RESULTS

============ Serving Benchmark Result ============
Successful requests:                     4
Failed requests:                         0
Request rate configured (RPS):           10000.00
Benchmark duration (s):                  125.90
Total input tokens:                      40000
Total generated tokens:                  4000
Request throughput (req/s):              0.03
Output token throughput (tok/s):         31.77
Peak output token throughput (tok/s):    64.00
Peak concurrent requests:                4.00
Total token throughput (tok/s):          349.48
---------------Time to First Token----------------
Mean TTFT (ms):                          37281.45
Median TTFT (ms):                        37480.25
P99 TTFT (ms):                           58355.92
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          88.39
Median TPOT (ms):                        88.22
P99 TPOT (ms):                           109.47
---------------Inter-token Latency----------------
Mean ITL (ms):                           88.39
Median ITL (ms):                         66.85
P99 ITL (ms):                            73.62
==================================================

FINAL NOTES :

To me, perf is « acceptable » for agentic coding use cases and the quality output is pretty good for its size. This setup might be a reliable alternative to 3090s setup as it’s much cheaper or CPU/GPU setup as it’s faster (prefill/decode).  Don't hesitate to ask any questions.