MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)

Posted by ai-infos@reddit | LocalLLaMA | View on Reddit | 67 comments

MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)

TL;DR Results from the title are for single inference with 2 prompt of 1k and 15k tokens.
So no MTP (as it’s slower for big prompt), no DFlash (working too but slower for big prompt), no quant used (full precision wanted) and the results are pretty good for a 2018 card.
(Bench has been done with TP8, but the model not quantized fits also with TP2 and works pretty fast too, around 34 tps TG)

IMO, fully usable with Claude Code or Hermes or any other agentic harness.

I think there’s still room to go higher (by updating the software & hardware stacks, eg. use of pcie switch with lower latency, more optimized dflash/mtp without overhead for rocm/gfx906, etc)  

Inference engine used (vllm fork v0.20.1 with rocm7.2.1): https://github.com/ai-infos/vllm-gfx906-mobydick/tree/main

Huggingface Quants used: Qwen/Qwen3.6-27B

Main commands to run:

docker run -it --name vllm-gfx906-mobydick -v /llm:/llm --network host --device=/dev/kfd --device=/dev/dri --group-add video --group-add $(getent group render | cut -d: -f3) --ipc=host aiinfos/ vllm-gfx906-mobydick:v0.20.1rc0.x-rocm7.2.1-pytorch2.11.0

FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" VLLM_LOGGING_LEVEL=DEBUG vllm serve \
     /llm/models/Qwen3.6-27B \
    --served-model-name Qwen3.6-27B \
    --dtype float16 \
    --max-model-len auto \
    --max-num-batched-tokens 8192 \
    --block-size 64 \
    --gpu-memory-utilization 0.98 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --mm-processor-cache-gb 1 \
    --limit-mm-per-prompt.image 1 --limit-mm-per-prompt.video 1 --skip-mm-profiling \
    --default-chat-template-kwargs '{"min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}' \
    --tensor-parallel-size 8 \
    --host 0.0.0.0 \
    --port 8000 2>&1 | tee log.txt

FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" VLLM_LOGGING_LEVEL=DEBUG vllm bench serve \
  --dataset-name random \
  --random-input-len 10000 \
  --random-output-len 1000 \
  --num-prompts 4 \
  --request-rate 10000 \
  --ignore-eos 2>&1 | tee logb.txt

RESULTS:

============ Serving Benchmark Result ============
Successful requests:                     4
Failed requests:                         0
Request rate configured (RPS):           10000.00
Benchmark duration (s):                  121.54
Total input tokens:                      40000
Total generated tokens:                  4000
Request throughput (req/s):              0.03
Output token throughput (tok/s):         32.91
Peak output token throughput (tok/s):    56.00
Peak concurrent requests:                4.00
Total token throughput (tok/s):          362.03
---------------Time to First Token----------------
Mean TTFT (ms):                          32874.56
Median TTFT (ms):                        35622.63
P99 TTFT (ms):                           47843.84
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          88.66
Median TPOT (ms):                        85.94
P99 TPOT (ms):                           108.67
---------------Inter-token Latency----------------
Mean ITL (ms):                           88.66
Median ITL (ms):                         73.61
P99 ITL (ms):                            74.26
==================================================