MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)

Posted by ai-infos@reddit | LocalLLaMA | View on Reddit | 67 comments

TL;DR Results from the title are for single inference with 2 prompt of 1k and 15k tokens.
So no MTP (as it’s slower for big prompt), no DFlash (working too but slower for big prompt), no quant used (full precision wanted) and the results are pretty good for a 2018 card.
(Bench has been done with TP8, but the model not quantized fits also with TP2 and works pretty fast too, around 34 tps TG)

IMO, fully usable with Claude Code or Hermes or any other agentic harness.

I think there’s still room to go higher (by updating the software & hardware stacks, eg. use of pcie switch with lower latency, more optimized dflash/mtp without overhead for rocm/gfx906, etc)

Inference engine used (vllm fork v0.20.1 with rocm7.2.1): https://github.com/ai-infos/vllm-gfx906-mobydick/tree/main

Huggingface Quants used: Qwen/Qwen3.6-27B

Main commands to run:

docker run -it --name vllm-gfx906-mobydick -v /llm:/llm --network host --device=/dev/kfd --device=/dev/dri --group-add video --group-add $(getent group render | cut -d: -f3) --ipc=host aiinfos/ vllm-gfx906-mobydick:v0.20.1rc0.x-rocm7.2.1-pytorch2.11.0

FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" VLLM_LOGGING_LEVEL=DEBUG vllm serve \
     /llm/models/Qwen3.6-27B \
    --served-model-name Qwen3.6-27B \
    --dtype float16 \
    --max-model-len auto \
    --max-num-batched-tokens 8192 \
    --block-size 64 \
    --gpu-memory-utilization 0.98 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --mm-processor-cache-gb 1 \
    --limit-mm-per-prompt.image 1 --limit-mm-per-prompt.video 1 --skip-mm-profiling \
    --default-chat-template-kwargs '{"min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}' \
    --tensor-parallel-size 8 \
    --host 0.0.0.0 \
    --port 8000 2>&1 | tee log.txt

FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" VLLM_LOGGING_LEVEL=DEBUG vllm bench serve \
  --dataset-name random \
  --random-input-len 10000 \
  --random-output-len 1000 \
  --num-prompts 4 \
  --request-rate 10000 \
  --ignore-eos 2>&1 | tee logb.txt

RESULTS:

============ Serving Benchmark Result ============
Successful requests:                     4
Failed requests:                         0
Request rate configured (RPS):           10000.00
Benchmark duration (s):                  121.54
Total input tokens:                      40000
Total generated tokens:                  4000
Request throughput (req/s):              0.03
Output token throughput (tok/s):         32.91
Peak output token throughput (tok/s):    56.00
Peak concurrent requests:                4.00
Total token throughput (tok/s):          362.03
---------------Time to First Token----------------
Mean TTFT (ms):                          32874.56
Median TTFT (ms):                        35622.63
P99 TTFT (ms):                           47843.84
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          88.66
Median TPOT (ms):                        85.94
P99 TPOT (ms):                           108.67
---------------Inter-token Latency----------------
Mean ITL (ms):                           88.66
Median ITL (ms):                         73.61
P99 ITL (ms):                            74.26
==================================================

[-]

WhatTheFlukz@reddit

mtp has defintiely been faster than no mtp on my mi50, im using a custom fork with MTP and rotorquant though

[-]

ai-infos@reddit (OP)

interesting but how much faster in TG with 10k+ prompts? and does it slow down PP speed?

[-]

MaruluVR@reddit

For your workflow how big is the difference between full F16 and Q8?

Seeing people run full F16 is rare so I am curious, I personally am using Q8KM.

[-]

ai-infos@reddit (OP)

sometimes for simple frontend bugs, it just fails to fix it in q8_0 while it succeed with full F16
Q8KM is much slower than Q8_0 on MI50 setups so i don't use it

I also made a triton kernel for fp8 in vllm mobydick fork (as fp8 is much closer to f16 than Q8 quant) but speed is actually much worse \~24 tps TG TP4 (the kernel is still not fully optimized)

[-]

kwizzle@reddit

How are you cooling those gpus? And is it very loud?

[-]

koguma@reddit

Thingiverse has a bunch of stuff for these cards, or grab stuff from AliExpress. But you need blower fans attached to the back. You'll need either a giant ass case like the Phanteks Enthoo Pro 2 server case, or one of those open shell cases. I'm still messing with cooling, that's honestly my biggest bane right now. I need to swap the cards around (one doesn't post initially on power up, only on reboot, I need to move it to a CPU pcie slot) and have to reattach different fans...

[-]

ai-infos@reddit (OP)

2 small fans 50mm per gpu
it was sold with the gpu and yes that's quite loud (even at 50%) as it was not high quality fans
(more details on related setup here: https://github.com/ai-infos/guidances-setup-16-mi50-deepseek-v32/tree/main )

[-]

into_devoid@reddit

I don’t see the point over llama.cpp. With 2xmi50 you get 50t/s with mtp, and you can run 4 agents like that with 8 cards.

[-]

ai-infos@reddit (OP)

the perf at 50t/s is also for prompt of 10k+ tok with some depth ctx? i don't think so
and what about the PP (prompt processing) speed?

in general, MTP is not good for agentic stuff as prompts are quite big and MTP adds overhead, slowing down the whole speed when big prompts

[-]

into_devoid@reddit

It holds stead at 10k+ drops to 38+ around 20k, then starts to increase again to 42k as context grows. Odd behavior, but consistent.

As a commenter mentioned, the latest pr fixes slowed it down on mi50 hardware.

Pp a little more than half on this early pr, but had it running an agentic workflow and it was quick enough. 8 cards will of course increase pp, but it’s plenty useable even with the early pr mtp penalty.

[-]

ai-infos@reddit (OP)

thanks for the info, i might give another shot to llama cpp to check how it performs on my setup (in comparison with vllm)

[-]

legit_split_@reddit

Yeah PP is the main bottleneck for agentic use

[-]

gh0stwriter1234@reddit

Maybe with the original PR but perf has regressed.. its just not ready yet.

[-]

yeah-ok@reddit

I noticed that too, a lot of the MTP code was done as fast inline code and now that it's been made safer/proper it's become slower by quite the margin.

[-]

gh0stwriter1234@reddit

I think one commenter there summed it up there is some overhead that is no longer avoided and also rolling back to a partial match doesn't work currently so there are several places perf is lost.

[-]

Right_Weird9850@reddit

mtp on mi50 with 27b? Where? I've been doing all sorts of gymnastics with llama.cpp on mi50. How? Help pls

[-]

exaknight21@reddit

I have the Mi50 32 GB and use TurboQuant k_8 bit and v_8bit - with MTP and Qwen 3.6 - 35B - A3B. Its pretty impressive. 70K context all on GPU no offloading.

I use OP’s mobydick vLLM fork for Qwen3.5:4B thinking 16K context, 4096 max gen with 10 concurrent users to test my saas.

My card is power capped to 220 watts, i dont see a reason to make it higher.

I’ll share the llama.cpp fork i used of a fella, and the way I dockerized it. Currently travelling.

[-]

koguma@reddit

How are you capped to 220 watts? I've been slowly building out my system with two mi50's but I got stuck on the cooling, and I also need to keep power levels down.

[-]

exaknight21@reddit

Not sure but its been like this since i got it. I got the fan blower and shroud separate from ebay. Works nice.

[-]

EugenePopcorn@reddit

Try a Q4_1-MTP quant. I'm getting 45t/s with just a single card.

[-]

Practical-Collar3063@reddit

What is the PCIe configuration of your set up ? PCIe 4.0 x8 ? for tensor parallelism and dense models it seems to be quite important.

[-]

ai-infos@reddit (OP)

pcie 4.0 x8 14GB/s where 3 of them have been downgraded to 7GB/s

actually for TP, one of the most important thing is also the latency you get between them (in pcie 3.0, which has equivalent latency on my motherboard, the token speed difference wasn't too much, 0-5%)

[-]

TrahlenYT@reddit

your post gave me the idea: is there a website where people post their hardware configuration, vllm/llama.cpp/ollama settings and their token speeds on testet models? just like spark arena?

[-]

Mordred500@reddit

whatcanI.run

[-]

Evanisnotmyname@reddit

Spam dangerous Website.

It’s canIrununit

[-]

Mordred500@reddit

Elaborate pls

[-]

bigattichouse@reddit

Commenting, because I also have an MI50 (32G)

[-]

k_means_clusterfuck@reddit

Commenting, because I have 4 MI50s (32G) that I still need to put in a rig.

[-]

CheatCodesOfLife@reddit

!remind me 1 day

[-]

bigattichouse@reddit

no pressure

[-]

RemindMeBot@reddit

I will be messaging you in 1 day on 2026-05-14 23:55:50 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

Vegetable-Score-3915@reddit

Same boat, 2 mi50s

[-]

havnar-@reddit

Wait… these are super cheap? What’s the catch

[-]

u23043@reddit

They aren't that cheap ($200 for 16GB, >$400 for 32GB), and the catch is that OP is using 8 of them

[-]

Such_Advantage_6949@reddit

Exactly i keep finding where is the catch, until i see 8 of them

[-]

Momsbestboy@reddit

Yep, 8 is the problem. 8x 200W is 1.6kW, and 8x300USD is 2.4k USD. Now wait for the Radeon 9600/9600D, and you will be able to buy at least two of them for the same price....

[-]

Such_Advantage_6949@reddit

I all nvidia, have 248gb nvvidia ram, but it is mixes of gou and i cant use tp with vllm due to that

[-]

WhatTheFlukz@reddit

i just got two 32gb ones recently for $335 each

[-]

XccesSv2@reddit

I got mine for 190€ 7 months ago (32gb of course!) thats a good deal but for the price now i wouldnt buy them.

[-]

Etroarl55@reddit

Not officially supported by ROCM or AMD anymore I believe

[-]

General_Service_8209@reddit

Exactly that, unless you get into sketchy territory, you are stuck with ROCm 6.3 or earlier. Also, while these cards have lots of memory with good bandwidth even by current standards, their GPUs unfortunately haven't aged as well, and you can easily end up compute bound instead of memory bound, especially when you do training.

[-]

koguma@reddit

You can use rocm 7.2 (like OP is using). Country Boy Computers on YT shows how and has install instructions.

[-]

NickCanCode@reddit

power consumption I guess

[-]

Healthy-Nebula-3603@reddit

Interesting

[-]

vucamille@reddit

How did you make rocm 7.x work? I have rocm 7.0 and copied the kernels from an older rocblas repo but with llama.cpp, qwen 3.6 is not working.

[-]

koguma@reddit

https://youtube.com/shorts/EPZSDHWH13M?si=eAHYGK3HU8fnNaHb https://countryboycomputersbg.com/dual-instinct-mi50-32gb-running-moe-models-with-self-built-llama-cpp-gpt-oss20b-qwen330b-and-gpt-oss120b/

Fyi, use the llama.cpp install. Ollama dropped mi50 support.

[-]

vucamille@reddit

This is what I did earlier this year. It works well with Qwen 3, but not Qwen 3.6. A few months ago, the official statement from llama cpp devs was that this approach (copying old tensors) is not right and will not be supported by llama.cpp, at least not in the main branch.

[-]

brahh85@reddit

now rocm is easier, amd supported it again, unoficially

https://www.reddit.com/r/LocalLLaMA/comments/1t86j45/comment/oku2hli/?context=3

[-]

vucamille@reddit

Awesome! I have 2x MI50 16GB and will definitely try tensor parallelism

[-]

ai-infos@reddit (OP)

i got the docker image from mixa3607: https://hub.docker.com/r/mixa3607/rocm-gfx906/tags
that i reused to compile vllm mobydick fork (when i checked his code, i saw that he copied also all the tensile rocblas files related to gfx906 and does also a fresh compilation of rccl for gfx906)

he also has llama.cpp docker images with rocm 7.2.1 but don't know if qwen3.6 works there (i don't use gguf quants and prefer the full precision)

[-]

vucamille@reddit

Looks like docker is the way to go. Will try it out! I only have 32GB VRAM so I need the quants to leave me enough memory for the context window. Even with everything in VRAM, it starts getting slow when the context fills up, so system RAM is not an option.

Right now I am using qwen 3 30b with unsloth quants, but it sometimes behaves in a strange way. For instance when asked to implement function and a test, it just printf the correct answer... I really want to try out qwen 3.6 to compare.

[-]

laul_pogan@reddit

The --mm-processor-cache-gb 1 --limit-mm-per-prompt --skip-mm-profiling combo is the right text-only approach. If the fork picks up --language-model-only (in vLLM nightly now), that's cleaner; skips multimodal profiling entirely rather than neutering it after the fact. Same net result, less ceremony.

--load-format fastsafetensors gives 4-7x faster shard loading on cold start across TP ranks. No throughput change once loaded, but restart cycles get much faster at TP8 with 27B shards.

The 32-47s TTFT under concurrent load is roughly expected. At your reported 1569 tok/s PP single-inference, 10k tokens is ~6.4s prefill per request. Four prompts prefilling with partial batching lands you around that TTFT range. Tuning --max-num-batched-tokens lower improves TTFT fairness for short requests under mixed load without hurting peak PP.

At 0.98 gpu-memory-utilization with TP8, the VRAM math is generous (8x32GB vs ~54GB for fp16 27B), so you have real headroom. On NVIDIA with tighter per-card margins, hard hangs at 0.85+ on 27B are a real failure mode; AMD HBM may behave differently, but if you see intermittent OOM under concurrent load, 0.90-0.92 is first thing to try.

The dFlash disable on long prompts is interesting. If attention isn't the bottleneck at long context (compute or memory bandwidth elsewhere dominates), that's a useful calibration point for the ROCm fork.

[-]

Ok-Measurement-1575@reddit

This is on par with my 3090s, I think?

I thought these were shit? Can you do a llama-bench?

[-]

koguma@reddit

See my post above the yt vid shows bench

[-]

a_beautiful_rhind@reddit

If this is 8 gpu, the dirty secret is that spreading out the compute can increase speeds. It's like when I run a model with TP on 2x3090 vs 4x. My textgen speed goes up in a properly working TP backend.

Benchmark of the same model with 2 and 4 mi50 would be more reasonable for those purchasing.

[-]

dsanft@reddit

That's 362tok/s PP but split across 4 concurrent requests.

[-]

ai-infos@reddit (OP)

not exactly, the 362 is the

Total token throughput

which includes the "output token throughput" as well

if you want the pure PP across the 4 requests, you have to do 40k/32.8s = 1219 tok/s PP

And for a single request (of 15k tok) with warmup, i even got 1569 tok/s PP

[-]

dsanft@reddit

Sorry I didn't see you're using 8 cards, that's actually probably reasonable.

[-]

Ok-Measurement-1575@reddit

Yeh, this makes more sense now.

[-]

ai-infos@reddit (OP)

no worries, just for info, with tp4, i've got \~700 tps PP (single user)
the goal of this bench was to show that by cumulating the max number of mi50 with TP, the PP increases

(nb: with this model, we can go max to TP 8, as attention head number is 24 and some tensors of encoder MM layers requires a number divisible by 16, so 8 was the only possible max)

[-]