Final Monster: 32x AMD MI50 32GB at 9.7 t/s (TG) & 264 t/s (PP) with Kimi K2.6

Posted by ai-infos@reddit | LocalLLaMA | View on Reddit | 49 comments

[32 MI50 32GB setup](

moonshotai/Kimi-K2.6 int4 @ 9.7 tok/s (output of 136 tok) and 263 tok/s (input of 14564 tok) on vllm-gfx906-mobydick

Github link of vllm fork: https://github.com/ai-infos/vllm-gfx906-mobydick

Power draw: \~640W (idle) / \~4800W (peak inference)

Is it worth ? No, unless you’ve got solar panels or free energy…

Setup details:
That’s just 2 nodes of 16 GPU that i plugged together with 10G cable ethernet. You can find details on 1 node of 16 GPU there:

https://github.com/ai-infos/guidances-setup-16-mi50-deepseek-v32

cmd i run:

NCCL_SOCKET_IFNAME=eno1 GLOO_SOCKET_IFNAME=eno1 PYTHONUNBUFFERED=1 VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=1200 OMP_NUM_THREADS=4 \
FLASH_ATTENTION_TRITON_AMD_REF="TRUE" FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" VLLM_LOGGING_LEVEL=DEBUG \
python3 -m torch.distributed.run --nnodes=2 --node_rank=0 --nproc_per_node=16 --master_addr=10.0.0.8 --master_port=29500 /llm/models/shared/openai_server_kimi.py 2>&1 | tee log.txt

NCCL_SOCKET_IFNAME=eno1 GLOO_SOCKET_IFNAME=eno1 PYTHONUNBUFFERED=1 VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=1200 OMP_NUM_THREADS=4 \
FLASH_ATTENTION_TRITON_AMD_REF="TRUE" FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" VLLM_LOGGING_LEVEL=DEBUG \
python3 -m torch.distributed.run --nnodes=2 --node_rank=1 --nproc_per_node=16 --master_addr=10.0.0.8 --master_port=29500 /llm/models/shared/openai_server_kimi.py 2>&1 | tee log.txt

the script "openai_server_kimi.py" is just based on official vllm example with torchrun (modified to support openai api..and not really optimized... the vllm default command that included torchrun didn't work for me, need more investigation to debug...), i can share it on github too if there's any interest (but need to be more optimized)

ps: I still didn’t do a full guidance setup for this because i’m quite not satisfied of the perf… First, this setup run at pcie gen3 x8 and pcie gen4 x4 , all are supposed to be at 7GB/s but got one at 3.5GB/s (due to instability of risers…) Theoretically, if i manage to do a new setup with max pcie bandwidth : 28GB/s (if x16) or 14GB/s (if x8) in TP8 PP4 (or TP4 PP8) and with optimized vllm software stack, I believe we can jum to 600-1000 PP and 9-12 TG (without mtp)… and now this setup might be interesting if we compare to hybrid setup (ddr5-rtx 6000 pro, etc) but i think i’m done with all of it and I might just enjoy small models, much faster on smaller setups.

Feel free to ask any questions and/or share any comments.

[-]

FullOf_Bad_Ideas@reddit

This is beautiful.

Can it maintain 8-16 concurrent requests at low context?

[-]

ai-infos@reddit (OP)

1/ i didn't do the test because i would have to debug vllm (and maybe ray) to make work the vllm openai server api

my slop vibe coded script ( https://github.com/ai-infos/guidances-setup-32-mi50-kimi-k26/blob/main/openai_server_kimi_torchrun_vllm.py ) uses only vllm engine and a fast api for single user inference

but i think that even if i make it work with vllm properly, the 10G ethernet cable between the two pipeline can become a bottleneck (depending on the prompting batch sizes)

2/ recently i saw that: https://huggingface.co/ubergarm/Kimi-K2.6-GGUF/discussions/3 which got an impressive PP at 700 (at low context) with ddr5 + rtx 6000 pro (and i think that even a single 5090 can produce not so bad PP with ddr5); if i saw that at the begginning of last year, it would be a better choice to build this setup (price difference was not so high i think).

Anyway, i'm going back to my 6*3090 setup and try to run minimax m2.7 with vllm and lmcache (with ram offloading for full context), and going to use my brain for agentic code (i think it will consume less than 4.8kW for same or better output lol; but i will go back to this kind of big setup when llm + multi agents harness become really smart with 0 human action in the loop)

3) 405*2=810, so yes theoretically, it could run it as it has 1024GB of vram (never thought of that as this llm was not so powerful)

[-]

FullOf_Bad_Ideas@reddit

1 - it's TP16 DP2 right? I don't think 10G link would be a bottleneck as long as you don't TP over it. Even with larger batches. I think you can get 200 t/s output out of it with concurrency.

2 - yeah cpu offload can work for Kimi but right now it's probably much more expensive than even 32 Mi50s, and I think it'd start tanking massively with concurrency as it would stress RAM more.

Anyway, i'm going back to my 6*3090 setup and try to run minimax m2.7 with vllm and lmcache (with ram offloading for full context

You can also try to run low-bpw Qwen 3.5 397B exl3 quant, it may (or not) be better than Minimax M2.7. https://huggingface.co/cpral/qwen397b_28bpw_opt_2026_04_13 - it should be possible to make an even better low bit quant but I didn't have a need for it yet so I didn't focus on it

[-]

beryugyo619@reddit

so what's the tg if 32-way queued? is it like 32x of that 9.7tok/s speed?

[-]

ai-infos@reddit (OP)

tg is more memory bandwidth , so with 32 mi50 at 1TB/s , we can think that it will give us 32 TB/s as a unified system.
But that's not the case, the gpu are interconnected with PCIE (and at gen3 x8, you get 7GB/s ; gen4 x4 = 7GB/s too unidirectionnal) >> this is one of the bottleneck that will limit our TG (and due to 10G cable, we're forced to use TP16 PP2 instead of TP32 , so there's another bottleneck there, etc)

[-]

beryugyo619@reddit

I mean Kimi K2.6 is 1T-32B MoE, so if distributed over 32 cards, wouldn't it scale linearly in a near flatline to 32 concurrent tg for benchmark tasks, then it would get exponentially slower, in theory, if tasks weren't fighting over same experts

[-]

xandep@reddit

Mad respect (for you and the cards).

I just love my 32GB MI50. Now that they are expensive at around 400-500 bucks, they may not be The Best card to purchase (I guess?), but I'm getting 1100pp/100tg (max) in Qwen3.6 35B (around 300pp/30tg on 27B) at about 180W and full F16 context. Don't know of another card near that price that can do the same.

2 of those and (fingers crossed) Qwen3.6 122B would be SOLID.

[-]

ai-infos@reddit (OP)

agree with you
now at 400-500, it's not interesting, especially if you can get 1 3090 at 900-1000 as PP is generally 4 times faster on 3090

but yeah, it does amazingly well on qwen3.6 35B, someone run a benchmark there with this vllm fork and that's quite impressive:

https://arkprojects.space/wiki/AMD_GFX906/vllm/benchmark/Qwen3.6-35B-A3B

3000pp+ 69tg+ !

[-]

Similar-Republic149@reddit

Woah you have to share your config, for Qwen 3.6 35b I get only 62tkps

[-]

AustinM731@reddit

Does Ray not work with these GPUs across multiple nodes? Or do you just get better performance using torch.distributed.run?

[-]

ai-infos@reddit (OP)

ray worked with deepseek v3.2 (tp 8 pp2) but i couldn't make it work with glm 5.1 awq or this kimi k2.6 int4 model... i got hangs during first prompt and don't know why...

(when i tried torch.distributed.run aka torchrun on deepseek v3.2 and i compared to ray, i didn't notice any speed difference for single user, but i guess maybe there's a difference on multi user inference)

[-]

MundanePercentage674@reddit

Why bought so many MI50 Instead of some else ?

[-]

ai-infos@reddit (OP)

cheap af when i bought it (\~110$/mi50 32G without shipping and insurance on alibaba; not the case anymore today...)

[-]

MundanePercentage674@reddit

Bro the bill and 1 GPU cable loose connection happy troubleshooting 😂.

[-]

No_Algae1753@reddit

640 WATTS AT IDLE ?!?!?! WTF

[-]

FullstackSensei@reddit

Wanna bet it's a few hundred watts more?

I have several 32GB Mi50s. They idle at ~20W each on average. There are a ton of fans there, there's probably a ton of active PCIe switch cards, there's the base system running the show, and who knows how many PSUsz all pitching in. I'd say it's over 1kwh idle.

[-]

No_Algae1753@reddit

Why even bother with them when idling is so high ? Like 20 watts is insane to me just for one card alone

[-]

FullstackSensei@reddit

Because I don't leave them to idle for long? My P40s and 3090s idle at 9W and I don't leave them on when not in use either. I genuinely don't understand why people feel the need to keep these machines on 24/7 when they use them only for a fraction of that.

Idle power doesn't concern me at all, because they're off more than 2/3rd of the time if you count weekends, and when they're on they spend way more than half that time crunching or generating tokens, and when I'm done I just shut it down. I run server boards with IPMI, so powering up any or even all is a couple of taps away on my phone.

[-]

No_Algae1753@reddit

Well there are some out here who use llms a lot (Im one of them). It would be annoying to me always having to start up and shut down just for a single prompt for example. But i guess usecases vary.

[-]

FullstackSensei@reddit

Have a secondary small machine for that. I have a Jetson agx xavier 64GB on all the time. Idles at 7W and runs 36W under full load. I keep the latest 30-40B MoE models there at Q8. It's a lot cheaper than keeping 400B models on all the time to answer small questions

[-]

No_Algae1753@reddit

Makes sense

[-]

reacusn@reddit

3090s

Don't those idle at 20w as well? All the 3090s I've seen idle at 20-25w vs the 3090 tis at less than 10w.

[-]

FullstackSensei@reddit

Huh? s as in plural. There was no super in Ampere.

[-]

cantgetthistowork@reddit

My rig is never idle. Even while I'm sleeping I'll give it a massive to do

[-]

rorowhat@reddit

Set the clocks to a lower point for idle, I think you can safe a lot more power

[-]

zsydeepsky@reddit

if will be the most ideal system that runs in the Arctic. lol.

[-]

DeepOrangeSky@reddit

Lol, very nice.

A few questions pertinent to your setup:

What is the average response time of the firefighters in your town?
Are you on good terms with all the firefighters in your town?
Have you ever slept with the ex (or non-ex) wives of any of the firefighters in your town? If so, did any of the firefighters find out about it?
How well maintained is the engine and transmission and starter-motor of the fire truck of the nearest station to your house?

[-]

ai-infos@reddit (OP)

lol nice one! (no idea for all the questions btw, but this monster is actually off for now, i turned it on only for the photo and i wanted to show you guys that someday you can load AGI locally if you addup a bunch of consumer gpus, but at the cost of high energy requirements)

[-]

starkruzr@reddit

that is an insanely low TG number for 32 goddamn cards, jfc

[-]

ProfessionalSpend589@reddit

These are 2 computers connected over 10G Ethernet. Latency over the network will limit things.

[-]

ai-infos@reddit (OP)

actually, not really in TP16 PP2 , each pipeline on each node

it will become the bottleneck if i do large batch prompting with very big prompts (before that, the bottleneck, is inside the node: pcie bandwidth latency, 26.5 tflops f16 of each card, etc...)

[-]

Jumpy_Fuel_1060@reddit

It is my understanding that TG doesn't really scale up across cards. It's more that it maintains the same rate based on memory bandwidth. If a single layer was on a slower RAM speed then TG would be slower.

This is analogous to wiring batteries up in parallel: same voltage (TG), super amperage (model size).

[-]

MikeRoz@reddit

It's like 0.4 tok/s faster than I get running all but one layer of the same model at the same precision in DDR5.

(Also, it's hard to imagine this model replying with only 136 tokens. Probably has thinking off.)

[-]

ai-infos@reddit (OP)

If you add rtx 6000 pro or 5090 in your setup, you will get 700-150 tok/s PP (with llama.cpp)

(the 136 tok was with a simple prompt: "many thanks" after i sent to it a code review; to me, if i compare to qwen3.5, the model is not overthinking too much, depends on the context and what you ask)

[-]

koibKop4@reddit

This is super impressive! ❤️❤️❤️

[-]

Adventurous-Paper566@reddit

😲

[-]

Kulqieqi@reddit

Seriosuly, what is wrong with all of those rigs and cards handing around, like literally hanging on some cord. Is it that hard to 3d print some rack for it so it takes less space, looks good and does not danger with snapping?

[-]

beryugyo619@reddit

done is better than perfect, ain't stupid if it works

[-]

FullstackSensei@reddit

Cabling is a nightmare. You want to have as much as space as you can because you'll spend possibly days until everything is showing up in the OS and running without cards randomly dropping out or the whole thing crashing.

Risers and multi PSU setups are really hard to, well, setup

[-]