GLM 4.6 UD-Q6_K_XL running llama.cpp RPC across two nodes and 12 AMD MI50 32GB
Posted by MachineZer0@reddit | LocalLLaMA | View on Reddit | 63 comments
Finally got another six MI50 32gb. Removed my old Nvidia Titan Vs in my 2nd HP DL580 Gen9.
Here we go.
running on secondary host:
~/llama.cpp.20251012/build/bin/rpc-server --host 0.0.0.0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 6 ROCm devices:
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 2: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 3: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 4: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 5: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
Never expose the RPC server to an open network!
This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Starting RPC server v3.0.0
endpoint : 0.0.0.0:50052
local cache : n/a
Devices:
ROCm0: AMD Radeon Graphics (32752 MiB, 32694 MiB free)
ROCm1: AMD Radeon Graphics (32752 MiB, 32694 MiB free)
ROCm2: AMD Radeon Graphics (32752 MiB, 32694 MiB free)
ROCm3: AMD Radeon Graphics (32752 MiB, 32694 MiB free)
ROCm4: AMD Radeon Graphics (32752 MiB, 32694 MiB free)
ROCm5: AMD Radeon Graphics (32752 MiB, 32694 MiB free)
Accepted client connection
Then on primary host:
~/llama.cpp/build/bin/llama-server --model ~//models/GLM-4.6-UD-Q6_K_XL-00001-of-00006.gguf --cache-type-k q8_0 --cache-type-v q8_0 --n-gpu-layers 94 --temp 0.6 --ctx-size 131072 --host 0.0.0.0 --rpc 192.168.1.xxx:50052 --alias GLM-4.6_RPC
Observations (vs Single Node 6x MI50 32gb with GLM 4.6 Q3_K_S):
- Prompt processing about the same on smaller prompts. 62-65 tok/s
- Text generation 7.5 tok/s vs 8.5 tok/s, UD-Q6_K_XL vs Q3_K_S
- Each server idles \~350W. Inference causes 1-2 GPUs to round robin across 12 GPUs with100-170w power draw vs the rest (10-11 GPUs) @ \~20w.
Prior experiement:
https://www.reddit.com/r/LocalLLaMA/comments/1nxv7x6/performance_of_glm_46_q3_k_s_on_6x_mi50/
serige@reddit
How are your 2 nodes connected? If the secondary host doesn't have access to the model, how long does it take to transfer necessary parts of the model before you can do you first prompt?
MachineZer0@reddit (OP)
They are connected on 10gbe SFP+
I rsync'ed the files over before executing llama-server, but it did take quite some time to start serving. It was less time than rsync though.
Curious if it transfered the GGUFs straight to the RPC Server's GPU VRAM.
CheatCodesOfLife@reddit
It used to send do that, and was a real pain to start large models / tweak the -ot regex (5 minute wait after each OOM).
Earlier in the year they added a -c flag that stores tensors in ~/.cache/llama.cpp which made it faster to re-load models.
I haven't tried it since the big update last week where you don't need a separate rpc server per GPU.
MachineZer0@reddit (OP)
The -c flag did make the RPC server cache tensors (see above). However there is no noticeable difference in speed to load weights.
__E8__@reddit
Excellent setup for some real science!
Have you tried row vs layer split modes in lcpp? I suppose this prob still needs work, but a little test can't hurt. MLDataScientist showed row splitting (tensor parallel) gets quite bit of perf w vllm. Tho I supp for your setup, you'd want to do tp within the same node and stack nodes by layers. Dunno if lcpp can do it like dat.
But what I've been pondering that yer warhorse can ans is: how well does speculative decoding work undr such conds? Normally, on smol nums of mi50s there isn't enough spare processor to let spec dec shine. But w all the latency from the rpc biz, there might be enough spare pipeline cycles for spec dec to matter.
MachineZer0@reddit (OP)
Shockingly Speculative decoding had worse performance. Lost 15-18 tok/s PP and 1 tok/s tg.
Maybe because a 0.6B draft model is not a match for a 357B?
segmond@reddit
GLM is a complex model that's more taxing to infer. Although DeepSeek is bigger, I can infer Deepseek faster on the same hardware. KimiK2 is bigger than Deepseek and GLM and it even infers faster than both. So the story is not just about the total size of the model, but the complexity of the model.
__E8__@reddit
Interesting. What are the most complex models in your opinion? Least? Where does Gemma lie on your spectrum? Like Gemma's time to first tok is usually way faster than most models, so ttft might be a proxy for model complexity?
Have you ever seen spec dec work rly well (like +25%)? 10% more tok/s is the best I've personally seen and it amts to .2 to 5tok/s improv. Not worth the trouble in my experiments thus far (normal chat & overnight batch jobs).
CheatCodesOfLife@reddit
With MoEs it's mostly about active parameters. Kimi-K2 has less than Deepseek-R1. All the Gemma-3's will be faster than both, especially since you can easily offload them fully to vram.
__E8__@reddit
I think your draft choice is fine. I use the same for my GLM4.5 experiments.
That sounds like what I measure too. For smaller models: +/- 10% on 2x mi50, 0-10% on 2x 3090. And 0-10% running GLM4.5 Q4KXL on 2x 3090 + nvme.
fallingdowndizzyvr@reddit
It's not shocking at all. My experience with spec decoding is along the same lines.
cantgetthistowork@reddit
Your PP speeds are worse than a DDR5 rig
MachineZer0@reddit (OP)
Each server is worth $500-700. GPUs about $225.
Reproducible for about $3900.
How much is DDR5 setup?
cantgetthistowork@reddit
About the same for 768GB + 1x3090
CheatCodesOfLife@reddit
Yeah they're pretty shit for MoEs, but for dense models they're pretty good bang for buck.
egomarker@reddit
I'm always interested what is the cost per million tokens for this kind of rigs
MachineZer0@reddit (OP)
Best case I’m doing 60 x 60 x 7.5 tokens output per hour. It would take 37 hours to do 1m tokens output. My setup draws about 850w during inference. $0.28/kwh.
37 hours x 0.85 kw x $0.28/kwh = $8.8 per million output.
On smaller context prompt processing is 4-6 seconds vs about 10 mins for very verbose output with thinking tokens. The ratio is about 1:100. So in theory another 8 cents for input, but about 10k tokens.
Definitely not worth it unless you have < $0.07/kwh or utmost need for privacy and can’t pay the upfront cost of $65k for a quad Blackwell workstation.
aetherec@reddit
With so many MI50s, llama.cpp is not the way to go.
Use vLLM or SGlang with tensor parallel. Not sure if SGlang works, but I know vLLM gfx906 will be a lot better at least
_hypochonder_@reddit
Dense models are faster with vLLM gfx906 but MoE models aren't optimized.
>https://www.reddit.com/r/LocalLLaMA/comments/1nme5xy/4x_mi50_32gb_reach_22_ts_with_qwen3_235ba22b_and/
>Qwen3-235B-A22B-AWQ (TP 4) - TG 22t/s; PP 290t/s
Qwen3-235B-A22B-Instruct-2507-MXFP4_MOE.gguf runs also with tg128 21t/s with llama.cpp on my machine. (4x AMD MI50)
Chromix_@reddit
There were some recent reports that KV quantization reduced speed a lot with the GPT-OSS MoE models. Maybe it's worth a try here to run without KV quant and half the context size to still fit in VRAM. The current 8 tps inference speed seem rather slow given the relatively fast VRAM on the MI50s. Maybe it's just RPC overhead though.
fallingdowndizzyvr@reddit
Hm... no. I went through this with someone in the last week or so. Here are some results both with and without KV quanting. While it's a tad slower at lower context, at high context KV is quite a bit faster for PP.
Chromix_@reddit
Maybe the issue is specific to GPT-OSS then, or the RPC overhead masks it.
fallingdowndizzyvr@reddit
Those numbers are from GPT-OSS. That's what it means when it says "gpt-oss".
MachineZer0@reddit (OP)
Before:
After:
Performance about the same pp: 65 tok/s, tg: \~7.5 tok/s
Chromix_@reddit
Thanks, was worth a try. There must be some other - hopefully solvable - performance bottleneck then.
jacek2023@reddit
finally a RPC example on r/LocalLLaMA , this should be saved for later guys :)
fallingdowndizzyvr@reddit
Finally? I posted about it when it first hit and have pretty much continually posted about it ever since.
https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacpp_now_supports_distributed_inference/
jacek2023@reddit
Yes, I upvoted your post year ago, I don't see your other posts (probably your account is hidden)
CheatCodesOfLife@reddit
I've got 2 of his old posts bookmarked for that reason lol
fallingdowndizzyvr@reddit
Well then, it wasn't "finally" was it? Since you upvoted my post a year ago, then you already knew that.
Regardless of whether posts are visible in a profile or not, doesn't mean they aren't visible. Like this post. You're seeing it right now.
random-tomato@reddit
It feels like RPC is basically abandoned in llama.cpp so I'm relieved that someone has tested it out and it still works haha.
fallingdowndizzyvr@reddit
What? I posted about it all the time. Like all the time. I think I posted about it yesterday.
jacek2023@reddit
fallingdowndizzyvr@reddit
jacek2023@reddit
yes but I was searching for this week posts
jacek2023@reddit
jacek2023@reddit
abandoned?
https://github.com/ggml-org/llama.cpp/pull/16441
https://github.com/ggml-org/llama.cpp/pull/16276
plus many more
Long_comment_san@reddit
Imagine we had a sub 1000$ card with 96 of VRAM with cuda and driver support.
fallingdowndizzyvr@reddit
Why stop there, imagine if we had 192GB of VRAM for $10.
Long_comment_san@reddit
What I said is quite realistic though. 1gb of LPDDR is way under 10$ nowadays, more like 3-7 range. And 3060-4060 class GPU costs less than 200$ for sure.
fallingdowndizzyvr@reddit
Well, don't we already have that then? It's called a Max+ 395. That's 3060-4060 class. If you factor in the pretty decent CPU and other incidentals like a SSD, case, power supply, whatever. All that is worth $700. So you get the GPU and 96GB for that $1000 you are talking about. You have to put a GPU card into something anyways.
Long_comment_san@reddit
It's not a GPU at all. And it's not 700$, it's almost 1700$ on sales. Best you can do at 700$ is 32gb currently. And there's a nit of an issue that it's usually thermally limited to oblivion.
fallingdowndizzyvr@reddit
It is a GPU. The only difference between an iGPU and a dGPU is the "i" and the "d". "I" meaning it's integrated, "d" meaning is discrete. None of that changes whether it's a GPU or not.
As for system RAM versus VRAM, the only thing that matters is speed. And the Max+ 395 system RAM is comparable to 4060 VRAM.
Who said it was $700? I didn't. Why are you saying it?
"If you factor in the pretty decent CPU and other incidentals like a SSD, case, power supply, whatever. All that is worth $700."
Yeah, that includes the "decent CPU and other incidentals like a SSD, case, power supply, whatever." that's worth $700. So $1700 - $700 = $1000 for the GPU component.
Except it's not. I've shown that over and over and over and over again.
That cost a lot more. Like a lot more. I thought you were all about it being cheap.
No. It won't. Run a large dense model and the Max+ 395 will leave the 5090 + RAM in the dust. As AMD marketing made a point of. As people said it was unfair since of course it would beat down a 5090 since the entire model doesn't fit and system RAM would make it crawl.
MachineZer0@reddit (OP)
We will 3y until used Blackwell hits that level.
Long_comment_san@reddit
I know and it kind of sucks because we'll get the ram but not GPU tech
exaknight21@reddit
Bro it would WILD to see 3060 32 GB sub $500.
Long_comment_san@reddit
Yeah.
AllYouNeedIsVTSAX@reddit
Could you give us a build spec? Real curious about this.
MachineZer0@reddit (OP)
2x HP DL580 gen9
Each with 4x E7 v4 procs 576gb DDR4 2400 1TB SSD 6x MI50 32gb Built-in dual 10gbe
a_beautiful_rhind@reddit
And here I thought that my 290w idle with model loaded was bad.
panchovix@reddit
250W on my PC with a loaded model, 7 gpus + 9900X.
Life is suffering when electricity is 0.25USD per kwh (Chile). I just have it most of the time powered off as I can't go lower than that.
a_beautiful_rhind@reddit
I did total cost with the fees and it comes out to 18-20c for me. Going to have to get in the habit of unloading the models and doing suspend/resume on the driver. Or maybe nvidia fixes the driver one day and the 3090s can idle at 5w like the 2080ti.
MachineZer0@reddit (OP)
But sir, for the science.. ;)
LagOps91@reddit
That's... honestly not that impressive? Maybe 2x the speed of a consumer pc with a mix of vram and ram for q3_ks. I don't quite have enough ram+vram, but on a 10gb smaller quant i have about 5 t/s at 4k context and 3.5 t/s at 16-32k context.
woahdudee2a@reddit
might be because RPC itself is slow
llama-impersonator@reddit
this setup basically uses 1 out of the 12 gpus at a time, it is going to be super compute limited
LagOps91@reddit
well, no. they did run the Q3 version on a single cluster and it wasn't that much faster.
soshulmedia@reddit
I get 10tok/s @ IQ2_XXS over 5 x MI50 / 32GiB @ short prompt / smallish context in a low-bandwidth low-lane low-CPU rig. Maybe something worth trying as an alternative?
Sidenote for anyone struggling with similar setups: 'pci=realloc,nocrs' in the kernel command line worked wonders for me to get all the PCI address range and BAR / rebar allocation errors and problems solved.
ortegaalfredo@reddit
Making llama.cpp RPC don't crash is an achievement at the level of the invention of Transformers.
nomorebuttsplz@reddit
what's the total power draw? 350ish*12?
MachineZer0@reddit (OP)
Idle 350w x 2 = 700w
Inference (350w x 2) + (170w x 2) = 1040w max, but probably closer to 850w.
Servers each have 4 CPUs, 576gb via 16gb DIMMs and 4 power supplies. Could probably optimize on a different model with 2 CPUs, 4 DIMMs and 1 power supply and half the idle power.
nomorebuttsplz@reddit
This is pretty good performance overall. Does inference or PP slow down at higher contexts?
MachineZer0@reddit (OP)
Yes. It slows down. on Q3_K_S 10k context took about 20mins PP. I think it will be similar.