GLM 4.6 UD-Q6_K_XL running llama.cpp RPC across two nodes and 12 AMD MI50 32GB

Posted by MachineZer0@reddit | LocalLLaMA | View on Reddit | 63 comments

Finally got another six MI50 32gb. Removed my old Nvidia Titan Vs in my 2nd HP DL580 Gen9.

Here we go.
running on secondary host:

~/llama.cpp.20251012/build/bin/rpc-server --host 0.0.0.0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 6 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 2: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 3: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 4: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 5: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Starting RPC server v3.0.0
  endpoint       : 0.0.0.0:50052
  local cache    : n/a
Devices:
  ROCm0: AMD Radeon Graphics (32752 MiB, 32694 MiB free)
  ROCm1: AMD Radeon Graphics (32752 MiB, 32694 MiB free)
  ROCm2: AMD Radeon Graphics (32752 MiB, 32694 MiB free)
  ROCm3: AMD Radeon Graphics (32752 MiB, 32694 MiB free)
  ROCm4: AMD Radeon Graphics (32752 MiB, 32694 MiB free)
  ROCm5: AMD Radeon Graphics (32752 MiB, 32694 MiB free)
Accepted client connection

Then on primary host:

~/llama.cpp/build/bin/llama-server --model ~//models/GLM-4.6-UD-Q6_K_XL-00001-of-00006.gguf --cache-type-k q8_0 --cache-type-v q8_0 --n-gpu-layers 94 --temp 0.6 --ctx-size 131072 --host 0.0.0.0 --rpc 192.168.1.xxx:50052 --alias GLM-4.6_RPC

Observations (vs Single Node 6x MI50 32gb with GLM 4.6 Q3_K_S):

Prompt processing about the same on smaller prompts. 62-65 tok/s
Text generation 7.5 tok/s vs 8.5 tok/s, UD-Q6_K_XL vs Q3_K_S
Each server idles \~350W. Inference causes 1-2 GPUs to round robin across 12 GPUs with100-170w power draw vs the rest (10-11 GPUs) @ \~20w.

Prior experiement:

https://www.reddit.com/r/LocalLLaMA/comments/1nxv7x6/performance_of_glm_46_q3_k_s_on_6x_mi50/

[-]

serige@reddit

How are your 2 nodes connected? If the secondary host doesn't have access to the model, how long does it take to transfer necessary parts of the model before you can do you first prompt?

[-]

MachineZer0@reddit (OP)

They are connected on 10gbe SFP+

I rsync'ed the files over before executing llama-server, but it did take quite some time to start serving. It was less time than rsync though.

Curious if it transfered the GGUFs straight to the RPC Server's GPU VRAM.

[-]

CheatCodesOfLife@reddit

It used to send do that, and was a real pain to start large models / tweak the -ot regex (5 minute wait after each OOM).

Earlier in the year they added a -c flag that stores tensors in ~/.cache/llama.cpp which made it faster to re-load models.

I haven't tried it since the big update last week where you don't need a separate rpc server per GPU.

[-]

MachineZer0@reddit (OP)

The -c flag did make the RPC server cache tensors (see above). However there is no noticeable difference in speed to load weights.

[-]

Have you tried row vs layer split modes in lcpp? I suppose this prob still needs work, but a little test can't hurt. MLDataScientist showed row splitting (tensor parallel) gets quite bit of perf w vllm. Tho I supp for your setup, you'd want to do tp within the same node and stack nodes by layers. Dunno if lcpp can do it like dat.

But what I've been pondering that yer warhorse can ans is: how well does speculative decoding work undr such conds? Normally, on smol nums of mi50s there isn't enough spare processor to let spec dec shine. But w all the latency from the rpc biz, there might be enough spare pipeline cycles for spec dec to matter.

[-]

MachineZer0@reddit (OP)

Shockingly Speculative decoding had worse performance. Lost 15-18 tok/s PP and 1 tok/s tg.

Maybe because a 0.6B draft model is not a match for a 357B?

~/llama.cpp.20251012/build/bin/llama-server --model ~/models/GLM-4.6-UD-Q6_K_XL-00001-of-00006.gguf -md ~/models/GLM-4.5-DRAFT-0.6B-v3.0.Q8_0.gguf --top_k 1 --draft 16 --temp 0.6 --ctx-size 131072 --host 0.0.0.0 --rpc 192.168.1.xxx:50052 --alias GLM-4.6_RPC

[-]

segmond@reddit

GLM is a complex model that's more taxing to infer. Although DeepSeek is bigger, I can infer Deepseek faster on the same hardware. KimiK2 is bigger than Deepseek and GLM and it even infers faster than both. So the story is not just about the total size of the model, but the complexity of the model.

[-]

E8@reddit

Interesting. What are the most complex models in your opinion? Least? Where does Gemma lie on your spectrum? Like Gemma's time to first tok is usually way faster than most models, so ttft might be a proxy for model complexity?

Have you ever seen spec dec work rly well (like +25%)? 10% more tok/s is the best I've personally seen and it amts to .2 to 5tok/s improv. Not worth the trouble in my experiments thus far (normal chat & overnight batch jobs).

[-]

CheatCodesOfLife@reddit

With MoEs it's mostly about active parameters. Kimi-K2 has less than Deepseek-R1. All the Gemma-3's will be faster than both, especially since you can easily offload them fully to vram.

[-]

E8@reddit

I think your draft choice is fine. I use the same for my GLM4.5 experiments.

That sounds like what I measure too. For smaller models: +/- 10% on 2x mi50, 0-10% on 2x 3090. And 0-10% running GLM4.5 Q4KXL on 2x 3090 + nvme.

[-]

fallingdowndizzyvr@reddit

It's not shocking at all. My experience with spec decoding is along the same lines.

[-]

cantgetthistowork@reddit

Your PP speeds are worse than a DDR5 rig

[-]

MachineZer0@reddit (OP)

Each server is worth $500-700. GPUs about $225.

Reproducible for about $3900.

How much is DDR5 setup?

[-]

cantgetthistowork@reddit

About the same for 768GB + 1x3090

[-]

CheatCodesOfLife@reddit

Yeah they're pretty shit for MoEs, but for dense models they're pretty good bang for buck.

[-]

egomarker@reddit

I'm always interested what is the cost per million tokens for this kind of rigs

[-]

MachineZer0@reddit (OP)

Best case I’m doing 60 x 60 x 7.5 tokens output per hour. It would take 37 hours to do 1m tokens output. My setup draws about 850w during inference. $0.28/kwh.

37 hours x 0.85 kw x $0.28/kwh = $8.8 per million output.

On smaller context prompt processing is 4-6 seconds vs about 10 mins for very verbose output with thinking tokens. The ratio is about 1:100. So in theory another 8 cents for input, but about 10k tokens.

Definitely not worth it unless you have < $0.07/kwh or utmost need for privacy and can’t pay the upfront cost of $65k for a quad Blackwell workstation.

[-]

aetherec@reddit

With so many MI50s, llama.cpp is not the way to go.

Use vLLM or SGlang with tensor parallel. Not sure if SGlang works, but I know vLLM gfx906 will be a lot better at least

[-]

_hypochonder_@reddit

Dense models are faster with vLLM gfx906 but MoE models aren't optimized.
>https://www.reddit.com/r/LocalLLaMA/comments/1nme5xy/4x_mi50_32gb_reach_22_ts_with_qwen3_235ba22b_and/
>Qwen3-235B-A22B-AWQ (TP 4) - TG 22t/s; PP 290t/s
Qwen3-235B-A22B-Instruct-2507-MXFP4_MOE.gguf runs also with tg128 21t/s with llama.cpp on my machine. (4x AMD MI50)

[-]

Chromix_@reddit

There were some recent reports that KV quantization reduced speed a lot with the GPT-OSS MoE models. Maybe it's worth a try here to run without KV quant and half the context size to still fit in VRAM. The current 8 tps inference speed seem rather slow given the relatively fast VRAM on the MI50s. Maybe it's just RPC overhead though.

[-]

fallingdowndizzyvr@reddit

There were some recent reports that KV quantization reduced speed a lot with the GPT-OSS MoE models.

Hm... no. I went through this with someone in the last week or so. Here are some results both with and without KV quanting. While it's a tad slower at lower context, at high context KV is quite a bit faster for PP.

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Vulkan,RPC | 9999 |    4096 |     4096 |  1 |    0 |          pp4096 |        262.65 ± 0.72 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Vulkan,RPC | 9999 |    4096 |     4096 |  1 |    0 |           tg128 |         51.40 ± 0.03 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Vulkan,RPC | 9999 |    4096 |     4096 |  1 |    0 | pp4096 @ d20000 |        178.00 ± 1.01 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Vulkan,RPC | 9999 |    4096 |     4096 |  1 |    0 |  tg128 @ d20000 |         39.64 ± 0.02 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Vulkan,RPC | 9999 |    4096 |     4096 |  1 |    0 | pp4096 @ d65536 |         29.65 ± 0.43 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Vulkan,RPC | 9999 |    4096 |     4096 |  1 |    0 |  tg128 @ d65536 |         27.68 ± 0.02 |

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | type_k | type_v | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -----: | -----: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Vulkan,RPC | 9999 |    4096 |     4096 |   q4_0 |   q4_0 |  1 |    0 |          pp4096 |        240.33 ± 0.79 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Vulkan,RPC | 9999 |    4096 |     4096 |   q4_0 |   q4_0 |  1 |    0 |           tg128 |         51.12 ± 0.02 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Vulkan,RPC | 9999 |    4096 |     4096 |   q4_0 |   q4_0 |  1 |    0 | pp4096 @ d20000 |        150.62 ± 3.14 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Vulkan,RPC | 9999 |    4096 |     4096 |   q4_0 |   q4_0 |  1 |    0 |  tg128 @ d20000 |         39.04 ± 0.02 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Vulkan,RPC | 9999 |    4096 |     4096 |   q4_0 |   q4_0 |  1 |    0 | pp4096 @ d65536 |         99.86 ± 0.46 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Vulkan,RPC | 9999 |    4096 |     4096 |   q4_0 |   q4_0 |  1 |    0 |  tg128 @ d65536 |         27.17 ± 0.04 |

[-]

Chromix_@reddit

Maybe the issue is specific to GPT-OSS then, or the RPC overhead masks it.

[-]

fallingdowndizzyvr@reddit

Those numbers are from GPT-OSS. That's what it means when it says "gpt-oss".

[-]

MachineZer0@reddit (OP)

Before:

llama_kv_cache: RPC0[192.168.1.155:50052] KV buffer size =  2176.00 MiB
llama_kv_cache: RPC1[192.168.1.155:50052] KV buffer size =  2176.00 MiB
llama_kv_cache: RPC2[192.168.1.155:50052] KV buffer size =  2176.00 MiB
llama_kv_cache: RPC3[192.168.1.155:50052] KV buffer size =  2176.00 MiB
llama_kv_cache: RPC4[192.168.1.155:50052] KV buffer size =  2176.00 MiB
llama_kv_cache: RPC5[192.168.1.155:50052] KV buffer size =  1904.00 MiB
llama_kv_cache:      ROCm0 KV buffer size =  2176.00 MiB
llama_kv_cache:      ROCm1 KV buffer size =  2176.00 MiB
llama_kv_cache:      ROCm2 KV buffer size =  2176.00 MiB
llama_kv_cache:      ROCm3 KV buffer size =  2176.00 MiB
llama_kv_cache:      ROCm4 KV buffer size =  2176.00 MiB
llama_kv_cache:      ROCm5 KV buffer size =  1360.00 MiB
llama_kv_cache: size = 25024.00 MiB (131072 cells,  92 layers,  1/1 seqs), K (q8_0): 12512.00 MiB, V (q8_0): 12512.00 MiB

After:

llama_kv_cache: RPC0[192.168.1.155:50052] KV buffer size =  4096.00 MiB
llama_kv_cache: RPC1[192.168.1.155:50052] KV buffer size =  4096.00 MiB
llama_kv_cache: RPC2[192.168.1.155:50052] KV buffer size =  4096.00 MiB
llama_kv_cache: RPC3[192.168.1.155:50052] KV buffer size =  4096.00 MiB
llama_kv_cache: RPC4[192.168.1.155:50052] KV buffer size =  4096.00 MiB
llama_kv_cache: RPC5[192.168.1.155:50052] KV buffer size =  3584.00 MiB
llama_kv_cache:      ROCm0 KV buffer size =  4096.00 MiB
llama_kv_cache:      ROCm1 KV buffer size =  4096.00 MiB
llama_kv_cache:      ROCm2 KV buffer size =  4096.00 MiB
llama_kv_cache:      ROCm3 KV buffer size =  4096.00 MiB
llama_kv_cache:      ROCm4 KV buffer size =  4096.00 MiB
llama_kv_cache:      ROCm5 KV buffer size =  2560.00 MiB
llama_kv_cache: size = 47104.00 MiB (131072 cells,  92 layers,  1/1 seqs), K (f16): 23552.00 MiB, V (f16): 23552.00 MiB

Performance about the same pp: 65 tok/s, tg: \~7.5 tok/s

[-]

Chromix_@reddit

Thanks, was worth a try. There must be some other - hopefully solvable - performance bottleneck then.

[-]

jacek2023@reddit

finally a RPC example on r/LocalLLaMA , this should be saved for later guys :)

[-]

fallingdowndizzyvr@reddit

Finally? I posted about it when it first hit and have pretty much continually posted about it ever since.

https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacpp_now_supports_distributed_inference/

[-]

jacek2023@reddit

Yes, I upvoted your post year ago, I don't see your other posts (probably your account is hidden)

[-]

CheatCodesOfLife@reddit

I've got 2 of his old posts bookmarked for that reason lol

[-]

fallingdowndizzyvr@reddit

Well then, it wasn't "finally" was it? Since you upvoted my post a year ago, then you already knew that.

Regardless of whether posts are visible in a profile or not, doesn't mean they aren't visible. Like this post. You're seeing it right now.

[-]

random-tomato@reddit

It feels like RPC is basically abandoned in llama.cpp so I'm relieved that someone has tested it out and it still works haha.

[-]

fallingdowndizzyvr@reddit

What? I posted about it all the time. Like all the time. I think I posted about it yesterday.

[-]

jacek2023@reddit

[-]

fallingdowndizzyvr@reddit

[-]

jacek2023@reddit

yes but I was searching for this week posts

[-]

jacek2023@reddit

[-]

jacek2023@reddit

abandoned?

https://github.com/ggml-org/llama.cpp/pull/16441

https://github.com/ggml-org/llama.cpp/pull/16276

plus many more

[-]

Long_comment_san@reddit

Imagine we had a sub 1000$ card with 96 of VRAM with cuda and driver support.

[-]

fallingdowndizzyvr@reddit

Why stop there, imagine if we had 192GB of VRAM for $10.

[-]

Long_comment_san@reddit

What I said is quite realistic though. 1gb of LPDDR is way under 10$ nowadays, more like 3-7 range. And 3060-4060 class GPU costs less than 200$ for sure.

[-]

fallingdowndizzyvr@reddit

Well, don't we already have that then? It's called a Max+ 395. That's 3060-4060 class. If you factor in the pretty decent CPU and other incidentals like a SSD, case, power supply, whatever. All that is worth $700. So you get the GPU and 96GB for that $1000 you are talking about. You have to put a GPU card into something anyways.

[-]

Long_comment_san@reddit

It's not a GPU at all. And it's not 700$, it's almost 1700$ on sales. Best you can do at 700$ is 32gb currently. And there's a nit of an issue that it's usually thermally limited to oblivion.

[-]

fallingdowndizzyvr@reddit

It's not a GPU at all, it's iGPU with system memory.

It is a GPU. The only difference between an iGPU and a dGPU is the "i" and the "d". "I" meaning it's integrated, "d" meaning is discrete. None of that changes whether it's a GPU or not.

As for system RAM versus VRAM, the only thing that matters is speed. And the Max+ 395 system RAM is comparable to 4060 VRAM.

And it's not 700$, it's almost 1700$ on sales.

Who said it was $700? I didn't. Why are you saying it?

"If you factor in the pretty decent CPU and other incidentals like a SSD, case, power supply, whatever. All that is worth $700."

it's almost 1700$ on sales.

Yeah, that includes the "decent CPU and other incidentals like a SSD, case, power supply, whatever." that's worth $700. So $1700 - $700 = $1000 for the GPU component.

And there's a bit of an issue that it's usually thermally limited to oblivion.

Except it's not. I've shown that over and over and over and over again.

You're better off buying a 5090 and slapping it into existing computer.

That cost a lot more. Like a lot more. I thought you were all about it being cheap.

Whatever you plan to run on 395 max, gonna run on 5090 + ram a lot faster.

No. It won't. Run a large dense model and the Max+ 395 will leave the 5090 + RAM in the dust. As AMD marketing made a point of. As people said it was unfair since of course it would beat down a 5090 since the entire model doesn't fit and system RAM would make it crawl.

[-]

MachineZer0@reddit (OP)

We will 3y until used Blackwell hits that level.

[-]

Long_comment_san@reddit

I know and it kind of sucks because we'll get the ram but not GPU tech

[-]

exaknight21@reddit

Bro it would WILD to see 3060 32 GB sub $500.

[-]

Long_comment_san@reddit

Yeah.

[-]

AllYouNeedIsVTSAX@reddit

Could you give us a build spec? Real curious about this.

[-]

MachineZer0@reddit (OP)

2x HP DL580 gen9

Each with 4x E7 v4 procs 576gb DDR4 2400 1TB SSD 6x MI50 32gb Built-in dual 10gbe

[-]

a_beautiful_rhind@reddit

And here I thought that my 290w idle with model loaded was bad.

[-]

panchovix@reddit

250W on my PC with a loaded model, 7 gpus + 9900X.

Life is suffering when electricity is 0.25USD per kwh (Chile). I just have it most of the time powered off as I can't go lower than that.

[-]

a_beautiful_rhind@reddit

I did total cost with the fees and it comes out to 18-20c for me. Going to have to get in the habit of unloading the models and doing suspend/resume on the driver. Or maybe nvidia fixes the driver one day and the 3090s can idle at 5w like the 2080ti.

[-]

MachineZer0@reddit (OP)

But sir, for the science.. ;)

[-]

ortegaalfredo@reddit

Making llama.cpp RPC don't crash is an achievement at the level of the invention of Transformers.

[-]

nomorebuttsplz@reddit

what's the total power draw? 350ish*12?

[-]

MachineZer0@reddit (OP)

Idle 350w x 2 = 700w

Inference (350w x 2) + (170w x 2) = 1040w max, but probably closer to 850w.

Servers each have 4 CPUs, 576gb via 16gb DIMMs and 4 power supplies. Could probably optimize on a different model with 2 CPUs, 4 DIMMs and 1 power supply and half the idle power.

[-]

nomorebuttsplz@reddit

This is pretty good performance overall. Does inference or PP slow down at higher contexts?

[-]

MachineZer0@reddit (OP)

Yes. It slows down. on Q3_K_S 10k context took about 20mins PP. I think it will be similar.