Ran some Llama.cpp RPC test to see if its worth it. And if 10Gbe needed.

Posted by lemondrops9@reddit | LocalLLaMA | View on Reddit | 32 comments

Let me first say I am not doing anything with parallelism so these benchmarks and tests are not for you.

That said if your hobbyist like me that is left wondering if can I use the GPUs my other PCs then I have some answers and but I'm still learning. There is probably a better config for Llama.cpp but haven't see any huge gains, in fact flash attention seems to slow things down a bit so I didn't test with on. Also I'm sure if someone has better than consumer level networking they could get their latency down more which should improve things. I just don't have that kind of hardware.

On my main AI PC (see gpu details below) as the main for these tests. The 2nd PC has a 5070 and 3080 I tested this PC on WIndows 11, WSL, and Native Linux. And for fun one go around with a 3rd PC with a 5060ti 16gb. Here is the results.

I did double check to be sure the RPC server was in fact being used on each run.

Start off with the main PC only as a control to see how RPC does work. You can see my config and hardware used. For some reason I didn't need to rearrange my gpu order for the llama.bench to work good. All my test this PC is the main and is running Linux Mint with Nvidia driver 590.48.0.1 with Cuda toolkit 13.1 on a 2.5gbe connection.

[Control](

This is the 2nd PC is running native Linux on 2.5gbe connection.

[2nd PC is running 5070 & 3080](

Next is the same setup but with a 1gbe connection.

Now for Windows 11 where things get a lot slower.

[2nd PC is running 5070 & 3080](

WSL with Nvidia 595, Cuda toolkit 13.1. 2.5gbe connection

[5070 & 3080](

Same as above but used a 1gbe connection.

Sill using WSL, back on 2.5gbe but using only the 3080

[3080 only](

Same specs but only the 5070 this time around.

[5070 only](

Same as above but on a 1gbe connection.

[5070 only - 1gbe connection](

Finally thought I would throw a 3rd PC into the mix. The 2nd PC is running both gpus in native Linux for this test. The 3rd PC is running Windows 11 with a 5060ti 16gb on a 2.5gbe connection.

I don't know if the Windows issue is because the 3080 is running as the primary for Windows. But I've had a lot of weird issues with Windows. The main take away after testing is RPC is quite viable at least with a smaller context and a lot better when both running Linux. I'm waiting for some parts so I can add the 5060ti to the 2nd PC for larger context and I'm curious how it might scale up from here.

Oh and on a side note I did have an issue with Linux because it installed a generic network driver. I was getting pings around 1.5-3ms but this was fixed before the tests.

[-]

segmond@reddit

I did this test a long time ago and posted, more than a year ago. RPC doesn't really help much with MoE. You will see solid improvements with dense models.

[-]

lemondrops9@reddit (OP)

?? yeah it helps run bigger models and or larger context.

[-]

segmond@reddit

rubbish. an MoE will run faster on one rig partially offloaded to GPU than on all GPUS across multiple rigs. I did this with deepseekv3.1 and kimik2. it was better to keep it on one rig and partial offload than get it all on GPU, and i had fast ethernet too.

however, llama3-405b ran 4x faster distributed across all GPUs than partial offload to system ram.

[-]

lemondrops9@reddit (OP)

Are you a bot? who the f$&k is running lama3-405b ?

Also it matters what your offloading too as Im on DDR4.

[-]

CheatCodesOfLife@reddit

He's incorrect about that or maybe using AMD GPUs.

I tested GLM-4.7 IQ2_M on the latest main branch of ik_llama.cpp

full offload 6 GPUs:

prompt eval time =    1970.16 ms /  1732 tokens (    1.14 ms per token,   879.12 tokens per second)
eval time =   14273.39 ms /   661 tokens (   21.59 ms per token,    46.31 tokens per second)
total time =   16243.55 ms /  2393 tokens

full offload to 5 GPUs + 1 GPU on a remote rig

prompt eval time =    3414.84 ms /  1596 tokens (    2.14 ms per token,   467.37 tokens per second)
eval time =   24796.95 ms /   691 tokens (   35.89 ms per token,    27.87 tokens per second)
total time =   28211.79 ms /  2287 tokens

cmoe

prompt eval time =   18514.10 ms /  1732 tokens (   10.69 ms per token,    93.55 tokens per second)
eval time =   59151.19 ms /   839 tokens (   70.50 ms per token,    14.18 tokens per second)
total time =   77665.28 ms /  2571 tokens

5 GPUs with 10 ffn_exps.* on CPU:

prompt eval time =    6742.43 ms /  1732 tokens (    3.89 ms per token,   256.88 tokens per second)
eval time =   44515.14 ms /   893 tokens (   49.85 ms per token,    20.06 tokens per second)
total time =   51257.57 ms /  2625 tokens

[-]

segmond@reddit

1 gpu on a remote rig is minimal, i was offloading to 10-16 remote GPUs. go view my profile.

[-]

lemondrops9@reddit (OP)

Is your remote PC running Windows?

[-]

CheatCodesOfLife@reddit

Nope, Arch -> Ubuntu-server

[-]

lemondrops9@reddit (OP)

Ive only seen that % slower when the 2nd pc is Windows. But I have not played with arch at all.

[-]

CheatCodesOfLife@reddit

It's not likely to be an Arch vs Ubuntu/other-distro thing. Probably just that I can't use graph-split / nccl over rpc.

Mainline llama.cpp is probably similar speed with 6 local GPUs vs 5 local, 1 rpc?

[-]

lemondrops9@reddit (OP)

Should be getting close to the same speed but less for prefill.

Ive tested this with a bunch of models and each time is around 5% slower and 20-30% less for prefill.

But when I add a Windows with a 5000 series its 15% less speed and when I used the 3080 its a loss of 35%. I haven't gone crazy to figure out why as my 2nd PC is running Linux now.

I didnt use graph-split for any testing.

[-]

lemondrops9@reddit (OP)

thanks for posting your results. I haven't tried off loading to the cpu much because it ends up running at lot slower.

I only have dual Channel ddr4 3200 ram.

[-]

segmond@reddit

silly, i'm not running llama405b anymore, i ran it when it first came out, can't you read past tense?

[-]

lemondrops9@reddit (OP)

maybe if you wrote it in past tense...
Benchmarks? System Specs?

[-]

TheCityzens@reddit

Llamacpp rpc tests showed decent speed for my setup too. Worth it if local privacy matters to you. Hardware limits still show on bigger models.

[-]

NigaTroubles@reddit

Try only linux then

[-]

lemondrops9@reddit (OP)

Thats my conclusion. Maybe someone out there has gotten WSL to work just as good. But I'd rather take the time to get Linux running.

[-]

ItilityMSP@reddit

Wsl2 has a whole hyperv network switch that you can't really control and other network abstractions, it was never optimized for performance, amd is designed to use the default network card. Skip it, it's a mess for serious tests.

[-]

lemondrops9@reddit (OP)

I find Windows overall a pain to get things working smoothly. I was more curious how Windows and WSL would compare to Linux.

[-]

ItilityMSP@reddit

Wsl2 is great for what it is, allows you to develop and run inference on linux via driver passthrough if windows is your daily driver for other reasons.

That's the way I use it because, I want my software to be linux native, but networking is where you need to be careful, wsl2 network serving behavior is not native linux behavior, localhost, 127.0.0.1 is fine but anything more than that you need to forward the service to expose it. So as soon as software needs more than localhost, it goes on a dedicated linux dev machine and that's also where agents playground is with no internet.

[-]

NigaTroubles@reddit

Also i forgot to say Thanks for your works

I hope it will works for you I will try it too at sometime

[-]

lemondrops9@reddit (OP)

Thank you!

[-]

ArtfulGenie69@reddit

I'm on Linux and I use rpc constantly. I have two PC's with 2x3090 and they run qwen3.5 122b q4 @ 800t/s prefill and 55t/s tg. Only slowish part, 60-90s, is the first load which can be mitigated by turning on the rpc cash on the slave machine. It's a great in-between and I'm also on 2.5gb ethernet. Without mtp btw.

[-]

ikkiho@reddit

yeah I got the same shape on llama.cpp rpc. with layer split decode barely uses the wire since you're shipping per-token activations not weights, so even 1gbe was fine for me. fwiw the one place where 10gbe would have actually mattered was when I tried tensor-parallel via -sm row, bandwidth saturated immediately even on 2.5gbe. with plain split the bigger bottleneck for me was usually pcie on the rpc host, the lan never broke a sweat.

[-]

lemondrops9@reddit (OP)

Do you mind sending me your config that you're using for tensor-parallel. I tried -sm tensor which gave me 1/2 the speed and -sm row which was 1/4 the original speed. When testing on my dual 3090's using Qwen3.6 35B A3B.

[-]

lemondrops9@reddit (OP)

Totally makes sense that tensor parallel would saturate the bandwidth. My Main is running 3 of the gpus from PCIe 3.0 x1 one of those is a wifi socket. I haven't seen much data go over the PCIe bus other than when loading the model.

I plan on getting into tensor-parallel just not sure how great it will be with my setup. So my focus has been more on increasing the total Vram with decent speeds for large models.

[-]

JockY@reddit

What type of parallelism are you using? E.g. tensor, split, row, etc. There should be massive differences in network saturation between, say, tensor parallel and row parallel.

[-]

lemondrops9@reddit (OP)

Yes I could see network saturation if using tensor parallel but like I said at the the start I didn't test any of these. With a mix of gpus and 3 of them on PCIe 3.0 x1 its been lower on my list to get going. More so with the new tech coming out like MTP that I'd like to try.

[-]

Appropriate_Purpose2@reddit

Interesting tests on distributed inference with llama.cpp. For production deployments, ensuring high-bandwidth, low-latency networking is key. Runcrate's platform offers optimized networking for distributed GPU workloads, which could help you scale beyond consumer hardware limitations.

[-]