Zai replaced the network architecture running GLM-5.1 inference and the gains are pretty wild

Posted by Scared-Biscotti2287@reddit | LocalLLaMA | View on Reddit | 27 comments

Been following the infrastructure side of AI more lately and stumbled on this from Zai. They upgraded the network architecture on a thousand-GPU cluster running GLM-5.1 coding inference from the standard ROFT setup to something they built called ZCube, developed with Tsinghua University and HarnetsAI

The numbers from production:

- Switch and optical module costs down 33%

- GPU inference throughput up 15%

- P99 tail latency on first token dropped 40.6%

Same GPUs, same software stack, same model. Just the network architecture changed

The actual problem they were solving is interesting. With Prefill-Decode disaggregated inference, KV Cache transfers create highly asymmetric traffic between nodes. ROFT topology handles training workloads fine but with PD disaggregation the traffic patterns dont match the static rail mapping, so you get hotspots on specific Leaf switches and PFC backpressure building up

ZCube addresses it by going fully flattened, removing the Spine layer entirely and using a complete bipartite interconnect between two switch groups. Eliminates a whole category of congestion that ROFT cant avoid by design

The cost reduction while getting better performance is the part that stands out. Usually you pay more for better network hardware. Here they cut hardware costs by a third and got 15% more throughput out of the same GPUs

[-]

kevinlch@reddit

i mean, they can keep it as secret but instead they publish it for public. i wish openai can publish more papers like this and not just ads

[-]

Smile_Clown@reddit

I think the disconnect here is pretty wide. OpenAI is a for profit company. OpenAI does it's own research and development that is more than likely more advanced that this.

Handing this or any other development over to a competitor is not wise as it would instantly level the playing field driving costs down, driving access down.

Redditors, randoms, think this is a great idea, lower costs for everyone! Except GPU's, datacenters and infrastructure is not cheap and if they cannot get the revenue they need to operate YOUR access and capabilities go down.

We all seem to think OpenAI is just stuffing bank accounts with billions in profit. That is not the case. In addition, these open models are all building off of output from the big players, without them, we would not have this.

To serve their customer base, OpenAI needs a LOT of operating capital. They have to be the best to continue to charge the premium to continue to invest in infrastructure. it's not like they are just accumulating cash (not right now anyway)

The biggest problem with Americans (especially) and redditos/social media in general and how they critique these kinds of things is they start from the greedy old white man burning hundred dollar bills in a fire place in the middle of summer while giggling and never really consider the bigger picture.

[-]

DeathGuppie@reddit

Used to be you could do anything over the phone, now you need an internet connection. Once every bit of your information exists on the cloud, and they have lock it. Those prices for access will be whatever they decide.

This isn't a "helping mankind" not for profit venture, they see the money clearly ahead of them. When they control both the hardware and the software we will be left with the bill or be left out.

[-]

pipizich@reddit

OpenAI but with closed source, closed weight, closed paper

[-]

dbenc@reddit

open to making mad profits from investor hype

[-]

thrownawaymane@reddit

Z.ai probably learned this from Facebook.

Obviously FB isn't really doing the open model thing anymore but in terms of writing the Datacenter bible for people to read more you do, in fact have to give it to them

[-]

s2k4ever@reddit

even if they do, there will be ads in them

jwpbe@reddit

occasionally on this sub you read something that makes you feel lke the dumbest person on earth

Zeioth@reddit

For the ignorant: What I'm looking at? I understand that's a multi layer load balancer?

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

stackfullofdreams@reddit

Evpn for the win!!, if you know.. you know.

Haiku-575@reddit

I believe it. I've been daily-driving GLM5.1 as my primary coding model for a month, and have only managed to spend $5USD because so many tokens end up being cache hits (and I only use it an hour a day or less). But a month-and-a-half ago you could only do one request at a time (!!) and would constantly time out, and everything was slow. Some time in the last few weeks everything magically started working well and everything spend up dramatically. I thought maybe pi.dev changed the way they interact with GLM, but this (and probably other improvements on GLM's side) makes a lot more sense.

viper33m@reddit

Eli5 pls

Fluid_Protection_337@reddit

33% switch cost reduction plus 15% throughout gain on the same hardware is the kind of result that gets infrastructure teams interested fast. Real production numbers not just benchmarks

Jumpy-Possibility754@reddit

The bottleneck keeps moving lower in the stack.

allinasecond@reddit

Is there a danger that the hardware stacks and bottlenecks keep adapting to something that might not exist in 2 years?

SIGCOMM ’25, September 8–11, 2025 ???

HavenTerminal_com@reddit

pay less, get more usually needs a footnote. the footnote is that ROFT was the problem.

Peyzxc@reddit

33% switch cost reduction plus 15% throughput gain on the same hardware is the kind of result that gets infrastructure teams interested fast. Real production numbers not just benchmarks

PetersOdyssey@reddit

1000 tok/s Deepseek 4 Pro Wen

bblankuser@reddit

when GPT 6 open sources

AnticitizenPrime@reddit

Does anyone know how to replicate their agent mode UI on their site locally? Apparently it's some sort of modified OpenWebUI, but modified how, I don't know (some plugins)?

I'm talking about how it organizes the to-dos and whatnot in the left pane, and code previews/project files on the right.

Screenshot (Discord link, hope it works)

Limp_Classroom_2645@reddit

Would be great to attach a source for this

Scared-Biscotti2287@reddit (OP)

Yes thanks for the note.

Switch and optical module costs down 33%
GPU inference throughput up 15%

=> 70% more cost-effective