Zai replaced the network architecture running GLM-5.1 inference and the gains are pretty wild
Posted by Scared-Biscotti2287@reddit | LocalLLaMA | View on Reddit | 27 comments
Been following the infrastructure side of AI more lately and stumbled on this from Zai. They upgraded the network architecture on a thousand-GPU cluster running GLM-5.1 coding inference from the standard ROFT setup to something they built called ZCube, developed with Tsinghua University and HarnetsAI
The numbers from production:
- Switch and optical module costs down 33%
- GPU inference throughput up 15%
- P99 tail latency on first token dropped 40.6%
Same GPUs, same software stack, same model. Just the network architecture changed
The actual problem they were solving is interesting. With Prefill-Decode disaggregated inference, KV Cache transfers create highly asymmetric traffic between nodes. ROFT topology handles training workloads fine but with PD disaggregation the traffic patterns dont match the static rail mapping, so you get hotspots on specific Leaf switches and PFC backpressure building up
ZCube addresses it by going fully flattened, removing the Spine layer entirely and using a complete bipartite interconnect between two switch groups. Eliminates a whole category of congestion that ROFT cant avoid by design
The cost reduction while getting better performance is the part that stands out. Usually you pay more for better network hardware. Here they cut hardware costs by a third and got 15% more throughput out of the same GPUs
jwpbe@reddit
occasionally on this sub you read something that makes you feel lke the dumbest person on earth
kevinlch@reddit
i mean, they can keep it as secret but instead they publish it for public. i wish openai can publish more papers like this and not just ads
Smile_Clown@reddit
I think the disconnect here is pretty wide. OpenAI is a for profit company. OpenAI does it's own research and development that is more than likely more advanced that this.
Handing this or any other development over to a competitor is not wise as it would instantly level the playing field driving costs down, driving access down.
Redditors, randoms, think this is a great idea, lower costs for everyone! Except GPU's, datacenters and infrastructure is not cheap and if they cannot get the revenue they need to operate YOUR access and capabilities go down.
We all seem to think OpenAI is just stuffing bank accounts with billions in profit. That is not the case. In addition, these open models are all building off of output from the big players, without them, we would not have this.
To serve their customer base, OpenAI needs a LOT of operating capital. They have to be the best to continue to charge the premium to continue to invest in infrastructure. it's not like they are just accumulating cash (not right now anyway)
The biggest problem with Americans (especially) and redditos/social media in general and how they critique these kinds of things is they start from the greedy old white man burning hundred dollar bills in a fire place in the middle of summer while giggling and never really consider the bigger picture.
DeathGuppie@reddit
Used to be you could do anything over the phone, now you need an internet connection. Once every bit of your information exists on the cloud, and they have lock it. Those prices for access will be whatever they decide.
This isn't a "helping mankind" not for profit venture, they see the money clearly ahead of them. When they control both the hardware and the software we will be left with the bill or be left out.
pipizich@reddit
OpenAI but with closed source, closed weight, closed paper
dbenc@reddit
open to making mad profits from investor hype
thrownawaymane@reddit
Z.ai probably learned this from Facebook.
Obviously FB isn't really doing the open model thing anymore but in terms of writing the Datacenter bible for people to read more you do, in fact have to give it to them
s2k4ever@reddit
even if they do, there will be ads in them
Zeioth@reddit
For the ignorant: What I'm looking at? I understand that's a multi layer load balancer?
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
stackfullofdreams@reddit
Evpn for the win!!, if you know.. you know.
Haiku-575@reddit
I believe it. I've been daily-driving GLM5.1 as my primary coding model for a month, and have only managed to spend $5USD because so many tokens end up being cache hits (and I only use it an hour a day or less). But a month-and-a-half ago you could only do one request at a time (!!) and would constantly time out, and everything was slow. Some time in the last few weeks everything magically started working well and everything spend up dramatically. I thought maybe pi.dev changed the way they interact with GLM, but this (and probably other improvements on GLM's side) makes a lot more sense.
viper33m@reddit
Eli5 pls
Fluid_Protection_337@reddit
33% switch cost reduction plus 15% throughout gain on the same hardware is the kind of result that gets infrastructure teams interested fast. Real production numbers not just benchmarks
Jumpy-Possibility754@reddit
The bottleneck keeps moving lower in the stack.
allinasecond@reddit
Is there a danger that the hardware stacks and bottlenecks keep adapting to something that might not exist in 2 years?
s2k4ever@reddit
SIGCOMM ’25, September 8–11, 2025 ???
HavenTerminal_com@reddit
pay less, get more usually needs a footnote. the footnote is that ROFT was the problem.
Peyzxc@reddit
33% switch cost reduction plus 15% throughput gain on the same hardware is the kind of result that gets infrastructure teams interested fast. Real production numbers not just benchmarks
PetersOdyssey@reddit
1000 tok/s Deepseek 4 Pro Wen
bblankuser@reddit
when GPT 6 open sources
AnticitizenPrime@reddit
Does anyone know how to replicate their agent mode UI on their site locally? Apparently it's some sort of modified OpenWebUI, but modified how, I don't know (some plugins)?
I'm talking about how it organizes the to-dos and whatnot in the left pane, and code previews/project files on the right.
Screenshot (Discord link, hope it works)
Limp_Classroom_2645@reddit
Would be great to attach a source for this
Scared-Biscotti2287@reddit (OP)
Yes thanks for the note.
CrafAir1220@reddit
The PFC backpressure problem with ROFT in PD disaggregated setups has been a known pain point. Good to see someone actually solving it at the architecture layer instead of just throwing more bandwidth at it
sudochmod@reddit
Wut
LeTanLoc98@reddit
Switch and optical module costs down 33%
GPU inference throughput up 15%
=> 70% more cost-effective