Zai replaced the network architecture running GLM-5.1 inference and the gains are pretty wild

Posted by Scared-Biscotti2287@reddit | LocalLLaMA | View on Reddit | 27 comments

Zai replaced the network architecture running GLM-5.1 inference and the gains are pretty wild

Been following the infrastructure side of AI more lately and stumbled on this from Zai. They upgraded the network architecture on a thousand-GPU cluster running GLM-5.1 coding inference from the standard ROFT setup to something they built called ZCube, developed with Tsinghua University and HarnetsAI

The numbers from production:

- Switch and optical module costs down 33%

- GPU inference throughput up 15%

- P99 tail latency on first token dropped 40.6%

Same GPUs, same software stack, same model. Just the network architecture changed

The actual problem they were solving is interesting. With Prefill-Decode disaggregated inference, KV Cache transfers create highly asymmetric traffic between nodes. ROFT topology handles training workloads fine but with PD disaggregation the traffic patterns dont match the static rail mapping, so you get hotspots on specific Leaf switches and PFC backpressure building up

ZCube addresses it by going fully flattened, removing the Spine layer entirely and using a complete bipartite interconnect between two switch groups. Eliminates a whole category of congestion that ROFT cant avoid by design

The cost reduction while getting better performance is the part that stands out. Usually you pay more for better network hardware. Here they cut hardware costs by a third and got 15% more throughput out of the same GPUs