GLM 5.1 Locally: 40tps, 2000+ pp/s

Posted by val_in_tech@reddit | LocalLLaMA | View on Reddit | 90 comments

After some sglang patching and countless experiments, managed to get reap-ed nvfp4 version running stable and FAST on 4 x RTX 6000 Pros (limited to 350W). Very happy with performance and quality. Inference software is still under-optimized for those cards. I think we will see their true potential unfold this or early next year.

## Throughput by Context Depth

| Prefilled | PP@4096 | TG@512 |

| --------- | ------- | ------ |

| 0 | 2229.0 | 42.03 |

| 4K | 1943.6 | 41.41 |

| 16K | 1558.9 | 39.72 |

| 32K | 1234.2 | 38.19 |

| 64K | 863.5 | 35.87 |

| TTFR 0 | 1834 | - |

| TTFR 4K | 4201 | - |

| TTFR 16K | 13098 | - |

| TTFR 32K | 29823 | - |

| TTFR 64K | 80490 | - |

## TG Peak (burst throughput)

43.00 42.00 40.00 39.00 37.00

Overall experience with opencode is pretty close to Sonnet + Claude Code.

Will play with different concurrency settings this weekend.

Anyone seen better performance on this hardware?