GLM 4.6 on 128 GB RAM with llama.cpp

Posted by ilintar@reddit | LocalLLaMA | View on Reddit | 20 comments

Recently I got my hands on a new box at work with 128 GB RAM and 32 GB VRAM (it's a semi-budget option, with 2x5070, but it performs really well). I decided I'm going to try a few of the bigger models. Obviously, a very good model to run on this is GPT-OSS-120B and it's been the default model, but I've set my eyes on the big ones. The GLM 4.6 REAP was a bit overwhelming, but then I though "what if I could get my hands on a good low quant that fits"?

So, with the help of https://huggingface.co/AesSedai I've obtained a really nice mixed quant: https://huggingface.co/AesSedai/GLM-4.6-GGUF/tree/main/llama.cpp/GLM-4.6-Q6_K-IQ2_XS-IQ2_XS-IQ3_S - it's tuned to *just barely* fit in 128GB. What's surprising is how good quality it retains even at such low quant sizes - here's its analysis when I fed it the `modeling_kimi.py` file from Kimi Linear: https://gist.github.com/pwilkin/7ee5672422bd30afdb47d3898680626b

And on top of that, llama.cpp just merged the results of a few weeks of hard work of new contributor hksdpc255 on XML tool calling, including GLM 4.6: https://github.com/ggml-org/llama.cpp/commit/1920345c3bcec451421bb6abc4981678cc721154

Feel free to give it a try - on my box it's getting around 40 t/s prompt processing and about 5 t/s generation, which is not lightning fast, but still a HUGE upgrade from the 5 t/s pp and 3 t/s tg when I tried just a slightly bigger quant.