GLM 4.6 on 128 GB RAM with llama.cpp
Posted by ilintar@reddit | LocalLLaMA | View on Reddit | 20 comments
Recently I got my hands on a new box at work with 128 GB RAM and 32 GB VRAM (it's a semi-budget option, with 2x5070, but it performs really well). I decided I'm going to try a few of the bigger models. Obviously, a very good model to run on this is GPT-OSS-120B and it's been the default model, but I've set my eyes on the big ones. The GLM 4.6 REAP was a bit overwhelming, but then I though "what if I could get my hands on a good low quant that fits"?
So, with the help of https://huggingface.co/AesSedai I've obtained a really nice mixed quant: https://huggingface.co/AesSedai/GLM-4.6-GGUF/tree/main/llama.cpp/GLM-4.6-Q6_K-IQ2_XS-IQ2_XS-IQ3_S - it's tuned to *just barely* fit in 128GB. What's surprising is how good quality it retains even at such low quant sizes - here's its analysis when I fed it the `modeling_kimi.py` file from Kimi Linear: https://gist.github.com/pwilkin/7ee5672422bd30afdb47d3898680626b
And on top of that, llama.cpp just merged the results of a few weeks of hard work of new contributor hksdpc255 on XML tool calling, including GLM 4.6: https://github.com/ggml-org/llama.cpp/commit/1920345c3bcec451421bb6abc4981678cc721154
Feel free to give it a try - on my box it's getting around 40 t/s prompt processing and about 5 t/s generation, which is not lightning fast, but still a HUGE upgrade from the 5 t/s pp and 3 t/s tg when I tried just a slightly bigger quant.
Jealous-Astronaut457@reddit
This one is really nice: And on top of that, llama.cpp just merged the results of a few weeks of hard work of new contributor hksdpc255 on XML tool calling, including GLM 4.6: https://github.com/ggml-org/llama.cpp/commit/1920345c3bcec451421bb6abc4981678cc721154
wishstudio@reddit
Congrats! 5 t/s tg is good but 40 t/s pp looks like something isn't right.
ilintar@reddit (OP)
Might still be memory-constrained.
Sorry_Ad191@reddit
might want to try ik_llama.cpp a fork of llama.cpp, got huge boost for gpu/cpu inference. especially for prompt proecessing. ubergarm on hf makes special quants for it too that might be even faster but it works with regular llama.cpp ggufs as well
ilintar@reddit (OP)
Yeah, the problem with ik_llama is I have no idea how the tool-calling status is there at the moment (a few months ago it completely didn't work, I know they forked the mainline chat templates since then but I don't know exactly about GLM).
notdba@reddit
So hksdpc255 actually opened the same pull request at https://github.com/ikawrakow/ik_llama.cpp/pull/958, and it got merged in the same day. It works great ❤️
With `-b 4096 -ub 4096`, you should be able to get 120\~480 t/s for PP, depending on PCIe speed, with both ik and mainline, when the prompt is large enough. For small prompts, ik has much better PP with CPU.
Speed aside, I prefer ik over mainline since the IQK quants are really good. For SOTA quants, I recommend mixing one with https://github.com/Thireus/GGUF-Tool-Suite
ExcessiveEscargot@reddit
Spotted the Wheel of Time fan!
yazoniak@reddit
Nice try but it is just useless.
ilintar@reddit (OP)
Why useless? I agree it's not a super-comfortable use case, but I tried really hard to *actually* make it useful - a slow but still usable processing speed, a context that can fit real use cases and a quantization that doesn't gimp the model too much. It's not one of those "4096 context size with 0.1 t/s" experiments, I actually wanted to have something I could at least try out for real-life stuff.
Sorry_Ad191@reddit
keep it up its far from useless
Academic-Lead-5771@reddit
oh that's gore of GLM 4.6
seriously though we are we stuffing comically low... actually, 6-bit is fine? but other weights are at 3 and 2? oh well. not the most heinous quant I guess.
some guy was trying to stuff a 1-bit GGUF of this model into his 3090+RAM and was posting threads wondering why it was misbehaving. hilarious.
anyway how's the output quality?
I love local LLM but at a certain point its worth paying a few bucks for near full precision on demand versus this silliness.
blankboy2022@reddit
I don't know much about LLM, is that actually too low? 1 bit seems too weird but what should be the acceptable quant?
No-Refrigerator-1672@reddit
A quick rule of thumb: Q4 is within 3-5% of original model's benchmarking scores; Q3 will have like 10% less, and Q1 like 30% less. Some authors manage to bring Q4 to be nearly identical to base model by strategically quantizing some of the layers to Q6 (i.e. Unsloth Dynamic series), but overall, Q3 and less should be avoided, those quants are more of an academic exercise than usrfull product.
Sorry_Ad191@reddit
yes it should be stated that even somehting like a smol_iq3_xxs can be 3.75bpw. due to the imatrix calibration keeping many layers at 32, 16, 8, 6, 5, 4 bit etc. all good todays quants are like this
Sorry_Ad191@reddit
its not too low. for good models like glm, deepseek, kimi k2 they are all good. Once you get to a good 3bit quant the gains up to fp8 are small. 3bit and below has a nice curve. for example full fp8 deepseek 3.1 scores 74 on a benchmark. 3bit might score 74 as well but might also do a little worse. but once you go below three bit it will be 68-65, and 1 bit is down closer to 50. So 1 bit probably oly for the big moes, 2 bit for big moes pretty good. 3bit and aboveyou are 99% there
Sorry_Ad191@reddit
low bit quants are really good these days. halfway through the full polyglot with smol_iq3_xxs for kimi k2 thinking and its beating claude, o3, grok 4 and on pair with gpt 5.
Progress:[######################## ] 48% (108/225)=== FULL STATS ===- dirname: 2025-11-18-10-02-47--smol_iq3_kstest_cases: 108model: openai/moonshotai/Kimi-K2-Thinkingedit_format: diffcommit_hash: c74f5efpass_rate_1: 45.4pass_rate_2: 80.6pass_num_1: 49pass_num_2: 87percent_cases_well_formed: 95.4Front_Eagle739@reddit
Ive been using the glm 4.6 iq2_xxs unsloth quant on my 128 gb mac. Best local model I've used by far for coding and creative writing. It follows the prompt extremely well, feels as smart as the openrouter deepseek r1/v3 or better and pretty much never wanders off topic at long context. Once in a long while it drops a chinese character but other then that i wouldn't know its heavily quantised at all.
Tried glm 4.5 air, minimax m2, deepseek, all the qwens, oss 120 etc. None of them come close for my use cases.
blbd@reddit
Have you tried the unsloth dynamic quants?
TheActualStudy@reddit
Can you run this on a box with a DE or does that "just barely fits in 128GB" not really allow for running it on your desktop?
ilintar@reddit (OP)
You mean with a desktop environment? The box has a Wayland session with Xrdp running in the background, does that qualify? 😃