GLM 4.6 on 128 GB RAM with llama.cpp

[-]

Jealous-Astronaut457@reddit

This one is really nice: And on top of that, llama.cpp just merged the results of a few weeks of hard work of new contributor hksdpc255 on XML tool calling, including GLM 4.6: https://github.com/ggml-org/llama.cpp/commit/1920345c3bcec451421bb6abc4981678cc721154

[-]

wishstudio@reddit

Congrats! 5 t/s tg is good but 40 t/s pp looks like something isn't right.

[-]

ilintar@reddit (OP)

Might still be memory-constrained.

[-]

Sorry_Ad191@reddit

might want to try ik_llama.cpp a fork of llama.cpp, got huge boost for gpu/cpu inference. especially for prompt proecessing. ubergarm on hf makes special quants for it too that might be even faster but it works with regular llama.cpp ggufs as well

[-]

ilintar@reddit (OP)

Yeah, the problem with ik_llama is I have no idea how the tool-calling status is there at the moment (a few months ago it completely didn't work, I know they forked the mainline chat templates since then but I don't know exactly about GLM).

[-]

notdba@reddit

So hksdpc255 actually opened the same pull request at https://github.com/ikawrakow/ik_llama.cpp/pull/958, and it got merged in the same day. It works great ❤️

With `-b 4096 -ub 4096`, you should be able to get 120\~480 t/s for PP, depending on PCIe speed, with both ik and mainline, when the prompt is large enough. For small prompts, ik has much better PP with CPU.

Speed aside, I prefer ik over mainline since the IQK quants are really good. For SOTA quants, I recommend mixing one with https://github.com/Thireus/GGUF-Tool-Suite

[-]

ExcessiveEscargot@reddit

Spotted the Wheel of Time fan!

[-]

yazoniak@reddit

Nice try but it is just useless.

[-]

ilintar@reddit (OP)

Why useless? I agree it's not a super-comfortable use case, but I tried really hard to *actually* make it useful - a slow but still usable processing speed, a context that can fit real use cases and a quantization that doesn't gimp the model too much. It's not one of those "4096 context size with 0.1 t/s" experiments, I actually wanted to have something I could at least try out for real-life stuff.

[-]

Sorry_Ad191@reddit

keep it up its far from useless

[-]

Academic-Lead-5771@reddit

oh that's gore of GLM 4.6

seriously though we are we stuffing comically low... actually, 6-bit is fine? but other weights are at 3 and 2? oh well. not the most heinous quant I guess.

some guy was trying to stuff a 1-bit GGUF of this model into his 3090+RAM and was posting threads wondering why it was misbehaving. hilarious.

anyway how's the output quality?

I love local LLM but at a certain point its worth paying a few bucks for near full precision on demand versus this silliness.

[-]

blankboy2022@reddit

I don't know much about LLM, is that actually too low? 1 bit seems too weird but what should be the acceptable quant?

[-]

No-Refrigerator-1672@reddit

A quick rule of thumb: Q4 is within 3-5% of original model's benchmarking scores; Q3 will have like 10% less, and Q1 like 30% less. Some authors manage to bring Q4 to be nearly identical to base model by strategically quantizing some of the layers to Q6 (i.e. Unsloth Dynamic series), but overall, Q3 and less should be avoided, those quants are more of an academic exercise than usrfull product.

[-]

Sorry_Ad191@reddit

yes it should be stated that even somehting like a smol_iq3_xxs can be 3.75bpw. due to the imatrix calibration keeping many layers at 32, 16, 8, 6, 5, 4 bit etc. all good todays quants are like this

[-]

Sorry_Ad191@reddit

its not too low. for good models like glm, deepseek, kimi k2 they are all good. Once you get to a good 3bit quant the gains up to fp8 are small. 3bit and below has a nice curve. for example full fp8 deepseek 3.1 scores 74 on a benchmark. 3bit might score 74 as well but might also do a little worse. but once you go below three bit it will be 68-65, and 1 bit is down closer to 50. So 1 bit probably oly for the big moes, 2 bit for big moes pretty good. 3bit and aboveyou are 99% there

[-]

Sorry_Ad191@reddit

low bit quants are really good these days. halfway through the full polyglot with smol_iq3_xxs for kimi k2 thinking and its beating claude, o3, grok 4 and on pair with gpt 5.

Progress:

[######################## ] 48% (108/225)

=== FULL STATS ===

- dirname: 2025-11-18-10-02-47--smol_iq3_ks

test_cases: 108

model: openai/moonshotai/Kimi-K2-Thinking

edit_format: diff

commit_hash: c74f5ef

pass_rate_1: 45.4

pass_rate_2: 80.6

pass_num_1: 49

pass_num_2: 87

percent_cases_well_formed: 95.4

[-]

Front_Eagle739@reddit

Ive been using the glm 4.6 iq2_xxs unsloth quant on my 128 gb mac. Best local model I've used by far for coding and creative writing. It follows the prompt extremely well, feels as smart as the openrouter deepseek r1/v3 or better and pretty much never wanders off topic at long context. Once in a long while it drops a chinese character but other then that i wouldn't know its heavily quantised at all.

Tried glm 4.5 air, minimax m2, deepseek, all the qwens, oss 120 etc. None of them come close for my use cases.

[-]

blbd@reddit

Have you tried the unsloth dynamic quants?

[-]

TheActualStudy@reddit

Can you run this on a box with a DE or does that "just barely fits in 128GB" not really allow for running it on your desktop?

[-]

ilintar@reddit (OP)

You mean with a desktop environment? The box has a Wayland session with Xrdp running in the background, does that qualify? 😃