GLM-4.5-Air llama.cpp experiences?

Posted by DorphinPack@reddit | LocalLLaMA | View on Reddit | 41 comments

ik_llama.cpp too! I’d love to hear how people are running it (hardware, CLI flags, use case, etc.) Bandwidth constraints and having a single 3090 are giving me a bit of analysis paralysis choosing a quant to start. I’m a patient hybrid inference gal, as long as it’s not seconds per token 😂. Workload is usually long context document work and coding (still looking a local Roo/aider to go steady with). From what I’ve seen ~70GB for Q4 would be a good fit with typical the MoE CPU/GPU setup as I have >70GB of RAM to play with. I’m afraid to go too low with so few active parameters — or is that guiding principal more bound to total parameters? I’m surprised I haven’t seen more yet but with gpt-oss dropping the morning the GLM GGUFs did I get why.

Reply to Post

Reply

41 Comments

[-]

kironlau@reddit

if you wanna try ik\_llama.cpp, just download quant form ubergarm [ubergarm/GLM-4.5-Air-GGUF · Hugging Face](https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF) in fact, all you need is RAM+VRAM > \~70GB (some layers could be offload to GPU, some to CPU/RAM)

Reply

[-]

DorphinPack@reddit (OP)

I’ve got my eye on a few quants and both llama.cpps set up and working with other models What I don’t have is the bandwidth this month to try one quant, get disappointed and then download another 70GB.

Reply

[-]

Paradigmind@reddit

Maybe you can use free Wifi from McDonalds, a Coffe Shop, hotel, train station or something? Or internet cafe.

Reply

[-]

DorphinPack@reddit (OP)

I thought about it but it already takes a while to pull these on my full ~800Mbps down pipe. I also would put money on McDonalds-level free WiFi QoSing the traffic to hell if their firewalls are configured to encourage light use while eating your Big Mac. I ended up being able to visit someone who didn’t mind me pulling a couple hundred gigs over their usual usage. Good ol’ sneakernet.

Reply

[-]

Paradigmind@reddit

Nice. On which quant did you settle? And how many t/s do you get? I have a 3090 aswell and I tried the Q5 quant but only got 1.5 t/s.

Reply

[-]

DorphinPack@reddit (OP)

Gah, I really don't like the quality I'm getting <4bpw (shocker). I like how this model handles code without much nudging so I'll be trying to invest some time into getting ik\_llama.cpp tuned for as much speed as I can get with IQ4\_KSS, even if it means settling for a smaller context size. Workload is automation scripts so nothing crazy but I try to give it real world problems. I've seen my model downloader script torn apart and put back together a lot at this point lol I got \~40-60 pp t/s and \~6-10 tg t/s with the first ten layers on GPU and then all the FFN expert tensors from layer 11 up on CPU. But I'm not sure it's the absolute best solution even if it's working okay for me. Testing is just a little slow and painful but I'm getting there. About to graduate from poking at llama-server with bash and actually using the damn tools in the repo. If you'd like me to be thinking about your setup as I go drop your specs. The key details are PCIe version, DDR version, DDR speed, RAM, thread count and some idea of how zippy those threads are (including if you're on a platform with extra goodies like AVX512). The current approach is how llama.cpp does `--cpu-moe` but it's far from universal and I don't even know if it's best for me. I saw a comment about it potentially looking best when you are CPU bound and the PCIe bus is underutilized. Splitting layers (some tensors on layer N on CPU, some on GPU) means data goes over PCIe during the process of creating a token. Keeping whole, contiguous layers of tensors with clean boundaries on slow interconnects should minimize overhead on the bus. It matters a lot more for parallel workloads so it seems likely that gaming-workstation users don't realize it won't generalize. Hybrid inference isn't for serious professionals, or so they say.

Reply

[-]

Paradigmind@reddit

I like how you gradually optimize your setup. My rig: rtx 3090 on PCIe 5 ([x670e Tomahawk](https://de.msi.com/Motherboard/MAG-X670E-TOMAHAWK-WIFI)), [Ryzen 9700x](https://www.amd.com/de/products/processors/desktops/ryzen/9000-series/amd-ryzen-7-9700x.html) (8 cores, 16 threads, AVX512), 96GB DDR5 Ram @ 5200Mhz.

Reply

[-]

CryptoCryst828282@reddit

I have found intel to be much better for my use case on LLMs, tho others may diagree with me. I find most z890 boards have far more m.2 slots through the chipset that i can use with occulinks. I have one that has 5 m.2s 2 thunderbolds and 4 pcie slots with one of them being a 4x4 birfricated into 4 more occulinks.

Reply

[-]

DorphinPack@reddit (OP)

Nice! I'll dump a bit of the context I have on the implications so forgive me if you know any of it. This is part of my learning process and keeps me motivated better than studying alone these days. I prefer this when I have the option! Just remember not to trust me too much heh like I said kid with a circular saw vibes. \- consumer DDR5 boards often don't get full bandwidth on four DIMMs but it's a little complex so just worth looking in to. I tried peeking and all I found was that there are people who have talked about memory controller issues on your board (not always a bad sign, good to have data points, didn't find any about bandwidth-limited workloads though) \- I \*\*just\*\* found out why you may not want to give every thread to your CPU backend. Most setups (especially my DDR4) will saturate the memory bandwidth before all the cores are fully occupied. [See here for more](https://www.reddit.com/r/LocalLLaMA/comments/1gdxmz2/comment/lu5z5v3/), but I think you may actually be close to being able to use all your cores with 5200 DDR5. It depends on the memory controller, though. And you'll want to find a way to observe it saturating then experiment with backing off to do that properly. A little blind testing wouldn't hurt until then. \- You also have an edge in PCIe bw on gen 4 (the highest 3090 supports) but seeing as both of us are probably CPU bound running a 40-60GB model on 24GB+RAM I doubt it'l factor in here. \- with AVX512 you should def give ik\_llama.cpp a look. Happy to help anyone get it compiled and get a feel for how to find info in the repo which is optimized for ikawrakow's developer workflow to align with the purpose of the fork. \- Attention max-batch in particular should be useful as I think you'll be able to push your batch size with those 512-wide SIMD instructions on the CPU, thereby increasing your PCIe utilization (underutilized when CPU bound, remember) by sending more data per batch. \- When you're tight on VRAM that's important because it affects the size of the compute buffer and can be tuned for higher perf. The higher batch sizes will blow that compute buffer up without it and then you lose speed again by offloading more to CPU... there has to be a sweet spot. \- There is obviously a level of understanding which algorithms are going to run going on when he does this, but at [the end of this comment](https://github.com/ikawrakow/ik_llama.cpp/pull/237#issuecomment-2691104270) the author does some optimization on Deepseek Mini that opened my eyes. Obviously calculating the size of the matrix based on tokens per expert and the size of the token embeddings is doable but I don't yet know if I could determine the optimal size for my hardware without testing. If you want to do that math the values you need to plug in are prefixed with `glm4moe` in the model details.

Reply

[-]

DorphinPack@reddit (OP)

I tried ubergarm IQ2\_KL again and I think it is my sweet spot for now after all. That's what I'll be benchmarking? I guess? I really need to reinstall this setup on bare metal -- I am so clearly CPU bound I doubt my data would be helpful. So that's probably what I do tomorrow.

Reply

[-]

Paradigmind@reddit

What I think is odd is that I seem to get the exact same speed with Q6, Q5 and Q4 quants. Just 1.55 t/s. I use Koboldcpp. I would like to have 5-8 t/s, so slow reading speed. Is the IQ2 still smart enough?

Reply

[-]

DorphinPack@reddit (OP)

Bottlenecking does funny things! Add in the way modern GGUFs use drastically different quantizations for different layers and I can see that happening. Like once it's bad it won't get worse until you have to start mmap'ing the weights off storage. Picture a bucket with a tiny hole and then a bucket 10/20/30 percent larger with the same sized hole, but don't get too attached to the analogy in detail. It breaks down. I can vouch for the ubergarm IQ2\_KL being better than the UD-IQ3\_K\_XL from unsloth for my workload. I wouldn't say it's fair to stack up against some of the other IQ2 quants I've seen (just comparing the layer sacrifices). My gut says we can totally bump those numbers up with your specs. If you like koboldcpp check out [croco.cpp](https://github.com/Nexesenex/croco.cpp) which used to be the "frankenfork". It's the ik\_llama.cpp equivalent down to supporting (currently) ik-only quants. You'll need to figure out which you can use but as CPU bound hybrid inference-ers we need all the CPU optimizations we can get -- that's ik's killer feature. There's also a guy who doesn't post quants, he posts a collection of all the layers quantized in all the ways that make sense. Then, using his tool, you download them and build a custom GGUF. I think ultimately this is going to be THE way to squeeze power out of frugal hardware. Current quants are jack of all trades. Oh and THANK YOU. I just re-ran with 8 threads to try to give you a datapoint and it's up to 10t/s tg. Looks like that stuff about memory bandwidth per core really matters. This is potentially very interesting if you can disable cores to save power as that often leads to MORE SPEED as you'll throttle less in long workloads.

Reply

[-]

DorphinPack@reddit (OP)

Welp I started down a few rabbit holes while running tests. I wanted to get a decent baseline before I switch it all up tomorrow. This is all with <20% GPU utilization -- CPU is blocking things HARD. I have this box tuned down aggressively and that needs to change along with disabling hyperthreading. I will probably test that again before I put Linux on here. I've gotta wait for backups to sync anyway. But anyway... The first thing I did was try to work out [the attention max-batch math](https://github.com/ikawrakow/ik_llama.cpp/pull/237) to keep... the matrices... well fed? I'm sure once I run actual benchmarks I'll know more but the settings I found seem to work well for this model. I don't think it'l change too much with just how drastically CPU bound I am but it'l certainly be easier to test once the whole rig is running faster. In case the flags are new to anyone, you're looking for batch size 2048, micro-batch (ubatch) size 1024 and attention max-batch size 512 in the command below. IQ4\_KSS is it for me on this setup, I think. I actually figured a lot out trying to push past a consistent 8t/s+ on IQ2\_KL when I stumbled across the actual [PR for -fmoe](http://github.com/ikawrakow/ik_llama.cpp/pull/229) and realized how powerful `"ffn_down_exps=CPU"`is. It's not enough on its own but it gives you a really good base to start offloading up and gate expert FFNs by layer. Keeping up+gate together on the GPU sounds like the best way to take advantage of those operations being fused in the CUDA backend. If down is added to the fusion this setup will OOM without adjustment but it's working right now -- 48t/s pp, 8.9t/s tg. Perfectly usable for the patient, although I was in a hurry so context hovered around 2K when finished. It will slow down and maybe once I'm benchmarking with prefill I'll get to feel like an idiot all over again! I fear I've caught the bug and will be much more scientific going forward. ik_llama.cpp/build/bin/llama-server \ -m models/GLM-4.5-Air/ubergarm_ik_gguf/IQ4_KSS/GLM-4.5-Air-IQ4_KSS-00001-of-00002.gguf \ --alias GLM-4.5-Air --alias code \ --no-mmap -b 2048 -ub 1024 -amb 512 \ -fmoe -rtr -ngl 99 --threads 16 \ -ot '\.ffn_down_exps=CPU' \ -ot '[2-4][0-9]\.ffn_.*_exps=CPU' \ -c 32768 -fa

Reply

[-]

Paradigmind@reddit

In the meantime, although it is about offloading a different model, [this thread](https://www.reddit.com/r/LocalLLaMA/s/k6yBsW7nXK) might have some good ideas about offloading in general. In one of the posts I linked to another thread where another user suggests another offloading method. Unluckily I'm to nooby to try all these approaches myself. But maybe you get some ideas out of them. :D

Reply

[-]

DorphinPack@reddit (OP)

Ah thanks! The llama.cpp cpu-moe flag is funnily enough just “ffn_.*_exps=CPU” and if you add the number of layers it just limits it. Not a terrible solution but also just a starting point!

Reply

[-]

Paradigmind@reddit

Ah okay I see! I'm looking forward to your findings and advancement.

Reply

[-]

Paradigmind@reddit

Wow this looks amazing! Thank you so much for keeping me up-to-date. I will reply if I'm home later. I'm a slow phone typer, lol.

Reply

[-]

DorphinPack@reddit (OP)

Here’s an easy and free one. They are not kidding about turning off hyper threading (SMT for me on AMD) Holy shit. Night and day difference. Whole new baseline before I rebuild the OS.

Reply

[-]

DorphinPack@reddit (OP)

I talked myself down to IQ4_KSS and even that was a bit of an overshoot. I’m evaluating the UD IQ3_K_XL and ubergarm’s IQ2_KL (which is 3.41 bpw in reality). Not sure WHY but it is a lot heavier than I expected. I have thoughts but nothing solid yet.

Reply

[-]

kironlau@reddit

sad to hear that (I'm living in a place, where boardband is unlimited of usage) F.Y.I, Ubergarm is one of the best quant maker for MOE. (as the author of Ikllama.cpp said) Ik\_llama.cpp has some SOTA new quants, 5\~20% peformance better that that mainline. (with proof of test, you could find in the github page of ik\_llama.cpp) If you really want to find out the best quant of particular MOE model, ubergarm is a good starting point. ( he always tests the perplexity,and compare with the Q8 and FP16 of original model)

Reply

[-]

DorphinPack@reddit (OP)

Checking back in to say I only have an n of one but this seems like the right choice 🏎️💨 A slight adjustment to the author’s invocation (including turning off amb which just ended up doing its best r/ggggg impression) has me right in the pocket VRAM-wise at 32K context in the same speed ballpark I was getting from the smaller Qwen3 coder with 92K context.

Reply

[-]

kironlau@reddit

It's good to see you get your satisfying results. You mean using GLM-4.5-Air GGUF (from ubergarm)? You are DDR4/DDR5? I may consider to upgrade my ram to 64GB+.

Reply

[-]

DorphinPack@reddit (OP)

Yes correct DDR4 2666

Reply

[-]

kironlau@reddit

good, and if you can oc it, maybe to 3200mhz+ (max 3733, I could get, as AMD officially advised as maximum) you could be a 30\~50% gain in speed (compared to 2666)

Reply

[-]

DorphinPack@reddit (OP)

Hah yeah I should. Put that extra ECC bit to work! First I have to stop running in a VM and finally install Linux on the bare metal 🫠

Reply

[-]

kironlau@reddit

haha....Win11 eat up my 10\~14.5GB ram, even I stop all startup of unnecessary softwares

Reply

[-]

DorphinPack@reddit (OP)

Woah 10+ idling right after a fresh startup???

Reply

[-]

kironlau@reddit

I dunno, let me have a full check of my system @@

Reply

[-]

DorphinPack@reddit (OP)

I can't say that isn't high btw. Switched away before I had to switch to 11 so maybe that is just how it is?

Reply

[-]

kironlau@reddit

I think after stopping WSL+docker to start, the window should hold me about 8GB ram. (maybe my poor setting) Now, I just open Edge (for 8 tabs), WSL+docker running, and some screen cap tool, 15GB vram is gone.

Reply

[-]

DorphinPack@reddit (OP)

Ah thanks!

Reply

[-]

No-Statement-0001@reddit

Here is my configuration right out of llama-swap. I have 2x3090 and 2xP40 so enough VRAM to load the entire model plus a decent amount of context. I spent some time with performance optimizations with `--override-tensor` today. For me using `--tensor-split` to move more layers to the 3090s made biggest performance improvement (10%). Basically load as many layers onto the fastest GPUs equal more performance. ``` macros: "server-latest": | /path/to/llama-server/llama-server-latest --host 127.0.0.1 --port ${PORT} --flash-attn -ngl 999 -ngld 999 --no-mmap --no-warmup # quantize KV cache to Q8, increases context but # has a small effect on perplexity # https://github.com/ggml-org/llama.cpp/pull/7412#issuecomment-2120427347 "q8-kv": | --cache-type-k q8_0 --cache-type-v q8_0 models: # ~30 to 33tok/sec # Perf Tweaking notes: # - keep attn/exp tensors in layers together (same GPU) # - put as much as possible on the faster GPUs (3090s) # - not worth effort to tweak offloads, wound up making it slower # - using tensor-split was fine. glm-air: cmd: | ${server-latest} ${q8-kv} --model /path/to/models/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf # 80K context --ctx-size 81920 # put more layers on the 3090s (~10% faster) # balancing tweaks for VRAM with #P40s # devices: 3090, 3090, P40, P40 --tensor-split 28,30,22,20 ```

Reply

[-]

Glittering-Call8746@reddit

What's the tg ?

Reply

[-]

relmny@reddit

sorry to ask, but with the gpt hype, is hard to find posts about glm-4.5. I can't find values for temperature, top-k top-p etc are they not needed or are they the same as glm-4?

Reply

[-]

Dundell@reddit

I didn't really consider a P40, but I do have an extra I could slap in and see how it helps. Maybe even get more context out of it at a better rate.

Reply

[-]

DorphinPack@reddit (OP)

This logic has me wanting to grab a P40 even though I was going to save for a second 3090 (or, more likely, stick it out until more interesting options open up). I’m doing way better with hybrid inference than I expected with 16T and DDR4 2666 (in a VM!) so even 24 more “slow” GB of VRAM would probably just be a nice boost.

Reply

[-]

a_postgres_situation@reddit

One datapoint: I've been playing with GLM-4.5-Air-IQ4_NL.gguf GLM thinks forever, and I had to raise context to at least 32768 because of that, but the result had no errors, only some warnings - and worked. Qwen3-Coder-30B-A3B-Instruct-Q8_0 output was worse (but model is also half the size)

Reply

[-]

Dundell@reddit

I run x5 RTX 3060 12GBs (60GB Vram) + 64GBs ddr4 X99 Running the Unsloth Q3 UD XL GLM 4.5 Air gguf with 64k Q4_1 context. 175~125t/s read and 16~10~5t/s write depending on context I've been pushing 0~10k~35k. So far it passes a lot of normal tests. There's a few for Python scripts, working on one of my projects to fix an issue with a yaml file and JS file that interacts with the yaml. That went good, but obviously slower than I normally like. It definitely FEELS better than flash 2.5, and I have some use cases to use it without the worry of free daily limits which is good. Additionally I ran it on 3 tests with my personal automated report builder project that searches and scrapes articles for the llm to then summarize, gather 20~50 articles to then process all of them into a PDF report. It was better than Flash 2.5 in its format and creativity for the PDFs comparing, including a static compare against my 5 example articles and controlled prompts.

Reply

[-]

DorphinPack@reddit (OP)

Wow it handled coding tasks well with a Q4_1 cache? That’s really compelling.

Reply

[-]

MatterMean5176@reddit

It's fast so I would lean towards whatever larger quant you're considering.

Reply

[-]

DorphinPack@reddit (OP)

Yeah I have 180GB of RAM atm and a bit more freeing up next time I get a whole day to get my server set up right finally. I think with the active parameters being so low I’m tempted to go big for coding quality and try Q4 next month.

Reply