GLM-4.5-Air llama.cpp experiences?

Posted by DorphinPack@reddit | LocalLLaMA | View on Reddit | 41 comments

ik_llama.cpp too! I’d love to hear how people are running it (hardware, CLI flags, use case, etc.) Bandwidth constraints and having a single 3090 are giving me a bit of analysis paralysis choosing a quant to start. I’m a patient hybrid inference gal, as long as it’s not seconds per token 😂. Workload is usually long context document work and coding (still looking a local Roo/aider to go steady with). From what I’ve seen ~70GB for Q4 would be a good fit with typical the MoE CPU/GPU setup as I have >70GB of RAM to play with. I’m afraid to go too low with so few active parameters — or is that guiding principal more bound to total parameters? I’m surprised I haven’t seen more yet but with gpt-oss dropping the morning the GLM GGUFs did I get why.