GLM-4.5 Air Q8 vs GLM-4.5 IQ2_XXS

Posted by therealAtten@reddit | LocalLLaMA | View on Reddit | 29 comments

Lowest of lows post, but in all seriousness, both quants are virtually the same size:
GLM-4.5 Air Q8 = 117.5 GB
GLM-4.5 IQ2_XXS = 115.8 GB

I can't be the only one with 128 GB RAM having asked that question to themselves. While GLM-4.5 Air Q6_K_XL is downloading, has anyone by any chance tried both quants and can compare their outputs given your use cases? I am so curious to know if there is a sweet spot in the quality attained for a given RAM capacity, that is not necessarily the largest model you can fit... Thank you for any insights!

[-]

StrictSpite3206@reddit

Thanks a lot for the report.

[-]

YouDontSeemRight@reddit

Oh there's a bug in Llama Server and OpenWeb-UI! Using unsloths IQ4 I believe, I found after my chat query ended Llama Server continued to run inference on the model. I think it's a repeat query issue and it might be processing multiple requests concurrently. I tried the default Llama cpp web ui chat and it doesn't happen. Combination of Air, Llama server, and OpenWeb Ui is causing it. It wouldn't surprise me if it was Llama cpp even though it seemed to follow open web ui.

Oh! Also, streaming didn't seem to work with one of the quantity I tried. Super weird. I think it was the Q4 UD.

[-]

Mushoz@reddit

That is openweb ui generating a title and followup questions. It happens with all models & backends. You can disable it in openwebui settings.

[-]

YouDontSeemRight@reddit

I like the thought. It ran like that for hours though.

[-]

TKGaming_11@reddit

This is probably because the local model is set as the task model and is being used to generate tags and a title

[-]

YouDontSeemRight@reddit

I like your thinking. Somethings still not right though and it seems to be stuck on infinite loop?

[-]

pseudonerv@reddit

I would very much love to know what your experience is comparing these! Please do share!

[-]

therealAtten@reddit (OP)

Hi, I updated the post, I apologise if the findings are underwhelming. I highly apppreciate any prompts to try out

[-]

therealAtten@reddit (OP)

Thanks! I am not sure if I will download both. I just got the Air Q6-K-XL because I thought it will likely perform similar to the Q8 at 101.5 vs 117.5 GB, that's 16 GB I have left for Chrome in the background lol... will have to update my runtime and test it a little.

I could go with the GLM-4.5 IQ_1-M at 107.6 GB for a comparably-sized quant, or directly try the IQ2_XXS to avoid the unnecessary disfavour I'd do the base model.

[-]

hainesk@reddit

Don't forget room for context.

[-]

mukz_mckz@reddit

+1. Been burned by this.

[-]

East-Cauliflower-150@reddit

I can’t imagine either is better than unsloth q3_k_xl of qwen3 235b at ~100gb but worth a try…

[-]

therealAtten@reddit (OP)

I have been running both 235B IQ4_XS models so far, but they leave leave no room for anything else... I have yet to use GLM more extensively to come up with an opinion

[-]

DamiaHeavyIndustries@reddit

I'm curious of the performance you get out of them in terms of quality of output. I have 128 ram and I'm downloading 235B qwen now

[-]

East-Cauliflower-150@reddit

q3_k_xl performs very well and I run it daily on my MacBook Pro 128gb. To leave more room for context I run my streamlit chatbot on my Mac mini and the MacBook is just a LM Studio based server. I then use it on my phone anywhere over tailscape for perfect setup…

[-]

mukz_mckz@reddit

Keep us posted on what you feel about it! Would be a good test of the geo-mean low for MoEs

[-]

MLDataScientist@reddit

You should run the perplexity benchmark on both models and see what you get. I know it is not a real knowledge test but still gives you an idea.

[-]

kiselsa@reddit

Running perplexity benchmarks on quants of two different models will not say anything though about which one should you pick.

It only has meaning when comparing various quants of one model.

[-]

MLDataScientist@reddit

that is true. But it is also true that these are the same family of models (GLM and GLM Air). I think a better measurement would be KL Divergence - https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs

[-]

FalseMap1582@reddit

I am also very curious to know which one of the two would perform better. I suspect the lower quant version would have more world knowledge, whilst the Air version be more reliable in agentic/coding tasks.

[-]

Admirable-Star7088@reddit

You captured my interest for this gigantic model. I also have 128GB RAM, but also with 16GB VRAM, that's a total of 144GB RAM.

The larger IQ3_XSS quant is 144,9GB, just \~1GB more than my total RAM. By offloading that single gigabyte to my hard drive with mmap, I could potentially get this running at a somewhat bearable speed, maybe... perhaps.

[-]

fallingdowndizzyvr@reddit

You still need room for context. That's GBs.

[-]

Admirable-Star7088@reddit

Too late, I downloaded IQ3_XSS, lol. I'm getting \~1.5 t/s with 8k context. Reasoning is a pain though, but with /nothink it's quite bearable. Practical? Maybe not very much. Fun? Hell yeah!

[-]

a_beautiful_rhind@reddit

Big 4.5 is better.. the active parameters on air are low. It compares to gemma and makes similar mistakes. I have used both on API and ~4bit.

Brain damage of IQ2 < fewer active parameters and total size. It will be slower though.

[-]

Thireus@reddit

See perplexity graphs:

https://github.com/Thireus/GGUF-Tool-Suite/blob/main/ppl_graphs/GLM-4.5-Air.svg

https://github.com/Thireus/GGUF-Tool-Suite/blob/main/ppl_graphs/GLM-4.5.svg

GLM-4.5 Air Q8 is 4.5798. While GLM-4.5 IQ2_XXS should be in the 3.5 region.

[-]

therealAtten@reddit (OP)

Thanks so much for the resource! OpenAI broke my ability to read charts correctly, but from what I visually see in those svgs is that the GLM-4.5 at 2 bits has a value of 5.3 whereas the Air at 6 or 8 has a value of 4.6.

What am I missing? I unfortunately have no coding background and use Local LLMs happily in LMStudio & Windows, so I haven't tried running any perplexity benchmarks myself. Most notabily though, TG is twice as fast on the Air compared to the full one, and the Air is the same speed as my 235B IQ4_XS..

[-]

Thireus@reddit

You can find them here: https://github.com/Thireus/GGUF-Tool-Suite/blob/main/ppl_from_others.db

[-]

fallingdowndizzyvr@reddit

I've only been using a Q2 quant of GLM-4.5 for a day or so, but off hand my impression is that it's better than the Q6 quant of GL-4.5 Air I had been using. It's more.... well.... thoughtful. It's not that Air is bad. It's really good. But even the Q2 quant of the big one just seems better.

[-]

therealAtten@reddit (OP)

...y'all convinced me to download the GLM-4.5 IQ2_XXS in addition to the Air Q6 I was already happy with.