How to run the GLM-4.7 model locally on your own device (guide)

Posted by Dear-Success-1441@reddit | LocalLLaMA | View on Reddit | 61 comments

GLM-4.7 is Z.ai’s latest thinking model, delivering stronger coding, agent, and chat performance than GLM-4.6
It achieves SOTA performance on on SWE-bench (73.8%, +5.8), SWE-bench Multilingual (66.7%, +12.9), and Terminal Bench 2.0 (41.0%, +16.5).
The full 355B parameter model requires 400GB of disk space, while the Unsloth Dynamic 2-bit GGUF reduces the size to 134GB (-75%).

Official blog post - https://docs.unsloth.ai/models/glm-4.7

[-]

Barkalow@reddit

Is it really worth running the model in 1 or 2-bit vs something that hasnt been possibly lobotomized by quantization?

[-]

I've been playing around with 4.7 IQ2-S all day and I am seriously impressed. It has passed all the logic, world knowledge, and philosophy tests that I usually throw at new models. It's now my favorite model I can run. I just have to wait a long time at 3 tps.

[-]

crantob@reddit

Hello? Am I sposed to know what hardware you run?

Hello?

You forgot to mention what hardware you get 3 tps with...

[-]

Good_Roll@reddit

I hope you dont normally speak to people this way

[-]

crantob@reddit

Depends on what the person just said.

[-]

Impossible_Hour5036@reddit

Hello? Just ask a normal question like a normal human. No one is under any sort of obligation to predict when you'll be reading a thread and prepare all of the info for you.

"Hey would you mind telling me what hardware you're using?" <- Normal human ✅ [Your response] <- Someone who thinks people owe them something ❌

[-]

jeffwadsworth@reddit

Well, I have run the KIMI K2 Thinking 1bit version since unsloth dropped it and it does amazingly well considering its almost complete lobotomy on paper. I use the 4bit version of GLM 4.6 and it codes things even better than the website version for some reason. Temperature is super important, so I go with 1.0 for writing and 0.4 for coding tasks.

[-]

ortegaalfredo@reddit

Yes, it happened to me too, for some reason the q4 version is better than the web version. Must be heavily quantized in the web.

[-]

Particular-Way7271@reddit

Who knows what quant the website gives you...

[-]

a_beautiful_rhind@reddit

Better than not running it. Expect more mistakes. EXL3 can even squeeze it into 96g.

[-]

Barkalow@reddit

For sure, it was an honest question. I always operated under the pretense that a smaller model that's less quantized would outperform a larger one thats been reduced so much

[-]

Vusiwe@reddit

I think I'm finally moving on from Llama 3.3 70b Q8, to running GLM 4.7 Q2. It's a large step up.

[-]

IrisColt@reddit

It's the opposite.

[-]

a_beautiful_rhind@reddit

Yea that gets really fuzzy these days. Officially it was the opposite.

[-]

Allseeing_Argos@reddit

I'm running mostly GLM 4.6 Q2 and it's my favorite chat model by far.

[-]

yoracale@reddit

Thank you for testing and your feedback! 🙏🥰

[-]

Pristine-Woodpecker@reddit

It needs testing. It was true for DeepSeek, nobody seems to have tested it for this one.

[-]

jeffwadsworth@reddit

I use DS 3.1 Terminus with temperature 0.4 for coding tasks and wow. That model can cook.

[-]

cliffninja@reddit

This sounds very cool, Im curious to try this model with ollama cc endpoint for coding.
Im on a m4 macbook pro max 64gb, any suggestion on quantized model?
Ill likely use it for smaller tasks saving my antropic tokens, long ralph loops etc

[-]

Impossible_Hour5036@reddit

z.ai Christmas deal is still going on. max plan for 1 year = $288. billions and billions of those sweet mediocre GLM tokens. Conservative estimate is 20x faster TPS, I haven't measured

[-]

Infinite100p@reddit

Does anyone find 5 t/s usable?

For what??

[-]

WyattTheSkid@reddit

Good enough for running batch inference to make a dataset to distill glm 4.7 into a smaller model

[-]

Impossible_Hour5036@reddit

Are most people doing this as a hobby, for professional research, etc? Of course, it's hard to know what "most people" do at all, but I'm just curious as to what, if any, practical purposes doing this might have. I completely get doing it purely for the enjoyment of learning and experimenting and tweaking, and I'm 1000% in support of it.

But for someone who does their learning and experimenting in different channels rather than running AI locally, is there any actual practical reason to invest money in running GLM locally (beyond privacy obviously)?

[-]

Opening_Move_6570@reddit

Why ddr 5 is so expensive now?

[-]

blbd@reddit

I suspect that for most of us this will be "seconds per token" not "tokens per second".

[-]

amjadmh73@reddit

😂

[-]

cosicic@reddit

y'all think it will run on my macbook air? Q1_XXXXXXXXXXS 🙏

[-]

GotBanned3rdTime@reddit

💀

[-]

joexner@reddit

Is there a chance I could run some quant of GLM-4.7 on my 48GB M4 pro MBP? I'm sure it'd be slow as molasses, but can I replace my GH Copilot subscription yet if I'm willing to wait for it to cook?

[-]

Whole-Assignment6240@reddit

Does quantization impact the model's reasoning abilities significantly?

[-]

Proud_Fox_684@reddit

Yes, without a doubt. But it depends on how low you go, and how well it has been quantized.

[-]

Sophia7Inches@reddit

Can I run it if I have a GPU with 24GB VRAM and 64GB of System RAM?

[-]

FullOf_Bad_Ideas@reddit

it would spill into your swap/disk, so it would be uneven in speed and it would be very slow overall, probably around 0.1-0.5 t/s. If that classifies as "running" in your dictionary, then yes it will run. But it won't run in usable speeds.

with 48GB VRAM and 128GB RAM I had about 4 t/s TG speed on IQ3_XSS quant of GLM 4.6 at low context.

[-]

crantob@reddit

Thank you for the information. That is useful information.

[-]

yoracale@reddit

Yes it's possible if you use the 1-bit one: https://huggingface.co/unsloth/GLM-4.7-GGUF?show_file_info=GLM-4.7-UD-TQ1_0.gguf

[-]

Admirable-Star7088@reddit

No, you need at least 128GB RAM.

[-]

PopularKnowledge69@reddit

How can I run it on a configuration of 2x48 GB GPU + 64 GB RAM?

[-]

jeffwadsworth@reddit

If you can, grab LMStudio and check the unsloth GLM models. On the right they will list the size. You must have at least!! that must memory to even hold the model and more for any amount of context size. For example, I use the 4bit GLM 4.7 model and it is a 203GB model. So, for adequate performance, you will need something like 300 GB to run that baby. In your case, you could try to run the 1bit or 2bit GLM 4.7 with llama.cpp.

[-]

zipzapbloop@reddit

what are you running it on if you don't mind me asking?

[-]

jeffwadsworth@reddit

HP Z8 G4 with dual Xeon and 1.5 TB of ram.

[-]

RaGE_Syria@reddit

that... is a shitton RAM...

[-]

jeffwadsworth@reddit

3.2 t/s with the 4bit GLM 4.7 unsloth. Quite usable for me considering it is a coding wizard.

[-]

New-Yogurtcloset1984@reddit

I would go as far as to say it is a metric fuckton of ram.

[-]

RazzmatazzReal4129@reddit

you can't

[-]

PopularKnowledge69@reddit

Why is it possible with way less VRAM ?

[-]

random-tomato@reddit

You have 96GB VRAM + 64 GB RAM = 160 GB of memory total. Definitely more than enough to run Q2_K_XL!!!

[-]

Excellent-Sense7244@reddit

I hate to be gpu miserable

[-]

Nobby_Binks@reddit

FWIW, I can run Q3KXL with 64K context at ~7tps on 4x3090's and an old EPYC DDR4 system. May be able to eke out a bit more but my llama.cpp tweaking skills are not that good yet.

[-]

lolwutdo@reddit

Oh damn, didn't realize 4.7 is a bigger model; I thought it was the same size as 4.5 and 4.6

[-]

mikael110@reddit

It isn't, it's 355B total parameters which is exactly the same as 4.6 and 4.5.

[-]

lolwutdo@reddit

oh yeahh you're right, I got confused with air. lol

[-]

random-tomato@reddit

I'm 99% sure GLM 4.7 is the exact same size as 4.5 and 4.6

[-]

jeffwadsworth@reddit

Love it so far. It has some sassy to it.

[-]

jeffwadsworth@reddit

Grabbing the 4 bit unsloth. I would love to see the difference in coding tasks between it and the 1bit/2bit versions. But I am happy usually with half-precision.

[-]

Healthy-Nebula-3603@reddit

Ggml Q2 model is not nothing more than a gimik.

[-]

yoracale@reddit

Actually If you see our third-party benchmarks for Aider, you can see the 2-bit DeepSeek-V3.1 quant is slightly worse than full precision DeepSeek-R1-0528. GLM-4.7 should see similar accuracy recovery.

3-bit is definitely the sweet spot.

Also let's not forget if you don't want to run the quantized versions, that's totally fine, you can run the full precision quant which we also uploaded.

[-]

Healthy-Nebula-3603@reddit

That's 3 bit nor 2 bit

[-]

yoracale@reddit

It's 3bit, 2bit and 1bit.

[-]

Pristine-Woodpecker@reddit

GLM-4.7 should see similar accuracy

should is very load bearing here.

This is for example absolutely not true for Qwen3-235B. Without testing, you do not know if it's true for GLM.

[-]

yoracale@reddit

We tested it and it works great actually, just haven't benchmarked it since it's very resource intensive.

If you don't want to use 2-bit, like I said, that's fine there's always the bigger quants available to use and run!

[-]

Admirable-Star7088@reddit

What do you mean? I have used UD-Q2_K_XL quants of GLM 4.5, 4.6 and testing 4.7 right now. They are all the smartest local models I've ever run, way smarter than other, smaller models such as GLM 4.5 Air at Q8 quant or Qwen3-235b at Q4 quant.

Maybe it's true that Q2 is often too a too aggressive quant for most models, but GLM 4.x is definitely an exception.