How to run the GLM-4.7 model locally on your own device (guide)
Posted by Dear-Success-1441@reddit | LocalLLaMA | View on Reddit | 61 comments
- GLM-4.7 is Z.ai’s latest thinking model, delivering stronger coding, agent, and chat performance than GLM-4.6
- It achieves SOTA performance on on SWE-bench (73.8%, +5.8), SWE-bench Multilingual (66.7%, +12.9), and Terminal Bench 2.0 (41.0%, +16.5).
- The full 355B parameter model requires 400GB of disk space, while the Unsloth Dynamic 2-bit GGUF reduces the size to 134GB (-75%).
Official blog post - https://docs.unsloth.ai/models/glm-4.7
Barkalow@reddit
Is it really worth running the model in 1 or 2-bit vs something that hasnt been possibly lobotomized by quantization?
InfinityApproach@reddit
I've been playing around with 4.7 IQ2-S all day and I am seriously impressed. It has passed all the logic, world knowledge, and philosophy tests that I usually throw at new models. It's now my favorite model I can run. I just have to wait a long time at 3 tps.
crantob@reddit
Hello? Am I sposed to know what hardware you run?
Hello?
You forgot to mention what hardware you get 3 tps with...
Good_Roll@reddit
I hope you dont normally speak to people this way
crantob@reddit
Depends on what the person just said.
Impossible_Hour5036@reddit
Hello? Just ask a normal question like a normal human. No one is under any sort of obligation to predict when you'll be reading a thread and prepare all of the info for you.
"Hey would you mind telling me what hardware you're using?" <- Normal human ✅ [Your response] <- Someone who thinks people owe them something ❌
jeffwadsworth@reddit
Well, I have run the KIMI K2 Thinking 1bit version since unsloth dropped it and it does amazingly well considering its almost complete lobotomy on paper. I use the 4bit version of GLM 4.6 and it codes things even better than the website version for some reason. Temperature is super important, so I go with 1.0 for writing and 0.4 for coding tasks.
ortegaalfredo@reddit
Yes, it happened to me too, for some reason the q4 version is better than the web version. Must be heavily quantized in the web.
Particular-Way7271@reddit
Who knows what quant the website gives you...
a_beautiful_rhind@reddit
Better than not running it. Expect more mistakes. EXL3 can even squeeze it into 96g.
Barkalow@reddit
For sure, it was an honest question. I always operated under the pretense that a smaller model that's less quantized would outperform a larger one thats been reduced so much
Vusiwe@reddit
I think I'm finally moving on from Llama 3.3 70b Q8, to running GLM 4.7 Q2. It's a large step up.
IrisColt@reddit
It's the opposite.
a_beautiful_rhind@reddit
Yea that gets really fuzzy these days. Officially it was the opposite.
Allseeing_Argos@reddit
I'm running mostly GLM 4.6 Q2 and it's my favorite chat model by far.
yoracale@reddit
Thank you for testing and your feedback! 🙏🥰
Pristine-Woodpecker@reddit
It needs testing. It was true for DeepSeek, nobody seems to have tested it for this one.
jeffwadsworth@reddit
I use DS 3.1 Terminus with temperature 0.4 for coding tasks and wow. That model can cook.
cliffninja@reddit
This sounds very cool, Im curious to try this model with ollama cc endpoint for coding.
Im on a m4 macbook pro max 64gb, any suggestion on quantized model?
Ill likely use it for smaller tasks saving my antropic tokens, long ralph loops etc
Impossible_Hour5036@reddit
z.ai Christmas deal is still going on. max plan for 1 year = $288. billions and billions of those sweet mediocre GLM tokens. Conservative estimate is 20x faster TPS, I haven't measured
Infinite100p@reddit
Does anyone find 5 t/s usable?
For what??
WyattTheSkid@reddit
Good enough for running batch inference to make a dataset to distill glm 4.7 into a smaller model
Impossible_Hour5036@reddit
Are most people doing this as a hobby, for professional research, etc? Of course, it's hard to know what "most people" do at all, but I'm just curious as to what, if any, practical purposes doing this might have. I completely get doing it purely for the enjoyment of learning and experimenting and tweaking, and I'm 1000% in support of it.
But for someone who does their learning and experimenting in different channels rather than running AI locally, is there any actual practical reason to invest money in running GLM locally (beyond privacy obviously)?
Opening_Move_6570@reddit
Why ddr 5 is so expensive now?
blbd@reddit
I suspect that for most of us this will be "seconds per token" not "tokens per second".
amjadmh73@reddit
😂
cosicic@reddit
y'all think it will run on my macbook air? Q1_XXXXXXXXXXS 🙏
GotBanned3rdTime@reddit
💀
joexner@reddit
Is there a chance I could run some quant of GLM-4.7 on my 48GB M4 pro MBP? I'm sure it'd be slow as molasses, but can I replace my GH Copilot subscription yet if I'm willing to wait for it to cook?
Whole-Assignment6240@reddit
Does quantization impact the model's reasoning abilities significantly?
Proud_Fox_684@reddit
Yes, without a doubt. But it depends on how low you go, and how well it has been quantized.
Sophia7Inches@reddit
Can I run it if I have a GPU with 24GB VRAM and 64GB of System RAM?
FullOf_Bad_Ideas@reddit
it would spill into your swap/disk, so it would be uneven in speed and it would be very slow overall, probably around 0.1-0.5 t/s. If that classifies as "running" in your dictionary, then yes it will run. But it won't run in usable speeds.
with 48GB VRAM and 128GB RAM I had about 4 t/s TG speed on IQ3_XSS quant of GLM 4.6 at low context.
crantob@reddit
Thank you for the information. That is useful information.
yoracale@reddit
Yes it's possible if you use the 1-bit one: https://huggingface.co/unsloth/GLM-4.7-GGUF?show_file_info=GLM-4.7-UD-TQ1_0.gguf
Admirable-Star7088@reddit
No, you need at least 128GB RAM.
PopularKnowledge69@reddit
How can I run it on a configuration of 2x48 GB GPU + 64 GB RAM?
jeffwadsworth@reddit
If you can, grab LMStudio and check the unsloth GLM models. On the right they will list the size. You must have at least!! that must memory to even hold the model and more for any amount of context size. For example, I use the 4bit GLM 4.7 model and it is a 203GB model. So, for adequate performance, you will need something like 300 GB to run that baby. In your case, you could try to run the 1bit or 2bit GLM 4.7 with llama.cpp.
zipzapbloop@reddit
what are you running it on if you don't mind me asking?
jeffwadsworth@reddit
HP Z8 G4 with dual Xeon and 1.5 TB of ram.
RaGE_Syria@reddit
that... is a shitton RAM...
jeffwadsworth@reddit
3.2 t/s with the 4bit GLM 4.7 unsloth. Quite usable for me considering it is a coding wizard.
New-Yogurtcloset1984@reddit
I would go as far as to say it is a metric fuckton of ram.
RazzmatazzReal4129@reddit
you can't
PopularKnowledge69@reddit
Why is it possible with way less VRAM ?
random-tomato@reddit
You have 96GB VRAM + 64 GB RAM = 160 GB of memory total. Definitely more than enough to run Q2_K_XL!!!
Excellent-Sense7244@reddit
I hate to be gpu miserable
Nobby_Binks@reddit
FWIW, I can run Q3KXL with 64K context at ~7tps on 4x3090's and an old EPYC DDR4 system. May be able to eke out a bit more but my llama.cpp tweaking skills are not that good yet.
lolwutdo@reddit
Oh damn, didn't realize 4.7 is a bigger model; I thought it was the same size as 4.5 and 4.6
mikael110@reddit
It isn't, it's 355B total parameters which is exactly the same as 4.6 and 4.5.
lolwutdo@reddit
oh yeahh you're right, I got confused with air. lol
random-tomato@reddit
I'm 99% sure GLM 4.7 is the exact same size as 4.5 and 4.6
jeffwadsworth@reddit
Love it so far. It has some sassy to it.
jeffwadsworth@reddit
Grabbing the 4 bit unsloth. I would love to see the difference in coding tasks between it and the 1bit/2bit versions. But I am happy usually with half-precision.
Healthy-Nebula-3603@reddit
Ggml Q2 model is not nothing more than a gimik.
yoracale@reddit
Actually If you see our third-party benchmarks for Aider, you can see the 2-bit DeepSeek-V3.1 quant is slightly worse than full precision DeepSeek-R1-0528. GLM-4.7 should see similar accuracy recovery.
3-bit is definitely the sweet spot.
Also let's not forget if you don't want to run the quantized versions, that's totally fine, you can run the full precision quant which we also uploaded.
Healthy-Nebula-3603@reddit
That's 3 bit nor 2 bit
yoracale@reddit
It's 3bit, 2bit and 1bit.
Pristine-Woodpecker@reddit
should is very load bearing here.
This is for example absolutely not true for Qwen3-235B. Without testing, you do not know if it's true for GLM.
yoracale@reddit
We tested it and it works great actually, just haven't benchmarked it since it's very resource intensive.
If you don't want to use 2-bit, like I said, that's fine there's always the bigger quants available to use and run!
Admirable-Star7088@reddit
What do you mean? I have used UD-Q2_K_XL quants of GLM 4.5, 4.6 and testing 4.7 right now. They are all the smartest local models I've ever run, way smarter than other, smaller models such as GLM 4.5 Air at Q8 quant or Qwen3-235b at Q4 quant.
Maybe it's true that Q2 is often too a too aggressive quant for most models, but GLM 4.x is definitely an exception.