TheaterFire

GLM-4.7-REAP-50-W4A16: 50% Expert-Pruned + INT4 Quantized GLM-4 (179B params, ~92GB)

Posted by Maxious@reddit | LocalLLaMA | View on Reddit | 74 comments

Reply to Post

74 Comments

Revolutionary-Tip821@reddit

using vllm serve /home/xxxx/Docker/xxx/GLM-4.7-REAP-40-W4A16 \\ \--served-model-name local/GLM-4.7-REAP-local \\ \--host [0.0.0.0](http://0.0.0.0) \--port 8888 \\ \--tensor-parallel-size 2 --pipeline-parallel-size 3 \\ \--quantization gptq \\ \--max-model-len 14000 \\ \--gpu-memory-utilization 0.96 \\ \--block-size 32 \\ \--max-num-seqs 8 \\ \--max-num-batched-tokens 8192 \\ \--enable-expert-parallel \\ \--enable-prefix-caching \\ \--enable-chunked-prefill \\ \--disable-custom-all-reduce \\ \--disable-log-requests \\ \--tool-call-parser glm47 \\ \--reasoning-parser glm45 \\ \--enable-auto-tool-choice \\ \--trust-remote-code on 6 RTX 4090 it start generating and then fall by repeating same word endlessly, is there anyone have same experience?
View on Reddit #74936925

Sero_x@reddit

The repeating is a pipeline but that happens with this model
View on Reddit #74962188

Revolutionary-Tip821@reddit

i tried also with --tensor-parallel-size 4; but still it stuck repeating same word, so this model is not usable in this state
View on Reddit #75010547

Sero_x@reddit

Brother in christ I have used all the models for the last 24 hours to code, deep research etc.. your inference layer is busted.
View on Reddit #75030849

One-Macaron6752@reddit

Care to commnet? I am using exactly your model card vLLM Usage / invokation patter and I am alos stuck with it stuck repeating same word! :(
View on Reddit #75201953

Phaelon74@reddit

Why do pipelines, just 6 TP and rock and roll. Additionally reasoning parser I have seen what you are seeing. I don't use it and only use expert-parallel.
View on Reddit #74937719

Hisma@reddit

you can't do TP on 6 GPUs. It needs to be powers of 2. 2/4/8 GPUs is typically what's used for TP.
View on Reddit #74942754

Phaelon74@reddit

You can do TP on ANY number of GPUs, but vllm and sg-lang don't want to do the hard math to make it work, soo you can't on their stuff. EXL3 and tabby api can do TP6.
View on Reddit #74942881

fungnoth@reddit

I'm curious about the low VRAM + OK system RAM situation with moe offloading
View on Reddit #74926328

jhnnassky@reddit

Do you have already good ones?
View on Reddit #74927952

fungnoth@reddit

I sometimes use the GLM Air REAP. 10 layers in GPU and 38 layers MOE CPU. Usable, 12GB VRAM 64GB RAM
View on Reddit #75160464

a_beautiful_rhind@reddit

You can run 2.0bpw exl3 GLM and it's around 90gb. Comparison here would be interesting. When I tried previous 4.6 REAP, about 3 of them, the EXL was better subjectively. >Calibrated on code/agentic tasks; may have reduced performance on other domains All those other reap forgot how to talk outside such domains. It's interesting how nobody has deviated from the codeslop datasets cerebras used. My theory is a more rounded english only dataset would preserve much more performance. Then someone could do chinese only, etc.
View on Reddit #74932260

One-Macaron6752@reddit

I can second your opinion. I have also tried 2.65bpw exl3 quants and felt worlds better than the REAP. For me, the REAP version was: 1) full of hallucinations in places I’d never expected them 2) full of Chinese & Arabic characters dropping almost everywhere…
View on Reddit #74962261

Sero_x@reddit

These sound like inference layer errors to me.
View on Reddit #75030774

projectmus3@reddit

You’re the person who does roleplay with LLMs and talk to fictional characters right? Yeah maybe you should create a calibration dataset for roleplay and use that to REAP instead. The REAP models from Cerebras focus on coding, tool calling and agentic workloads, and they’ve been doing amazing for me.
View on Reddit #74950923

a_beautiful_rhind@reddit

Really only thing stopping me is the massive download. I've heard mixed results from people coding with it tho and if you do a perplexity test, usually it's double digit. The REAPS I tried would forget who presidents were and other basic facts. Left me a bit skeptical to invest big effort into it.
View on Reddit #74951561

Guilty_Nothing_2858@reddit

I want to know how is the performance? Faster but poor satisfaction rate? I saw lot of comment from china dev community, say GLM4.7 cloud is in quantised version. The answer is not good
View on Reddit #75020666

LegacyRemaster@reddit

https://preview.redd.it/nse8fr8mzabg1.png?width=2013&format=png&auto=webp&s=4d86c31bb4db3967d06dc05a7bf3a589395fc70b Super quick test. glm-4.7-reap-40p IQ3\_S - 94.57 gb. Fit on 96gb with 4k context. Will test more.
View on Reddit #75011057

Goghor@reddit

!remindme 7 days
View on Reddit #74996733

Revolutionalredstone@reddit

Next please do nanbeige, this this is a beast but needs prune + int4! https://old.reddit.com/r/LocalLLaMA/comments/1q2p2wa/nanbeige4_is_an_incredible_model_for_running/
View on Reddit #74952236

LocoMod@reddit

It's a 3B model that fits on a lemon. What's the point?
View on Reddit #74966843

Revolutionalredstone@reddit

You'd be surprised! I've got plenty of portable devices with 2GB vram and the diff between 3B partial and 2B fully offloaded is HUGE. Not so much about being ABLE to run, but being able to run FAST!
View on Reddit #74967232

LocoMod@reddit

Fair. I ordered the new Arduino recently. I wonder if a quant would run on that.
View on Reddit #74985112

SlowFail2433@reddit

Edge AI is a thing, often very small chips
View on Reddit #74969067

thejoyofcraig@reddit

Nanbeige is a 3b model. What are you hoping to prune it down to??
View on Reddit #74966707

Revolutionalredstone@reddit

TBH I'd take a 500m and 250m params with very big excitement! The other models pruned to this size: like Gemma and granite were absolute bangers! And this one has a lot more junk in the trunk per se. Ultra nano models can be VERY useful if they can barely speak ;D
View on Reddit #74967370

SlowFail2433@reddit

If you go small enough it stops getting much faster, especially at high batches sizes
View on Reddit #74969048

Revolutionalredstone@reddit

Agreed, once your fully offloaded to GPU your usually good to go! The other advantage of ultra small models is modal load up time. It's pretty glorious when your task can be done with a TINY model so the whole process from starting program to getting prompt is short ! ta
View on Reddit #74969370

SlowFail2433@reddit

Yes true I love using 7B and below on any hardware for that fast load
View on Reddit #74972601

sampdoria_supporter@reddit

I am completely ignorant of this model and REAP as a method but I'm hoping to hell this means running it on strix halo is possible
View on Reddit #74963689

fallingdowndizzyvr@reddit

You can run 4.7 on Strix Halo without this.
View on Reddit #74973008

GreatAlmonds@reddit

How? Unless you're running 1bit quants
View on Reddit #74974638

jacek2023@reddit

I need Q3, anyone working on GGUF?
View on Reddit #74928220

Kamal965@reddit

He just finished uploading them: [https://huggingface.co/0xSero/GLM-4.7-REAP-50-GGUF](https://huggingface.co/0xSero/GLM-4.7-REAP-50-GGUF)
View on Reddit #74940986

fallingdowndizzyvr@reddit

"404 Sorry, we can't find the page you are looking for."
View on Reddit #74967973

jacek2023@reddit

thank you!!!
View on Reddit #74941008

Kamal965@reddit

Update: https://preview.redd.it/4yd0psb205bg1.png?width=1381&format=png&auto=webp&s=1e10c322d224366331326dfaa7e9a7fb77b55979 The math ain't mathing, right?
View on Reddit #74941202

Kamal965@reddit

Np! By "just now," I literally mean just now. I refreshed the page 5 minutes ago and the repo was empty, lol. So maybe wait a few more minutes because he might be uploading more!
View on Reddit #74941076

noctrex@reddit

Let's try I guess
View on Reddit #74937821

Position_Emergency@reddit

Can see on the Huggingface page you're in the process of doing benchmarks 💯 Will be interested to see the results! Have you considered doing a similar size version of MiniMax M2.1? (and therefore a less aggressive REAP as it is a 220B model)
View on Reddit #74928603

SillyLilBear@reddit

M2.1 fits in similar ram without reaping just 4 bit.
View on Reddit #74958589

colin_colout@reddit

Minimax models are ~130gb at 4bits. If that can get under 90gb, it can fit in 128gb unified memory systems like my strix halo (though not sure if the format is even supported... yay rocm)
View on Reddit #74964686

dtdisapointingresult@reddit

He should've done diverse benchmarks before uploading lobotomyslop if you ask me.
View on Reddit #74931123

Position_Emergency@reddit

In the land of the blind the one eyed man is king.
View on Reddit #74932609

Murgatroyd314@reddit

In the land of the blind, the one eyed man is in an asylum for his delusions of having a fifth sense.
View on Reddit #74950430

dtdisapointingresult@reddit

That's the nicest thing that's been said about me in months. 2026 off to a good start!
View on Reddit #74932765

Position_Emergency@reddit

Sorry to ruin your 2026, but OP is the one eyed King. The blind are the lobotomyslop uploaders that ignore my polite requests for benchmarks :)
View on Reddit #74933273

Phaelon74@reddit

Again, people quanting AWQs (W4A16) need to provide details on what they did to make sure all experts were activated during calibration. Until OP co.es out and provides that, if you see this model act poorly, it's because the calibration data did not activate all experts and it's been partially-lobotomized.
View on Reddit #74937846

One-Macaron6752@reddit

At minimum, a good disclosure normally includes: • Calibration dataset description • Number of tokens / sequences • Observed expert routing frequencies • Whether forced routing was used • Whether rare experts were targeted … this is / should becoming best practice in papers & repos! ;)
View on Reddit #74961177

Kamal965@reddit

I mean, I agree in general that it's very frustrating to see AWQ quants that don't say what dataset, or domain, they used for calibration. But in this case, it is explicitly mentioned on the repo. The [README.md](http://README.md) shows the full steps on how to recreate that quant. The W4A16 calibration dataset used was [The Pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) and the REAP calibration dataset (and I think this is the more important one to know) was listed as ["glm47-reap-calibration-v2"](https://huggingface.co/datasets/0xSero/glm47-reap-calibration-v2) which is a dataset on the same author's HF page. He has 4 different REAP calibration datasets there, interestingly enough... but there are no actual descriptions of what the datasets contain. You'd have to look through each one to see, welp.
View on Reddit #74941934

Phaelon74@reddit

Right, but by default, GLM does not have a modeling file in say, LLM\_Compressor. So if he first made the quant in llm\_compressor and then reaped it, experts would be missing based on not being activated by his dataset, etc. That's more what I am alluding to. People doing AWQs need to explicitly say "And I did X, Y, Z, to make sure all experts were activated during dataset calibration.
View on Reddit #74942059

Kamal965@reddit

Wait, do you mean that, in the case of quantizing first and then pruning, if someone uses a subpar calibration dataset for quantization then the wrong experts might get pruned? Although the uploader explicitly says they pruned it first btw: https://preview.redd.it/4urxats535bg1.png?width=687&format=png&auto=webp&s=8b70c461c9e70625ede9246180075510463640ae
View on Reddit #74942431

Phaelon74@reddit

Here's what we know about AWQs right now: 1). Datasets matter immensely. All AWQ quants should be using specialized Datasets, meant for what the model is meant for. Coding model, use a coding Dataset, etc. (Using Ultra-chat or wikitext on a model meant for writing/RP or coding, we can see visible degradation in quant quality. I backed KLD and PPL into a version of VLLM and I can see in magnitudes of single digit % degradation.) 2). llm\_compressor has modeling files, that make sure for MoEs, we activate all experts during dataset calibration. GLM as a modeling files is not present in llm\_compressor. I have a PR to add it, but what it means, is if a line from your dataset does not activate all experts, it will disappear from the quant, which means you're losing intelligence. TLDR; While the poster reaped before they quanted, in the second phase quanting we need confirmation that the method of AWQ quanting, used either a model file or a loop within the main one\_shot, that activated all experts by force, instead of letting the Dataset activate it.
View on Reddit #74942918

Kamal965@reddit

First of all, thank you very much for this explanation! I appreciate it. I didn't know llm\_compressor can prune models. There's just one thing, and I'm wondering if you can verify: Based on a bit of research I just did, llm-compressor can prune models, and it contains AutoRound as one of multiple quantization backends/options. But AutoRound was used as a standalone quantization method here, and AutoRound doesn't prune. It's a weight-only PTQ method. I just reviewed their Github repo and couldn't find the word prune anywhere in the files or the README.md. See: [https://github.com/intel/auto-round](https://github.com/intel/auto-round) \- so no experts could have been pruned during the AutoRound quantization, only during the REAP stage.
View on Reddit #74945649

Phaelon74@reddit

LLM\_Compressor does not prune. LLM\_Compressor only quants. Auto-Round, AWQ, all work with datasets. These datasets are used to quantize the model. With MoE models, not all experts are activated. Without activating all experts for each sample during the Calibration and smoothing phases, intelligance will be lost. Don't get caught up on the pruning phase, it's irellevant for what we're specifically talking about here. During quantization, you MUST run each sample, through ALL experts to make sure the model is properly quantized. Today, llm\_compressor does not do that for GLM, because it does not by default, have a GLM modeling file, that forces it to run a sample through all experts. See this link: [https://www.reddit.com/r/LocalLLaMA/comments/1q2pons/comment/nxfnxyf/](https://www.reddit.com/r/LocalLLaMA/comments/1q2pons/comment/nxfnxyf/) All the OP needs to do, is add an additional line in the AutoRound script, to make sure it activates all experts during quantization.
View on Reddit #74947047

Position_Emergency@reddit

"Do not be deceived: God cannot be mocked. A man quants what he reaps."
View on Reddit #74945123

Impressive_Chain6039@reddit

yeah: different dataset for diffferent scope. Coding? Optimize REAP for coding with the right dataset.
View on Reddit #74943029

Position_Emergency@reddit

u/Maxious The quant\_config looks like it defaulted to "pile-10k" for the AutoRound pass? Since you already did the hard work creating "glm47-reap-calibration-v2" to select the best experts, wouldn't it be better to reuse that dataset for quantization? Pile-10k probably won't trigger those specific code/agent experts you preserved, leaving them uncalibrated (Silent Expert problem). It should be a 1-line swap in the AutoRound script to fix.
View on Reddit #74940741

Kamal965@reddit

That's actually a great question! I'm curious to know about that too. As far as I can tell, using the same calibration dataset for both pruning and quantization logically makes sense... am I missing something that makes it not a good idea?
View on Reddit #74942025

Velocita84@reddit

Ok, but REAP'd for what? It's my understanding that REAP prunes experts based on how often they're activated during inference of a calibration set, so what task(s) was it calibrated for?
View on Reddit #74939962

Kamal965@reddit

The W4A16 calibration dataset used was [The Pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) and the REAP calibration dataset was listed as ["glm47-reap-calibration-v2"](https://huggingface.co/datasets/0xSero/glm47-reap-calibration-v2) which is a dataset on the same author's HF page. Idk what's actually in the dataset because there's no description and I haven't read through it.
View on Reddit #74942102

Murgatroyd314@reddit

A quick glance at a few bits of the calibration data set finds a lot of programming, several logic/math puzzles, and a bit of trivia.
View on Reddit #74949852

Enottin@reddit

RemindMe! 7 days
View on Reddit #74948615

Enottin@reddit

RemindMe! 1 day
View on Reddit #74948731

Enottin@reddit

!RemindMe 7 days
View on Reddit #74948543

RemindMeBot@reddit

I will be messaging you in 7 days on [**2026-01-10 15:26:14 UTC**](http://www.wolframalpha.com/input/?i=2026-01-10%2015:26:14%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1q2pons/glm47reap50w4a16_50_expertpruned_int4_quantized/nxg9f1m/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1q2pons%2Fglm47reap50w4a16_50_expertpruned_int4_quantized%2Fnxg9f1m%2F%5D%0A%0ARemindMe%21%202026-01-10%2015%3A26%3A14%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201q2pons) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|
View on Reddit #74948589

LegacyRemaster@reddit

fit on 6000 96g ... let me try
View on Reddit #74937482

Dany0@reddit

Barely doesn't fit on 64gb ram + 32gb vram :( Q3\_KS managed to load once but OOM'd immediately during prompt processing
View on Reddit #74929093

ApartmentEither4838@reddit

Can this work on a A100 80GB?
View on Reddit #74936214

Dany0@reddit

I wonder if one could REAP + distill with the larger model to get better results
View on Reddit #74929154

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*
View on Reddit #74934848

Odd-Ordinary-5922@reddit

can someone try pruning gpt oss 120b? Ik there is already one but I think he messed up something. Much appreciated
View on Reddit #74932523

Steus_au@reddit

what’s the best way to test/compare it to full size one?
View on Reddit #74927250

DesignerTruth9054@reddit

Cool. Excited to try out 
View on Reddit #74926426