GLM-4.7-REAP-50-W4A16: 50% Expert-Pruned + INT4 Quantized GLM-4 (179B params, ~92GB)

[-]

Revolutionary-Tip821@reddit

using vllm serve /home/xxxx/Docker/xxx/GLM-4.7-REAP-40-W4A16 \\ \--served-model-name local/GLM-4.7-REAP-local \\ \--host [0.0.0.0](http://0.0.0.0) \--port 8888 \\ \--tensor-parallel-size 2 --pipeline-parallel-size 3 \\ \--quantization gptq \\ \--max-model-len 14000 \\ \--gpu-memory-utilization 0.96 \\ \--block-size 32 \\ \--max-num-seqs 8 \\ \--max-num-batched-tokens 8192 \\ \--enable-expert-parallel \\ \--enable-prefix-caching \\ \--enable-chunked-prefill \\ \--disable-custom-all-reduce \\ \--disable-log-requests \\ \--tool-call-parser glm47 \\ \--reasoning-parser glm45 \\ \--enable-auto-tool-choice \\ \--trust-remote-code on 6 RTX 4090 it start generating and then fall by repeating same word endlessly, is there anyone have same experience?

Reply

[-]

Sero_x@reddit

The repeating is a pipeline but that happens with this model

Reply

[-]

Revolutionary-Tip821@reddit

i tried also with --tensor-parallel-size 4; but still it stuck repeating same word, so this model is not usable in this state

Reply

[-]

Sero_x@reddit

Brother in christ I have used all the models for the last 24 hours to code, deep research etc.. your inference layer is busted.

Reply

[-]

One-Macaron6752@reddit

Care to commnet? I am using exactly your model card vLLM Usage / invokation patter and I am alos stuck with it stuck repeating same word! :(

Reply

[-]

Phaelon74@reddit

Why do pipelines, just 6 TP and rock and roll. Additionally reasoning parser I have seen what you are seeing. I don't use it and only use expert-parallel.

Reply

[-]

Hisma@reddit

you can't do TP on 6 GPUs. It needs to be powers of 2. 2/4/8 GPUs is typically what's used for TP.

Reply

[-]

Phaelon74@reddit

You can do TP on ANY number of GPUs, but vllm and sg-lang don't want to do the hard math to make it work, soo you can't on their stuff. EXL3 and tabby api can do TP6.

Reply

[-]

fungnoth@reddit

I'm curious about the low VRAM + OK system RAM situation with moe offloading

Reply

[-]

jhnnassky@reddit

Do you have already good ones?

Reply

[-]

fungnoth@reddit

I sometimes use the GLM Air REAP. 10 layers in GPU and 38 layers MOE CPU. Usable, 12GB VRAM 64GB RAM

Reply

[-]

a_beautiful_rhind@reddit

You can run 2.0bpw exl3 GLM and it's around 90gb. Comparison here would be interesting. When I tried previous 4.6 REAP, about 3 of them, the EXL was better subjectively. >Calibrated on code/agentic tasks; may have reduced performance on other domains All those other reap forgot how to talk outside such domains. It's interesting how nobody has deviated from the codeslop datasets cerebras used. My theory is a more rounded english only dataset would preserve much more performance. Then someone could do chinese only, etc.

Reply

[-]

One-Macaron6752@reddit

I can second your opinion. I have also tried 2.65bpw exl3 quants and felt worlds better than the REAP. For me, the REAP version was: 1) full of hallucinations in places I’d never expected them 2) full of Chinese & Arabic characters dropping almost everywhere…

Reply

[-]

Sero_x@reddit

These sound like inference layer errors to me.

Reply

[-]

projectmus3@reddit

You’re the person who does roleplay with LLMs and talk to fictional characters right? Yeah maybe you should create a calibration dataset for roleplay and use that to REAP instead. The REAP models from Cerebras focus on coding, tool calling and agentic workloads, and they’ve been doing amazing for me.

Reply

[-]

a_beautiful_rhind@reddit

Really only thing stopping me is the massive download. I've heard mixed results from people coding with it tho and if you do a perplexity test, usually it's double digit. The REAPS I tried would forget who presidents were and other basic facts. Left me a bit skeptical to invest big effort into it.

Reply

[-]

Guilty_Nothing_2858@reddit

I want to know how is the performance? Faster but poor satisfaction rate? I saw lot of comment from china dev community, say GLM4.7 cloud is in quantised version. The answer is not good

Reply

[-]

LegacyRemaster@reddit

https://preview.redd.it/nse8fr8mzabg1.png?width=2013&format=png&auto=webp&s=4d86c31bb4db3967d06dc05a7bf3a589395fc70b Super quick test. glm-4.7-reap-40p IQ3\_S - 94.57 gb. Fit on 96gb with 4k context. Will test more.

Reply

[-]

Goghor@reddit

!remindme 7 days

Reply

[-]

Revolutionalredstone@reddit

Next please do nanbeige, this this is a beast but needs prune + int4! https://old.reddit.com/r/LocalLLaMA/comments/1q2p2wa/nanbeige4_is_an_incredible_model_for_running/

Reply

[-]

LocoMod@reddit

It's a 3B model that fits on a lemon. What's the point?

Reply

[-]

Revolutionalredstone@reddit

You'd be surprised! I've got plenty of portable devices with 2GB vram and the diff between 3B partial and 2B fully offloaded is HUGE. Not so much about being ABLE to run, but being able to run FAST!

Reply

[-]

LocoMod@reddit

Fair. I ordered the new Arduino recently. I wonder if a quant would run on that.

Reply

[-]

SlowFail2433@reddit

Edge AI is a thing, often very small chips

Reply

[-]

thejoyofcraig@reddit

Nanbeige is a 3b model. What are you hoping to prune it down to??

Reply

[-]

Revolutionalredstone@reddit

TBH I'd take a 500m and 250m params with very big excitement! The other models pruned to this size: like Gemma and granite were absolute bangers! And this one has a lot more junk in the trunk per se. Ultra nano models can be VERY useful if they can barely speak ;D

Reply

[-]

SlowFail2433@reddit

If you go small enough it stops getting much faster, especially at high batches sizes

Reply

[-]

Revolutionalredstone@reddit

Agreed, once your fully offloaded to GPU your usually good to go! The other advantage of ultra small models is modal load up time. It's pretty glorious when your task can be done with a TINY model so the whole process from starting program to getting prompt is short ! ta

Reply

[-]

SlowFail2433@reddit

Yes true I love using 7B and below on any hardware for that fast load

Reply

[-]

sampdoria_supporter@reddit

I am completely ignorant of this model and REAP as a method but I'm hoping to hell this means running it on strix halo is possible

Reply

[-]

fallingdowndizzyvr@reddit

You can run 4.7 on Strix Halo without this.

Reply

[-]

GreatAlmonds@reddit

How? Unless you're running 1bit quants

Reply

[-]

jacek2023@reddit

I need Q3, anyone working on GGUF?

Reply

[-]

Kamal965@reddit

He just finished uploading them: [https://huggingface.co/0xSero/GLM-4.7-REAP-50-GGUF](https://huggingface.co/0xSero/GLM-4.7-REAP-50-GGUF)

Reply

[-]

fallingdowndizzyvr@reddit

"404 Sorry, we can't find the page you are looking for."

Reply

[-]

jacek2023@reddit

thank you!!!

Reply

[-]

Kamal965@reddit

Update: https://preview.redd.it/4yd0psb205bg1.png?width=1381&format=png&auto=webp&s=1e10c322d224366331326dfaa7e9a7fb77b55979 The math ain't mathing, right?

Reply

[-]

Kamal965@reddit

Np! By "just now," I literally mean just now. I refreshed the page 5 minutes ago and the repo was empty, lol. So maybe wait a few more minutes because he might be uploading more!

Reply

[-]

noctrex@reddit

Let's try I guess

Reply

[-]

Position_Emergency@reddit

Can see on the Huggingface page you're in the process of doing benchmarks 💯 Will be interested to see the results! Have you considered doing a similar size version of MiniMax M2.1? (and therefore a less aggressive REAP as it is a 220B model)

Reply

[-]

SillyLilBear@reddit

M2.1 fits in similar ram without reaping just 4 bit.

Reply

[-]

colin_colout@reddit

Minimax models are ~130gb at 4bits. If that can get under 90gb, it can fit in 128gb unified memory systems like my strix halo (though not sure if the format is even supported... yay rocm)

Reply

[-]

dtdisapointingresult@reddit

He should've done diverse benchmarks before uploading lobotomyslop if you ask me.

Reply

[-]

Position_Emergency@reddit

In the land of the blind the one eyed man is king.

Reply

[-]

Murgatroyd314@reddit

In the land of the blind, the one eyed man is in an asylum for his delusions of having a fifth sense.

Reply

[-]

dtdisapointingresult@reddit

That's the nicest thing that's been said about me in months. 2026 off to a good start!

Reply

[-]

Position_Emergency@reddit

Sorry to ruin your 2026, but OP is the one eyed King. The blind are the lobotomyslop uploaders that ignore my polite requests for benchmarks :)

Reply

[-]

Phaelon74@reddit

Again, people quanting AWQs (W4A16) need to provide details on what they did to make sure all experts were activated during calibration. Until OP co.es out and provides that, if you see this model act poorly, it's because the calibration data did not activate all experts and it's been partially-lobotomized.

Reply

[-]

One-Macaron6752@reddit

At minimum, a good disclosure normally includes: • Calibration dataset description • Number of tokens / sequences • Observed expert routing frequencies • Whether forced routing was used • Whether rare experts were targeted … this is / should becoming best practice in papers & repos! ;)

Reply

[-]

Kamal965@reddit

I mean, I agree in general that it's very frustrating to see AWQ quants that don't say what dataset, or domain, they used for calibration. But in this case, it is explicitly mentioned on the repo. The [README.md](http://README.md) shows the full steps on how to recreate that quant. The W4A16 calibration dataset used was [The Pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) and the REAP calibration dataset (and I think this is the more important one to know) was listed as ["glm47-reap-calibration-v2"](https://huggingface.co/datasets/0xSero/glm47-reap-calibration-v2) which is a dataset on the same author's HF page. He has 4 different REAP calibration datasets there, interestingly enough... but there are no actual descriptions of what the datasets contain. You'd have to look through each one to see, welp.

Reply

[-]

Phaelon74@reddit

Right, but by default, GLM does not have a modeling file in say, LLM\_Compressor. So if he first made the quant in llm\_compressor and then reaped it, experts would be missing based on not being activated by his dataset, etc. That's more what I am alluding to. People doing AWQs need to explicitly say "And I did X, Y, Z, to make sure all experts were activated during dataset calibration.

Reply

[-]

Kamal965@reddit

Wait, do you mean that, in the case of quantizing first and then pruning, if someone uses a subpar calibration dataset for quantization then the wrong experts might get pruned? Although the uploader explicitly says they pruned it first btw: https://preview.redd.it/4urxats535bg1.png?width=687&format=png&auto=webp&s=8b70c461c9e70625ede9246180075510463640ae

Reply

[-]

Phaelon74@reddit

Here's what we know about AWQs right now: 1). Datasets matter immensely. All AWQ quants should be using specialized Datasets, meant for what the model is meant for. Coding model, use a coding Dataset, etc. (Using Ultra-chat or wikitext on a model meant for writing/RP or coding, we can see visible degradation in quant quality. I backed KLD and PPL into a version of VLLM and I can see in magnitudes of single digit % degradation.) 2). llm\_compressor has modeling files, that make sure for MoEs, we activate all experts during dataset calibration. GLM as a modeling files is not present in llm\_compressor. I have a PR to add it, but what it means, is if a line from your dataset does not activate all experts, it will disappear from the quant, which means you're losing intelligence. TLDR; While the poster reaped before they quanted, in the second phase quanting we need confirmation that the method of AWQ quanting, used either a model file or a loop within the main one\_shot, that activated all experts by force, instead of letting the Dataset activate it.

Reply

[-]

Kamal965@reddit

First of all, thank you very much for this explanation! I appreciate it. I didn't know llm\_compressor can prune models. There's just one thing, and I'm wondering if you can verify: Based on a bit of research I just did, llm-compressor can prune models, and it contains AutoRound as one of multiple quantization backends/options. But AutoRound was used as a standalone quantization method here, and AutoRound doesn't prune. It's a weight-only PTQ method. I just reviewed their Github repo and couldn't find the word prune anywhere in the files or the README.md. See: [https://github.com/intel/auto-round](https://github.com/intel/auto-round) \- so no experts could have been pruned during the AutoRound quantization, only during the REAP stage.

Reply

[-]

Phaelon74@reddit

LLM\_Compressor does not prune. LLM\_Compressor only quants. Auto-Round, AWQ, all work with datasets. These datasets are used to quantize the model. With MoE models, not all experts are activated. Without activating all experts for each sample during the Calibration and smoothing phases, intelligance will be lost. Don't get caught up on the pruning phase, it's irellevant for what we're specifically talking about here. During quantization, you MUST run each sample, through ALL experts to make sure the model is properly quantized. Today, llm\_compressor does not do that for GLM, because it does not by default, have a GLM modeling file, that forces it to run a sample through all experts. See this link: [https://www.reddit.com/r/LocalLLaMA/comments/1q2pons/comment/nxfnxyf/](https://www.reddit.com/r/LocalLLaMA/comments/1q2pons/comment/nxfnxyf/) All the OP needs to do, is add an additional line in the AutoRound script, to make sure it activates all experts during quantization.

Reply

[-]

Position_Emergency@reddit

"Do not be deceived: God cannot be mocked. A man quants what he reaps."

Reply

[-]

Impressive_Chain6039@reddit

yeah: different dataset for diffferent scope. Coding? Optimize REAP for coding with the right dataset.

Reply

[-]

Position_Emergency@reddit

u/Maxious The quant\_config looks like it defaulted to "pile-10k" for the AutoRound pass? Since you already did the hard work creating "glm47-reap-calibration-v2" to select the best experts, wouldn't it be better to reuse that dataset for quantization? Pile-10k probably won't trigger those specific code/agent experts you preserved, leaving them uncalibrated (Silent Expert problem). It should be a 1-line swap in the AutoRound script to fix.

Reply

[-]

Kamal965@reddit

That's actually a great question! I'm curious to know about that too. As far as I can tell, using the same calibration dataset for both pruning and quantization logically makes sense... am I missing something that makes it not a good idea?

Reply

[-]

Velocita84@reddit

Ok, but REAP'd for what? It's my understanding that REAP prunes experts based on how often they're activated during inference of a calibration set, so what task(s) was it calibrated for?

Reply

[-]

Kamal965@reddit

The W4A16 calibration dataset used was [The Pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) and the REAP calibration dataset was listed as ["glm47-reap-calibration-v2"](https://huggingface.co/datasets/0xSero/glm47-reap-calibration-v2) which is a dataset on the same author's HF page. Idk what's actually in the dataset because there's no description and I haven't read through it.

Reply

[-]

Murgatroyd314@reddit

A quick glance at a few bits of the calibration data set finds a lot of programming, several logic/math puzzles, and a bit of trivia.

Reply

[-]

Enottin@reddit

RemindMe! 7 days

Reply

[-]

Enottin@reddit

RemindMe! 1 day

Reply

[-]

Enottin@reddit

!RemindMe 7 days

Reply

[-]

RemindMeBot@reddit

I will be messaging you in 7 days on [**2026-01-10 15:26:14 UTC**](http://www.wolframalpha.com/input/?i=2026-01-10%2015:26:14%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1q2pons/glm47reap50w4a16_50_expertpruned_int4_quantized/nxg9f1m/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1q2pons%2Fglm47reap50w4a16_50_expertpruned_int4_quantized%2Fnxg9f1m%2F%5D%0A%0ARemindMe%21%202026-01-10%2015%3A26%3A14%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201q2pons) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|

Reply

[-]

LegacyRemaster@reddit

fit on 6000 96g ... let me try

Reply

[-]

Dany0@reddit

Barely doesn't fit on 64gb ram + 32gb vram :( Q3\_KS managed to load once but OOM'd immediately during prompt processing

Reply

[-]

ApartmentEither4838@reddit

Can this work on a A100 80GB?

Reply

[-]

Dany0@reddit

I wonder if one could REAP + distill with the larger model to get better results

Reply

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*

Reply

[-]

Odd-Ordinary-5922@reddit

can someone try pruning gpt oss 120b? Ik there is already one but I think he messed up something. Much appreciated

Reply

[-]

Steus_au@reddit

what’s the best way to test/compare it to full size one?

Reply

[-]

DesignerTruth9054@reddit

Cool. Excited to try out

Reply

Reply to Post

74 Comments