Why do pipelines, just 6 TP and rock and roll. Additionally reasoning parser I have seen what you are seeing. I don't use it and only use expert-parallel.
You can do TP on ANY number of GPUs, but vllm and sg-lang don't want to do the hard math to make it work, soo you can't on their stuff.
EXL3 and tabby api can do TP6.
You can run 2.0bpw exl3 GLM and it's around 90gb. Comparison here would be interesting.
When I tried previous 4.6 REAP, about 3 of them, the EXL was better subjectively.
>Calibrated on code/agentic tasks; may have reduced performance on other domains
All those other reap forgot how to talk outside such domains. It's interesting how nobody has deviated from the codeslop datasets cerebras used. My theory is a more rounded english only dataset would preserve much more performance. Then someone could do chinese only, etc.
I can second your opinion. I have also tried 2.65bpw exl3 quants and felt worlds better than the REAP. For me, the REAP version was: 1) full of hallucinations in places I’d never expected them 2) full of Chinese & Arabic characters dropping almost everywhere…
You’re the person who does roleplay with LLMs and talk to fictional characters right? Yeah maybe you should create a calibration dataset for roleplay and use that to REAP instead.
The REAP models from Cerebras focus on coding, tool calling and agentic workloads, and they’ve been doing amazing for me.
Really only thing stopping me is the massive download.
I've heard mixed results from people coding with it tho and if you do a perplexity test, usually it's double digit.
The REAPS I tried would forget who presidents were and other basic facts. Left me a bit skeptical to invest big effort into it.
I want to know how is the performance? Faster but poor satisfaction rate? I saw lot of comment from china dev community, say GLM4.7 cloud is in quantised version. The answer is not good
https://preview.redd.it/nse8fr8mzabg1.png?width=2013&format=png&auto=webp&s=4d86c31bb4db3967d06dc05a7bf3a589395fc70b
Super quick test. glm-4.7-reap-40p IQ3\_S - 94.57 gb. Fit on 96gb with 4k context. Will test more.
Next please do nanbeige, this this is a beast but needs prune + int4!
https://old.reddit.com/r/LocalLLaMA/comments/1q2p2wa/nanbeige4_is_an_incredible_model_for_running/
You'd be surprised! I've got plenty of portable devices with 2GB vram and the diff between 3B partial and 2B fully offloaded is HUGE.
Not so much about being ABLE to run, but being able to run FAST!
TBH I'd take a 500m and 250m params with very big excitement!
The other models pruned to this size: like Gemma and granite were absolute bangers!
And this one has a lot more junk in the trunk per se.
Ultra nano models can be VERY useful if they can barely speak ;D
Agreed, once your fully offloaded to GPU your usually good to go!
The other advantage of ultra small models is modal load up time.
It's pretty glorious when your task can be done with a TINY model so the whole process from starting program to getting prompt is short !
ta
Update:
https://preview.redd.it/4yd0psb205bg1.png?width=1381&format=png&auto=webp&s=1e10c322d224366331326dfaa7e9a7fb77b55979
The math ain't mathing, right?
Np! By "just now," I literally mean just now. I refreshed the page 5 minutes ago and the repo was empty, lol. So maybe wait a few more minutes because he might be uploading more!
Can see on the Huggingface page you're in the process of doing benchmarks 💯
Will be interested to see the results!
Have you considered doing a similar size version of MiniMax M2.1? (and therefore a less aggressive REAP as it is a 220B model)
Minimax models are ~130gb at 4bits. If that can get under 90gb, it can fit in 128gb unified memory systems like my strix halo (though not sure if the format is even supported... yay rocm)
Again, people quanting AWQs (W4A16) need to provide details on what they did to make sure all experts were activated during calibration. Until OP co.es out and provides that, if you see this model act poorly, it's because the calibration data did not activate all experts and it's been partially-lobotomized.
At minimum, a good disclosure normally includes:
• Calibration dataset description
• Number of tokens / sequences
• Observed expert routing frequencies
• Whether forced routing was used
• Whether rare experts were targeted
… this is / should becoming best practice in papers & repos! ;)
I mean, I agree in general that it's very frustrating to see AWQ quants that don't say what dataset, or domain, they used for calibration. But in this case, it is explicitly mentioned on the repo. The [README.md](http://README.md) shows the full steps on how to recreate that quant. The W4A16 calibration dataset used was [The Pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) and the REAP calibration dataset (and I think this is the more important one to know) was listed as ["glm47-reap-calibration-v2"](https://huggingface.co/datasets/0xSero/glm47-reap-calibration-v2) which is a dataset on the same author's HF page. He has 4 different REAP calibration datasets there, interestingly enough... but there are no actual descriptions of what the datasets contain. You'd have to look through each one to see, welp.
Right, but by default, GLM does not have a modeling file in say, LLM\_Compressor. So if he first made the quant in llm\_compressor and then reaped it, experts would be missing based on not being activated by his dataset, etc. That's more what I am alluding to. People doing AWQs need to explicitly say "And I did X, Y, Z, to make sure all experts were activated during dataset calibration.
Wait, do you mean that, in the case of quantizing first and then pruning, if someone uses a subpar calibration dataset for quantization then the wrong experts might get pruned? Although the uploader explicitly says they pruned it first btw:
https://preview.redd.it/4urxats535bg1.png?width=687&format=png&auto=webp&s=8b70c461c9e70625ede9246180075510463640ae
Here's what we know about AWQs right now:
1). Datasets matter immensely. All AWQ quants should be using specialized Datasets, meant for what the model is meant for. Coding model, use a coding Dataset, etc. (Using Ultra-chat or wikitext on a model meant for writing/RP or coding, we can see visible degradation in quant quality. I backed KLD and PPL into a version of VLLM and I can see in magnitudes of single digit % degradation.)
2). llm\_compressor has modeling files, that make sure for MoEs, we activate all experts during dataset calibration. GLM as a modeling files is not present in llm\_compressor. I have a PR to add it, but what it means, is if a line from your dataset does not activate all experts, it will disappear from the quant, which means you're losing intelligence.
TLDR; While the poster reaped before they quanted, in the second phase quanting we need confirmation that the method of AWQ quanting, used either a model file or a loop within the main one\_shot, that activated all experts by force, instead of letting the Dataset activate it.
First of all, thank you very much for this explanation! I appreciate it. I didn't know llm\_compressor can prune models. There's just one thing, and I'm wondering if you can verify: Based on a bit of research I just did, llm-compressor can prune models, and it contains AutoRound as one of multiple quantization backends/options. But AutoRound was used as a standalone quantization method here, and AutoRound doesn't prune. It's a weight-only PTQ method. I just reviewed their Github repo and couldn't find the word prune anywhere in the files or the README.md. See: [https://github.com/intel/auto-round](https://github.com/intel/auto-round) \- so no experts could have been pruned during the AutoRound quantization, only during the REAP stage.
LLM\_Compressor does not prune. LLM\_Compressor only quants. Auto-Round, AWQ, all work with datasets. These datasets are used to quantize the model. With MoE models, not all experts are activated. Without activating all experts for each sample during the Calibration and smoothing phases, intelligance will be lost.
Don't get caught up on the pruning phase, it's irellevant for what we're specifically talking about here. During quantization, you MUST run each sample, through ALL experts to make sure the model is properly quantized. Today, llm\_compressor does not do that for GLM, because it does not by default, have a GLM modeling file, that forces it to run a sample through all experts.
See this link: [https://www.reddit.com/r/LocalLLaMA/comments/1q2pons/comment/nxfnxyf/](https://www.reddit.com/r/LocalLLaMA/comments/1q2pons/comment/nxfnxyf/) All the OP needs to do, is add an additional line in the AutoRound script, to make sure it activates all experts during quantization.
u/Maxious The quant\_config looks like it defaulted to "pile-10k" for the AutoRound pass?
Since you already did the hard work creating "glm47-reap-calibration-v2" to select the best experts, wouldn't it be better to reuse that dataset for quantization?
Pile-10k probably won't trigger those specific code/agent experts you preserved, leaving them uncalibrated (Silent Expert problem).
It should be a 1-line swap in the AutoRound script to fix.
That's actually a great question! I'm curious to know about that too. As far as I can tell, using the same calibration dataset for both pruning and quantization logically makes sense... am I missing something that makes it not a good idea?
Ok, but REAP'd for what? It's my understanding that REAP prunes experts based on how often they're activated during inference of a calibration set, so what task(s) was it calibrated for?
The W4A16 calibration dataset used was [The Pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) and the REAP calibration dataset was listed as ["glm47-reap-calibration-v2"](https://huggingface.co/datasets/0xSero/glm47-reap-calibration-v2) which is a dataset on the same author's HF page. Idk what's actually in the dataset because there's no description and I haven't read through it.
I will be messaging you in 7 days on [**2026-01-10 15:26:14 UTC**](http://www.wolframalpha.com/input/?i=2026-01-10%2015:26:14%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1q2pons/glm47reap50w4a16_50_expertpruned_int4_quantized/nxg9f1m/?context=3)
[**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1q2pons%2Fglm47reap50w4a16_50_expertpruned_int4_quantized%2Fnxg9f1m%2F%5D%0A%0ARemindMe%21%202026-01-10%2015%3A26%3A14%20UTC) to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201q2pons)
*****
|[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)|
|-|-|-|-|
Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW)
You've also been given a special flair for your contribution. We appreciate your post!
*I am a bot and this action was performed automatically.*
74 Comments
Revolutionary-Tip821@reddit
Sero_x@reddit
Revolutionary-Tip821@reddit
Sero_x@reddit
One-Macaron6752@reddit
Phaelon74@reddit
Hisma@reddit
Phaelon74@reddit
fungnoth@reddit
jhnnassky@reddit
fungnoth@reddit
a_beautiful_rhind@reddit
One-Macaron6752@reddit
Sero_x@reddit
projectmus3@reddit
a_beautiful_rhind@reddit
Guilty_Nothing_2858@reddit
LegacyRemaster@reddit
Goghor@reddit
Revolutionalredstone@reddit
LocoMod@reddit
Revolutionalredstone@reddit
LocoMod@reddit
SlowFail2433@reddit
thejoyofcraig@reddit
Revolutionalredstone@reddit
SlowFail2433@reddit
Revolutionalredstone@reddit
SlowFail2433@reddit
sampdoria_supporter@reddit
fallingdowndizzyvr@reddit
GreatAlmonds@reddit
jacek2023@reddit
Kamal965@reddit
fallingdowndizzyvr@reddit
jacek2023@reddit
Kamal965@reddit
Kamal965@reddit
noctrex@reddit
Position_Emergency@reddit
SillyLilBear@reddit
colin_colout@reddit
dtdisapointingresult@reddit
Position_Emergency@reddit
Murgatroyd314@reddit
dtdisapointingresult@reddit
Position_Emergency@reddit
Phaelon74@reddit
One-Macaron6752@reddit
Kamal965@reddit
Phaelon74@reddit
Kamal965@reddit
Phaelon74@reddit
Kamal965@reddit
Phaelon74@reddit
Position_Emergency@reddit
Impressive_Chain6039@reddit
Position_Emergency@reddit
Kamal965@reddit
Velocita84@reddit
Kamal965@reddit
Murgatroyd314@reddit
Enottin@reddit
Enottin@reddit
Enottin@reddit
RemindMeBot@reddit
LegacyRemaster@reddit
Dany0@reddit
ApartmentEither4838@reddit
Dany0@reddit
WithoutReason1729@reddit
Odd-Ordinary-5922@reddit
Steus_au@reddit
DesignerTruth9054@reddit