Cerebras REAPs: MiniMax-M2 (25, 30, 40%), Kimi-Linear 30%, more on the way!
Posted by ilzrvch@reddit | LocalLLaMA | View on Reddit | 22 comments
Hey everyone, we just dropped REAP'd MiniMax-M2 in 3 sizes:
https://hf.co/cerebras/MiniMax-M2-REAP-172B-A10B
https://hf.co/cerebras/MiniMax-M2-REAP-162B-A10B
https://hf.co/cerebras/MiniMax-M2-REAP-139B-A10B
We're running more agentic benchmarks for MiniMax-M2 REAPs, so far we're seeing good accuracy retention, especially at 25 and 30% compression.
We also recently released a Kimi-Linear REAP@30% and it works well for coding and for long-context QA:
https://hf.co/cerebras/Kimi-Linear-REAP-35B-A3B-Instruct
We're also working to get a Kimi-K2-Think REAP out, so stay tuned. Enjoy!
arjunainfinity@reddit
Thank you folks, inspired by you, we remixed it as THRIFT
random-tomato@reddit
What's the relative brain damage like for REAP'd models compared to the original? I've heard that it makes them pretty much unusable for non-coding use cases, but I haven't yet tested them myself.
noctrex@reddit
Well they are primarily coding models, so they just remove the stuff not needed for coding I would guess.
MiniMax-M2 original description:
TomLucidor@reddit
Now I wonder if there is a good balance between RP and IF/FC/code (cross-note to the other request) https://www.reddit.com/r/LocalLLaMA/comments/1or46rv/comment/npht3kg/
ScoreUnique@reddit
I have tested a few of the reap models and always end up having repetition issues :8
No_Conversation9561@reddit
Someone said it impacts agentic ability
Bird476Shed@reddit
Would also want to know, what is the experience between a Q6 and a Q8-REAP - both being approx the same size?
Zc5Gwu@reddit
I’d be curious. Coding and tool calling tend to degrade for low quants but I wonder how that compares with pruning.
notdba@reddit
Any plan for Kimi K2 Instruct? The 50% REAP looks great in the blog post and paper, and many of us do hybrid inference which can be quite slow with Kimi K2 Think
Shrimpin4Lyfe@reddit
Can anyone comment on the performance of the GLM 4.6 at Q4?
That seems like the perfect size for 4x 3090 and 128GB ram with some weights in ram!
tensorparty@reddit
Hi u/ilzrvch & team,
highly appreciate your efforts to get MiniMax M2 running with lower VRAM requirements and higher inference speeds! Thanks a lot! Many people (like me) are probably just waiting for just that to happen.
Unfortunately its still maxing out the sweet spot of whats possible for many of us. I feel like getting it "somehow" compressed way below 96GB VRAM would make it fly in the community. E.g. if we could get the MiniMax-M2-REAP-162B-A10B & MiniMax-M2-REAP-139B-A10B at 4-bit (int4/fp4, gguf/nvfp4/awq) that would really make it fly here. Also still to be considered enough space for a decent context window.
People with 4x 3090/4090/7900XTX, 2x 4090D 48GBmod, 3x 5090, rtx pro 6000 are still not able to get it running. Maybe Im asking for too much and if so apologies!
In any case kudos for enabling the community! Well deserved!
suicidaleggroll@reddit
It's an MoE, you can run it just fine with good speed on GPU/CPU hybrid. I'm running the normal MiniMax-M2 Q4_K_XL on a single RTX Pro 6000 at 55 t/s (550 pp), which is good enough for coding IMO.
tensorparty@reddit
Hi u/suicidaleggroll
thanks for your reply and I partially agree with your suggestion. For my own curiosity I would be happy to test your inference setup - thus kindly asking to share your inference parameters.
Despite the suggested option many local enthusiasts here are running 4x 3090/4090. Those chips are just not as capable as a pro 6000, come with the downside of needing to run with pcie communication during inference (additional performance penalty), and also require a power cap so that the whole inference system does not produce unreasonable noise/power consuption/heat (additional performance penalty). As a result 55t/s tg, 500t/s pp are likely not achievable (I would be very happy if someone with such a setup could run some performance benchmarks so that we all learn more). In addition building a 4x consumer setup means going through a lot of challenges since the whole system gets way more complex with raiser cables, multiple psu, a custom case, more complex cooling. Once successfully built you really want to leverage those 4x consumer gpus in parallel to get the most speed out of your effort and not being thrown back by having sequential processing of the gpus. What you want is vllm with --tensor-parallel-size 4 for parallel execution. Only that will provide the performance you have been working hard for. I know this myself since before moving to a 1 card setup I ran 8x 7900XTX.
Still looking forward to learn from your inference parameter setup. Thanks.
DefNattyBoii@reddit
Great insight, so does a 4x gpu setup is unlikely to hit more than 500-1000 PP speed due to the lack of bandwidth(i assume no Epyc/TR motherboards)? Having good prompt processing speed is one of the main hallmark of running local agents and is desired by a lot of people here.
suicidaleggroll@reddit
With ik_llama.cpp in llama-swap:
/app/llama-server --port ${PORT} --metrics --jinja --model /models/MiniMax-M2-UD-Q4_K_XL-00001-of-00003.gguf --temp 1.0 --top-p 0.95 --top-k 40 --ctx-size 131072 --n-gpu-layers 99 --n-cpu-moe 32 --attention-max-batch 512
knownboyofno@reddit
How are you running it? I'm guessing llama.cpp. if so, what is your command?
suicidaleggroll@reddit
I recently switched from llama.cpp to ik_llama.cpp. Token generation speed didn’t improve much, about 10%, but prompt processing speed more than doubled.
In llama-swap:
/app/llama-server --port ${PORT} --metrics --jinja --model /models/MiniMax-M2-UD-Q4_K_XL-00001-of-00003.gguf --temp 1.0 --top-p 0.95 --top-k 40 --ctx-size 131072 --n-gpu-layers 99 --n-cpu-moe 32 --attention-max-batch 512
noctrex@reddit
For anyone interested, I made some straight MXFP4 quants from the snipped MiniMaxes:
noctrex/MiniMax-M2-REAP-172B-A10B-MXFP4_MOE-GGUF
noctrex/MiniMax-M2-REAP-162B-A10B-MXFP4_MOE-GGUF
noctrex/MiniMax-M2-REAP-139B-A10B-MXFP4_MOE-GGUF
lumos675@reddit
God bless you and cerebras team. Huge Thanks to you guys.
tensorparty@reddit
That was fast! Congratz! Checking it out.
____vladrad@reddit
Have you all considered pruning a awq model. Like original GLM 4.6 to Awq and then prune that? Curious.
____vladrad@reddit
You all are so disgusting. Thank you for your hard work. ❤️