I made a 35% REAP of 397B with potentially usable quality in 96GB GPU

[-]

grumd@reddit

https://www.reddit.com/r/LocalLLaMA/comments/1s9mkm1/benchmarked_18_models_that_i_can_run_on_my_rtx/

I've added your REAP to my post (at IQ2_XS_Gv2). This model takes a bit more RAM than 122B:Q4_K_XL but didn't perform well unfortunately

[-]

Goldkoron@reddit (OP)

Thanks for testing. The unreaped IQ1 should be up within an hour or 2 here: https://huggingface.co/Goldkoron/Qwen3.5-397B-A17B

I'll be curious at least if it's competitive with the Bartowski quant.

[-]

grumd@reddit

Weird. Downloaded your Q1 quant and it simply fails at tool calling on almost every test. Same as vanilla Qwen3.5-9B

[-]

Goldkoron@reddit (OP)

Huh, that one was giving me solid writing results and was even able to answer fact and math logic questions I gave it. Didn't expect it would be broken for tool calls.

[-]

Yeah maybe tool calling got unreliable due to how low the quant is, I tried a few different hyperparameters but couldn't make it complete that specific benchmark. It did fine on my favorite "give me a carbonara recipe" test lol

[-]

Goldkoron@reddit (OP)

Do you have the vram to try the IQ2_XXS_G or IQ2_XS_G quants? Those might be a little more reliable.

I am also testing a new method which might give even better results but it wouldn't work at the IQ1 level. Instead of boosting sensitive tensors by +1 to reach bpw budget, also downgrade the least sensitive tensors by -1 quant level in order to get more budget to boost more of the high value tensors.

[-]

grumd@reddit

Nope, I'm at 16GB VRAM + 96GB RAM, so ~90GB models is the most I can do

[-]

Goldkoron@reddit (OP)

Alright, challenge accepted. I will ping you if I manage to make a higher scoring quant around 1.95bpw (same as current IQ1_S_G)

[-]

grumd@reddit

Alright! But honestly I think 397B is a lost cause at Q1, the compression introduces just too much inaccuracy. I'd love you more if you made a good ~80GB quant of 122B, it would be around Q4_K, but if with your techniques it reaches ~Q5 level of quality it would be a real killer of a model. I'm using unsloth Q4_K_XL with a ton of context usually these days, it's the best performer out of all models that I could fit on my system

[-]

Goldkoron@reddit (OP)

Just base 122b?

The KLD sensitivity scan wouldn't more than a day if I did it for that. I had just never bothered touching 122b because people said dense 27B was better.

[-]

grumd@reddit

Yeah, just base 122B. I've tried multiple REAPs and distills over the last few weeks and base Qwen models always come out on top in my benchmarks. 27B from my benchmarks is not better than 122B, rather similar quality, but the funny thing is that 122B can be offloaded to RAM and used with a 16GB GPU like mine, 27B needs to be fully on the GPU, so it's a bit more restrictive for normal consumer PCs.

[-]

Goldkoron@reddit (OP)

I am downloading most of the unsloth UD quants right now as a benchmark and will see if I can make better ones in smaller size. If my quants get better kld scores but worse results in your benchmarks then that will be telling whether KL divergence is a flawed baseline.

[-]

grumd@reddit

<3 You're the best

[-]

Goldkoron@reddit (OP)

Current status - https://imgur.com/a/ZkLeEHe

[-]

FoxiPanda@reddit

I am interested in trying the 122B_A10B Gutenberg quantization method out too - particularly the K_G_4.50 you listed on the model card (here) would be of interest to me. It seems like a great fit for the memory bandwidth capabilities of a Mac Studio M3 Ultra -- would still generate a solid tok/s number and be a genuinely useful local model for agentic tasks.

[-]

Goldkoron@reddit (OP)

Hi, thanks for your interest. The models are currently updating but it might take a day before they're all up. 6.00 looks like it will take another couple hours to upload, then 5.00, then 4.50 would start. Though xet might speed it up as more quants go up.

[-]

grumd@reddit

Looks good! Can't wait. I was running the full set of Aider polyglot benchmarks on UD-Q4_K_XL overnight, it did 76/225 tests so far and pass@2 is 76.3%. Unsloth's IQ3_XXS had a ~68% result. I'll finish the Q4 over the next few days and then run this on your 5bpw quant as well

[-]

tnhnyc@reddit

It'd be also interesting to compare your IQ1 with Unsloth's TQ1.

[-]

Necessary-Summer-348@reddit

What quantization method are you using? 35% REAP sounds aggressive even for Q2 - curious if you're seeing coherence issues past 4k context or if it's actually holding up for longer inference tasks.

[-]

Goldkoron@reddit (OP)

I describe my quantization method a bit in my comment and on huggingface.

Coherence was stable in storywriting around 35k context, though some story events (time and place) were being misremembered.

I have been starting to realize since last night that deeper quantizations on the unreaped model are actually better quality than this, the REAP is subtly breaking too much.

[-]

Necessary-Summer-348@reddit

Interesting — so the degradation is more distributed than just context coherence, it's in the base representation itself. Makes sense that aggressive pruning would compound with quantization errors. Is Q4 unreaped holding up noticeably better than Q2 REAP, or are you going even deeper?

[-]

Goldkoron@reddit (OP)

There's certain errors I am finding on REAP that happen even at lower percentages like 25% or less. So I suspect it's putting holes in the routing by reaping any experts.

Issues like the model thinking "feline" is a typo for "feline", etc. A lot of subtle token confusion.

[-]

FoxiPanda@reddit

I've downloaded Qwen3.5-397B-A17B-REAP35-IQ2_XS_Gv2.gguf and I'll give it a shot tomorrow and report back for my use cases as I have enough VRAM to run that and it'll be curious to see what kind of speed / accuracy / usefulness I can get out of it.

It's certainly an interesting idea. Thanks for sharing.

[-]

TomLucidor@reddit

Would love to know more about this and see if quants are useful

[-]

FoxiPanda@reddit

As promised, I loaded this up and ran it. I used b8660 llama-server and these parameters

llama-server \
  --model ~/models/Qwen3.5-397B-A17B-REAP35-IQ2_XS_Gv2.gguf \
  --mmproj ~/models/Qwen3.5-397B-A17B-mmproj-F32.gguf \
  --ctx-size 131072 \
  --n-gpu-layers 999 \
  --threads 16 \
  --parallel 1 \
  --batch-size 1024 \
  --ubatch-size 1024 \
  --cache-type-k bf16 \
  --cache-type-v bf16 \
  --flash-attn on \
  --jinja \
  --reasoning off \
  --temp 0.7 \
  --top-k 20 \
  --top-p 0.95 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0 \
  --metrics \
  --mlock \
  --host 0.0.0.0 \
  --port 8091

I don't have a bunch of fancy benchmark numbers or whatnot (yet at least) but I will say it gets about ~25-30tok/s out on my M3 Ultra 512GB system and uses about 100GB~ of VRAM.

It can use tools and use vision and generally is coherent in conversation. Honestly? Not bad so far. I'll poke at it some more and report back...I think it's still a bit too slow for my "daily driver" needs (I like 50tok/s or better) but it isn't bad which is a great thing for a model that had some serious brain surgery.

[-]

TomLucidor@reddit

Check prefill, if it is faster than 2K tps (or even 5K tps) I would be a little surprised (and remember MTP exists to add more speed on top)

[-]

FoxiPanda@reddit

Hmm, I tried to get MTP/speculative decoding working in llama.cpp for this model and I have to admit I couldn't figure out it out. /u/Goldkoron have you messed around with MTP on this model to speed up inference? I messed with it for like 20 minutes trying to get it to use Qwen3.5-4B as a draft model or enable native MTP but I just couldn't get it to load/work.

[-]

Goldkoron@reddit (OP)

I have not messed around with MTP/speculative decoding at all, sorry

[-]

FoxiPanda@reddit

No worries. It seems like perhaps the REAP killed support for it or I'm just doing it wrong (skill issue lol)... I end up with this error no matter what I try with this model:

common_speculative_is_compat: the target context does not support partial sequence removal srv load_model: speculative decoding not supported by this context

[-]

Goldkoron@reddit (OP)

I wouldn't be surprised if it broke it, I actually don't recommend the REAP model anymore anyway (even though I just posted this yesterday!)

I applied the K_G quantization method to the full model and even at deep quantization I was getting better outputs than the REAP was.https://huggingface.co/Goldkoron/Qwen3.5-397B-A17B

[-]

TomLucidor@reddit

Please try ternary quantization next!

[-]

FoxiPanda@reddit

Neat, I will take a look at it.

[-]

a_beautiful_rhind@reddit

What did you reap it on though? The previous attempts destroyed everything outside of coding. Model forgets how to write and that made me give up on this method since I want a generalist.

EXL3 can also compress something like this pretty small and has those hadamard rotations when making the quant, unlike gguf.

[-]

Goldkoron@reddit (OP)

My initial attempt at REAP is just like you describe. The model basically struggled to even speak english correctly anymore.

The first method I tried that worked without breaking it's ability to output simple language was using the imatrix activation data to prune the experts that were least used.

However, I think it's still subtly breaking too much, and I am finding that deeper quantizations (IQ1_S_G is 90gib) are better than the REAP35 at IQ2_XS

[-]

a_beautiful_rhind@reddit

Yea I didn't have a good time with all the reap I downloaded. You probably need a robust dataset, imatrix is a step up but guess not enough. I thought about trying whichever way on a small model until something good shakes out.

[-]

FoxiPanda@reddit

As promised, I loaded this up and ran it. I used b8660 llama-server and these parameters

llama-server \ --model ~/models/Qwen3.5-397B-A17B-REAP35-IQ2_XS_Gv2.gguf \ --ctx-size 131072 \ --n-gpu-layers 999 \ --threads 16 \ --parallel 1 \ --batch-size 1024 \ --ubatch-size 1024 \ --cache-type-k bf16 \ --cache-type-v bf16 \ --flash-attn on \ --jinja \ --temp 0.7 \ --top-k 20 \ --top-p 0.95 \ --min-p 0.0 \ --presence-penalty 0.0 \ --repeat-penalty 1.0 \ --metrics \ --mlock \ --host 0.0.0.0 \ --port 8091

I don't have a bunch of fancy benchmark numbers or whatnot (yet at least) but I will say it gets about ~25-30tok/s out on my M3 Ultra 512GB system and uses about 100GB~ of VRAM.

It can use tools and use vision and generally is coherent in conversation. Honestly? Not bad so far. I'll poke at it some more and report back...I think it's still a bit too slow for my "daily driver" needs (I like 50tok/s or better) but it isn't bad which is a great thing for a model that had some serious brain surgery.

[-]

Goldkoron@reddit (OP)

Hi everyone, I am new to making model quantizations but I thought the results I have gotten are worth sharing if anyone wants to help test or tear apart my method.

I took Qwen3.5-397B and used the imatrix activation data from Unsloth to REAP the bottom 35% used experts across all layers, cutting the model size down to 261B~ parameters. After a lot of testing, I settled on 35% being the most I can REAP this model using this method before noticeable brain damage occurs. I am not sure how much dumber it is than the base model, but the output quality does not feel dumb for my usecases.

Second improvement is I came up with a new quantization strategy. Yes I am using Claude Code to help with my tool scripts but look, I am writing this entire post by hand, as well as all the methodic testing I did.

I tested each tensor group in the model to find the most impactful per GB using KL Divergence (KLD) data compared to the Q8 source. My conclusion was to leave every tensor untouched except for the 180 down/gate/up expert tensors. So everything else is in Q8_0 or F32 as seen in the Q8_0 model. I then did a sensitivity scan of 180 tensors—180 models created and benchmarked with swapped tensors to rate each tensor by importance to KLD.

For each K_G quantization level, experts all start at the base quant and are upgraded by +1 quant level in order of highest value until the BPW(bits-per-weight) match a standard K_M quant in size.

I am not going to make big claims like "This method achieves quality 1-2 quant levels higher than normal" without presenting the data I have to back it up:

Quant	Size	BPW	Mean KLD	Same Top Token
Q5_K_M	173 GiB	5.69	0.00642	95.18%
Q4_K_G	148 GiB	4.86	0.00751	94.26%
Q4_K_M	148 GiB	4.86	0.01242	93.67%
Q3_K_G	116 GiB	3.83	0.00932	94.68%
Q3_K_M	116 GiB	3.83	0.03797	89.36%
IQ2_XS_G	87 GiB	2.86	0.02150	92.55%
Q2_K	89 GiB	2.93	0.10118	82.63%

I have not tested this model for coding, but I would like to hear from others how it compares to unreaped Qwen3.5 397B. I only have ~200GB of VRAM to work with so the largest quant I can use on the base model is Q3_K territory. For creative writing—I use LLMs for story writing mostly—the quality is quite good from my admittedly biased observance.

[-]

legit_split_@reddit

What about the IQ2_XS from ubergarm? Fits in 128gb.

https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF

[-]

sautdepage@reddit

Nice effort, thanks!

I've been using unsloth Q3-XL (166GB) with impressive results. I gave your Q4_M_Gv2 a try, vibe-feeling it:

- simonw's "draw a pelican" was worse.
- Asked to translate k-pop lyrics: didn't pick up the right band in 3 attempts: endless loop trying to figure it out / wrong name / didn't mention it. Q3-XL recognized it right away yesterday.
- Code review: Got in a git diff hell but they all do sometimes.
- Code task: Actual work in an existing project. Doing okay, no tool call errors, decent code. In architect mode seems a bit less "creative/knowledgeable" than Q3-XL and mostly stick to what I say -- not necessarily a bad thing and maybe just due to that particular task. Wrote the code, fixed build errors, made correct tests. Followed agents.md instructions for how to run tests correctly.

So far it's pretty decent, not sure it's better than Q3-XL though.

I'll use it some more. The nice thing about a good Q4 REAP is unlike Q3 we could envision a safetensors version that works with vllm to get the extra performance out of a 192GB VRAM budget. Llama/ik_llama are awesome but quite slow at PP and concurrency.

[-]

Goldkoron@reddit (OP)

Yeah I think my quantization strategy is more robust than the REAP strategy which was kind of just a "This is the first thing that kinda worked within my limited setup". I am uploading a Q3_K_G quant for full model which is 177gib, and might upload some smaller ones, especially if they're better than Q4_K on the REAP.

[-]

grumd@reddit

The quantization strategy looks very interesting. Would you be doing 122B? Way more people can run smaller models and 122B even fits in consumer GPUs like my 5080 with experts offloaded to RAM. People with 64GB RAM and 16GB GPUs can basically only run IQ3_XS of 122B, so making an efficient quant with your method would be very valuable

[-]

sautdepage@reddit

No worries! As long as people don't claim to have solved fusion power, experimentation is a good thing.

[-]

Goldkoron@reddit (OP)

I am grateful you tried it. I suspect REAP breaks the model routing in ways that are not immediately obvious on some tasks. I need to determine next if it's possible to REAP this many experts without catastrophic failures like thinking loops.

A KLD based method to REAP more safely might be possible but I suspect it would require individually testing over a thousand model combinations which might take a week or 2.

[-]

TomLucidor@reddit

Please SWE-Bench (or Live-Benching) these quants

[-]

HeyEmpase@reddit

Nice work - 35% REAP on 397B is impressive for a 96GB GPU! (makes me wonder how REAP’s pruning affects model reliability)

[-]

Technical-Earth-3254@reddit

Cool project! It may be useful to compare your reap'd model to a model of native similar size like Minimax 2.x or Step 3.5 Flash? Just to see if it's actually viable for this size range

[-]

Goldkoron@reddit (OP)

Yeah I would like to hear people's thoughts. I personally like it better than Qwen3.5 122B for its writing quality.

[-]

lakySK@reddit

Nice work! Would some kind of Autoresearch approach work with REAPs and quants?

Specify target size, metric to maximise (KLD or some benchmark) and let Claude Code go wild. Anyone tried that?

[-]

constructrurl@reddit

Wait, you squeezed a 397B model down to 96GB and it still has usable quality? That's the kind of dark magic we actually need, not another frontier model that needs a datacenter.

[-]

Goldkoron@reddit (OP)

It's a little more like I dumbed down a 397B model to 262B, then squeezed. I wouldn't try it with the expectation of getting the full 397B experience, but it does produce coherent output.

Like all Qwen3.5 models though, it is sometimes susceptible to thinking loops at start of a chat when reasoning is turned on.

[-]