Qwen3.5-27B Q4 Quantization Comparison
Posted by TitwitMuffbiscuit@reddit | LocalLLaMA | View on Reddit | 116 comments
This is a Q4 quantization sweep across all major community gguf quants of Qwen3.5-27B (available the 03/03/2026), comparing mean KLD to the BF16 baseline across different quantizers and recipes.
The goal is to give people a data-driven basis for picking a file rather than just grabbing whatever is available.
KLD (KL Divergence): "Faithfulness." It shows how much the quantized model's probability distribution drifts from a baseline (the probability distribution of the original weights). Lower = closer.
KLD Results — Custom Chat Dataset
Evaluated on titwitMuffbiscuit-v03-full.txt — chat-wrapped corpus (Qwen3.5 ChatML format), 2502 blocks, 47 chunks at context 4096.
Content: Science & engineering, Medicine, Philosophy, History, Finance, Culture, multilingual content and code snippets.
kld_plot_Qwen3.5-27B
Wikitext2 + Custom Dataset Comparison
Evaluated on wikitext2_test.txt, 72 chunks at context 4096 (plain text).
The dumbbell plot shows both datasets side by side — solid circle = chat corpus (primary), semi-transparent diamond = wikitext2 (secondary).
dumbbell_Qwen3.5-27B
lmstudio-community and mradermacher standard Q4_K_M are identical files — stacking/blending visible on the dumbbell plot.
Sorted by KLD — Custom Dataset
lmstudio-community Q4_K_M excluded — identical file to mradermacher Q4_K_M.
| Rank | Quantization | Size (GiB) | PPL | KLD |
|---|---|---|---|---|
| 1 | unsloth_Qwen3.5-27B-UD-Q4_K_XL | 16.411 | 5.8901 | 0.005087 |
| 2 | bartowski_Qwen3.5-27B-Q4_K_M | 15.952 | 5.8882 | 0.005633 |
| 3 | unsloth_Qwen3.5-27B-Q4_K_M | 15.591 | 5.8948 | 0.006193 |
| 4 | ubergarm_Qwen3.5-27B-smol-IQ4_NL | 15.415 | 5.9026 | 0.006371 |
| 5 | mradermacher_Qwen3.5-27B.i1-Q4_K_M | 15.404 | 5.9059 | 0.006469 |
| 6 | bartowski_Qwen3.5-27B-Q4_K_S | 14.985 | 5.8984 | 0.006720 |
| 7 | bartowski_Qwen3.5-27B-IQ4_XS | 14.130 | 5.9017 | 0.007062 |
| 8 | bartowski_Qwen3.5-27B-IQ4_NL | 14.851 | 5.9091 | 0.007233 |
| 9 | unsloth_Qwen3.5-27B-Q4_K_S | 14.686 | 5.9083 | 0.007449 |
| 10 | unsloth_Qwen3.5-27B-IQ4_NL | 14.610 | 5.9147 | 0.007461 |
| 11 | mradermacher_Qwen3.5-27B.i1-IQ4_XS | 13.680 | 5.9129 | 0.007569 |
| 12 | unsloth_Qwen3.5-27B-IQ4_XS | 13.949 | 5.9179 | 0.007677 |
| 13 | mradermacher_Qwen3.5-27B.i1-Q4_K_S | 14.499 | 5.9209 | 0.007937 |
| 14 | mradermacher_Qwen3.5-27B.Q4_K_M | 15.404 | 5.9028 | 0.009201 |
| 15 | mradermacher_Qwen3.5-27B.IQ4_XS | 13.784 | 5.9342 | 0.011463 |
| 16 | steampunque_Qwen3.5-27B.Q4_K_H | 14.864 | 5.9050 | 0.012091 |
| 17 | mradermacher_Qwen3.5-27B.Q4_K_S | 14.499 | 5.9293 | 0.012364 |
Most Efficient Quantization — Custom Dataset
Efficiency Score: √ (Normalized Size² + Normalized KLD²) — lower is better.
| Rank | Quantization | Size (GiB) | KLD | Eff. Score |
|---|---|---|---|---|
| 1 | bartowski_Qwen3.5-27B-IQ4_XS | 14.130 | 0.007062 | 0.317506 |
| 2 | mradermacher_Qwen3.5-27B.i1-IQ4_XS | 13.680 | 0.007569 | 0.341075 |
| 3 | unsloth_Qwen3.5-27B-IQ4_XS | 13.949 | 0.007677 | 0.369294 |
| 4 | unsloth_Qwen3.5-27B-IQ4_NL | 14.610 | 0.007461 | 0.471585 |
| 5 | unsloth_Qwen3.5-27B-Q4_K_S | 14.686 | 0.007449 | 0.490965 |
| 6 | mradermacher_Qwen3.5-27B.i1-Q4_K_S | 14.499 | 0.007937 | 0.493275 |
| 7 | bartowski_Qwen3.5-27B-IQ4_NL | 14.851 | 0.007233 | 0.520404 |
| 8 | bartowski_Qwen3.5-27B-Q4_K_S | 14.985 | 0.006720 | 0.527916 |
| 9 | mradermacher_Qwen3.5-27B.i1-Q4_K_M | 15.404 | 0.006469 | 0.659219 |
| 10 | ubergarm_Qwen3.5-27B-smol-IQ4_NL | 15.415 | 0.006371 | 0.659346 |
| 11 | unsloth_Qwen3.5-27B-Q4_K_M | 15.591 | 0.006193 | 0.716059 |
| 12 | bartowski_Qwen3.5-27B-Q4_K_M | 15.952 | 0.005633 | 0.835306 |
| 13 | mradermacher_Qwen3.5-27B.Q4_K_M | 15.404 | 0.009201 | 0.847417 |
| 14 | mradermacher_Qwen3.5-27B.IQ4_XS | 13.784 | 0.011463 | 0.877012 |
| 15 | unsloth_Qwen3.5-27B-UD-Q4_K_XL | 16.411 | 0.005087 | 1.000000 |
| 16 | mradermacher_Qwen3.5-27B.Q4_K_S | 14.499 | 0.012364 | 1.043999 |
| 17 | steampunque_Qwen3.5-27B.Q4_K_H | 14.864 | 0.012091 | 1.055620 |
Hardware: i3-12100F — 64GB DDR4-3200 — RTX 3060 12GB
Evaluation tool: llama.cpp (mainline) version: 8189 (4d828bd1a)
rorowhat@reddit
How are you running these?
TitwitMuffbiscuit@reddit (OP)
Very very slowly.
Like 2 tokens/s for the those that can't fit in vram (and those that fit are too low bpw for my taste).
I don't use this model obviously.
rorowhat@reddit
I mean what harness are you using?
TitwitMuffbiscuit@reddit (OP)
Oh I'm using little scripts like the one at the bottom of the post. It's constantly updated and not extensively tested. They should be multiplatform but require python.
https://github.com/cmhamiche/kld-sweep
python kld_sweep.py --exe /path/to/llama-perplexity --baseline /path/to/baseline/Qwen3.5-0.8B-BF16.gguf --quants /path/to/quants_folder --dataset /path/to/eval_dataset_260426-0239.txt --output /path/to/output_folder --args="-t 7 -c 512 -ngl 99" --model-name Qwen3.5-0.8BFor the dataset you can use https://github.com/cmhamiche/kld-sweep-dataset, it's interactive (from eaddario/imatrix-calibration but I still need to add Southeast Asia languages), so just run the script as is. It's randomized bu default so I can share the dataset after a test but you can use a seed to avoid this.
rorowhat@reddit
Haha Im reading this on my phone and I read the post as "kid-sweep" and thought it was redditor. Makes sense now. Are you running these with thinking enabled? And how many shots are you running, is this the average of a few? Sorry for all these questions, but I have some extra PCs laying around not doing not anything and this sounds like an interesting endeavor.
TitwitMuffbiscuit@reddit (OP)
No, no it's not this type of eval, those are just KLD figures: how "faithful" it is to the baseline in terms of probably distribution like the chance of having a similar output compared to f16/bf16 to mke it short.
For the evals you are talking about I've used https://github.com/EleutherAI/lm-evaluation-harness in the past with llama-server.
I did translate gsm8k (platinum) in my native language a while back but it's probably completly saturated with the latest models.
It depends on the type of eval ofc but I tend to favor regex extraction over LLM-as-a-judge type of eval and use like 3 shots because I want to do a quick assement on particular tasks I'm interrested in, not publish a paper.
I don't think LiveCodeBench is available tho.
If I ever do those types of evals I'd produce a thorough methodology for reproducibility (but would probably get some flaks anyway because internet).
rorowhat@reddit
I tried this eval harness before and didn't play well with llama.cpp for me. Had some issues with the chat template and missing some other things. It was very frustrating.
TitwitMuffbiscuit@reddit (OP)
Evaluation is not simple for sure, even when it works, there's some subjectivity, some wiggle room and definitly not much confidence intervals displayed on those tables.
/https://www.reddit.com/r/LocalLLaMA/comments/1sl59qq/comment/ogc0sv4/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
ASYMT0TIC@reddit
Can you add nfvp4 to this?
TitwitMuffbiscuit@reddit (OP)
I don't have the hardware nor did the community provided nvfp4 quantized weights at that time.
The only native nvfp4 relase that is recent is NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4.
Someone made a gguf if you wanted to try it out jdziat/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4-GGUF
PaMRxR@reddit
I made a bit different plot of the first table showing quantization size vs. KLD. Quantizations under or close to the best fit line should be preferable I suppose.
TitwitMuffbiscuit@reddit (OP)
Yeah, behind each quant there is a recipe and you never know what trade offs have been made and how models will behave. Sometimes bigger =/= better.
PaMRxR@reddit
I've been trying to run kld-sweep myself and have a short suggestion for improvement. In addition to --args it could suport --args-quants. For running the full bf16 model I find that I need different parameters as it doesn't fit in VRAM for me, compared to the quants.
TitwitMuffbiscuit@reddit (OP)
It's updated I've opted for --baseline to replace -bf16 since it was ambiguous (q8_0 could also be used a baseline) and I added an optional --args-baseline.
Then the version I uploaded had the option to turn down my pc that was running at night set to on by default. I removed the option, sorry about that.
PaMRxR@reddit
That was fast! Thanks for sharing this great work mate.
TitwitMuffbiscuit@reddit (OP)
Well, thanks for pinging me otherwise people on windows would have their pc shutdown at the end of the script, awkward...
PaMRxR@reddit
Phew good thing I run Linux! Otherwise it would've been a pain as I connect remotely to my machine some 10km away.
TitwitMuffbiscuit@reddit (OP)
Phew indeed.
TitwitMuffbiscuit@reddit (OP)
Btw if you need a new dataset, there's this tool for both KLD eval and imatrix calibration: https://github.com/cmhamiche/kld-sweep-dataset
Category + language group + target chunk count with the option to wraps in the model's chat template from the GGUF's metadate.
TitwitMuffbiscuit@reddit (OP)
Yeah, if you're doing the sweep with the only arguments that fits bf16 it won't be optimal. You're right. I've done the bf16 logits beforehand.
I'm not home right now but I'll do the changes you suggested, thank you.
InternationalNebula7@reddit
This is very helpful. Here's my question: Are you able to fit these quants on your RTX 3060 12GB or are you spilling over to CPU and taking the performance hit?
Perhaps I should try a Q4 on my 16 GB VRAM.
TitwitMuffbiscuit@reddit (OP)
It's running at a crawling 4.5 t/s with -ngl 36, then it's getting worse.
Iory1998@reddit
Just offload KV cache to RAM and increase the layers offloaded to GPU.
TitwitMuffbiscuit@reddit (OP)
Let me try with -nkvo, I'll report back in a sec.
Galahad56@reddit
Any luck speeding it up? What was your final config if you had success? Thanks mate
TitwitMuffbiscuit@reddit (OP)
Nope, I'm running 122B-A10B now (at less than 8 t/s): .\llama-server.exe -dio -t 7 -np 1 -fitt 1 -fitc 65536 --temp 1.0 --top-k 20 --min-p 0.0 --presence-penalty 1.5 -m ..\models\Qwen3.5-122B-A10B.gguf --mmproj ..\models\mmproj-BF16.gguf
Galahad56@reddit
This space moves so fast we might only have to wait 1 week to get it racing!
TitwitMuffbiscuit@reddit (OP)
True it's definitely hard to keep up. Papers all the time, new labs, new models plus very active projects.
It's a bit overwhelming.
wisepal_app@reddit
i have 16 gb vram and 96 gb ddr5 ram. which quant do you suggest and with which flags?
TitwitMuffbiscuit@reddit (OP)
The smallest Q4 I guess. Idk if Q3 is viable considering the number of parameters (27B).
Far-Low-4705@reddit
i think UD_IQ3 quant would be worth it it u can fully offload to GPU.
I quants tend to preserve performance more for STEM/Coding, so depends on your use case.
TitwitMuffbiscuit@reddit (OP)
Well, there is the Qwen3.5-122B-A10B.
wisepal_app@reddit
Ok. You mentioned -nkvo flag. First time i hear it. What does it do and how do you use it? One last question someone said use headless mode to save 1-2 GB. Are you talking about vram or normal ram saving?
pmttyji@reddit
https://github.com/ggml-org/llama.cpp/tree/master/tools/server
-kvo, --kv-offload, -nkvo, --no-kv-offloadwisepal_app@reddit
Okay i find out now -nkvo is abbreviation of --kv-offload
InternationalNebula7@reddit
Any luck at getting all layers to GPU for an RTX5080?
Iory1998@reddit
From my testing, KV Cache offloaded to CPU is bad when you use MoE but helpful when using dense models with layers offloaded to CPU.
3spky5u-oss@reddit
You'll lose about 1-2gb to OS if you aren't running headless.
Nice thing is the Qwen3.5 arch is very efficient on context, your KVcache won't be huge.
InternationalNebula7@reddit
I'm running headless, thankfully.
zelkovamoon@reddit
incredible work here
TitwitMuffbiscuit@reddit (OP)
Thank you !
sig_kill@reddit
This is excellent. In a sea of different options, this truly helps!
TitwitMuffbiscuit@reddit (OP)
Then I'm happy.
DistanceSolar1449@reddit
Yeah, this data (from a third party neutral source) is useful.
The data here validates the “common sense” of “go with unsloth Q4_K_XL or bartowski Q4_K_M unless you’re limited on space, in which case go with an IQ4 quant”
kaisurniwurer@reddit
And, most importantly (to me), imatrix from mradermacher is better, for more niche models (like most heretic models).
overand@reddit
I was wondering that - if people have preferences for folks providing quants!
TitwitMuffbiscuit@reddit (OP)
Well in terms of visibility, some repos are out there when you sort by downloads, for sure.
But then there's also a good chunk of users of Hugging Face that are willing to test experimental quants because the website filter feature is pretty well made.
kaisurniwurer@reddit
I don't really have a strong preference, but quite often mradermacher is the only option since they are providing quants for EVERY SINGLE MODEL IN EXISTENCE.
--Tintin@reddit
I would add lm studio
TitwitMuffbiscuit@reddit (OP)
lmstudio-community's Q4_K_M is not shown but it has been tested too, I haven't included it because it's bit for bits identical to mradermacher Q4_K_M. It's the only Q4 they have released.
DistanceSolar1449@reddit
Lmstudio-community is bartowski
https://huggingface.co/posts/yagilb/424754955629621
TitwitMuffbiscuit@reddit (OP)
Well not this one:
https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGUF
Qwen_Qwen3.5-27B-Q4_K_M.gguf 17.1 GB
https://huggingface.co/lmstudio-community/Qwen3.5-27B-GGUF
Qwen3.5-27B-Q4_K_M.gguf 16.5 GB
erubim@reddit
why? I don't get it. it seems to me the first table is is evidence that the naive strategy really works well: just get the biggest unsloth quant that fits you (they are increasingly better and seem the most reliable quants).
But what would you do with the efficiency score? It is likely dataset specific, so you did well comparing wikitext and a closed custom.
DistanceSolar1449@reddit
Efficiency matters more if you’re limited by context size for whatever reason.
You’re right though, “get the biggest/best model that fits on your gpu” is generally the right move.
TitwitMuffbiscuit@reddit (OP)
Well bigger equals better is a trend but with imatrix quants and various recipes you never know, look at this previous test for example:
https://www.reddit.com/r/LocalLLaMA/s/M1hbehDQGf
--Tintin@reddit
But going bigger also means going slower, right? I do have 128gb unified ram available but still go for some applications for the Q4-Q6 versions of these kind of models.
overand@reddit
I love this work you did. I wish your scatterplot used different shapes, though - it's very hard for me to tell some of those apart on my display, and I'm not even colour/colorblind.
TitwitMuffbiscuit@reddit (OP)
For the dumbbell plot? Yeah I could have used crosses.
Bottom is the custom corpus that uses the chat template,that's lowering the floor of PPL because |im_end|><|im_start|>user etc. are pretty boring for the model even tho the dataset is pretty diverse and at the top is wikitext2, plain text in english.
overand@reddit
This is what the graph looks like to about 1 in 50 males. (About 8%, or 1 in 25) XY males have colourblindness, about 0.5 % of XX females.
(Why the weird language? It's a medical thing, so it's one of the very, very few things where "it's about the chromosomes." [Which, like, we have a lot of chromosomes that have nothing to do with male/female, just like there are a lot of pronouns like I, You, What, That, etc, but language and whatnot.])
TitwitMuffbiscuit@reddit (OP)
Thanks for the explanation and raising awareness for color blindness, I've never actually bothered even tho I had a friend who couldn't distinguish red from green.
Anyway, lmstudio-community's has only one quant: Q4_K_M bit for bits identical to mradermacher Q4_K_M and tbh it's almost the same color as
I'll be more attentive next time.
overand@reddit
I actually mean the first graph; I'm looking at it on a different screen - from what I can tell, it has
bartowskiandsteampunque(if there's only one), but I can't findlmstudio(which may be because it's not there, or because I can't tell some of the colors apart.) I actually don't understand the dumbbell graph- but that's likely because I'm a dumbbell! (In other words, don't worry about explaining it to me - we only have so many hours in our lives!)TitwitMuffbiscuit@reddit (OP)
lmstudio-community's Q4_K_M is not shown but it has been tested too, I haven't included it because it's bit for bits identical to mradermacher Q4_K_M. It's the only Q4 they have released.
TheCTRL@reddit
I really love you research! It would be very useful for the community to check also other models and maybe place results in a web site.
Because of I use and love qwen3-coder-next can you please repeat the process with this model?
If you cannot it would be useful to have a sort of script to evaluate models quantization!
Thanks!
TitwitMuffbiscuit@reddit (OP)
Here we go, it has not been tested extensively, beware! https://github.com/cmhamiche/kld-sweep
TitwitMuffbiscuit@reddit (OP)
Well, I won't test qwen3-coder myself because I mostly do those tests for myself and I don't use it but I can share the windows scripts if I tidy them a bit and provide a readme.
To be fair it's nothing out of the ordinary, nothing the man page (or --help) of llama. cpp would explain with a bit of help from an llm. There are countless discussions about this process on llama.cpp's GitHub.
Personnaly, I'm way too lazy to play with regex and while I manage bash, PowerShell is completely unknown to me.
Maybe I'll think about a crude UI, it shouldn't be too complicated. No promises.
About the webpage, well the thing is, those tests are more of a snapshot than anything, maybe everything will be requantized tomorrow for a bug found in the template or a new feature of llama.cpp whatever and it will be completely outdated.
I don't think I can manage versioning/revisions/a database or impose a verification measure for the scores, do the PR around the project etc without ending utterly bored.
I'll keep you updated if I come up with something easy to run, I'm not a coder in the first place but I'm sure the internet will provide constructive criticism if the stuff is not up to par.
Gueleric@reddit
Thanks for the work! How come for models like bartowski_Qwen3.5-27B-IQ4_XS you show a 14.1GB size when huggingface shows 15.2?
TitwitMuffbiscuit@reddit (OP)
Good question. Hugging Face shows GB while I reported GiB. 15,172,208,160 bytes ÷ 1,073,741,824 = 14.13 GiB
DistanceSolar1449@reddit
GiB is kind of a bad choice when VRAM is measured in GB
TitwitMuffbiscuit@reddit (OP)
You're getting downvoted but you're making a good point.
I've just reported the size reported by llama.cpp. I'll do a new table layer today.
anotheruser323@reddit
Hardware manufacturers use GB of 1024MB, like it should be. That is what you should use, like you did I guess, because that is what matters.
Making base 10 chips is impractical.
TitwitMuffbiscuit@reddit (OP)
Oof, less work to do then. Thanks.
Succubus-Empress@reddit
my meat brain can not do this advanced meth stuff.
Gueleric@reddit
Ah, the classic. Thanks for the reply
Iory1998@reddit
If you download any model from HF, you see it's size a bit smaller on your disk.
dtdisapointingresult@reddit
Can you clarify what you mean by this? The quant has identical speed and accuracy as Q8_0? The speed of a Q8_0 but the accuracy of a Q4? What?
I've seen tons of dense models quantized to MXFP4 on HF, are you saying it's all a waste of time? What about NVFP4, is that also a waste of time on dense models?
TitwitMuffbiscuit@reddit (OP)
Both the q8_0 and the mxfp4 are the same. I don't know the technical reasons for the upcast by lama-quantize but I've tried it and it results in q8_0 when you quantize a dense model to mxfp4.
https://huggingface.co/DevQuasar/Qwen.Qwen3.5-27B-GGUF/blob/main/Qwen.Qwen3.5-27B.MXFP4_MOE.gguf
SHA256: 1e7678bbc144226f5c5078a952b412fb323c5f91227234cf2dc8c1139c19490e
Size of remote file:28.6 GB
blk.0.attn_gate.weight [5 120, 6 144] Q8_0 blk.0.attn_norm.weight [5 120] F32 blk.0.attn_qkv.weight [5 120, 10 240] Q8_0 blk.0.ffn_down.weight [17 408, 5 120] Q8_0 blk.0.ffn_gate.weight [5 120, 17 408] Q8_0 blk.0.ffn_up.weight [5 120, 17 408] Q8_0 blk.0.post_attention_norm.weight [5 120] F32 blk.0.ssm_a [48] F32 blk.0.ssm_alpha.weight [5 120, 48] Q8_0 blk.0.ssm_beta.weight [5 120, 48] Q8_0 blk.0.ssm_conv1d.weight [4, 10 240] F32 blk.0.ssm_dt.bias [48] F32 blk.0.ssm_norm.weight [128] F32 blk.0.ssm_out.weight [6 144, 5 120] Q8_0
https://huggingface.co/DevQuasar/Qwen.Qwen3.5-27B-GGUF/blob/main/Qwen.Qwen3.5-27B.Q8_0.gguf
SHA256: 98f26008eb136ac8f3b8bc7d6afd8aa0397158b84a2a9f39c247d75deb2dd9db
Size of remote file:28.6 GB
blk.0.attn_gate.weight [5 120, 6 144] Q8_0 blk.0.attn_norm.weight [5 120] F32 blk.0.attn_qkv.weight [5 120, 10 240] Q8_0 blk.0.ffn_down.weight [17 408, 5 120] Q8_0 blk.0.ffn_gate.weight [5 120, 17 408] Q8_0 blk.0.ffn_up.weight [5 120, 17 408] Q8_0 blk.0.post_attention_norm.weight [5 120] F32 blk.0.ssm_a [48] F32 blk.0.ssm_alpha.weight [5 120, 48] Q8_0 blk.0.ssm_beta.weight [5 120, 48] Q8_0 blk.0.ssm_conv1d.weight [4, 10 240] F32 blk.0.ssm_dt.bias [48] F32 blk.0.ssm_norm.weight [128] F32 blk.0.ssm_out.weight [6 144, 5 120] Q8_0
dtdisapointingresult@reddit
What you say can't be the general rule:
I've done a KLD test once on Nemotron Nano 3, Noctrex's MXFP4 had the lowest divergence compared to other 4-bit quants. AFAIK that is a standard bf16 model.
TitwitMuffbiscuit@reddit (OP)
I don't understand, you've just mentionned an MoE model.
And no NVIDIA-Nemotron-3-Nano-30B-A3B-MXFP4_MOE.gguf is not a "standard" bf16 model, look:
Maybe I don't get what you're trying to say.
dionisioalcaraz@reddit
I found that some smaller Q4 quants have slower tg than some bigger ones, something which I didn't expect. If you could add a table of relative speeds instead of KLD would be awesome as another bench to take into account when choosing a Q4 quant. Amazing work anyway, thanks a lot!
Steuern_Runter@reddit
How big is the difference to Q5 and Q6?
metigue@reddit
Love these analysis. Did AesSedai not quant a 27B? I recall his IQ4 being the best for the 35B model
Digger412@reddit
Hi, no I haven't because I've focused mostly on MoE models. I've gotten a few requests to quant this model but I'm not sure it'll have the same benefits as MoE's do, since this is a dense model. Quantizing the ffns so much works well with a sparsely activated model and I will need to test to see if the same is true for the dense ones.
It's kind of been lower priority though since I've been working on a few other things.
TitwitMuffbiscuit@reddit (OP)
Happy 🍰 day !
Digger412@reddit
Thank you!
TitwitMuffbiscuit@reddit (OP)
Not that I know of.
dinerburgeryum@reddit
Yea guilty. I kept the attention, output and embedding tensors in Q8 (and ssm_out in bf16) since I’m on a 24+16G build and often do long horizon work. Still, I’ll experiment with mradermacher’s Q4 based on your efficiency chart. Thanks as always for putting this together!
TitwitMuffbiscuit@reddit (OP)
I was like, wait a minute... Anyway, thanks for experimenting.
dinerburgeryum@reddit
Actually, sorry to double post here, but I think it's worth highlighting: mradermacher_Qwen3.5-27B.i1-IQ4_XS contains heavily quantized SSM layers, which I've gotta admit I've never known to perform well in downstream tasks. I think it really breaks down these hybrid models to quantize the ssm_alpha and ssm_beta layers. I dunno what this means in terms of benchmarking, but I'm starting to think KLD might not be the perplexity replacement we were hoping for.
TitwitMuffbiscuit@reddit (OP)
Feel free to ramble all day long.
I think I might be able to run some different benchmarks on the 9B without spending two days on this. I'll try later this week (or the next) and check different recipes.
Something new like https://github.com/scienceetonnante/eleusis-llm-benchmark
Unless someone else is willing to do 27B and include your quant...
dinerburgeryum@reddit
Huh. Yeah I’m game, that sounds fun. Sounds like a good, interesting way to flex long horizon reasoning too. Let me know if you end up running the bench suite against it I’ll run it as well!
TitwitMuffbiscuit@reddit (OP)
For sure. That will be interesting
dinerburgeryum@reddit
Yeah I’m excited to throw some of these slimmer quants at my current task set. Hopefully ik will fix the current mmproj issues with 3.5 I wanna come home dude haha.
pmttyji@reddit
Once again thanks for posting detailed threads like this. Glad to see IQ4_XS(my favorite quant due to less VRAM) is not at the bottom of those tables.
Long live IQ4_XS!
TitwitMuffbiscuit@reddit (OP)
Long live IQ4_XS! Lower and I'm asking myself if I shouldn't rename the model "flash" or "broke_edition".
tarruda@reddit
Also love the IQ4_XS. Seems like this one could work on a 16G VRAM card, curious to see how well it performs on a RTX 5060 Ti.
pmttyji@reddit
It should. Use -fit flags & KVCache to Q8.
munkiemagik@reddit
You're a gem mate. some of us really need to see stuff like this. Thanks.
This might be just the post i needed to jump-start me back into figuring out how to run similar comparative tests. I started looking into this casually several months back but got distracted away and never went back to it. What I'd love to be able to do is get qualitative comparisons across a range of different parameters with different quantisation levels.
Unfortunately you often find tests for the specific model you are interested in but its only pp/tg reported, or if it is more qualitative comparison of model vs model its never the model variant you can fit, its always the full OR 'wrong' weights.
Though it looks like I need to immerse myself a bit more into the academia of LLM first to get a handle on some of the principles you were talking about. For example I have come to acknowledge that I am looking for lower KL Divergence but what does that actually mean, I couldn't explain that properly to someone because I still cant really explain that to myself. Im still only 'number' bigger or smaller comprehension.
TitwitMuffbiscuit@reddit (OP)
It is a rabbit hole and it's worse with benchmarks. Like, what's the one that is not completely saturated by recent models and representative of the type of tasks I run, is it qualitative or is there bad/vague questions on the dataset, what's the latest, the quickest to run. Eval is hard.
PaMRxR@reddit
Do different sampling parameters (temp, top-p/min-p) have an effect on these benchmarks? Would be great if you published also the parameters you used.
TitwitMuffbiscuit@reddit (OP)
You can't change those settings with llama-perplexity.
https://manpages.debian.org/unstable/llama.cpp-tools-extra/llama-perplexity.1.en.html
Yeah I want to keep it short but you're not wrong, I'm on windows but I could have uploaded some logs on github and link them at the end of the post. I'll keep that in mind.
I'll get you the script I used as soon as I'm at the computer.
Ok-Measurement-1575@reddit
Did you really do all this work on a 3060?
Fairplay!
TitwitMuffbiscuit@reddit (OP)
Yeah, I've been waiting for the results for ages... In the meantime Qwen released 3 other models and fired their employees.
moahmo88@reddit
Great!Thanks!
LetterRip@reddit
Any particular reason for your efficiency score formula? They seem mostly similar in size so there seems little hope for fitting more layers or a speed boost from the marginally smaller models.
Gringe8@reddit
If you have a 16gb card you wont be able to fit the 4km size, but you could fir the iq4ks with decent context. Also even a gb or two with qwen 3.5 can get you alot of extra context.
Tasty-Butterscotch52@reddit
I am running it on a 3090 and its a bit slow. The VRAM usage goes up to 22gb... I am still playing with the settings on OpenWebUI trying to get it to be a bit more efficient. Also, struggling with websearch... the model refuses to use websearch. All other models such as gemma3 will use websearch just fine...
Gringe8@reddit
I get 2000 pp and 28 tg with 48gb vram on q8. Maybe some of it is spilling into your system ram.
Gringe8@reddit
If you cant hold the whole thing in vram id try the 35b or one of the smaller dense models
TitwitMuffbiscuit@reddit (OP)
Yeah it's definitly more relevant for quants twice the size, it's more an assessment of the recipe used to quantize. It's also useful for spoting outliers when people might think that bigger=better which not always the case.
Carbonite1@reddit
These are SUCH high quality posts, good data and presented really well, helping us all make good choices. Thank you!!
naxneri@reddit
I really liked this one, sokann/Qwen3.5-27B-GGUF-4.165bpw 13.6gb and 39t/s with 18k context and 22t/s with 20k\~24k
-_Apollo-_@reddit
Can you check some of the opus 4.6 distills too?
-_Apollo-_@reddit
Wow, thank you
Gringe8@reddit
Thanks for this. Hopefully it translates similarly to the 122b model. I was torn between q4km and iq4ks since the latter is faster for me. Now i know the quality isnt much different.
TitwitMuffbiscuit@reddit (OP)
Unfortunately, it's really not generalizable, it's for this model and those quants specifically.
CATLLM@reddit
Thank you 🙏 I find these tests very useful.