Who is your favourite quant publisher and why?
Posted by No_Algae1753@reddit | LocalLLaMA | View on Reddit | 64 comments
Hey everyone,
I’ve been a big fan of Unsloth for several reasons:
- They publish models ASAP after release.
- They usually offer the lowest PPL.
- Their website has tons of helpful tutorials and documentation.
Recently, I stumbled upon this Reddit thread suggesting to try out an Apex MoE quant of Mudler instead:
👉 https://www.reddit.com/r/LocalLLaMA/comments/1t3n6jo/apex_moe_quants_update_25_new_models_since_the/
So I decided to test it myself. I tried running Qwen3.5 122B IQuality, which is roughly the same size as Qwen3.5 122B Q4_K_XL. So far, I haven’t noticed a difference in real world tasks between these two models in terms of output quality so i decided to run one gsm8k benchmark and unsloth was slightly better.
So im asking you now, who is your fav publisher and why?
QuickExpert@reddit
In my case, Bartowski was roughly \~15% faster than Unsloth. The quality differences are indistinguishable.
marscarsrars@reddit
Wait fr
dampflokfreund@reddit
Gemma 4 26b and 35B MoE also take less VRAM than their Unsloth counterpart. IDK what's up with that.
VoiceApprehensive893@reddit
depends on the quant recipe i guess the unsloth iq3_xxs is smaller
marscarsrars@reddit
Doesnt the precision and accuracy loss below q4 make these quants unusable?
VoiceApprehensive893@reddit
sometimes a q3 is the smartest model you can fit
q3's are small enough to justify the damage
relmny@reddit
Unsloth (for your same reasons), Bartoswki and Ubergarm (for the biggest models with ik_llama.cpp)
Kahvana@reddit
I've been liking Unsloth models less, and Bartowski models more over the past months.
I like that bartowski's imatrix data is (mostly) public, and there is a speed difference between the quants on my weaker hardware. Bartowski also still provides Q1 quants without removing them after release.
Total_Activity_7550@reddit
bartowski never rushes release to become first finetune and reap download count. And unsloth always does that, and so often they realise "fixes" later (not telling that they let garbage out in the first place).
danielhanchen@reddit
We again want to clear up multiple misunderstandings around our GGUF updates. Some people have said we re-upload often because of our own mistakes - this is false - the majority of issues are the labs themselves or the implementation backing them.
We just like to publicize about them, so it "seems" like it's our problem but it's NOT.
1) Mistral Medium 3.5 Fix
We worked with Mistral to fix a config.json RoPE parsing bug in Mistral Medium 3.5 - EVERYONE had this issue, so everyone had to re-convert - but we were the ones to collab with Mistral to fix it. See https://www.reddit.com/r/LocalLLaMA/comments/1t1itn1/unsloth_solved_bug_in_mistral_medium_35/
2) MiniMax 2.7 NaNs
We found NaNs in 38% of Bartowski’s (10/26 quants) and 22% of ours (5/23 quants). Bartowski STILL hasn't uploaded fixed quants - so 38% are broken still - we already did ages ago.
We identified a fix and already patched ours - see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/ Bartowski has not patched yet, but is actively working on it.
3) Gemma 4 was re-uploaded 5 times
Three were due to about 10 to 20 llama.cpp bug fixes, some of which we helped investigate and contribute a fix as well. The fourth was an official Gemma chat template improvement from Google. Every provider had to update, not just us. See llama.cpp PRs which shows \~30 PR fixes / improvements for Gemma-4
See https://www.reddit.com/r/LocalLLaMA/comments/1sqrl1l/gemma_4_26ba4b_gguf_benchmarks/ for new benchmarks
4) Qwen3.5 SSM issues
We shared 7TB of research artifacts showing which layers should not be quantized. The issue was not that providers’ quants were broken, but that they were not optimal - mainly around `ssm_out` and `ssm_*` tensors. We have since improved ours and now lead on KLD vs. disk space for Qwen3.5 as well.
Most if not all quant providers then take our findings then update their quants. We talked about our analysis and research at https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwen3535ba3b_unsloth_dynamic_ggufs_benchmarks/ and https://www.reddit.com/r/LocalLLaMA/comments/1rlkptk/final_qwen35_unsloth_gguf_update/
5) CUDA 13.2 is actually broken
This causes some low bit quants on all models to get gibberish. Some people have dismissed it as not being an issue, but NVIDIA has confirmed it's a problem and a fix is coming in CUDA 13.3. See Unsloth Issue 4849, llama.cpp issue 21255, issue 21371
As a temporary solution use CUDA 13.1. See https://github.com/ggml-org/llama.cpp/issues/21255#issuecomment-4248403175 quote from https://github.com/johnnynunez
Total_Activity_7550@reddit
This doesn't address the fact that you always rush without testing, after that people download tens of gigabytes of data, then having to redownload everything. I stopped doing that mistake. I am not against your work, it is great, I guess your compute resources are also great, but the strategy isn't nice. You trade being in "Trending" list on HuggingFace for reliability.
RobotRobotWhatDoUSee@reddit
Huge fan, keep it up!
Digger412@reddit
Did you happen to spot the MiMo-V2.5 (non-Pro) layer 47 `ffn_down_exps` issue at Q4/Q5? I had to quantize it to Q6_K otherwise I was getting NaN on those quants for my Q4_K_M.
BraceletGrolf@reddit
I for one love your work, so please keep it going.
danielhanchen@reddit
Thanks :)
Kahvana@reddit
I mean, it's unavoidable with how complex llms and llama.cpp are that there will be issues on day 1. It's nice for those that there is an option for those who want to run models day 1.
With the recent release cycle of Qwen3.5, Gemma4 and Mistral Medium 3.5, I've realized that stability is more appealing to me, so I now rather wait a week or two before trying the model.
voyager256@reddit
That’s quite unfair - it’s mostly not their fault as these are various teething issues - unrelated to Unsloth or any other provider You could argue they could put big warning/ disclaimer on the model page for the first couple weeks and track history of fixes . Ok but the do already a lot and for free so …
danielhanchen@reddit
There is a weird misunderstanding going on that these issues are only Unsloth related or somehow we caused them - it's because we publicize them that it looks like it's "our" fault - see https://www.reddit.com/r/LocalLLaMA/comments/1tc588v/comment/olnpx8h/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
danielhanchen@reddit
We again want to clear up multiple misunderstandings around our GGUF updates. Some people have said we re-upload often because of our own mistakes - this is false - the majority of issues are the labs themselves or the implementation backing them.
We just like to publicize about them, so it "seems" like it's our problem but it's NOT.
1) Mistral Medium 3.5 Fix
We worked with Mistral to fix a config.json RoPE parsing bug in Mistral Medium 3.5 - EVERYONE had this issue, so everyone had to re-convert - but we were the ones to collab with Mistral to fix it. See https://www.reddit.com/r/LocalLLaMA/comments/1t1itn1/unsloth_solved_bug_in_mistral_medium_35/
2) MiniMax 2.7 NaNs
We found NaNs in 38% of Bartowski’s (10/26 quants) and 22% of ours (5/23 quants). Bartowski STILL hasn't uploaded fixed quants - so 38% are broken still - we already did ages ago.
We identified a fix and already patched ours - see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/ Bartowski has not patched yet, but is actively working on it.
3) Gemma 4 was re-uploaded 5 times
Three were due to about 10 to 20 llama.cpp bug fixes, some of which we helped investigate and contribute a fix as well. The fourth was an official Gemma chat template improvement from Google. Every provider had to update, not just us. See llama.cpp PRs which shows \~30 PR fixes / improvements for Gemma-4
See https://www.reddit.com/r/LocalLLaMA/comments/1sqrl1l/gemma_4_26ba4b_gguf_benchmarks/ for new benchmarks
4) Qwen3.5 SSM issues
We shared 7TB of research artifacts showing which layers should not be quantized. The issue was not that providers’ quants were broken, but that they were not optimal - mainly around `ssm_out` and `ssm_*` tensors. We have since improved ours and now lead on KLD vs. disk space for Qwen3.5 as well.
Most if not all quant providers then take our findings then update their quants. We talked about our analysis and research at https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwen3535ba3b_unsloth_dynamic_ggufs_benchmarks/ and https://www.reddit.com/r/LocalLLaMA/comments/1rlkptk/final_qwen35_unsloth_gguf_update/
5) CUDA 13.2 is actually broken
This causes some low bit quants on all models to get gibberish. Some people have dismissed it as not being an issue, but NVIDIA has confirmed it's a problem and a fix is coming in CUDA 13.3. See Unsloth Issue 4849, llama.cpp issue 21255, issue 21371
As a temporary solution use CUDA 13.1. See https://github.com/ggml-org/llama.cpp/issues/21255#issuecomment-4248403175 quote from https://github.com/johnnynunez:
No_Algae1753@reddit (OP)
Yes this approach has pro and cons. I still rather have not 100% working models than none and often times it's also a llama.cpp or the model publisher themselves (e.g. mistral)
danielhanchen@reddit
We don't publish Q1 quants for dense models because they're not useful and loop after a few turns - KLD and PPL are parabolic at these bits, so we decided to not provide them depending on the model - we specifically do not want folks to use useless quants - we still do it for the large ones like https://huggingface.co/unsloth/MiMo-V2.5-Pro-GGUF
We publish our imatrix final weights - for eg https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/blob/main/imatrix_unsloth.gguf_file - so you can plug this into any quantization process.
We're the best in KLD and PPL for Qwen 3.6, Gemma 4, and many other quants - see https://www.reddit.com/r/LocalLLaMA/comments/1sqrl1l/gemma_4_26ba4b_gguf_benchmarks/ and https://www.reddit.com/r/LocalLLaMA/comments/1so5nrl/qwen36_gguf_benchmarks/
We recently helped Mistral fix a bug in Mistral Medium 3.5 - see https://huggingface.co/mistralai/Mistral-Medium-3.5-128B/discussions/18
MiniMax 2.7 - We found NaNs in 38% of Bartowski’s (10/26 quants) Bartowski STILL hasn't uploaded fixed quants - so 38% are broken still - we already fixed ours ages ago - see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/
There is a weird misunderstanding going on that these issues are only Unsloth related or somehow we caused them - it's because we publicize them that it looks like it's "our" fault - see https://www.reddit.com/r/LocalLLaMA/comments/1tc588v/comment/olnpx8h/ for more details
Kahvana@reddit
Your comment is oddly defensive. I simply stated why I prefer Bartowski the past few months. Did I say something mean to you? If I did, I apologize sincerely.
Don't get me wrong, I'm happy to receive the work you've provided for free and understand it's a luxury position to be in after having made quants myself.
Public_Umpire_1099@reddit
Bartowski is awesome for keeping the imatrixes out in the open. Not only that, but he keeps his datasets for making your own imatrixes out in the open. I recently needed to quantize a fine-tune of Qwen 3.6 35B and make it MTP compatible. Between his datasets, and some of the community ones out there that support not destroying the MTP heads, I was able to produce a Q8, and an IQ4 & Q4KM with relative ease that works fantastically and is extremely high accuracy. Legitimately the best quantized MTP Qwen 3.6 35B I've used so far, not even due to bias.
Constant-Simple-1234@reddit
ByteShape for speed and quality, but they take time and publish only few selected. Then Unsloth for size, quality, docs and transparency (and all the work).
TheGlobinKing@reddit
AesSedai and Bartowski
ttkciar@reddit
Bartowski. He doesn't quant all of the models I'd like, but he quants the must-haves / best of them, and his quants are fairly reliable.
If he hasn't quantized a must-have LLM yet, it's probably because he's waiting for llama.cpp to iron out some support issues. Sometimes he has to requant because of upstream bugs, but usually not.
I especially like that Bartowski frequently (though not always) publishes a bf16 GGUF, which I like to download against the eventuality of llama.cpp regaining its native training feature. Also, once upon a time you used to be able to convert bf16 GGUFs back to safetensors, but the script for doing that broke about a year ago due to GGUF format changes, and I don't know if it's feasible to try to fix it.
If a model is too niche for Bartowski to quantize it, I go to Mradermacher.
meganoob1337@reddit
cyankiwi makes very good awq quants , and is pretty fast for new models!
if you read it , thank you !
rainbyte@reddit
Here I'm using Qwen3.6 AWQ quants by Cyankiwi, thank you so much Cyankiwi 🤗
No_War_8891@reddit
QuantTrio for vLLM stuff
meganoob1337@reddit
cyankiwi is also also pretty nice
rainbyte@reddit
Yeah, I like both :)
No_War_8891@reddit
true 😎
Hipponomics@reddit
Liking the practices of one quant publisher above another is fine, but you should be aware of the context they work in.
All these quants are built on the old quantization technology that Iwan Kawrakow made a few years ago. He forked llama.cpp and has been making new quants since then. ubergarm and a few others publish the newer IQn_K quants which are considerably better than any quants that work with mainline llama.cpp, this includes both Unsoth's and Mudler's quants.
The fact that Iwan left llama.cpp to make his own fork is one of the biggest losses to happen to local LLM inference. We would all be using his latest IQ_KS and IQ_KT quants if it weren't for that. If someone figures out a way to resolve the conflict between Georgi and him, they would be doing the community an enormous service!
RobotRobotWhatDoUSee@reddit
Unsloth and bartowski.
Wrt to Unsloth, it was amazing seeing papers about dynamic quants, thinking "someday we will get high quality low-bit quants," and then Unsloth operationalized high-quality low bit quants much earlier than I expected.
The first time I ran llama 4 scout on a laptop and it almost passed my personal code test with an unsloth UD2 quant, it felt like a magical glimpse into the future. Incredible.
Ok_Mine189@reddit
The Bloke.
roosterfareye@reddit
What happened to the Bloke? I see the name against many aging ggufs...
Equivalent-Repair488@reddit
He was the pioneer, the mradermacher+bartowski+unsloth before they even existed, the one and only quant provider, when he departed his role, many like the above attempted to fill his void.
He stopped around feb 2024 I believe, because he has his own company to run now, but back then he was the only mass quant provider for GGUFs, and other formats
o0genesis0o@reddit
I download models from The Bloke, Bartowski, and Unsloth.
But sometimes I also download the official quant from llamacpp team as well as ground truth.
mantafloppy@reddit
Anyone but Unsloth.
I don't like having to redownload my gguf 6 time.
Yorn2@reddit
lukealonso has made one of the best MiniMax M2.7 quants for my current use case. mratsim made one before that for M2.5. Aes Sedai as well. Basically any of the guys making models in the RTX6kPRO Discord are genius-level model creators.
VoiceApprehensive893@reddit
been using unsloth models though i dont really see a difference between them and smth like bartowski or mradermacher quants
grumd@reddit
Unsloth did some KLD benchmarks and APEX models were worse than other quants.
Makers7886@reddit
Lately I look for AesSedai on day 0-ish (maybe biased as I'm a fan of WOT books) then ubergarm a bit later when looking for most capability per gb situations/not vllm.
Digger412@reddit
AesSedai here - yep, I'm a fan of the books too ;)
Thanks for checking out my quants!
json12@reddit
Your quants are awesome. One suggestion I’d like to make is if you can also do lower quants (Q1s and Q2s etc.)
Digger412@reddit
Usually I point people to Ubergarm at that range since his ik_llama quants provide better quality at that bpw. But I can look at adding an IQ2-range to my lineup for people who want to stick on mainline llama.cpp, sure!
itssethc@reddit
Bartowski first, Unsloth if needed.
xaocon@reddit
It's not that hard to make your own with your own dataset, you don't need a ton of data to do a good job. If you really don't want to do that you may need to try a few to figure out the best. Unsloth and bartowski are good options to start with if you want a gguf.
Yes_but_I_think@reddit
Both are great people, do you have ulterior motive in choosing and comparing them?
Kat-@reddit
Just AesSedai. I take a bf16 abilerated version of a model I want from p-e-w or llmfan46 and quantize to GGUF using the scheme AesSedai creates for the same model.
Past-Economist7732@reddit
Ubergarm and Aesedai are my goats
Digger412@reddit
<3
a_beautiful_rhind@reddit
ubergarm, aesedai, bartowski and mradermacher
Probably misspelled some of the names.
Mordimer86@reddit
DavidAU has some decent works. Some pretty weird builds, but also way better than Unsloth quant of Qwen3.6 27B. Compared to Unsloth IQ4_XS from this side does not get stuck in loop with Opencode right at the end of a task.
I've also checked perplexity and his quants seem to be better than Unsloth too:
astope909@reddit
All of them except unsloth. unsloth are sleazy and shitty and constantly go around making advertisement posts about nothing. They are clearly trying to come up with some brand that they can sell when they have nothing. Their moe quants are ass cause they do shit like quant the shared experts which are used on every token. Always use Aes or Ubergarm over them. their regular quants arent special. Bartowskis are just as good. They claim they have "special sauce dynamic" quants. But theres only one thing they do that Aes, Bartowski, and Ubergarm dont and thats Unsloth does not tell you their quanting recipes. Even IK shit their claims about "dynamic" quants in an ik_llama help thread asking for unsloth support. they constantly feel the need to tell everyone about the "fixes" they push. Which are 99% of the time tiny chat template updates that anyone could have done. Aes has PR'd lcpp for model support like kimi 2.5's vision encoder and he just added mimo 2.5 support with vision. What the hell has unsloth done other than make some quants early and slap names on them to make them look fancy? And they use this subreddit so I half expect them to find this comment and shove a chart in my face thats probably misleading like their others. or will show flat out that Aes's quants are better
RAZA_2666R@reddit
Honestly, hard to beat Unsloth for speed and stability. Their documentation alone saves so much headache. I’ve noticed similar things with GSM8k results on other quants too; Unsloth just tends to hold up better on logic tasks.
No-Juggernaut-9832@reddit
Black sheep AI for Apple MLX
Mr_Moonsilver@reddit
Cyankiwi for his AWQ models an Quant Trio
Bulky-Priority6824@reddit
in like apex but for 3.6 it passes think tags regardless of any attempt to prevent it to frigate genai so thats a deal breaker. for 3.6 ive only tried unsloth and its been superb out of the box. 3.6 35b a3b q4 k xl
0-0x0@reddit
This guy's video made me use Qwen3.6 Q4_K_M from lmstudio, although I could run the Q8_K_XL from unsloth reliably, for this model I'm just playing around with the Q4_K_M since it's obviously faster and doesn't give up much on quality tbh. https://www.youtube.com/watch?v=ONQcX9s6_co
Hot_Turnip_3309@reddit
sometimes unsloth hits it out of the park but I swear they don't test very well, for example their 3.6 35B isn't very good. The bartowski one is good. But I am using APEX for this model.
OutrageousMinimum191@reddit
Generally I prefer to download and store original transformers models, quantizing them myself, except very big models, in this case I prefer Unsloth to download low quant ggufs.
nickm_27@reddit
From what I understand the realistic situation is that there is going to be very little if any difference in actual output behavior / quality between publishers at the same quant level. Some prioritize speed while trying to maintain as much quality as possible and others prioritize quality at a given model size.
For me personally I have been using Unsloth as they provide the recommended llama.cpp parameters which usually work well for me, and I have not had a good enough reason to try another publisher because it seems like at the end of the day it will at best be very similar.
JLeonsarmiento@reddit
This guy here:
https://huggingface.co/leonsarmiento
Solid quants for Apple.