Are Unsloth models as good as I read?
Posted by denis-craciun@reddit | LocalLLaMA | View on Reddit | 215 comments
Has anybody done some comparing between the models that Unsloth offers and their counter part?
For example: I've been using qwen3.6:35b-a3b Q4_K_M , and on my MBP 64GB I get around 39 t/s
Using Unsloth Studio, unsloth/qwen3.6:35b-a3b UD-Q4_K_XL I get around 57 t/s
The difference in speed is significant. From what I've understood the Unsloth model runs a per-layer sensitivity analysis and assigns different quantization levels depending on how "important" each layer is. This obviously makes the model smaller, and from what I've been reading, the model should even perform better.
What are your experiences?
emprahsFury@reddit
No they're not better, they're just different. A q4 quant is a really just a q4 quant. Every quant maker (bc everyone uses llama-quantize...) does "per layer" quants. And uses an imatrix and blah blah blah. Everyone is doing what unsloth does. What you see and hear is the parasocial relationship people think they have with the Unsloth creators because they are active in this subreddit, and of course Unsloth's full court press marketing on this sub.
DeepOrangeSky@reddit
But, as for how high or low you quantize the different layers or portions of the models (and not just the main weights, but also some of those other half dozen to a dozen or so of other things), the different quant makers often use vastly different quantization levels from one another on the different portions. Which can potentially have a huge difference in how the models perform, especially for long-context scenarios.
Personally I would actually prefer if more quant-makers, would default to using 16-bit/full precision for the really small portions of the models that make up like 1% of their overall file-size or less, rather quanting some of those parts down to q3 or whatever when making a q3 quant (since the size benefits are so minuscule, but who knows how much damage is being done when quantizing the miscellaneous small or tiny pieces so hard, rather than just the main weights), even if it means the model is 1 or 2 percent bigger in the end. It is very difficult to find any major quantmakers that do that, though. I think Bartowski seems to try to keep the fragile pieces at q8, and AesSedai seems to try to keep them at Q6 or something (both of which is higher than some other main quantmakers that sometimes bring some of those parts down to just the overall quant level, like Q3 for a Q3 or Q2 for a Q2 or whatever, so, meaning that Bartowski and Aes both seem relatively careful about it compared to some of the other quantmakers, is what I mean by that, just to be clear), from the models of theirs that I looked at the quant info of on huggingface, so, at least there's that.
But, yea it seems to be an awkward tug and pull where on the one hand the main quant-makers want to be able to show that they have the smallest PPL or KLD per size of a quant, on those charts that show and compare them at that, but, this can potentially be at the sacrifice of long-context real world performance, for some models/depending on exact which things get quanted down to exactly what levels.
Ideally there would be some needle-in-the-haystack tests done for these main quants, at decent sample sizes, for the most popular models, and a main quant-maker like UD could maybe make a "needle optimized" qaunt of the main models that emphasized maxxing out long-context performance rather than just trying to get the best KLD score per file size. They could of course still release those, too, but have one "Needle Edition" quant at the main numerical level of quant(s) (i.e. at Q2 and/or at Q3 and/or at Q4, etc) (not as important for Q6 quants and above, since the whole thing is quanted at Q6 and above anyway) would be pretty cool, imo.
I guess I will ping u/noneabove1182 (Bartowski) and u/Digger412 (AesSedai) to see what they think about making at least one "Needle Edition" quant (especially at like Q3 or Q4) for the main models.
Doing it for every quant-level of every model ever would probably be pretty annoying for them, and use up more time, money, resources, clutter, etc, but maybe for the half dozen or so most main, popular models in existence (Gemma4, Qwen3.6, etc), for their most high-priority Q3 and/or Q4 level, it could be worth having one "needlefinder-optimized quant" with the most potentially-sensitive little pieces kept at 16-bit/full precision for things that are small enough that it has barely any effect on overall size, and who-knows-what-it'll-do-in-the-real-world consequences for long context use. I know for the really tiny pieces, everyone keeps some of those things at 16-bit or 32-bit, but I'm talking about the slightly more in-between (small, but not ultra-small) pieces that are often put all the way down to Q3 for a Q3 quant or Q2 for a Q2 quant by even some main quantmakers, when it wouldn't make the model much bigger to keep them at 16-bit or whatever.
This way they could still have their main quants that performed great on the KLD-vs-size charts with most of their quants, business-as-usual, but have like one weird "needle edition" quant that ignored that stuff and just went for the most better-safe-than-sorry mentality about all the non-main-weights quantization levels as much as possible.
emprahsFury@reddit
this is what I would label one of the parasocial relationship seeking comments. Why are you dragging those two into our conversation? Let them participate in the sub on their own, and as they want to. Maybe you are friends with them and calling their name is an artifact of an actual personal relationship, but I doubt it.
DeepOrangeSky@reddit
Bartowski is the main quant-maker who I use the quants of the most as of so far, and I also use Aes' quants sometimes. When some of these medium-small portions of the quants get compressed it can genuinely affect long-context performance in real-world use, so I was genuinely curious to talk with them about it, of whether some maxxed out quant that didn't even go to Q8, but was at full precision for some of the quants was something they'd be willing to try. If I knew how to make quants myself, I would have just done it myself, but I don't know how to make quants yet.
I can see why it looked that way to you, but that is my honest reason I wanted to ask them about it (especially Bartowski, since I use his quants the most).
My bad if it came off weird
Digger412@reddit
Hi, I usually keep the majority of the model in Q8 (or FP16/BF16 if it's something SSM-related, since those are super duper sensitive) unless I'm doing a Q2 or very low Q3 quant. As you say, the majority of the weights are in the routed expert FFNs and they make up the overwhelming majority of the model by weight.
This paradigm (keeping non-FFN tensors in high bpw quantizations) breaks down on non-MoE models though, I've tried it on a couple of dense models in the 30B range and it underperforms the baseline normal quantization recipes for the size. The sparsity of MoE models is what makes this trick work I think.
RE: needle-optimized quants, I do want to do more benchmarking with my quants. I've recently upgraded my system to 8x RTX 6000 Pros so I have a lot more bandwidth for research/experimentation/benchmarking now. As I mentioned at the start, usually it's only my \~Q2-level quants that don't have Q8_0 as the default type though and I usually post the recipe "mixture" for `Default Type / FFN Up / FFN Gate / FFN Down` on the page. So I don't know if there's much more juice to squeeze there basically.
The other possibility is moving to more varied quant levels for the FFNs, some people like Goldkoron / Thireus / Eaddario have been doing work on measuring per-tensor quantization error in the hopes of squeezing out more quality per bit via preserving important tensors in higher bpw too.
DeepOrangeSky@reddit
Thx for the reply. As for using Q8 rather than 16 for some of the medium-small stuff (I think Bartowski does this as well, and then most others go lower, other than Diner and maybe Noctrex, for various quants), are you pretty sure the jump from 8 to 16 won't do much for some of these things? Some of them are small enough it seems it might be worth trying in the "needle optimized" one. The only thing that makes me wonder about if it would ruin things is, I'm a pretty big noob so maybe I misunderstood, but when I was asking Diner about it a while back, I think he mentioned that if you use 16 bit for some of the things, the quant itself might not run on like llama.cpp or something because it isn't set up right for it or something.
Also, when I asked some AIs about how the SSM_BA thing even happened, that Diner made that thread about a while back for the qwen coder quants, like, why such a sensitive piece of the model was being default qauntized so heavily in the default recipe or whatever it's called, it was saying that it was basically llama.cpp's fault, that when new architectures come out, they don't always update the default settings for the quant-making, so it ended up treating the SSM_BA.weight as if it was a main weight, instead of protecting it at like F16 or whatever, so the best way to solve this overall issue would be to talk to whoever is making the llama default settings for quant-makings to be more careful to have it not crush the small pieces so much by default. This way when the main quant-makers on huggingface quickly make their quants of new models on release day, we wouldn't see things like SSM_BA at Q3 or whatever, and models falling apart in long context. But, I'm just some random noob who doesn't even know how to use a CLI much yet let alone anything about making quants or using llama.cpp stuff etc so I don't even know where to begin with who or where to ask around about that. I do know from using LLMs that it clearly matters, since it actually does seem to affect the performance even in my casual noob use cases, and thus seemed crazy to me that stuff like that was actually happening, and millions of people were downloading basically half-destroyed quants that could've been way better if they were just like 1% bigger, lol, so, it still kind of blew my mind, even as a noob, but anyway, not really sure how to help out yet since I'm too much of a noob to know what to do about it yet.
Digger412@reddit
You might be interested in this discussion: https://github.com/ggml-org/llama.cpp/discussions/20522
This is essentially the main entrypoint for quant level selection: https://github.com/ggml-org/llama.cpp/blob/master/src/llama-quant.cpp#L411 and there already is consideration for things like "is it a 1D tensor? Just use F32" and other things like that. There was something in there about SSM weights but I don't recall where offhand.
DeepOrangeSky@reddit
Thanks, I'll check it out. Also, thanks for making the quants that you make, much appreciated!
wouldveshouldvebot@reddit
Hi! I just wanted to let you know that you said "could of" when the correct spelling is "could've" which is actually a contraction of "could have." Hope this helps!
I am a bot, and this action was performed automatically.
DeepOrangeSky@reddit
Thanks. But even more usefully, maybe tell your creator/operator to make an alternate version of you that scans all the quants on huggingface and whenever anyone has the SSM_BA set to anything lower than 16-bit, to have them ping the quantmaker to ask them if they can make a version where the SSM_BA (or similar little parts like that) aren't quanted down. :p
Finanzamt_Endgegner@reddit
Unsloth makes better quants than most other people out there tbh, some special quants might be better from others and ofc bartowski makes decent quants too, though mostly its a bit of a tradeoff, bartowskis are mostly just faster on the same file size but a bit worse in accuracy. There are exceptions though like for 3.6 35b i think for example had iq4s and q6kl on roughly the same efficiency level as unsloth but they were faster.
emprahsFury@reddit
It's been a day, so here's one of the parasocial relationship people
Finanzamt_Endgegner@reddit
"parasocial relationship" Just because you have no idea how quantization works doesnt mean unsloth and bartowski dont make better quants. Even if the difference is just a few percentage points it is there.
denis-craciun@reddit (OP)
Thank you! Would say that models from these companies (like the quantised UD-Q4_K_XL) perform better than the “base” Q4_K_M?
hurdurdur7@reddit
More precise than regular q4_k_m, not faster. Most of use cases of token generation speed are stuck behind memory bandwidth vs model size. Bigger quant = slower. IQ quants shift this a tiny bit, but not by much.
But then again, for me precision beats speed. Accurate result is better than inaccurate.
Finanzamt_Endgegner@reddit
Probably but in my experience they can be rather slow, check out the benchmark table (;
Monkey_1505@reddit
There are very marginal differences between different variable quants. But they are basically the same.
Speed for quantization is purely going to depend on what your system accelerates and doesn't in terms of math.
Finanzamt_Endgegner@reddit
I mean yeah, you should definitely just test out both for speed differences on your system and then act accordingly
Finanzamt_Endgegner@reddit
just to say if you want more speed go bartowski if you want a tiny bit more accuracy go unsloth. For me i prefer bartowski since 10% more speed is worth more to me than 1% more accuracy.
denis-craciun@reddit (OP)
Ah ok thank you, just seen this answer now
Finanzamt_Endgegner@reddit
it depends on models ofc so take that more like a general guideline
Finanzamt_Endgegner@reddit
Here are the benchmarks https://unsloth.ai/docs/models/qwen3.6#unsloth-gguf-benchmarks
yoracale@reddit
We introduced Dynamic quantization in January 2025 and yes it has been the standard practice for most quant makers ever since. Everyone is doing what we're doing yes but we also conduct benchmarks on KLD and it shows that there are benefits of using Unsloth quants
The whole point of quantization is to have lower KLD so it has as much closeness to the original BF16 weights. If you're not reducing KLD then that means your quantization algorithm is broken. oobabooga did benchmarks for Gemma 4 26b on a kld dataset using real world usecases including conversation and coding and unsloth still performs vastly better. see: https://localbench.substack.com/p/gemma-4-26b-a4b-gguf-quality-benchmark
We tested it on wikitext, etc which actually puts our quants unfavorably as our imatrix calibration datasets doesnt contain any.
We don't post as often in r/Localllama and it's just me and my brother u/danielhanchen so unsure what you mean by full press court marketing.
emprahsFury@reddit
this PR management is exactly what I mean by full court press, including (especially) acting like you don't know what it means to full court press someone or thing, but also assuming ownership of the idea of quanting different pieces to different levels and implying to the reader that everyone else is just copying Unsloth in that regard.
KURD_1_STAN@reddit
So what are those kld or klp thingies? It seems, in their benchnarks, their models stay closest to bf16 at each quant.
throwaway-link@reddit
kld isn't an absolute truth. oobabooga has independent kld testing of gemma 4 31b which shows bartowski pretty close, sometimes ahead. Unsloth has a methodology they think is most representative and they optimise for it so obviously their own benchmarks look really good. This is why third party testing for anything is important even if everyone is acting in good faith.
yoracale@reddit
This is incorrect. We do not optimize anything for our own benchmarks. In fact the benchmarks we conduct are unfavorable to our quants as we do not use any wiki text in our dataset yet we test it directly on wiki text KLD benchmarks.
The whole point of quantization is to have lower KLD so it has as much closeness to the original BF16 weights. If you're not reducing KLD then that means your quantization algorithm is broken. You saw the benchmarks for Gemma 4 31B but have you seen oobabooga did benchmarks for Gemma 4 26b on a kld dataset using real world usecases including conversation and coding and unsloth still performs vastly better. see: https://localbench.substack.com/p/gemma-4-26b-a4b-gguf-quality-benchmark
Quote: "The Pareto frontier is dominated by unsloth UD quants. Only bartowski Q6_K_L and two IQ2_XXS entries (bartowski, mradermacher) at the very bottom break the streak. ggml-org, lmstudio-community, and mudler never appear on the frontier (except Q8_0 where everyone is the same)."
throwaway-link@reddit
Your 26B graph shows you significantly better than everyone else and the person I replied to seemed to think you were better in every way, when as you quote that's not completely true. I never said you don't do good work, nor that you optimise on the test set, just when you control both creation and eval there's always the possibility of bias even if you think you're being fair.
yoracale@reddit
KLD benchmarks cannot have bias though when you're testing on wiki text. If we optimized for wiki text, which we didn't, then oobagooba also did benchmarks for KLD (who uses a 250k real world usecase dataset and not wiki text) and yet it shows we're vastly superior. "The Pareto frontier is dominated by unsloth UD quants. Only bartowski Q6_K_L and two IQ2_XXS entries (bartowski, mradermacher) at the very bottom break the streak. "
throwaway-link@reddit
If I thought you were optimising on the test set I wouldn't have said you were operating in good faith. Compare your graph with theirs, someone unfamiliar with kld looking at only yours would come away with a stronger conclusion than if they saw both.
KURD_1_STAN@reddit
Ah, then it is just "their benchmark". I thought kld was like concrete tests/data
yoracale@reddit
It is not our benchmark, The whole point of quantization is to have lower KLD so it has as much closeness to the original BF16 weights. If you're not reducing KLD then that means your quantization algorithm is broken. oobabooga did benchmarks for Gemma 4 26b on a kld dataset using real world usecases including conversation and coding and unsloth still performs vastly better. see: https://localbench.substack.com/p/gemma-4-26b-a4b-gguf-quality-benchmark
We tested it on wikitext, etc which actually puts our quants unfavorably as our imatrix calibration datasets doesnt contain any.
KURD_1_STAN@reddit
The result is behind a paywall. And btw offtopic, what to do with imatrix files? Lm studio doesnt download them nor i can see if i just put it with the model or need to point to it somewhere
throwaway-link@reddit
imatrix is used in the quant process, you don't do anything with it
yoracale@reddit
You don't need the imatrix files to run the modle. They're just there so you can use it to convert your own GGUFs using it.
DataGOGO@reddit
Because a lot of layers stay in BF16; it is a dynamic quant.
Nothing fancy, anyone can make dynamic quants
KURD_1_STAN@reddit
Okey but still, if those charts are correct and nothing shady going around then why is unsloth not as good as it is said around here?
LA_rent_Aficionado@reddit
Charts are good but you need to take it for what it’s worth… benchmarks are as good as the data being benchmarked. If you make a quant focused on a certain dataset and benchmark it on that same data vs. a more generalist/differently focused quant, than clearly your specialist quant will surpass the generalist.
yoracale@reddit
Oobabooga did benchmarks for Gemma 4 26b on a kld dataset using real world usecases including conversation and coding and unsloth still performs vastly better. see: https://localbench.substack.com/p/gemma-4-26b-a4b-gguf-quality-benchmark
Our benchmarks are actually tested it on wikitext, etc which actually puts our quants unfavorably as our imatrix calibration datasets doesnt contain any.
LA_rent_Aficionado@reddit
All fair but my point stands, ultimately that testing used a dataset, while the link is subscription-walled the test implies a dataset that ultimately favors certain topics:
Since gemma-4-26B-A4B-it is a MOE model, different inputs will exercise different routing paths and experts based on the imatrix calibration data. Experts are generally not homogenous to a certain domain so logically there would be experts that are upcasted that do not necessarily align to the imatrix calibration dataset. Conversely, there would also be benchmark domains where the quants would perform more poorly than a generalist if a different benchmark dataset was used.
KURD_1_STAN@reddit
I thought kld was a general test so was more concrete
LA_rent_Aficionado@reddit
Ultimately, KLD testing requires a dataset so it can absolutely be skewed to the benchmark dataset, particularly if the quant was configured on a same or similar dataset
DataGOGO@reddit
I never said that.
I simply pointed out why they “are closer to BF16” than other full quants; because they are partially BF16.
I am not saying that is a bad thing, you often want to keep some layers in BF16 / FP32 for PTQ without QAT.
The right way to do any quantization is to do QAT then quant the model; which is what model creators do when they publish FP8 / MXFP4 / NVFP4 models.
PTQ, no matter if it is dynamic or calibrated is always lossy.
KURD_1_STAN@reddit
I understand the confusion, i wasnt asking why they are the closest to bf16, i was asking if those charts are legit or misleading, and if legit then how are they not the better choice in most quants?
DataGOGO@reddit
honestly,
It depends on the model, and what quants you compare it to.
If you take a QAT FP8 quant, it will kick the crap out of any PTQ Q8/fp8 quant, no matter if it is dynamic or not.
If you take a properly done NVFP4 quant, it will kick the crap out of any PTQ Q4 quant, no matter how it is done.
There is no magic in quantization, it is all done the same, with the same formulas the same tools, etc. What makes a difference in "quality" comes down to the either the calibration data, or your dynamic ranges and what layers you leave in BF16.
I'm sure Unsloth's "benchmarks" are legit, the question is in compared to what exactly?
denis-craciun@reddit (OP)
I was wondering if that’s actually what the situation was.. thank you, what you say makes sense
jimmytoan@reddit
Speed gains are real - Unsloth's dynamic quant strategy (per-layer sensitivity) preserves accuracy in the layers that matter most rather than applying uniform compression. That said, the chat template issue mentioned above is a genuine gotcha. If tool calling is in your workflow, double-check the template against the base model's tokenizer config before committing to a run.
Intelligent_Ice_113@reddit
I wish they give more attention to MLX world 🙏
yoracale@reddit
We are going to very soon. Btw did you know we already uploaded many MLX quants for Qwen3.6? https://huggingface.co/collections/unsloth/qwen36
Kerbiter@reddit
AMD crowd too please 🙏
AMD iGPU here
yoracale@reddit
Training and running already works in Unsloth Studio on Linux and WSL devices. Just not Windows which we're working on :)
Kerbiter@reddit
oh great, good to know WSL is at least supported, props to you folks
Intelligent_Ice_113@reddit
yes, I already switched to
unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit, butunsloth/Qwen3.6-27B-UD-MLX-MXFP4is too heavy (25.6Gb) in comparison with alternative -mlx-community/Qwen3.6-27B-mxfp4(15.2Gb). I hope27bcan weigh less in the future.denis-craciun@reddit (OP)
Me too. But I’m sure they will so I’m hopeful
arousedsquirel@reddit
Most run Nvidia instead of apple..lol
denis-craciun@reddit (OP)
Ya they do, but I believe ARM architecture will become a pillar for AI. And obviously Apple is an important name in the ARM world, and often a go to company computer / laptop. I can see a need for it to get bigger. I hope 🤞🏻
arousedsquirel@reddit
yep, just for telling the truth about Apple one get's downvoted by the not so open community related to the Apple dogpile.....do yourself a favor and get something open and free and leave that hemisphere....
ridablellama@reddit
its not just speed there are often template and bug fixes with tool calling and unsloth is very responsive and fast on those updates. This can mean a broken model vs non-broken.
LA_rent_Aficionado@reddit
The downside is they also push to have things out so fast I would presume they do not have time to actually test inference and they have been several instances of broken models and chat templates
yoracale@reddit
We actually test all the time, as previously said, would highly recommend you to read our Qwen3.6 GGUF Benchmarks which asdrteses a lot of these problems. 95% of the time, its mostly out of our hands and the only reason why we get flak for it is because we are the only ones are actively updating people about fixes etc. So naturally people associate our quants with being needing updated when everyone's in fact needs updates.
Qwen3.6 post: https://www.reddit.com/r/LocalLLaMA/comments/1so5nrl/qwen36_gguf_benchmarks/
CentrifugalMalaise@reddit
I’ve been using your quants constantly since I started, thank you! Genuine question though: Google told me earlier that unsloth dynamic quants require some decompression at inference making them slower than standard quants. Is this (or anything like this) true? Thanks
yoracale@reddit
Absolutely not true - it's for all quants, whether dynamic or not or from anyone else. Do you have an example of the what Google said 0.0
CentrifugalMalaise@reddit
I’ve had a look through my history and I can’t find it. It’s possible that it was actually a Reddit comment I read somewhere. Sorry for the confusion, I was very tired yesterday! Anyway it’s good to know it’s not true! If you could answer me one more question it would be a MASSIVE help to me: lots of people say that the chat templates on qwen3.5 and 3.6 are broken (in general) like this guy: https://www.reddit.com/r/Qwen_AI/s/0nhjDpenLu but do you fix the chat templates on unsloth quants yourselves? Because I’m wondering whether I need to look into the “fixed” template in that post. Thank you!
PinkySwearNotABot@reddit
does unsloth plan on innovating in the MLX space for Apple silicone? I know you guys now have some MLX variants, but I‘m thinking more in terms of actual quantization like you guys did with dynamic 2.0 with GGUFs
yoracale@reddit
Yes, we are going to innovate for training. For inference 'maybe' but for now, we are focused on training which will be released soon.
LA_rent_Aficionado@reddit
I'm not saying its endemic or anything and I really appreciate your efforts and responsiveness
yoracale@reddit
Thanks and I get it, if you have any feedback on what we could improve on please let me know
Humble-Badger9567@reddit
Just piping in to say I appreciate what you, why you do it, and think it's a hell of a way to build a solid value proposition in such a chaotic market. Big fan.
yoracale@reddit
Thanks for the support really appreciate it!! ♥️
kiwibonga@reddit
The two 35Bs have been particularly buggy since the very first day 3.5 came out due to Qwen distributing broken templates. The fixes are out there floating around reddit for people who look hard enough but that issue needs more exposure IMO.
ivoras@reddit
>Qwen distributing broken templates
How can that happen? Don't they need working templates for training?
kiwibonga@reddit
The models try to close with after opening, also fail to preserve thinking unless explicitly enabled. So they do really dumb stuff like give up in the middle of making progress on a solution, forget what they're doing and loop, even start talking to you through comments in the code they write because they can't speak between tool calls anymore.
I think they think they're thinking when thinking is off.
ayylmaonade@reddit
Could you link this template, by any chance?
kiwibonga@reddit
https://www.reddit.com/r/Qwen_AI/s/ucm2XGWoa0
ayylmaonade@reddit
Thank you so much! Hopefully this ends some of the issues I've been having. Appreciate it. :)
FlyingDogCatcher@reddit
y'all are awesome and I always look for your quants first
yoracale@reddit
Thanks for the support as always!! ♥️🙏
kyr0x0@reddit
Exactly; the amount of bs they have published is borderline braindead. I love them for doing so much cool work, but they really should shift priority towards quality
LA_rent_Aficionado@reddit
Exactly, if you take so many shots you are bound to miss a few. They are unparalleled in terms of responsiveness and generally high quality and time to upload but things definitely slip through the cracks as a result of this but they are very quick to fix - very commendable.
My main gripe is that I see unsloth at its core as a training backend and that seems to be taking a back seat to all the efforts with quants and now studio. Native multi-gpu support has been TBD for ages... yes, things like accelerate and DDP are now feasible but the feature parity to promises is simply not there.
Other platforms like axolotl seem to be putting far more effort into advancing training these days while unsloth seems to be pivoting to making training more accessible to the masses. That's great for a lot of people and I commend them on their success but unfortunately that leaves a certain demographic behind.
denis-craciun@reddit (OP)
I see, fair enough. Thank you
Mart-McUH@reddit
For dense models they are more or less same as other top quanters. For MoE they use special recipes which usually bring better performance for bpw compared to standard quants. There are few others like AesSedai who also do special quants for MoE (but lot fewer models/quants available).
Biggest problem with Unsoth is that they do everything very quickly, and so if you jump the train early, especially with new model type/architecture, there can be broken quants/templates etc. (they do update them later with fixes, but sometimes it can be even 3 or more updates, which gets bit tiring). If you want to avoid this, best is to wait at least 1-2 weeks after release before downloading. But if you want to be at the frontier, you have to accept the early adopter problems that come with it.
PiaRedDragon@reddit
They are dog shit.
I tested each of their models against the RAM (I chose them because they have same sized, actually slightly smaller) models for comparison, and they lost every single heads up, by between 13% and 31%.
BTW I put this results on this sub a couple of weeks ago, and woke up to a perma ban on Unsloth sub, even though I never posted there. Apparently they don't like facts. Big babies.
alex20_202020@reddit
Where is RAM? https://huggingface.co/unsloth is clear, substitute RAM instead and nothing.
yoracale@reddit
Btw watch out for OP, they're shilling and are actually from baai. They have been shitting on unsloth since 4 weeks ago or so. Check their hsitory
PiaRedDragon@reddit
EXCUSE ME?!?!?!
You will see from my post history that I like these models because they STAND UP TO TESTING, unlike the dog shit that is Unsloth.
ALL my comments are testable by anyone on this sub, the models from both Unsloth and RAM are on Huggingface and anyone can download them and run the exact same tests as I did.
You are pulling the, if you can't attack the message, attack the person card, because you being from Unsloth know your models are dog shit in comparisons.
Thats why you perma-banned me from your sub, even though I never posted there, you don't want the truth coming out how shit your models are.
The only association I have with the model provider is that I like their models, because they are not dog shit like your.
rm-rf-rm@reddit
Having a dissenting opinion is fine, but please curb the pattern of bad mouthing. I've been giving your past comments a pass as its still south of the harrassment border but Reddit is already flagging your comments as potential harrassment. Please consider this a warning and continued offenses will require action from our side
PiaRedDragon@reddit
Fair enough, I will stop calling them dog shit and just let the data speak for itself.
PiaRedDragon@reddit
PiaRedDragon@reddit
Oh, and what Mike doesn't mention here is he is a FOUNDER of Unsloth, lol, the hypocrisy is strong in this one.
PiaRedDragon@reddit
https://huggingface.co/baa-ai
alex20_202020@reddit
https://huggingface.co/baa-ai
Not good IMO, who knows what they do to models. On second thought - maybe it is what most of companies do anyway and we do not have much choice?
PiaRedDragon@reddit
Whatever they are doing is magic.
They have a research page on their website, but from what I can tell they have not published their secret sauce, cause I did go hunting for it.
LA_rent_Aficionado@reddit
There's only so much that can be done if you think about it: QAT, dynamic quants, favoring of certain experts/expert grouping, targeted calibration, etc.
For every use case that it excels at there will be a use case where it lags.
thereisonlythedance@reddit
I’d rather a more accurate quant than a slightly smaller one. Your post is really harsh.
yoracale@reddit
Watch out for OP, They're shilling their own quants btw and have had some beef with Unsloth for quite some time unsure why. They are from baii and yet do not disclose this which is extremely misleading
PiaRedDragon@reddit
This guys is the FOUNDER of Unsloth.....lol
He can't attack the message, because ANYONE can reproduce the results I did, so he attacks the person.
PiaRedDragon@reddit
Watch out for Mike, his is one of the Founders from Unsloth and have a vested interest in people not finding out how bad their models are.
Whatever you do, don't do the testing I did that will show exactly that.
PiaRedDragon@reddit
Watch out for Mike, he is one of the Founders of Unsloth, and has vested interests in people not finding out his models are dog shit.
So whatever you do, don't do what I did and actually test them.
yoracale@reddit
Btw be cautious of OP as they are shilling their own product without disclosing they're from baii while their whole schtick is to claim Unsloth quants are 'bad'.
PiaRedDragon@reddit
Huh? The RAM ones were both smaller AND more accurate.
thereisonlythedance@reddit
Yes sorry I didn’t see your MMLU test. Still think you’re being obnoxious.
PiaRedDragon@reddit
I have tested their GLM versions and their Kimi versions, same exact result.
TBH, I have given up testing the unsloth models now because they lose every heads up. Last test I did against them was Jang vs Unsloth, and they got crushed on that also.
They are just not good.
SidneyFong@reddit
Do you have a list of quant providers that are publishing consistently good quality quants? (besides baa.ai that you already mentioned -- who, btw, if I may ungratefully criticize, doesn't provide a lot of sizing options for their quants, and no gguf for a lot of models yet)
thereisonlythedance@reddit
That does not reflect my real world usage at all. Unsloth are typically superior to say Boartowski. Benchmarks only have so much value.
PiaRedDragon@reddit
Have you tried any other ones? I mean I go out of my way to test different versions to get the best I can.
Once I find a good one any future bench marking is done against this one until it gets beat.
Sure my go to for the last few months has been RAM, but I am sure something would come along and knock them of the championship podium at some stage.
And when I find them that will be my go to.
PiaRedDragon@reddit
In all honesty I think it was obnoxious of them to perma-ban me from their sub when I didn't even post in there and the only thing I was guilty of was running independent tests.
But each to their own.
rorowhat@reddit
How are you running the evals on gguf models?
PiaRedDragon@reddit
Those benchmarks were MLX, but when I do gguf benchmarks I load the model up using llama.cpp.
I find that allows claude code to ensure it applies the template and settings from the huggingface page and does not add extra overhead.
rorowhat@reddit
How are you suing it via llama.cpp? I've tried many times with different benchmark harnesses but it just doesn't work correctly.
PiaRedDragon@reddit
I get calude to create an OpenAI compatible API and then just fire the questions at the port, eg http://localhost:8080/v1
The tricky part is vision: llama-server supports multimodal via --mmproj for LLaVA-style models, but Gemma 4 cision bridge isn't wired up there yet, so benchmarks like MathVision would be text only for gguf tests.
rorowhat@reddit
I did something similar, basically just passed the end point to the benchmark but it complain saying the template is chat or something along those lines. To me this made me realize how these tests could be so wrong based on the template you use, etc.
PiaRedDragon@reddit
Yeah the trick is to tell claude (or whatever coding tool you are using) to pull the chat template and recommended settings directly from the model card from huggingface.
That way you are testing it against the model providers best configuration.
rorowhat@reddit
Precisely my point, the template you're testing is not what you're using "in real life" when you use llama.cpp, for example.
PiaRedDragon@reddit
I am not sure why you would not use the recommended chat template in real life? Most gguf solutions allow you to specify a custom chat template.
I know Ollama and LM Studio both allow you to specify a custom chat template.
yoracale@reddit
Be cautious with some users here. They have been repeatedly criticizing Unsloth while shilling their own quants at baii.ai, without clearly disclosing that connection.
They also claimed our GLM quant was broken without testing it, then didn’t provide evidence when asked. You can check their history and judge for yourself.
PiaRedDragon@reddit
Be cautious with Mike, he is one of the Founders of Unsloth, and has vested interests in people not finding out his models are dog shit.
So whatever you do, don't do what I did and actually test them.
PiaRedDragon@reddit
deleted_by_reddit@reddit
[removed]
PiaRedDragon@reddit
A quick look at his post history and you will find out that HE is one of the Founders of Unsloth.
lol, you can't make this shit up.
How sad.
PiaRedDragon@reddit
computehungry@reddit
It's probably a template issue at that big of a difference, hopefully. Which leads to the next question. IDK why they try to change the template. I'd be more impressed if they actually publish what they analyzed and changed. But they don't and usually it breaks. When it breaks it's a llama.cpp issue, when it runs well it's a good unsloth release. How convenient.
yoracale@reddit
Don't believe OP! They are trying to shill their own product
PiaRedDragon@reddit
Whenever I test I use the recommended template and settings on the hugging face page, in fact that is built in to my claude code bench marking.
You are right, if they did publish what they were doing, we might be able to troubleshoot it for them. But alas, we just move to better models.
computehungry@reddit
Oh, you mean you take the templates from the official releases? Ok, makes sense.
denis-craciun@reddit (OP)
On that’s interesting, thank you Is it “fair” to compare the unsloth models for RAM? Is it possible that they suck at that and completely exceed at something else?
PiaRedDragon@reddit
I mean anything is possible, I normally run basic bench marking as a smoke test, if it is already doing crap on that then it is a pretty good indicator something is broken.
If it passes on basic benchmarks, I then have custom benchmarks for Reasoning and coding, which are the two use cases I am most interested in. I have found general benchmarks are a pretty good proxy for these two capabilities, esp if they do well on Math.
If their basics are bad, it really impacts their reasoning, I am not sure of the mechanism behind it, but it is consistent with the testing I have done.
PiaRedDragon@reddit
BTW, I am only talking about their UD versions, if they have just done a standard FP16 to INt4 conversion, that is deterministic, and their INT4 will be no different than any other INT4 version out there.
denis-craciun@reddit (OP)
Ah ok I see. Thanks 🙏🏻
denis-craciun@reddit (OP)
Ah ok I see, makes sense. Thank you
alex20_202020@reddit
What kind of comparison is that? You compare using different engines? Try to use same engine.
Final-Rush759@reddit
Who knows. There are not enough tests to show they are actually better. I think they are mostly within the margin of errors with other quants. I don't think you lose anything by using them.
One_Internal_6567@reddit
They are garbage indeed.
I have about the same experience for about 2 years - same quant from unsloth and any other big producer would end up in reasoning loops, garbage results and all kind of instabilities that other producers don’t have at all. I have no explanation on how and why it happens exactly, but happens
sdfgeoff@reddit
How are you running them? What hardware? What are you comparing them too?
One_Internal_6567@reddit
I get good results usually with Bart and drummer, but not exclusively, I try few and choose best for the specific model and use case
Both Mac and Nvidia hardware
Llamacpp predominantly
Koalateka@reddit
Yeah, that's my personal experience as well.
denis-craciun@reddit (OP)
Thank you both guys. Very interesting. That’s what I cared a lot about: experience rather than benchmarks What big producer shall I go for? Bartwoski? Thank you
Koalateka@reddit
I have tried a couple of quarts of Bartwoski with good results. I don't really have a favourite, I try many different quant sources, but previously I had the idea that unslot were the best (I thought it was the consensus) and I got serious disappointments for some different models.
denis-craciun@reddit (OP)
Thank you
DataGOGO@reddit
Dynamic quant with no calibration
MerePotato@reddit
I would avoid Q4 for modern super high knowledge dendity models like these, go Q6 if you can fit it
JLeonsarmiento@reddit
Why are you using gguf instead of mlx on a Mac?
denis-craciun@reddit (OP)
No don’t worry I do use MLX with Ollama. I talked about those models because the question was a bit more general
JLeonsarmiento@reddit
Oh I see. I was wondering if Unsloth studio had an advantage over llama.ccp or any of their wrappers (Lmstudio or Ollama)… I’m trying to increase speed on qwen3.6-27b on my Mac… 🤔
denis-craciun@reddit (OP)
How much are you getting now? (And what’s your setup)?
JLeonsarmiento@reddit
Qwen3.6-27b on a M4Pro chip:
Vanilla 4bit mlx-community: 15 t/s
qwen3.6-27b-ud-mlx-3bit: 10 t/s
qwen3.6-27b@iq3_xxs: 8.1 t/s
qwen3.6-27b@q3_k_xl: 9.6 t/s
qwen3.6-27b@q2_k_xl: 9.1 t/s
Prompt processing follows the same pattern.
ganonfirehouse420@reddit
I use nothing but unsloth quants. It's such an improvement.
yoracale@reddit
Thank you appreciate the support! <3 <3
Then-Topic8766@reddit
Where is https://www.reddit.com/user/danielhanchen/ ? It's getting hot. 🍿
mantafloppy@reddit
They fliped the switch, all the negative post got downvoted and the praise are here and raising.
Then-Topic8766@reddit
OK, but I do not understand why my post is negative to them. I have big respect for them and some of the best gguf-s I use are their UD variants... I like Bartovski too. And I still keep some models from TheBloke...
Last_Mastod0n@reddit
This is slightly unrelated, but does anyone know how to disable thinking on the unsloth models for Qwen in LM Studio? I cannot figure out any way to disable it. I tried putting tags like /no think in the system prompt to no avail.
The original quantized Qwen models can have thinking toggled on and off in the UI, but unsloth seems to be forced on with no toggle. I was thinking that kwargs might be the only way but as of my knowledge LM Studio doesnt support them.
sdfgeoff@reddit
Toggle button next to the chat or on the model settings
bonobomaster@reddit
Button isn't there, if the model isn't a curated one from LM Studio.
bonobomaster@reddit
{% set enable_thinking = false %}
Put that in the chat template as first line.
Last_Mastod0n@reddit
Thank you!!
mantafloppy@reddit
After their 4th or 5th re-release, they are as good as all the other.
But that have a marketing team working this sub.
yoracale@reddit
We don't have a marketing team working on this sub. It's literally just me and u/danielhanchen and we also rarely post on r/Localllama. At most once every two/three weeks.
Educational_Rent1059@reddit
lol at OP , "marketing team"
mantafloppy@reddit
Of course everything about Unsloth is naturally generated, that why after 2 hours of only negative comment, they have all been downvoted and replace by positive post with GPT-ism in them.
Naturaly grown engagement, for sure, FR, No cap.
denis-craciun@reddit (OP)
Thank you. This aligns with what other people have said on this thread. Makes sense
mantafloppy@reddit
Good thing you checked the reply early, their bot farm is out now and the praise is overwhelming, as always.
Educational_Rent1059@reddit
Or you just can't stand the fact that Unsloth has brought huge value to the OSS community and people are downvoting you simply because they don't agree with you.
You even stated they have a marketing team when they are 2 brothers working on providing evertyhing for the community, can you take off the tin foil hat now?
yoracale@reddit
Can you find one commenter even suspected of being a bot? It's actually the opposite and there were many bots downvoting my comments after i just posted them (some of them had 5 upvotes already and went to -9) and other users have suspected as well, see: https://www.reddit.com/r/LocalLLaMA/comments/1sweq3t/comment/oif2phd/
I think the community can just see through some of the weird shilling posts here.
denis-craciun@reddit (OP)
I see. It’s fine I don’t really care that much tbf. My post wasn’t meant to side with anybody (I use Unsloth Studio and I find it useful for SFT fine tuning for example). But ya, worth clarifying for others that might read this.
yoracale@reddit
Yes I totally understand and appreciate you using Unsloth studio. Just wanted to clarify a bit more because I just realized original Pia's post had like 64 upvotes or something (he gained like 50 in a few mins) and my post all had downvotes. But now his has many downvotes. My guess if because Reddit detected the botting upvotes and downvotes and removed all of them.
denis-craciun@reddit (OP)
Omg you’re right that’s wild. I’ve never seen this..
yoracale@reddit
Actually OP, the downvotes you saw weren't real. Pia the guy most likely bought downvotes most likely and bought upvotes for his. My original comment had like 6 upvotes and a few mins later it was at -9 downvotes when I was calling them out for shilling
czktcx@reddit
Unsloth's UD models' names do not really reprepsent quantization type any more. IQ4NL may not contain any IQ4NL tensor, IQ1_M may be mainly IQ2_XXS. Performance is highly determined by the quantization type.
Unsloth always uplaod immatrix file, which is wonderful so people can re-quantize into any type they want
CryptoSpecialAgent@reddit
Unsloth models are quite remarkable in terms of their efficiency: I was able to fire up a 2-bit dynamic quant of qwen3.6-35b-a3b with a 128k context on my MBP 16GB… Even at this quantization level, there was not quite enough “vram” to store everything and a minimal degree of swapping to and from the SSD during inference was necessary
Performance was acceptable despite this - not amazing, but usable: 10s time to first token, then 5-10 TPS thereafter.
And Unsloth is being truthful in their claims that their quants are less lossy than equivalent size quants made with other methods: sure, the 2 bit model wasn’t a rockstar coder, but for general chatbot use and long form content creation, it was certainly good enough - I made it a web search tool and a web fetch tool, and the model appeared totally competent at knowing how and when the tools should be used
denis-craciun@reddit (OP)
Thank you for the feedback! Useful
Force88@reddit
Unsloth studio (or whatever their web ui is called) peform unreliable for me, like sometimes it is very smooth, but sometimes it refuses to load the same model that have worked fine before.
Llama.cpp is the same too. I use it with open webui, but sometimes the backend just not responsive and I have to close & restart the command in terminal...
Educational_Rent1059@reddit
The Studio is in beta and under development they are releasing many improvements gradually, latest update had a whole UI update too
nikhilprasanth@reddit
Models are good. But more importantly they provide documentation and benchmarks , which I find important.
yoracale@reddit
Thanks for the support man! Appreciate it 💚🦥
denis-craciun@reddit (OP)
Thank you :)
LetsGoBrandon4256@reddit
They have a nice doc site with well written documentation for the models.
They do benchmarks for quants that shows their quants are better.
Though I just can't get this out of my head for some reason
yoracale@reddit
Thanks for the support! Actually we don't exaggerate the scale at all. In fact we lessen the log scale compared to other quant providers otherwise the difference will be even larger.
oobabooga for example did benchmarks for Gemma 4 26b on a kld dataset using real world usecases including conversation and coding and unsloth still performs vastly better whilst using a very similar scale to the benchmarks we conduct. see: https://localbench.substack.com/p/gemma-4-26b-a4b-gguf-quality-benchmark
LetsGoBrandon4256@reddit
Aye. Sorry if my post sounded snarky. It was mostly a tongue-in-cheek joke.
Your documents sites are my first stop when I need to quickly look up the sampler settings and prompt templates for the recent new models and your Gemma 4 26B quants are my new daily driver. Keep up the good work.
I had an impression that the inference speed (relative to the model size) is a bit hit-or-miss when compared to other quants. Apparently it has something with IQ3 not playing nice with CPU offload but oh well, it's not the end of the world.
yoracale@reddit
Thanks for the support and got you, yes in general IQ quants are slower
denis-craciun@reddit (OP)
God the scale on this graph is so misleading ahahah
yoracale@reddit
That's not our benchmarks btw
denis-craciun@reddit (OP)
Ya no I figured don’t worry
a_beautiful_rhind@reddit
The quants are usually ok once a little time has passed from the model release day. If you get them on day 1, decent chance the template will be changed or something else fixed.
I just pick best PPL/KLD for the size on models > 30b.
denis-craciun@reddit (OP)
Thank you :)
Phaelon74@reddit
Unsloth believes heavily in "first to the key, first to the egg." This results in half baked loafs of bread they often have to come back and fix. They do have good quants once fixed. And they do provide a lot of help to the community when it comes to alternate pipelines for finetuning, etc.
Their whole "dynamic" quanting tho is kind of meh. Most other quanters have been doing this all along, and never really called attention/branded it as it's the meta at the moment.
There's also a healthy amount of pooping on them as they spend a lot of time/effort to say or try to say they are the best, or someone came and used what they did to fix something, when I'm in discord with the other quanters and they were already fixing or changing something on their own. So it's a mixed bag.
End of day, grab lots of models, try them yourself, identify the best for your use-case. Don't just stick with one quanter and act like theirs are the best, forever and ever.
Karyo_Ten@reddit
AesSedai and Ubergarm always publish KLD and/or PPL. They have a robust quantization technique. I'm usually disappointed by unsloth (GLM-4.6, Step-Flash ...
rakarsky@reddit
Presumably disappointed with their Step Flash because they never released one?
Karyo_Ten@reddit
Or they deleted it out of shame
rakarsky@reddit
I guess I missed it. I was looking for it a while back, but I usually prefer AesSedai quants anyway.
yoracale@reddit
We never released it btw, idk what Op Is talking about
Karyo_Ten@reddit
Ah, I was disappointed due to waiting to test them: https://www.reddit.com/r/unsloth/comments/1r330v7/step35flash_unlosth_dynamic_ggufs/
yoracale@reddit
Yes we originally wanted to work on them but didn't have time so we forgot. Unfortunately it happens a lot
yoracale@reddit
We didn't delete it out of shame, what? We never uploaded it or made GGUFs for it that's why.
Karyo_Ten@reddit
I was disappointed due to waiting to test them: https://www.reddit.com/r/unsloth/comments/1r330v7/step35flash_unlosth_dynamic_ggufs/
yoracale@reddit
Yes that's correct we never released one. Idk what OP is on about. We only uploaded the 16bit safetensor version and not the GGUFs
denis-craciun@reddit (OP)
Interesting Thank you
Mantikos804@reddit
Everyone has an opinion. The taste test rule applies here. Try the UD quants, in unsloth studio and see what you think. I love em. They are right…for me.
yoracale@reddit
Thanks for the support appreciate it!!,🙏♥️
denis-craciun@reddit (OP)
I will be trying them more in the next days. Maybe I’ll do another post in some months with my experience in production Thank you
dashrndr@reddit
I like them. Good docs and easy to understand
yoracale@reddit
Thanks for the support appreciate it! 🙏♥️
LA_rent_Aficionado@reddit
Better is subjective to the use case and hardware, a larger quant that has any type of sensible layer strategy will yield better quality outputs at the expense of speed. Unsloth provides a number of variations at different sizes, as others have mentioned, Unsloth isn’t really doing anything novel - there are only so many ways you can do a dynamic quant and their strategy is good but nothing groundbreaking IMO.
Where things can differ is if it’s an imatrix quant; the nature of the dataset driving the quantization may drive quality depend, depending on your use case. For instance if a provider uses an agentic coding focused imatrix it will clearly lean this way in terms of quality.
To my knowledge unsloth leans more towards coding but they do not publish their dataset like bartowski does.
denis-craciun@reddit (OP)
Ah ok I see Thank you :)
Expensive-Paint-9490@reddit
I mainly use huge MoE models. I have tried UD_Q4 quants vs Bartowski's and mradermacher's Q4_K_M. The latter two are faster on my system (hybrid inference with llama.cpp) and quality is the same. So I am using those.
denis-craciun@reddit (OP)
Thank you! I’m hearing a lot of good things about the Bartwoski ones. Are they better than Unsloth then?
Expensive-Paint-9490@reddit
Honestly I am not knowledgeable enough about quantization to give you a clear answer.
What I can say you is: Bartowski, together with mradermacher, took somehow the place of The Bloke which was the OG huggingface quantizer, when The Bloke left the space. They both quantize things with top quality and without strings attached. Unsloth started as a very cool project on QLoRA fine-tuning; their fame exploded when they become main quantizers. Unsloth guys sometimes have a bit of an aggressive attitude when criticized, something that neither Bartowski nor mradermacher ever had.
Bartowski in one occasion gave his opinion on UD quants, which he considers suboptimal for reasons. It was not an attack on Unsloth, just his technical opinion.
Last but not least, there are other big name in quantization like ikawrakow, ubergarm, thireus. Ouside the llama.cpp ecosystem, the most important name is turboderp; his exllama3 quantizations are still the best on benchmarks.
denis-craciun@reddit (OP)
Thank you! I’ll have a look at all of them
Interesting-Print366@reddit
It can be the best option for same quants. But higher quants are better nomatter what quant you use
denis-craciun@reddit (OP)
Thank you
KURD_1_STAN@reddit
Can we like not do this and not discover anything not acceptable? I dont wanna redownload all models again, lets stay uninformed.
denis-craciun@reddit (OP)
😂😂
DataGOGO@reddit
They are all the same
dlcsharp@reddit
From what I understand, yes you are correct that layers have different quant levels. From my experience, XL qua
denis-craciun@reddit (OP)
Thank you!
dlcsharp@reddit
Np