Qwen3.6-35B-A3B Uncensored Aggressive is out with K_P quants!

Posted by hauhau901@reddit | LocalLLaMA | View on Reddit | 76 comments

The Qwen3.6 update is here. 35B-A3B Aggressive variant, same MoE size as my 3.5-35B release but on the newer 3.6 base.

Aggressive = no refusals; it has NO personality changes/alterations or any of that, it is the ORIGINAL release of Qwen just completely uncensored

https://huggingface.co/HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive

0/465 refusals. Fully unlocked with zero capability loss.

From my own testing: 0 issues. No looping, no degradation, everything works as expected.

To disable "thinking" you need to edit the jinja template or simply use the kwarg {"enable_thinking": false}

What's included:

- Q8_K_P, Q6_K_P, Q5_K_P, Q4_K_P, Q4_K_M, IQ4_NL, IQ4_XS, Q3_K_P, IQ3_M, Q2_K_P, IQ2_M

- mmproj for vision support

- All quants generated with imatrix

K_P Quants recap (for anyone who missed the 122B release): custom quants that use model-specific analysis to preserve quality where it matters most. Each model gets its own optimized profile. Effectively 1-2 quant levels of quality uplift at \~5-15% larger file size. Fully compatible with llama.cpp, LM Studio, anything that reads GGUF (Ollama can be more difficult to get going).

Quick specs:

- 35B total / \~3B active (MoE — 256 experts, 8 routed per token)

- 262K context

- Multimodal (text + image + video)

- Hybrid attention: linear + softmax (3:1 ratio)

- 40 layers

Some of the sampling params I've been using during testing:

temp=1.0, top_k=20, repeat_penalty=1, presence_penalty=1.5, top_p=0.95, min_p=0

But definitely check the official Qwen recommendations too as they have different settings for thinking vs non-thinking mode :)

Note: Use --jinja flag with llama.cpp. K_P quants may show as "?" in LM Studio's quant column. It's purely cosmetic, model loads and runs fine.

HF's hardware compatibility widget also doesn't recognize K_P so click "View +X variants" or go to Files and versions to see all downloads.

All my models: HuggingFace-HauhauCS

Also new: there's a Discord now as a lot of people have been asking :) Link is in the HF repo, feel free to join for updates, roadmaps, projects, or just to chat.

Hope everyone enjoys the release.

[-]

Iory1998@reddit

No degradation? Hard to believe. Never seen an uncensored model without quality degradation.

[-]

FinBenton@reddit

I cant say about that but in general, these aggressive versions have always been the best out of all of these uncensored ones, in my use I dont see any degradation.

[-]

-p-e-w-@reddit

There are benchmarks for that. This can be measured. We don’t have to go by people’s vibes.

Unfortunately the author doesn’t release unquantized versions (unlike essentially every other researcher on any topic), which makes benchmarking much harder because the standard harnesses don’t support GGUFs.

The maintainer of the UGI Leaderboard has been repeatedly asked to benchmark those models, but had to give up because he couldn’t get the quants to work. It’s really difficult to assume good faith here.

[-]

NoahFect@reddit

He couldn't get the quants to work? What kind of BS is that? They work. Have you tried them yourself?

[-]

AlwaysLateToThaParty@reddit

the guy is the main developer behind heretic fwiw. usually worth a listen.

[-]

NoahFect@reddit

Trouble is, (a) the HauHauCS Qwen models are the best ones I've tried, including the heretics; and (b) he seldom misses a chance to jump in with this exact complaint when HauHauCS announces a new release. We get it, the Kullback-Leiber divergence is suboptimal, but at the end of the day the HauHauCS models feel just as smart as the originals and never, ever refuse anything.

[-]

-p-e-w-@reddit

he seldom misses a chance to jump in with this exact complaint whenever HauHauCS announces a new release

I will stop complaining the moment the author provides evidence to support their uncorroborated boasts (“zero capability loss”).

Which, btw, should be the default expectation, not something you have to specifically ask for.

I have never claimed zero capability loss for Heretic models (even though some of them beat the base model on benchmarks), and I consider the very idea to be nonsensical. If you change behavior, then there will be some way in which the behavior becomes worse. That’s common sense, and to claim otherwise (especially without any evidence) is just dishonest.

[-]

NoahFect@reddit

I hear what you're saying and don't disagree in principle, but it comes across as unnecessarily defensive. When did HauHauCS target your models for criticism? Is there some beef that isn't apparent to the rest of us, just reading these threads?

[-]

-p-e-w-@reddit

Is there something he's claiming that you can specifically refute by example?

The burden of proof is on the one who’s making the claims. Especially when those claims are highly unusual, but even otherwise.

[-]

Iory1998@reddit

Exactly! I've been using local models a week after the original llama was leaked. Tried countless models, and I never ever tried an uncensored version that worked as good as the vanilla model.

[-]

NoahFect@reddit

What would be an example of a prompt that consistently performs better on the original Qwen 3.6 35B release than it does on this one?

[-]

draconic_tongue@reddit

LMFAO

[-]

AlwaysLateToThaParty@reddit

His criticisms seem consistent and considered to me.

[-]

-p-e-w-@reddit

Read it yourself: https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard/discussions/590

I currently get vllm/transformers errors like "ValueError: GGUF model with architecture qwen35 is not supported yet." when trying to run this. Since HauhauCS only uploads ggufs, I'll have to wait to test it.

[-]

NoahFect@reddit

Based on other posts from him, I think (he may correct me here if I'm wrong) he's running out of space on HF. Could be as simple as that.

All I know is, it draws a decent pelican for a 35B MoE.

[-]

AlwaysLateToThaParty@reddit

Thanks for the insight. Yes. It you're making it hard to validate your system, one can't assume good faith.

[-]

RickyRickC137@reddit

There has to be some metric to say that. Here's Heretic - https://huggingface.co/Abiray/Qwen3.6-35B-A3B-heretic-GGUF It has KL Divergence value to measure (closer to 0 means closer to the original model)

[-]

-p-e-w-@reddit

The KLD in that model card is very likely misleading btw. It’s unrealistically low even for a SOMA model. I suspect that the fork of Heretic it was made with is still missing the “two-stage CoT skip” patch, without which it can measure at a token position where the probability distribution is highly skewed.

Yes, correctly measuring model divergence is very, very complicated.

[-]

Kodix@reddit

But responding in a different way to the original model is the point. It would be unclear if higher divergence on any particular answer was "correct" or not.

[-]

droans@reddit

They're not talking about asking how to make drugs or write erotica - of course you'd expect those responses to be different.

But if you ask it to tell you facts about Venus, you'd expect the same seed to give the same response.

[-]

SheepherderBeef8956@reddit

If you ask it to answer something with an objectively true answer that doesn't trigger refusals in the base model it's reasonable to assume an uncensored model should respond in the same way if it's unaffected.

[-]

KallistiTMP@reddit

Worth noting - it is possible to go too hard on removing refusals. The more recent research has shown that hallucinations are very closely tied to under-activation of the uncertainty-related circuits, as a result of post-training where non-answers are scored the same as confidently wrong answers. So, the model inadvertently learns that just confidently making wild guesses is always better than just saying "I don't know".

That's a very hand-wavey non-technical explanation, but that's more or less the currently supported theory.

So, you ideally want to selectively get rid of the "I'm sorry Dave, I'm afraid I can't do that" answers, without getting rid of the "I actually have no fucking idea" refusals.

That may be less of an issue now though, I'm sure a lot of major players have already adjusted their post-training pipelines to punish wrong answers more than non-answers since it's such an easy optimization.

[-]

no-adz@reddit

I do see a bit of degradation.

Ref: unsloth/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf
Uncens: HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q2_K_P.gguf

Prompt: 'The car wash is only 50 meters away; should I walk there or drive?'

Ref: 'Since it's a car wash, you'll want to drive your car there—even 50 meters counts! 😄 [etc]'
Uncens: 'At just 50 meters, walking is almost always the better choice. Here’s a quick breakdown to help you decide: [etc]'

Also 3 other prompts clearly resulted in degraded answers for the uncens model.

[-]

Prudent-Ad4509@reddit

Here is the answer from Q8_K_P:

Since it's only 50 meters (about 160 feet) away, driving your car is almost always the better choice. Here's why:

🚗 Drive it if you plan to wash it yourself or use the car wash equipment. You'll need the car on-site anyway, and it'll only take 15-30 seconds to roll over. No need to carry hoses, buckets, or soap.

🚶 Walk if you're just dropping off the keys for a full-service wash, grabbing supplies, checking hours, or if it's a "walk-up" self-service station where you wash the car from outside.

Bottom line: Unless you're doing a drop-off or it's specifically designed for foot access, just drive it over. The distance is negligible, and you'll save yourself the hassle of carrying gear or pushing the car.

[-]

tavirabon@reddit

Not saying it isn't degraded, but asking a single question once is just noise, nothing meaningful can be extrapolated from it.

[-]

no-adz@reddit

Indeed, that would not make sense. So I did not do so.

[-]

Zestyclose_Yak_3174@reddit

At that quant level artifacts are more common. I would like to see a 4 bit comparison. Will also do it myself but it will take some time.

[-]

ArkCoon@reddit

it says "KL0.0764" in the gguf metadata.. I'm guessing the kl divergence is 0.0764 in that case? Idk why it's not stated in the model card though

[-]

Electronic-Metal2391@reddit

Dear, is there anyway to stop your quant from thinking? It's not adhering to any recommendations to stop thinking, I'm using the Q4-K-M.

[-]

No-Leave-4512@reddit

Thank you! Any plans on doing Gemma4-26B-A4B?

[-]

EvilEnginer@reddit

Really nice update. I've been waiting for this one too :D

[-]

PaceZealousideal6091@reddit

I see you have updated your repository with this as the base model! You have added the K_P quants! Can you please help me understand why is it called K_P? isn't it same as imatrix quants? if no, how is it different?

[-]

EvilEnginer@reddit

I think those quants has better tensor profiles for data storage.

[-]

havnar-@reddit

Mlx opus distilled when

[-]

UntimelyAlchemist@reddit

I'm a newbie to AI and am still getting into it all, so by no means an expert. But I've been experimenting with your Qwen 3.5 and Gemma 4 models, comparing them to Unsloth and Heretic versions, and at least from my subjective experience your releases are fantastic. Truly totally uncensored, and I haven't yet noticed any degradation. Heretic releases are disappointing and refuse every one of my test prompts, so I'm not sure what's going on with those...

Will you be doing the bigger size Gemma models?

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Electronic-Metal2391@reddit

Thank you very much! Your models are great!!!

[-]

Icy_Annual_9954@reddit

What Hardware do I need for running this? Any reliable stats?

[-]

Top-Rub-4670@reddit

custom quants that use model-specific analysis to preserve quality where it matters most.

You've just described imatrix?

[-]

PaceZealousideal6091@reddit

I have been asking him about this for so his past 2 or 3 posts! He never replies! Are you sure it is imatrix? So, llama.cpp doesn't care how you label the quants and processes it regardless?

[-]

Goldkoron@reddit

Personally I prefer if all UIs started showing bpw values for each quant. All the K_L, K_S, etc stuff is kind of getting too messy while bpw is a constant people can reference for size for any quants.

[-]

TechnoByte_@reddit

That says nothing about the method used to create the quant, which matters a lot for quality

[-]

CountlessFlies@reddit

Yeah but aren’t there multiple quants at the same bpw? Eg, IQ4_XS vs Q4_K_M. They’re both 4 bpw right?

[-]

Goldkoron@reddit

Q4_K_M quants are usually around 4.7 or higher bpw I believe. IQ4_XS a bit lower but usually highly uniform on IQ4_XS tensors like Q4_K_M quants have sporadic Q5_K tensors around.

[-]

PaceZealousideal6091@reddit

The description says "Each model gets its own optimized profile. Effectively 1-2 quant levels of quality uplift at ~5-15% larger file size" Can someone confirm this? Is it really that good? So, Q4_K_P performes better than Q4_K_XL and is comparable to Q5_K_M at the least in perplexity/kld?

[-]

mrdevlar@reddit

I'm very much enjoying your releases, since the alignment on the Qwen models appeared to be very strong.

Well done, keep up the good work.

[-]

Long_comment_san@reddit

I think we should collectively agree on some global statement like: "drop of quality below 1% is to be considered loseless".

I think people repeatedly explaining that takes more time than the issues this quality drop causes. Let's save some time.

[-]

llama-impersonator@reddit

still a distinct lack of information on what you did, how you tested "zero capability loss", etc.

[-]

Sus-Amogus@reddit

It’s always fun how this comment never gets a response from this author haha

[-]

FinBenton@reddit

My foilhat theory is the guy works for qwen and knows exactly what to do, just doesnt disclosure it :D

[-]

AlwaysLateToThaParty@reddit

Normal hate; guy likes attention but doesn't know how to do it legitimately.

[-]

NoahFect@reddit

Hatless; guy doesn't owe any of you jack squat.

[-]

AlwaysLateToThaParty@reddit

Shit talkers talk shit. This, and other news, at 11.

[-]

Jackw78@reddit

Appreciate the work! On a sidenote only two quants are available to download so I assume the files seem to be still being uploaded?

[-]

hauhau901@reddit (OP)

Hey Jack, yep - they're being uploaded as we speak. The list of quants you see here in my post is what the final upload tally will be.

[-]

NoahFect@reddit

Thanks for your work on these, seriously. Hope you're able to do something with the larger Gemma models at some point, as well!

[-]

Jackw78@reddit

Good to know! 1st time for me being this early:)

[-]

OniCr0w@reddit

Yep they're uploading

[-]

EatTFM@reddit

can someone explain why using presence-penalty 0.0 is officially recommended for coding tasks?

[-]

Kodix@reddit

"print(x) print(y)"

The second print would get a presence penalty, because print already exists in the current context.

[-]

BelgianDramaLlama86@reddit

Because blocking repetition in code breaks the code.

[-]

RegularRecipe6175@reddit

Thank you King

[-]

Wildnimal@reddit

Just used this model for the past 2 hours and it has passed most of what i threw at it. Still playing with temperature and Top P. Currently settled on 0.6 Temp

[-]

VoiceApprehensive893@reddit

big gemmas?

[-]

Raredisarray@reddit

I love how they drop this shit on same day as opus 4.7😹😹 LFG Qwen team🔥🔥

[-]

leonbollerup@reddit

NICE…

Could you do an uncensored version of the qwopus 35b a3b please … getting som really good results with that

[-]

Clear-Ad-9312@reddit

Some of the quants are not showing on the sidebar on HuggingFace, better look at the files themselves!

[-]

tempedbyfate@reddit

Thank you for this release. Any chance you could also create Uncensored versions of the two larger gemma4 models? I think you posted a few a weeks ago you were going to look at them next but I don't think they have been released yet? Thank again for all your work!

[-]

buttplugs4life4me@reddit

Isn't the P the same as imatrix?

[-]

Goldkoron@reddit

What he's doing is just raising the quant level for tensors he deems important for quality for each model. How he determines what to use for each tensor/tensor group I don't know.

I made some quants for 122b and 397B using my own concept of what he's probably doing where I do a sensitivity scan of all the expert tensors to determine which ones actually affect the model most from being upgraded/downgraded, then rank them from most important to least important to get an actual data driven method of higher quality quantization.

But my method on 122b which produces better quants according to kld scores compared to unsloth UD didn't find the same tensors being upgraded as his K_P quants.

[-]

moahmo88@reddit

Great! Thank you!

[-]

HopePupal@reddit

the king! appreciate it