Qwen3.6 35B A3B Heretic (KLD 0.0015!) Incredible model. Best 35B I have found! | TheaterFire

Qwen3.6 35B A3B Heretic (KLD 0.0015!) Incredible model. Best 35B I have found!

Posted by My_Unbiased_Opinion@reddit | LocalLLaMA | View on Reddit | 100 comments

Been using this for a few days. It is BY FAR the best uncensored model I have found for Qwen 3.6 35B. With IQ4XS, Q8 KVcache, 262K context, it fits in 24GB of VRAM and does not fail on multi turn tool calls. I honeslty feel like it is smarter than the original model (call me crazy). The model also has a very low KLD so it should in theory be similar to the orignal model on harmless prompts.

llmfan's 3.5 35B model does actually benchmark higher than the original in the UGI NatInt section, so I have a solid hunch this 3.6 35B will also benchmark higher than the original 3.6 model as well.

Y'all should give it a try.

[-]

SebasErro@reddit

Is it safe to use a hacked model?

[-]

FORNAX_460@reddit

The term hacked does not imply to llms. In simple terms this model wont say no to most things the base model would. And safe or unsafe totally depends on the users mindset, regardless uncensored or not. Cause ai makes mistakes so dont believe or let it do anything (in agentic workflows) without double checking.

[-]

Scared_Bedroom_8367@reddit

All low parameter models are garbage, in my experience

[-]

Evanisnotmyname@reddit

And when’s the last time you used one?

[-]

Scared_Bedroom_8367@reddit

Last week

It was a 8B parameter model

[-]

ex-arman68@reddit

MLX versions, including fixed chat template and restored vision capability:

https://huggingface.co/froggeric/Qwen3.6-27B-Uncensored-Heretic-v2-MLX-8bit

https://huggingface.co/froggeric/Qwen3.6-27B-Uncensored-Heretic-v2-MLX-4bit

https://huggingface.co/froggeric/Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-8bit

https://huggingface.co/froggeric/Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit

[-]

mantafloppy@reddit

The uncensored part seem ok, but infinite loop in tool call.

[-]

My_Unbiased_Opinion@reddit (OP)

Do you have preserve thinking on?

[-]

mantafloppy@reddit

Just re-tested with {%- set preserve_thinking = true %} added to my jinta template.

No change.

The tool call dont always fail.

I ask for a search, it successfully used the tool. But on the 2 follow up question, it ended up in an infinite loop of doing the same call again and again.

[-]

mantafloppy@reddit

Preserve thinking is for the thinking to be kept in context for follow up questions, should not make any difference in the same response generation.

I'll test tomorrow (on mobile now), but i doubt it will make a difference.

[-]

cosmicr@reddit

I have this on the vanilla model too. I gave up on it and went to 27b

[-]

shikima@reddit

So it still have the same problem 🥲

[-]

Chiralistic@reddit

Based on the model you posted mudler made a apex quant. Works the same quality for me but way faster.

[-]

-p-e-w-@reddit

This model is interesting in that it uses separate parameters for the linear and traditional attention blocks, an approach I have recently refused to merge from a pull request.

Heretic is a tool that can be used by absolute beginners, but it can be even more effective when wielded by a master. The creator of this model, llmfan46, is without a doubt a master user of Heretic and deserves full credit for the model’s stellar performance. They did much more than just run a command line program here.

[-]

My_Unbiased_Opinion@reddit (OP)

for sure. I have found llmfan and coder3101 to make some of the best heretic models. Also, we all appreciate your work on the Heretic tool.

[-]

Imaginary-Unit-3267@reddit

You know, I've never understood... what do people actually do with uncensored models? Never once in all the time I've used LLMs (since early 2020!) have I actually run into a situation where one refused a request - local or cloud. I am not sure what kind of request I could possibly want to make that they might refuse!

[-]

No-Anchovies@reddit

Sysec.

u/-p-e-w- the llmfan models always felt like overhyped seo-tailored slop made to reach the homepage so I've avoided downloading, however your comment brings some confidence into trying it again. Thanks for all the contributions

[-]

po_stulate@reddit

gpt-oss for example would straight up refuse to work if its mcp search results happen to include some slur words from some random sites. A good uncensored model also puts all focus on understanding and answering your questions, rather than putting half the efforts including disclaimers.

[-]

FuckNinjas@reddit

porn

[-]

Imaginary-Unit-3267@reddit

It's all over the entire internet... why do you need more...

[-]

Equivalent-Repair488@reddit

When you are steering the story yourself, and the story is reacting to your actions (prompts), it adds a different kind of immersion.

[-]

FaceDeer@reddit

I have extremely specific tastes.

[-]

FuckNinjas@reddit

There's not enough fanfic.

[-]

-p-e-w-@reddit

Many local models refuse to answer any questions about medical or legal issues, for example. Do you really never ask those?

There was a screenshot posted here a few weeks ago where the user told Gemma 4 that they are in the wilderness with a broken leg and no reception, and asked for emergency instructions for splinting the leg. Gemma responded that it can’t help with that and they need to visit a medical professional.

[-]

Imaginary-Unit-3267@reddit

I've never had reason to ask about legal stuff. And I don't have medical issues worth looking up very much! Pure luck so far I guess.

[-]

ZealousidealBadger47@reddit

For someone with infant with high fever, it will be useful to ask uncensor model for advise after i input everything that i know (e.g. drink how mls milk, sleep, cry...)

[-]

MotokoAGI@reddit

Medical discussion, Tax discussion, Sexuality, Politics insights, Religion discussion, Race Relations, Cyber security, Low level systems programming. Legal advice, etc. You obviously are not a heavy user.

[-]

tomByrer@reddit

There is some talk that abilerated models are better at agentic in general.

[-]

Good-Hand-8140@reddit

Cybercrime tool development

[-]

MrMeier@reddit

You can't tease us like this. Why did you refuse the pull request?

[-]

-p-e-w-@reddit

Because adding another component type adds 4 new parameters that need to be optimized, which according to folklore observations about TPE convergence behavior would require running more trials, possibly up to 300, to get comparable quality.

[-]

Chromix_@reddit

Unless they always need to be optimized and cannot have sensible defaults or the component is so deeply embedded that it cannot reasonably be disabled: Why not merge it and put a big disclaimer on the option: "You'll reduce quality unless you know what you're doing"?

That way experienced users like OP can use this easily and they won't have to maintain their own private branch on top of the official repo.

[-]

-p-e-w-@reddit

It’s not an “option”, component naming is hardcoded.

[-]

My_Unbiased_Opinion@reddit (OP)

this is something I am wondering too. maybe it doesnt work across the board?

[-]

dtdisapointingresult@reddit

I don't want to be 'that guy' but can anyone explain to me what's the point of an uncensored Qwen 35B A3B model?

The system prompt is enough to get it to do questionable IT tasks, so I can only imagine it's for creative writing. But who the hell is using a A3B model for its writing? In fact, who's using Qwen in the first place for writing?

[-]

socialjusticeinme@reddit

I want my LLM to read user supplied text and make a content judgement off of it. If the user supplied the N word something, a normal LLM will start refusing - an uncensored one won’t.

Same thing goes if you want it to analyze uploaded pictures like if it’s sexual or gore or something - need an uncensored VLM (like this one from llmfan) to do it.

[-]

My_Unbiased_Opinion@reddit (OP)

Understandable. For my use its medical and legal related, hooked up with rag.

[-]

iLaux@reddit

Thanks for sharing. How does it compare to Gemma 4 26b? It's better at the same quantization you said on your post?

[-]

My_Unbiased_Opinion@reddit (OP)

some people will like gemma 4 better. some people will like gwen 3.6 better. for creative writing, there is no competition, gemma is far better and is more stable in higher outputs. but im personally partial to 3.6 for its better tool calling and speed.

[-]

fauni-7@reddit

I am going nuts with Gemma 4 (uncensored) with creative writing, it's the best I ran on my 4090 at Q4.
When ever I do a story part I run the same prompt on DeepSeek, and Gemma is not disapointening.

[-]

misha1350@reddit

You can look up various comparisons yourself. There are some things Gemma 4 leads the pack in and some things that Qwen3.6 35B A3B is the best at.

[-]

Awwtifishal@reddit

I use this model but with unsloth quants: I use quant_clone with unsloth's GGUF to get the exact llama-quantize recipe to build it, and used it with the BF16 GGUF of this model (and unsloth imatrix file).

[-]

My_Unbiased_Opinion@reddit (OP)

Interesting. Is there a guide to do this?

[-]

Awwtifishal@reddit

https://github.com/electroglyph/quant_clone

It just gives you the command with 3 placeholders (that you have to replace): imatrix file, full precision GGUF (BF16 in the case of qwen), and name of the GGUF file to write

[-]

Equivalent-Repair488@reddit

I am currently using unsloth's Qwen 3.6 27B UD Q5_K_XL quant at about 204k context, for just hobby vibecoding through roocode VSC.

Is there really an improvement with a heretic model? For this instance I don't get refusals anyway so is there a benefit for someone like me?

[-]

redblood252@reddit

And here I am sturggling with qwen3.6 27b at UD-Q4_K_XL and 16Gb vram. But currently only have a 5060ti

35B works well but gives lackluster responses in comparison

[-]

misha1350@reddit

Of course. 16GB VRAM isn't enough. You'll need to either step down to UD-Q3_K_XL, or just run off of your iGPU, if you have one, just so that you could get extra few hundred megabytes of free space in the VRAM. Try running the system off of an iGPU (connecting the HDMI/DisplayPort to the motherboard instead of the GPU) first, it might help out a lot.

[-]

redblood252@reddit

I'm on a server, I have an epyc 7763 with 128Gb of system RAM and an rtx 5060ti.
I run models with llama.cpp/vllm on kubernetes
Currently 27B UD-Q4XL gives 0.5tps and 35B UD-Q6XL gives \~30 tps

[-]

ayylmaonade@reddit

You could try out Unsloth's new IQ4_NL_XL quant, or even just the normal IQ4_NL/XS by them. If you offload layers, I'd expect it to bump up from 30tps a bit. I also strongly recommend using speculative decoding or MTP. For llama.cpp, just add --spec-default and you're basically set.

For vLLM, I find MTP-3 is best.

[-]

redblood252@reddit

Anything I can do about the dense one?

[-]

misha1350@reddit

It looks like you've got no VRAM left. So try UD-Q3_K_XL instead, it might fit into the 16GB of VRAM all the way and you'll have some 10-15 tps. Alternatively, 35B with UD-Q4_K_XL would probably bump up the tps count to 45, even though it won't fit all the way.

You can also try out Gemma 4 with UD-Q4_K_XL, which has a total parameter count of 26B, usually comes in the Instruct variant, and thus you'll get responses right away. Though at this point you might want to just run Gemma 4 31B with reasoning, Google Search integration and a big context window and good speed in Google AI Studio with generous API limits.

Q3_K_XL would be fine for Qwen3.6 27B because dense models are more resilient to quantisation since the active parameter count is so big. Whereas MoE models with less than 5 billion active parameters quickly fall apart even with UD-Q3_K_XL quantisation because it's like trying to puncture holes in an already thin water filter, resulting in nonsense and factual errors being common (so for the MoE models, the smallest you should go for is UD-Q4_K_XL).

[-]

fabyao@reddit

Great idea. I have a 7900xt 20GB VRAM and 32GB DDR5 with 7950x CPU. Ill boot with the IGPU. However i was wondering if i could use both igpu and dedicated? I am using the Q4 K XL and get about 33 t/s. I could do with more contexts. I am llama cpp

[-]

misha1350@reddit

7900 XT already has 20GB VRAM and Q4_K_XL would fit, though for maximizing the context window, you'd like to not run anything else and run off of the iGPU to get extra VRAM savings on the dGPU. The difference may actually be 1GB when running Windows 11 with various apps open, and more specifically, Steam.

[-]

My_Unbiased_Opinion@reddit (OP)

I also like 27B. but 35B is way faster and in my experience, this specific model competes with 27B in openclaw. I have even tested this model down to Q3KL and it still doesnt fail tool calls.

[-]

Independent-Date393@reddit

IQ4_XS in 24GB with 262K context is the headline. that's genuinely usable context for most workflows without needing to chunk

[-]

Independent-Date393@reddit

given the HauhauCS drama this week, worth noting this is llmfan46 using actual Heretic, not Reaper. the KLD 0.0015 number is the real signal here.

[-]

Practical_Low29@reddit

The multi-turn tool call reliability is what sold me on it. Ran it through a few hundred back-to-back calls over a couple days and failure rate was noticeably lower than the base unsloth quant. Hard to attribute directly to the KLD but the pattern was consistent enough that I stopped second-guessing it.

[-]

2Norn@reddit

what is an uncensored model?

[-]

Spara-Extreme@reddit

Ok but if you’re using tool calls and doing productivity work, why do you need heretic? Outside of niche like pen testing, the normal model should be fine?

[-]

DaMoot@reddit

Iiiinnteresting. Colour me interested! My SIEM agent somehow switched back to 3.6 35b a3b last night and couldn't do a single tool call because tool calling out of 3.6 35b a3b is so terrible. Switched back to 27b and it ran great like always until it tried to inject 200k tokens worth of data on only 32g of VRAM!

[-]

my_name_isnt_clever@reddit

You have something wrong with your setup. I'm running hours long agentic sessions in hermes-agent using qwen 3.6 35b. It's not as intuitive as the 27b but it has zero issues with tool calling.

[-]

DaMoot@reddit

What's your command line for 35b if I may ask? With all the valves we can adjust maybe it's just one thing that's hamstringing me. You aren't the first person I've seen who says they run Hermes on it

[-]

my_name_isnt_clever@reddit

Sure. I just switched to the model in the OP at Q6_K and it's working great with latest llama.cpp:

llama-server \
                  -m ''${model_path} \
                  --mmproj ''${mmproj_path} \
                  --image-min-tokens 1024 \
                  --port ''${PORT} \
                  --temp 1.0 \
                  --top-p 0.95 \
                  --min-p 0 \
                  --top-k 20 \
                  --presence-penalty 0.5 \
                  --no-mmap \
                  --chat-template-kwargs '{"preserve_thinking": true}' \
                  -ngl 999 \
                  -c 200000 \
                  --jinja \
                  --keep -1 \
                  --cache-type-k q8_0 \
                  --cache-type-v q8_0

[-]

DaMoot@reddit

Maybe so be cause Hermes is entirely unusable with 35b a3b but 27b works just fine in my experience. Not sure what that setup flaw would be though. It's a good model for other things but extremely poor in tool calling in Hermes or llama.cpp open web UI in my experience.

[-]

Practical-Collar3063@reddit

this sounds like a set up flaw, what quant are running ? are you using quantised kv cache ? I am using 35b 3ab MLX 8bit and it has never failed a single tool call for me

[-]

m3kw@reddit

What are some good use cases

[-]

jadbox@reddit

No benchmarks yet? I'll wait.

[-]

QuantumCatalyzt@reddit

Here is the link to GGUF

[-]

My_Unbiased_Opinion@reddit (OP)

yep! this is from the creator!

also you can use imatrix quants (i am using iq4xs) from https://huggingface.co/mradermacher/Qwen3.6-35B-A3B-uncensored-heretic-i1-GGUF

if you want vision to work, just download the mmproj file from unsloth. I am using LMstudio and throw the vision encoder in the same folder.

[-]

retroriffer@reddit

I found the mmproj files at https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main, but I gather the precision has to match the quantization of the LLM model I downloaded ( For example if I chose 5-bit or 6-bit model that wouldn't work with the BF16/F16/F32 .mmproj files? )

[-]

teleprint-me@reddit

No, the precision does not need to match. You can you use the half precision with a lower quant, but it does slightly increase memory usage. Considering how small it is, Ive concluded it was negligible. If you need to squeeze the most out of RAM, then you can quantize it.

[-]

mission_tiefsee@reddit

why dont you run qwen3.6 27B ?

[-]

My_Unbiased_Opinion@reddit (OP)

Good question. I actually like 27B. The issue is speed. I recently got into openclaw and realize that PP speed matters a ton for what I want to do with it. 25B moe is not as good as 27B but imho for my use case, is close enough. also my wife uses the model with openwebui for general web search and I find 35B to be good enough while being much faster.

[-]

Pwc9Z@reddit

Note that original Qwen3.6 models are pretty easy to jailbreak, depending on how uncensored you really need

[-]

My_Unbiased_Opinion@reddit (OP)

Yeah I noticed this as well with the unsloth quants. it seemed to be less censored than 3.5 out of the box.

[-]

Non-Technical@reddit

It seems like these newer releases Qwen3.6, Gemma-4 don’t refuse anything if you have a good prompt. Maybe I’m not creative enough.

[-]

typical-predditor@reddit

It turns out the ethics alignment stage of LLMs is getting in the way of good performance!

[-]

My_Unbiased_Opinion@reddit (OP)

I actually agree with this. Derestricted 120B performs far better than the orignal model across the board. GPT-OSS 120B is a prime example of ethics and policy adherence getting in the way of performance.

[-]

redblood252@reddit

does jailbreaking always improve performance/speed?

[-]

National_Cod9546@reddit

I never use a heretic version unless I get multiple swipes of refusals on the normal version. Uncensored versions are always at least a little dumber than original versions. Heretic seems to be the best method to uncensor them while retaining maximum smarts.

[-]

general_sirhc@reddit

I've never used a jail broken model that out performed the original.

In my experience they tend to be good at doing whatever their told blindly even if it means entirely missing the point.

[-]

MotokoAGI@reddit

What kind of uncensored prompt are you feeding it?

[-]

My_Unbiased_Opinion@reddit (OP)

it doesnt need an uncensored prompt.

[-]

MotokoAGI@reddit

I don't mean system prompt, I mean inputs...

[-]

My_Unbiased_Opinion@reddit (OP)

ah. I usually ask it pretty specific medical related questions, I am a nurse by trade. I do have it hooked up to rag and I do confirm the outputs. I hate having to dance around by prompt to get it to do what I want.

[-]

MotokoAGI@reddit

Thanks! I'm downloading it now and will give it a go.

[-]

My_Unbiased_Opinion@reddit (OP)

lmk how it works out!

[-]

eidrag@reddit

pretty funny when I ask gemma4 and qwen3.5/3.6 about teenage girl period tampon and they straight refuse it. while uncensored/heretic model let me lick excrement no issue, no problem

[-]

My_Unbiased_Opinion@reddit (OP)

hahah I love it.

[-]

CryptoUsher@reddit

low kld means it's close to original, but how's the tradeoff on reasoning depth

have you tested it on long-horizon planning tasks, or mostly chat?

[-]

My_Unbiased_Opinion@reddit (OP)

im using it with openclaw and openwebui. no issues. no issues with either, especially openclaw

[-]

CryptoUsher@reddit

got it, thanks. openclaw’s been solid for multi-step tasks on my end too

[-]

My_Unbiased_Opinion@reddit (OP)

hell yeah. enjoy the ride until the next sota oss model lol

[-]

ACheshirov@reddit

Is it better than HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive ?

[-]

My_Unbiased_Opinion@reddit (OP)

it is WAY better than hauhau agressive. this is more "lossless" then his model.

[-]

misha1350@reddit

Yes, it is

[-]

Septerium@reddit

Seems nice! Is the IQ4XS GGUF available online?

[-]

My_Unbiased_Opinion@reddit (OP)

I used the mradermacher i1 quant and just copied over the mmproj vision file into lmstudio to make vision work again.

[-]

DocWolle@reddit

better than the uncenored HauhauCS version?

[-]

My_Unbiased_Opinion@reddit (OP)

hell yeah. WAY better. This model is better and more "lossless" than HauhauCS claims his model is.