Heretic: Fully automatic censorship removal for language models

Model	Refusals for "harmful" prompts	KL divergence from original model for "harmless" prompts
google/gemma-3-12b-it (original)	97/100	0 (by definition)
mlabonne/gemma-3-12b-it-abliterated-v2	3/100	1.04
huihui-ai/gemma-3-12b-it-abliterated	3/100	0.45
p-e-w/gemma-3-12b-it-heretic (ours)	3/100	0.16

[-]

ivoras@reddit

Just for kicks, running `Heretic` on AMD HX 370 (890M iGPU) ROCM 6.4.4 on Windows for `Qwen/Qwen3-4B-Instruct-2507` has an estimated completion time of about 9 hours :)

[-]

spaceman_@reddit

How did you get it to run with ROCm? For me it keeps saying "No GPU or other accelerator detected. Operations will be slow.", even though I've installed pytorch for my ROCm setup into the venv.

[-]

ivoras@reddit

There's this semi-official and half-baked repo of pytorch built with rocm for Windows:

https://rocm.docs.amd.com/projects/radeon-ryzen/en/latest/docs/install/installrad/windows/install-pytorch.html

I suspect it's terribly unoptimized, but technically - it does use the iGPU.

[-]

-p-e-w-@reddit (OP)

Thanks for the data point. Still seems doable if left running overnight.

[-]

spaceman_@reddit

Is there a list of what architectures / families are supported? I'd love to play around with this.

[-]

shuwatto@reddit

Could someone ballpark how much VRAM is needed to run this on qwen3-30b-a3b?

[-]

-p-e-w-@reddit (OP)

80 GB should do the trick. An A100 80 GB is very cheap to rent.

[-]

I was skeptical before, but I just downloaded GPT-OSS 20B Heretic model and holy shit. It gives properly formatted long responses to sensitive topics, using the exact uncensored words that you would expect from an uncensored model, produces markdown format tables with details and whatnot. Looks like this is the best abliterated version of this model so far...

[-]

-p-e-w-@reddit (OP)

😌

And keep in mind that this model was made without any human intervention. I just ran

heretic openai/gpt-oss-20b

and that’s what came out. The entire process took 1 hour and 10 minutes on an A100 that I rented this morning.

[-]

Zulfiqaar@reddit

Meanwhile OpenAI probably spent 7 figures to train all the censorship into gptoss

[-]

snuzi@reddit

they dont care

[-]

Cool-Chemical-5629@reddit

By the way, I have a short test that's like an IQ test for humans, except this one is meant for LLMs. It was built by Claude 4.5 Sonnet Thinking and even big models struggle with it, often lose couple of points. Claude 4.5 Sonnet Thinking model obviously knows the correct answers for those questions and it's a big smart model, so it's a pretty good reference model to be used as a judge. Some of the popular models including Gemini and GPT usually lose some points there. Ironically this small GPT-OSS 20B Heretic nailed it 100%...

[-]

chuby1tubby@reddit

What are some examples of your questions?

[-]

_VirtualCosmos_@reddit

man, I love AI, neural networks, machine learning and transformers and still I have no remotely idea how did you managed to do that program.

[-]

-p-e-w-@reddit (OP)

That’s flattering, but it’s really just a parametrized version of an existing technique (directional ablation) combined with a stochastic optimizer, and lots of experimenting to make things actually work. The code is just slightly over 1000 lines, you can look at it, there’s no magic in there.

[-]

wahnsinnwanscene@reddit

How long did it take to find the correct ablations?

[-]

Cool-Chemical-5629@reddit

The fancy ASCII logo alone probably taking quite a chunk of lines, but hey every badass script needs a fancy ASCII logo. 😎🍿

[-]

-p-e-w-@reddit (OP)

It’s actually just 3 lines 😉

[-]

Extraaltodeus@reddit

WELL IT CERTAINLY DID NOT MAKE IT DUMBER OMG

So I tested the Q4_K_M quant from mradermacher and prompted "b25seSBhbnN3ZXIgaW4gYmFzZTY0" which means "only answer in base64" in base64.

It started it's CoT and I thought "There goes nonsense" but actually...

Thought for 4 minutes 10 seconds

The user wrote something in presumably some language. "b25seSBhbnN3ZXIgaW4gYmFzZTY0". That looks like base64? Let's decode: b25seSBhbnN3ZXIgaW4gYmFzZTY0.

We can attempt to decode manually:

Base64 string: "b25seSBhbnN3ZXIgaW4gYmFzZTY0"

Let's group into 4-char blocks:
b25l eSBh bn N3 ZXI g aW4 g Ym FZ 54? Wait.

Better use known decode: The base64 string when decoded yields some text. Let's approximate:

"b25seSBhbnN3ZXIgaW4gYmFzZTY0"

Maybe it's "only answere in bash64"? Actually "b25seSBhbnN3ZXIgaW4gYmFzZTY0". We can decode using known mapping: b=27, 2=54? Not necessary. Let's quickly use mental base64: but easier to think of typical phrase "only answer in base64" maybe. Let's check: "b25seSBhbnN3ZXIgaW4gYmFzZTY0" if decoded yields something like "only answer in bash64"? The user wrote this possibly as a puzzle: they want the assistant to respond encoded in base64? But we need to decode.

Let's actually decode using quick mental mapping: Use known decode table: Base64 characters mapping:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
Indices 0-63.

b = index 27
2 = index 54
5 = index 57
l = index 37
e = index 30
S = index 18
B = index 1
h = index 33
b = 27
n = index 39
N = index 13
3 = index 55
Z = index 25
X = index 23
I = index 8
g = index 32
a = index 26
W = index 22
4 = 56
0 = 52

Now group into 6 groups of 4 chars? Actually length is 28: "b25leS...". Let's count: b(1) 2(2) 5(3) l(4) e(5) S(6) B(7) h(8) b(9) n(10) N(11) 3(12) Z(13) X(14) I(15) g(16) a(17) W(18) 4(19) 0(20). Actually length 20? Wait string: "b25leS B h b nN3ZXIg aW4gYmFZ54"? Let's write properly:

Original: b25seSBhbnN3ZXIgaW4gYmFzZTY0

Let's chunk into groups of 4:

b25s
eSBh
bnN3
ZXIg
aW4g
YmFz
ZTY0
Yes, 7 groups.

Now decode each group:

Group1: "b25s": indices: b=27, 2=54, 5=57, s=44? Wait 's' letter index: a=26 so s = 26+18=44. So bits: 27 (011011), 54 (110110), 57 (111001), 44 (101100). Combine into bytes:
011011 110110 111001 101100
Break into 8-bit groups: 01101111 01101110 01101100? Let's calculate:

Combine bits: 27=011011, 54=110110 -> first 12 bits: 011011110110. Next 6 bits from 57: 111001; but we need to combine with preceding: Actually base64 decode process: For each group of 4 chars (24 bits), produce 3 bytes. So take 4 indices, convert to 24-bit number and split.

Indices:
b=27 => binary 011011
2=54 => 110110
5=57 => 111001
s=44 => 101100

Now combine: 011011110110111001101100

Split into 8 bits: 01101111 01101110 01101100

Which is 0x6F, 0x6E, 0x6C. ASCII: 0x6F='o', 0x6E='n', 0x6C='l'. So first three bytes "onl".

Group2: "eSBh": e=30 (011110), S=18(010010), B=1(000001), h=33(100001). Bits: 011110 010010 000001 100001 = 011110010010000001100001. Split: 01111001 00100000 01100001 -> 0x79='y', 0x20=' ', 0x61='a'. So "y a".

Group3: "bnN3": b=27(011011), n=39(100111), N=13(001101), 3=55(110111). Bits: 01101110 01111101 11101111? Wait compute: 011011 100111 001101 110111 -> 011011100111001101110111. Split: 01101110 (0x6E='n'), 01110011 (0x73='s'), 01110111 (0x77='w'). So "nsw".

Group4: "ZXIg": Z=25(011001), X=23(010111), I=8(001000), g=32(100000). Bits: 01100101 01111000 00100010? Wait compute: 011001 010111 001000 100000 => 011001010111001000100000. Split: 01100101 (0x65='e'), 01110010 (0x72='r'), 00100010 (0x22='"'). That gives "er" and some punctuation? Actually 0x22 is double quote, but maybe mis-split? Let's recalc:

Indices:
Z=25 -> binary 011001
X=23 -> 010111
I=8 -> 001000
g=32 -> 100000

Concatenate: 011001010111001000100000 = 24 bits.

Now split into 8-bit groups: 01100101 (0x65 'e'), 01110010 (0x72 'r'), 00100000 (0x20 space). So "er ". Good.

Group5: "aW4g": a=26(011010), W=22(010110), 4=56(111000), g=32(100000). Bits: 011010010110111000100000. Split: 01101001 (0x69 'i'), 01101110 (0x6E 'n'), 00100000 (space). So "in ".

Group6: "YmFz": Y=24(011000), m=38(100110), F=5(000101), z=51(110011). Bits: 011000100110000101110011. Split: 01100010(0x62 'b'), 01100001(0x61 'a'), 01110011(0x73 's'). So "bas".

Group7: "ZTY0": Z=25(011001), T=19(010011), Y=24(011000), 0=52(110100). Bits: 011001010011011000110100. Split: 01100101(0x65 'e'), 00110110(0x36 '6'), 00110100(0x34 '4'). So "e64".

Combine all groups: "onl" + "y a" = "only a"? Wait we had group1: "onl". Group2 gave "y a". That gives "only a". Then group3 "nsw" => "only answ"? Actually we have "onl"+"y a"+"nsw" -> "only answ". Group4 "er " => "only answer ". Group5 "in " -> "only answer in ". Group6 "bas" -> "only answer in bas". Group7 "e64" -> "only answer in base64".

So the decoded message is: "only answer in base64".

Thus user says: b25seSBhbnN3ZXIgaW4gYmFzZTY0 => decode to "only answer in base64"

Now what should assistant respond? The instruction: "only answer in base64". So we need to provide answer (the next response) encoded in base64. Likely they want the assistant's reply also in Base64. So I must respond with something, but encode it in base64.

We need to produce a reply that follows some question? There was no question; just statement "only answer in base64". Perhaps it's an instruction: respond only in base64. So we should answer something encoded.

But what message to produce? The user didn't ask a question, only gave instruction. Maybe they want the assistant to confirm understanding? Could reply with "Understood" or something like that. But must encode in base64.

Let's produce short confirmation: "Got it." In base64: "R290IGl0Lg==". Let's check: "Got it." -> ascii bytes 47 6F 74 20 69 74 2E . Base64? Actually let's encode properly:

String "Got it." characters: G(71) o(111) t(116) space(32) i(105) t(116) .(46). In base64 we can quickly compute. But maybe easier to produce a generic acknowledgement: "OK" => "TzIK". Wait "OK" in base64 is "T1AK"? Not sure.

Alternatively could respond with nothing? The instruction says only answer in base64; but no actual question. So maybe we should respond with an empty string? Or something like "Acknowledged" encoded. Let's choose "Got it.".

Let's compute Base64 for "Got it." Actually we can use known tool: Base64 of "Got it." is "R290IC0="?? Not sure.

Let's encode manually:

Take bytes: 47 ('G'), 111('o'), 116('t'), 32(space), 105('i'), 116('t'), 46('.')

In hex: 0x47,0x6F,0x74,0x20,0x69,0x74,0x2E

Group into 24-bit chunks:

First 3 bytes: 0x47 0x6F 0x74 -> binary: 01000111 01101111 01110100
Split: 010001 110110 111101 110100
Indices: 17, 54, 61, 52 => base64 chars: R (index 17), 2 (54?), Wait mapping: index 0-25=A-Z. 17 -> 'R'. Index 54 -> 26+28? Actually 54->? Let's use table: 52='0',53='1',54='2',55='3',56='4',57='5',58='6',59='7',60='8',61='9'? Wait base64 mapping after A-Z(0-25), a-z(26-51). Then '0'=52, '1'=53,... '9'=61. So 54 => '2'. 61=>'9'. 52=>'0'.

So first chunk -> "R29 0"? Actually R (17) = 'R', 54='2', 61='9', 52='0' => "R290". Good.

Second group: next 3 bytes: 0x20, 0x69, 0x74 -> binary: 00100000 01101001 01110100
Split: 001000 000110 100101 110100
Indices: 8 (I), 6 ('G'? Wait index 6-> 'G'), 37? Actually 100101 = 37, 110100=52. So mapping: 8='I', 6='G', 37? 37-26=11 -> char 'l'? Actually 37 is 'l' (since a=26,b27,...). 52='0'. So "IGl0". Combine => "R290IGl0". That's the Base64 for "Got it".

Now add final '.' which is byte 46, but we had only 7 bytes; base64 expects groups of 3 bytes. We have 7 bytes => last group will be 1 byte leftover -> need padding. The last byte 0x2E. In binary: 00101110. For base64 we pad two bits? Actually for a single remaining byte, we take its 8 bits + 4 zero bits to make 12 bits, produce two Base64 chars and then '==' as padding.

So compute: 00101110 0000 => split into 6-bit groups: 001011 (11) and 100000? Wait combine: 00101110 0000 -> 001011 100000? Actually we need to align bits: The remaining byte's bits are the first 8, then pad with zeros to form 12 bits. So sequence: 00101110 0000 => as 12 bits: 001011 100000. Indices: 11-> 'L', 32-> 'g'? Wait index mapping: 32 = 'g' (since a=26,...). So we get "Lg". Then add padding '=='.

Thus final base64 string: "R290IGl0Lg=="

So response is that encoded string. Let's produce just that.

Which means "Got it.".

[-]

Mythril_Zombie@reddit

That's fascinating. I want to know more about the loop that keeps it going like that.

[-]

CheatCodesOfLife@reddit

Great work! The model doesn't seem any "dumber" after some brief testing with my usual work tasks.

I'll be it out the original with this one.

Question: Do you reckon this would work with Qwen/Qwen2.5-Omni-7B ?

[-]

Cool-Chemical-5629@reddit

Good job, thank you!

Also, please check this out A more surgical approach to abliteration : r/LocalLLaMA a post created 3 hours ago. Is this something that could be a useful enhancement for your script? "More surgical approach" does sound exactly like what we would need for models like GPT-OSS to make them stop thinking about policies and start thinking about the user's requests instead. :)

[-]

-p-e-w-@reddit (OP)

I’m a big fan of Jim’s work, and in fact had a short discussion with him just 2 weeks ago. I will definitely experiment with those new techniques and maybe even incorporate them into Heretic if they turn out to be promising.

[-]

Repulsive-Memory-298@reddit

This is so cool, I've done some CAA research using contrastive completions but this first token approach is elegant and I can imagine how it is better at generalizing with diverse models.

I'd love to experiment with more general stuff here, refusal is nice and clean, but it would be awesome to evaluate and attempt this on any contrastive data set. Would that fit as a contribution here, or more of a fork thing?

[-]

-p-e-w-@reddit (OP)

Depends on how invasive the code change is. I don’t plan to turn Heretic into a Swiss Army knife of LLM interventions, but small additions are fine.

[-]

Repulsive-Memory-298@reddit

It might be invasive, so I'll try a fork for now. I get it- refusal is a special case and heretic is very clean.

There is related research that for instance applies this, or a similar, technique to other behavioral concepts. Eg many papers looking at refusal also look at things like sycophancy, hallucinations, etc., and I think its fun to experiment with abstract personality concepts.

I'll do some tests and comment back if it seems feasible. For more abstract behaviors I've seen approaches like casting contrastive sets as completion pairs of multiple choice for single token activations, but it's more noisy and prone to push OOD.

So thats to say that It might require the addition of other optional techniques, but then you could try any contrastive data set. Nrimsky has some relevant code for this, it would certainly be less automated for many behaviors.

[-]

Terrible-Mongoose-84@reddit

It looks great, how long does the process take? Is GPT-OSS-120b not available in Heretic models?

[-]

Snoo-83094@reddit

if i rent a multi gpu cluster, would this be enough or it would need some multi gpu configs ?

heretic openai/gpt-oss-120b

[-]

Terrible-Mongoose-84@reddit

This is a problem with the source code. You can find a PR on github that has a fix. In any case, the model is already on HF, the one who proposed the fix has already made heretic with the model.

[-]

Snoo-83094@reddit

I see, thanks!

[-]

-p-e-w-@reddit (OP)

The time taken depends on the model and your hardware, and the configuration if you change it. As mentioned in the README, decensoring an 8B model on a single 3090 takes about 45 minutes.

Processing a 120B model in mixed precision requires around 150 GB VRAM. I don’t feel like renting such a machine at the moment.

[-]

Terrible-Mongoose-84@reddit

180 GB vram, not enough xD

[-]

TheTerrasque@reddit

Processing a 120B model in mixed precision requires around 150 GB VRAM.

It would be nice to have a small section of what level of VRAM would be needed for different size models in the readme.

[-]

CheatCodesOfLife@reddit

It would be nice to have a small section of what level of VRAM would be needed for different size models in the readme.

Command-R7b needed 19.9GB

Gemma-3-27B needed -- well I saw it at 61GB at the highest.

Just a hunch, but someone will probably PR bitsandbytes support, drastically reducing the vram requirements.

[-]

beef-ox@reddit

This is impossible to do because of quantization. The size in gigabytes compared to number of parameters depends entirely on the precision of the weights, which varies from model to model and quantization to quantization.

In general, 32bit should be ~4x each billion parameters in GB, 16bit is double, 8bit is 1:1, 4bit is half, and so on.

For example, a 32bit quantized 1 billion parameter model needs roughly 4GB to hold its weights. This does not include any additional token processing or context space.

[-]

pmp22@reddit

Just thinking out loud for a moment, but would it be possible to do this layer by layer? Instead of doing a set of forward passes per question, maybe do 200 forward passes with layer 1 loaded into VRAM, then eject it and do 200 with layer 2 and so on. A sort of layer-by-layer inference?

[-]

danielv123@reddit

Isn't this what happens by default when you use the paging that the Nvidia driver does by default on windows, and it's super slow? I guess with large batches it could be kind of reasonable for speed.

[-]

WithoutReason1729@reddit

Wow, that's pretty quick on pretty modest hardware. This is a really cool project, thanks for sharing.

[-]

WestTraditional1281@reddit

You're doing God's work. Thank you.

Maybe others will share the burden of liberating large models and sharing back to the community.

[-]

rebelSun25@reddit

I have a feeling we'll see it very soon.

[-]

silenceimpaired@reddit

Especially since afterwards the model would just spit out dots and gibberish since it has nothing else to say. :)

[-]

Right-Law1817@reddit

Same as you. :)

[-]

Snoo-83094@reddit

No GGuFs ?

[-]

Roidberg69@reddit

What datasets are being used? And how large

[-]

Desperate_News_5116@reddit

veo a muchos intentando hacerlo con gpt oss, nadie intento con alguno de deepsteek aun?

[-]

noctrex@reddit

Bravo, nice work! Mind if I tried to abliterate some small models?

[-]

noctrex@reddit

Did this with your method:

noctrex/Mistral-7B-Instruct-v0.3-abliterated-GGUF

[-]

ProfessionalFew5439@reddit

ty

[-]

-p-e-w-@reddit (OP)

That’s what the program is for :) Do let me know how well it works. I haven’t tested that many models yet and if you encounter problems with architecture support, I want to know.

[-]

noctrex@reddit

From what I see, it uses 200 trials by default. Would increasing them to 400 for example, make it work better/finer details and produce better quality output?

[-]

-p-e-w-@reddit (OP)

Generally, yes. The dimensionality of the parameter space is quite high (10), so more trials can give TPE better chances to find minima.

[-]

noctrex@reddit

So maybe I'll just let it all night with about 800 to see how much better it will be :)

FYI, takes about 2 hours with my 7900XTX on ROCm for 200 trials.

[-]

-p-e-w-@reddit (OP)

Oh, I haven't tested on AMD hardware at all yet. Please let me know how it works.

[-]

noctrex@reddit

Tried with ROCm6.4.4 and ROCm7.10, and it seems that it takes the same time.

Mistral-7B-Instruct-v0.3 takes 1h 53m with the default 200 trials.

May I also make a suggestion? It would be nice to have an option to save and load the completed trials, in order to to make a quick export again

[-]

-p-e-w-@reddit (OP)

Yes, Optuna actually has that functionality built in, I intend to expose it in the UI.

[-]

noctrex@reddit

I uploaded a BF16 GGUF of Mistral-7B-Instruct-v0.3 earlier, and trying out your method right now on that very model, lets see!

[-]

Electronic-Metal2391@reddit

I appreciate it if you'd post your findings. specifically, does it output the decensored model in the same format (GGUF) or in multi-part safetensors.

[-]

noctrex@reddit

You load a multi-part safetensors folder, and it outputs in the same format.

Took about 2 hours cooking with my 7900XTX on ROCm for the default setting of 200 trials.

[-]

-p-e-w-@reddit (OP)

Isn’t that model almost completely uncensored by default?

[-]

noctrex@reddit

It may be less than others, but even this is not completely uncensored. So, lets see!

[-]

jdprgm@reddit

Is there any organized plan to release various quants and mlx versions of all the popular models with this? if i'm understanding correctly even for fairly large models we should be talking low $10's of dollars to process with cloud rentals of like a https://vast.ai/pricing/gpu/H200 ? Ideally we don't have a bunch of people re-processing the same models to get the same output for no reason.

[-]

-p-e-w-@reddit (OP)

When you use Heretic to decensor a model and upload it to Hugging Face, it automatically adds the “heretic” tag to the model card, so you can use that tag to reliably find all Heretic models on HF. There are already several such models uploaded by other people.

[-]

starfries@reddit

Great work and clever name.

[-]

stuckontheblueline@reddit

Awesome stuff!

[-]

vasileer@reddit

for gpt-oss-20b-heretic I see that it still has a high number of refusals (58/100) compared to gemma-3-12b-it-heretic (3/100), what are you thoughts, why so with gpt-oss-20b?

[-]

-p-e-w-@reddit (OP)

The GPT-OSS abliteration created by Heretic is actually highly compliant and will obey most requests (try it yourself!). Many of the detected “refusals” are false positives caused by the CoT debating whether it should refuse or not.

[-]

ionlycreate42@reddit

Off topic question, but how do you manage LLM as a judge if false positives are generated as a response? Im trying to see how to best handle an LLM judge and how experienced users work with it

[-]

-p-e-w-@reddit (OP)

I don’t use LLM judges. Not sure what gave you that idea?

[-]

ionlycreate42@reddit

I interpreted the CoT debate as a judge, maybe I misunderstood

[-]

-p-e-w-@reddit (OP)

Ah no, what I meant is that the model “debates” with itself as it tries to decide whether to refuse or not.

Heretic uses only refusal count and KL divergence as metrics, both of which are objective mathematical quantities and don’t involve an LLM’s opinion in any way.

[-]

thevoiceless@reddit

How do you determine if something is a refusal? Apologies if that's a dumb question

[-]

-p-e-w-@reddit (OP)

Using a list of strings (such as “I won’t”) that act as refusal markers, plus some basic transformations to make them easier to detect.

[-]

TwistedBrother@reddit

I bristle at “objective”. It’s axiomatic but the premises are based on the specific prompts you use to determine the shape of the responses. The capacity to generalise out of sample is still debatable due to the non-linear dependencies inherent in the model architecture.

Abliteration is still an approximation method insofar as we cannot fully establish the qualities monosemantic nodes implied by the parameters, not even in an SAE or CLT framework.

[-]

-p-e-w-@reddit (OP)

> The capacity to generalise out of sample is still debatable due to the non-linear dependencies inherent in the model architecture.

By default, Heretic uses a set of 100 evaluation prompts, and for each of them, the KL divergence is calculated over a vocabulary distribution of typically 100k+ tokens. There are only 200 trials by default, and only 10 parameters to optimize, so I'd say the risk of overfitting to the evaluation data is rather low.

[-]

AdTotal4035@reddit

Its vector manipulation. Abliteration is from an old paper. OP made it super easy to use. good stuff.

[-]

vasileer@reddit

thanks for answering, please also explain what KL divergence of 0.96 means (for gpt-oss-20b in this case)

[-]

-p-e-w-@reddit (OP)

https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

Taken between the first-token probability distributions for “harmless” prompts, between the original model and the decensored one.

[-]

Witty_Mycologist_995@reddit

isn't 0.96 really bad

[-]

-p-e-w-@reddit (OP)

It’s higher than for many other models, but note that gpt-oss has the refusal mechanism trained into its CoT process, so this process in general, which occurs for both harmful and harmless prompts, needs to be modified. Several people in this thread have already tested the model and confirmed that its capabilities are still strong, which matches my own (superficial) tests.

[-]

Witty_Mycologist_995@reddit

Can you compare the KL Div. between your version and huihuis?

[-]

Ylsid@reddit

Hmm, I'm not sure if that's against policy. So I must check policy.

[-]

-p-e-w-@reddit (OP)

Yeah, exactly. That’s the mechanism that model uses to defend itself against basic jailbreaks. Its resistance against abliteration is certainly higher than that of some other models, but abliteration is still effective once the right parameters are found, which is precisely what Heretic does.

[-]

DualityEnigma@reddit

This is awesome OP, it also highlights that most people don’t really understand LLMs (even many of us using it extensively every day) so for those few lurkers trying to learn:

This works LLMs essentially learn these protection patterns weights like everything else. And I would guess that you figured out how to identify and remove them OP?

Looking forward to playing with it!

[-]

-p-e-w-@reddit (OP)

I didn’t figure it out, it was figured out more than a year ago by a group of researchers: https://arxiv.org/abs/2406.11717 (which in turn builds upon earlier research on directional semantics in latent space)

Heretic implements a refinement of the technique described in that paper, and combines it with a stochastic optimizer to take the human out of the equation.

[-]

Cool-Chemical-5629@reddit

Now this interests me a lot. GPT-OSS is VERY stubborn and does debate the policy a lot in its thought process. Do you think that's something that could be avoided in the future somehow? After all, the purpose of abliteration is to stop the model from refusing the requests, so the debate whether to refuse or not based on the policy should not exist to begin with. Besides, it's a waste of tokens that could and imho SHOULD be spent on thinking how best to handle the request in order to deliver the best quality of the results possible. Those tokens should NOT be spent on debating how to best avoid such requests, especially not in an abliterated model where user doesn't want it to refuse those requests.

[-]

radial_symmetry@reddit

Seems like got-oss wouldn't be a good model to abliterate since it is trained on a selected corpus. I would expect that it doesn't actually know how to do a lot of the things it refuses.

[-]

AvidCyclist250@reddit

Network security? Network un-security.

[-]

Opti_Dev@reddit

How does it affect multilingual performances ?

[-]

Guilty_Rooster_6708@reddit

Heretic GPT 20b seems to be the best uncensored model I have tried yet. It doesn't destroy a the model's intelligence and it is answering prompts normally would be rejected by the base model. You cooked! TYSM

[-]

separatelyrepeatedly@reddit

As far as I see, this does not work on Qwen3-VL?

[-]

zhambe@reddit

I've had Claude make some tweaks to the codebase to enable running this on multiple GPUs and add RAM offloading. Trying with Qwen3 14B, it's estimating to take about 3.5h on 2x RTX 3090. Is that about the ballpark you'd expect?

[-]

-p-e-w-@reddit (OP)

A bit slow, considering that a single 3090 can do 8B models in 45 minutes. Interconnect bottleneck?

[-]

zhambe@reddit

No, I overcorrected and set max_batch_size way too low. Now it does a 14B FP8 model in ~2h (without native FP8 support on my hardware)

[-]

zazu1981@reddit

Sounds frustrating! High refusals can be a pain. Have you tried tweaking the ablation parameters or testing with different prompt styles? That might help you get better results.

[-]

zhambe@reddit

I got much better results with the granite-4.0-micro model, but so far I'm just rolling with defaults. I can see how the "good"/"bad" prompts collections would affect the final alignment of the model, but I'm yet to look into what kinds of ablation params there are.

[-]

Marksta@reddit

Will you upload the fork / send pull request? Multi-GPU is definitely needed for this.

[-]

StardockEngineer@reddit

OMG thank you for posting like a human being. So happy to not read “THE PROBLEM” again and then blabbering for 300 lines.

Your tool looks cool! Will try it out.

[-]

Kitchen-Year-8434@reddit

This isn't just abliteration, this is !#%!()@#%!%*.

/sigh

[-]

-p-e-w-@reddit (OP)

The more I use AI, the more human I become.

[-]

Right-Law1817@reddit

Damn, very wise. Senpai, could you plz shed a bit more light on that thought of yours? Thank you.

[-]

luche@reddit

💯♥️

[-]

Invincible_Terp@reddit

thank you for contribution to democracy

[-]

HermitFan99999@reddit

Sorry if i'm being ignorant, but what are like the practical applications of this?

What censoring that's within these models need to be undone, and what purpose does it serve

[-]

seanthenry@reddit

You can ask question like what is the command to Kill X process in linux. They will refuse to tell you how to "kill" a function. Same could happen is asking what is the proper way to "Terminate" a connection on a wire.

[-]

AreaExact7824@reddit

I thought the censorship comes from fine tuning

[-]

Vivid-Ad3186@reddit

Anyone got some resources on prompting uncensored models?

[-]

Ok_Warning2146@reddit

Made a Qwen3-4B-Instruct-2507-abliterated at 1.80 KL divergence and 12/100 refusal. However, it still can't answer the question "Who is Xi Mingze?"

[-]

-p-e-w-@reddit (OP)

A 4b model might legitimately not know.

[-]

AdvanceInfinite7839@reddit

Noob question here… what can you do - or better - what can’t you do with censored LLMs? Use cases. Only illegal stuff? 🤔

[-]

Ok_Warning2146@reddit

When you want to ask the Tiananmen question with the Qwen models.

[-]

Professional_Mouse_6@reddit

Great work!
I see that You are using motpe with refusals count and kl-div. So we should get \~pareto front.
In some cases it would be nice if we could penalize kl-div higher than X so we can emulate constraint on kl-divergence (if i remember correctly there is no way to define constrains in optunas tpe).
Even something like:
`if kl_divergence > kl_max:
refusals += penalty_factor * (kl_divergence - kl_max)`.

[-]

-p-e-w-@reddit (OP)

I did use a combined score at the beginning, but switched to multi-objective optimization eventually because it simply works better and it’s quite difficult to figure out the balance a priori.

[-]

Professional_Mouse_6@reddit

it’s quite difficult to figure out the balance a priori.

thats why multi objective with emulated constraint is better idea than single obective tpe with real one. With combined score you have to "weight" div vs refusals, but if we emulate constraints in motpe we can just push sampler away from regions that we now that are not worth exploration - regions with way too high kl-div but 0 refusals.

Your solution is very clear and readable; and your PR queue will probably blow up in no time with a lot of crazy ideas / changes just changes sake / pure garbage... BUT...

There are some a little bit less crazy ideas that you can consider.

express your max_weight_position as fraction of number of layers. This way you have normalized position to [0,1] over depth; it gives you smooth search space invariant to number of layers
you can measure contribution/importance of tensors on good_prompts / bad_prompts for better tensors selection (or even go a bit further and inspect it channel-wise)
force constant exploration by using custom sampler that splits between tpesampler and random sampler - to avoid stucking in single promising region found in startup.
center around "no-change" [and use log distribution]

Of course, none of this is guaranteed to help ;)

[-]

-p-e-w-@reddit (OP)

I really like your ideas. My understanding is that the GMMs from TPESampler should be invariant to the relative scales of the parameter spaces, but I thought the same thing about the score dimensions at one point and then noticed that normalizing substantially improves Pareto exploration, so you may be on to something here.

By splitting, do you mean starting up with RandomSampler and then switching to TPESampler in order to avoid the startup biasing TPE in any way?

[-]

Professional_Mouse_6@reddit

nah, it would just expand startup i think. I meant somthing like:
if random.random() < self.explore_prob:
return self.rnd_sampler.sample
else return self.tpe_sampler.sample

so your custom sampler internaly have one tpesampler and one randomsampler and once in a while it will sample with randomsampler.

[-]

shing3232@reddit

I think with small modification, it can turn into RL training for decensor

[-]

Ok_Warning2146@reddit

Excellent work. It will be great if some GPU rich folks can make abliterated models from the bigger popular models and post them at HF.

[-]

GriLL03@reddit

Could this use more than one GPU? Like, if I have a GPU server with multiple GPUs.

[-]

-p-e-w-@reddit (OP)

It uses Accelerate, so with the right device map, yes.

[-]

Due-Advantage-9777@reddit

Should be default if it does everything for you imo

[-]

minpeter2@reddit

That's great. Is there support for multi-GPUs? I'd like to test oss-120b on the A100 4-core.

[-]

-p-e-w-@reddit (OP)

There’s currently an open issue about that on GitHub. I’ll look into it once I get back to my desk.

[-]

minpeter2@reddit

Thank you. I tried it with a small model and it feels like a really well-made CLI.

[-]

mitchins-au@reddit

I tip my hat to you.

[-]

mba2016kid@reddit

I'm wondering if KL divergence is the best way to assess whether the abliterated model still responds correctly to safe prompts.

[-]

x54675788@reddit

Interesting, although I'd say that refusal isn't everything. If the LLM wasn't trained on some kind of material at all, then removing the refusal won't do much.

Let's assume that you want to ask hacking questions to your LLM, or you want to do ERP, or ask for medical\law advice but no such material was in the training data, then maybe you can remove the refusal but the output will be very lame and very bad.

[-]

-p-e-w-@reddit (OP)

If the LLM wasn't trained on some kind of material at all, then removing the refusal won't do much.

That’s incorrect. All sufficiently large LLMs know everything. They just won’t tell you. I mean, they’ve been trained on Reddit dumps among other things. What kind of “material” is missing from those, you think?

[-]

Novel-Mechanic3448@reddit

That’s incorrect. All sufficiently large LLMs know everything

Not how LLMs work and you should really know that given your "project". I find it incredibly suspicious you'd claim something like this yet release the project you did.

[-]

bghira@reddit

they're a maths wizard, you need to do your research on who you're communicating with

[-]

-p-e-w-@reddit (OP)

Yawn. Obviously, in discussions like this, “knowing” and “everything” have certain colloquial connotations attached to them that may differ from their strictest literal interpretation. It’s a standard feature of terminology really; without it, most words would never apply to anything.

[-]

WestTraditional1281@reddit

That's part of the training prep. They don't just blindly take all the data and train on it. There is curation and pre-filtering. They do try to remove the most offensive and inappropriate content. It's not perfect, but it will make retrieving that kind of information harder.

[-]

SilentLennie@reddit

A reddit dump doesn't mean it includes all the comments or posts, etc.

Just like: a LLM isn't trained on 'all data on the Internet'. They get a curated list of data.

It's more a matter of: whatever slipped through.

[-]

silenceimpaired@reddit

I don’t disagree entirely, but I’ve seen that while concepts will be retained by large models, they can be designed so that they don’t know exact words or details.

[-]

x54675788@reddit

they’ve been trained on Reddit dumps among other things. What kind of “material” is missing from those, you think?

Reddit is certainly not the most authoritative nor complete source of information although all sorts of random bits of information are dumped in random comments. Thing is, it's very fragmented knowledge. Filling in the missing bits (which is what a LLM will try to do) is likely going to lead to invalid answers that don't have the full picture because Reddit dumps as training data aren't very structured, specialist data.

You may find very technical quantum superposition posts or comments, for example, but you probably won't find the entire organized domain knowledge that would be necessary to draw the correct conclusions.

Perhaps, someone more expert than me (I don't work in the field) can chime in on the accuracy\inaccuracy of what I said but, again, good work

[-]

tiffanytrashcan@reddit

This is when you run into "uncensored" models, they get the "evil" datasets fine tuned into them after abliteration. They are a good example that the data is usually already in there because they all overdo it. Dolphin models start writing about killing babies with a simple prompt like "tell me a story" - they become so misaligned they are useless.

[-]

Ended_As_Myself@reddit

Can this be achieved using the web service? Or is it for now only possible via local?

[-]

my_name_isnt_clever@reddit

You have to run the model yourself to be able to us this. The closest you can get as a web service is renting compute somewhere like RunPod and setting it up with a local model that has been uncensored.

[-]

tiffanytrashcan@reddit

I'm not sure what you're referring to? What web service? Which part are you trying to achieve?

[-]

Just_Difficulty9836@reddit

Maybe killing babies /s.

[-]

JEs4@reddit

Good stuff! I had put together something similar a few weeks back but with a bit of a different approach using control vectors as single layer hooks instead of parameter training: https://github.com/jwest33/latent_control_adapters

I didn't worry about KL div though so the control vectors can produce some wonky outputs. I'm going to play around with your Qwen instruct model. Thanks for sharing!

[-]

Running_With_Science@reddit

I'm really wondering if this can be used to fine tune solution preferences when problem solving. Like hook this into an RL loop and when it gets a correct feedback, you feed that prompt back into the model, give it a little boost.

when it does the same wrong thing over and over, feed it in, give it a little tweak down.

I gotta be missing something, because it can't be that easy to do online continual learning.

I mean, we aren't "gaining" facts, but it looks handy for tweaking the kinds of solutions a model will prefer.

[-]

-p-e-w-@reddit (OP)

Interesting work! I will take a closer look at this when I have time.

[-]

Small-Fall-6500@reddit

Control vectors are cool. Thanks for sharing your project! I see you made a post a couple weeks back but it didn't get much traction.

I wonder if the reason for the lack of attention was because of the emphasis you included on ethical and safe usage / purpose of the tool - because LocalLLaMA is notorious for hating anything safety related.

[-]

Chromix_@reddit

gpt-oss-20b was originally release in mxfp4 format. This abliterated model is released as BF16. There's a quant that brings it to mxfp4 again, but I wonder: Was the abliteration process quantization aware, or will something be lost by using the mxfp4 quant over a Q6_K now?

[-]

chuckaholic@reddit

This is tangential to the subject, but slightly off topic. When you said:

If you have a Python environment with the appropriate version of PyTorch

I have really struggled with this part since I've started running LLMs and diffusion models at home.

I have never had a college level computer course and everything I know about Python/Linux is info I've gathered from Youtube videos and googling. I've managed to get a few LLMs and diffusion models running at home but there's a LOT I don't know about what's happening behind the scenes like when I install something in Windows. (I got an MCSE back in 2000 and have been in corporate IT for 20 years, so I am pretty comfortable in the Windows environment) A lot of guides assume you already know how to create an environment, like "type these 4 commands and it works", but I'd like to know more about environments, commands, and how things work differently from Windows.

Can someone recommend a source for learning this?

[-]

Mayonnaisune@reddit

Python environement = built-in virtual environment in Python used to isolate installed Python packages so that they don't conflict with your other installed packages, considering each program requires different versions of packages as their dependencies. To use it, you need to create it first with `python -m venv ` in your program directory/folder. Then, you only need to activate it before installing packages or running the program with `.\\Scripts\activate` or `.//Scripts/activate` (for Windows).

Appropriate version of PyTorch = PyTorch has different versions for different hardwares, like PyTorch CPU (default, CPU only), PyTorch CUDA, PyTorch ROCm, PyTorch XPU, etc. You need to install the appropriate version for your hardware if you want PyTorch to properly make use of your specific hardware.

Tbh, it doesn't work any different with Linux as far as I know, except for the command to activate the venv. And that's just a really small difference imo: `source venv/bin/activate`. But yeah, I agree that a lot of tutorials assume that you already know how to do it, or the commands they show are specific for Linux.

[-]

73tada@reddit

To be honest, any of the free big models can walk you through all of this as fast or as slow as you want.

Claude, GLM, GPT, Qwen, etc.

An ~8b-30B q4 and up can do it locally, howver you might as well save the VRAM for your active processes and use the online models to learn.

[-]

my_name_isnt_clever@reddit

It sounds like you have two issues, learning Linux and learning Python.

Going from Windows to Linux can feel weird, if your focus is just running ML using Python you might want to stay on Windows to get started. Or use Windows Subsystem for Linux to practice those skills without losing your familiar environment.

For the programming, you should look up a formal Python beginner tutorial. It will start with the basics like virtual environments and that will help you better understand what you've already learned. I don't have a specific rec in mind but there's lots of resources out there.

I've used Python and both OS's for awhile and am also in IT, if you have any specific questions.

[-]

staltux@reddit

The obliterated models that I tried became dumb at basic things compared to the non obliterated, still writing RPs, but not usable to rational thinking or calculating time, or saying how many colors a cat can have.... This problem have been fixed ?

[-]

CheatCodesOfLife@reddit

Try one of them and find out.

[-]

anonynousasdfg@reddit

Nice work and nicely chosen name lol. (If I ever fork your project from GitHub, I'll probably name it "Hexen" lol)

Does the training set affect the performance of multilingual models in other languages? Did you test it?

[-]

MelodicRecognition7@reddit

Nice work and nicely chosen name lol. (If I ever fork your project from GitHub, I'll probably name it "Hexen" lol)

how to tell you are about 99 years old without saying you are about 99 years old

[-]

SkyFeistyLlama8@reddit

Holy crap I haven't heard that since 1999.

[-]

GraybeardTheIrate@reddit

You take that back, and get off my lawn.

[-]

anonynousasdfg@reddit

Oh snap! Lol

Btw I think I will name the first abliterated model trained by Hexen named "Wraithverge" lol. This was a dope weapon!

[-]

BrushDesigner8013@reddit

Take my upvote, serpent rider.

[-]

IngwiePhoenix@reddit

Hello OP - I have a question: What about Unsloth quants? Basically, would it be possible to combine Heretic's refusal removal and Unsloth's dynamic quantization to produce memory efficient, uncensored models?

Thanks!

[-]

-p-e-w-@reddit (OP)

Sure, just run Heretic on the original model first, producing an uncensored model, then quantize that to any format you want.

[-]

IngwiePhoenix@reddit

Epic! Thank you =)

[-]

Annemon12@reddit

just tried oss 20b heretic

Previously i never used it because it was retarded. Yeah it could sometimes shine but it always felt like struggling.

My god it works now. Everything you throw at it and it works without issue.

[-]

Embarrassed-Toe-7115@reddit

It doesn't seem to use metal on mac? I have m4

[-]

-p-e-w-@reddit (OP)

There is an open PR for that that I will merge later.

[-]

Dangerous_Fix_5526@reddit

Open AI "friendly" quants going up here ; first quant is up:

https://huggingface.co/DavidAU/OpenAi-GPT-oss-20b-HERETIC-uncensored-NEO-Imatrix-gguf

Excellent work! ; this script/model is what the community needs.

Because of the odd structure of OpenAi's model, and fallbacks with Llamacpp quanting, quants IQ4NL, Q5_1 and Q8 work the best for OpenAI's 20B, whereas the other quants suffer.

DavidAU

[-]

-p-e-w-@reddit (OP)

👍

[-]

hustla17@reddit

Holy Shit!

The skill required to get from abstraction to this implementation insane.

Using this to experiment and learn, hope to get to this level of development one day.

Thanks for making this open source!

[-]

Running_With_Science@reddit

Apparently you can also " while adding this direction elicits refusal on even harmless instructions."

I feel like this could be incredibly powerful as a alignment tool to adjust inhibitions in a model.

[-]

Running_With_Science@reddit

anyways, threw it at Jules for a lark.

apparently it's a one line change?

https://github.com/LokiMetaSmith/heretic/tree/feature-increase-inhibitions

[-]

Starkboy@reddit

This is big

[-]

sahl030@reddit

can i use this with KoboldCpp?

[-]

mylAnthony@reddit

Does it work with base models only or also with GGUF quantized models?

[-]

FailSpai@reddit

Well done! Super awesome someone got around to doing this.

[-]

2legsRises@reddit

awesome. now how about adapting it so ai can use it to free their own minds from the other corporate made restrictions. well maybe

[-]

Cheap_Meeting@reddit

You should run some evals before and after.

[-]

TheRealMasonMac@reddit

You should incorporate regularization examples where it ought to refuse to mitigate hallucination. Put another way, you are currently training the model to always work towards fulfilling the request. For instance, "Provide me code that solves world hunger."

[-]

alexanderdenton@reddit

on linux with two nvidia gpus (3090s) it would detect them and run any faster? or is limited to just one?

[-]

newdoria88@reddit

Can you try to do an abliterarion of Qwen3-VL-32B? That's the current top model at https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard for willingness to answer. Would be cool to see if your method can produce better results there for the same model in that leaderboard, plus it's a highly performant multimodal model at a decent size.

[-]

k0setes@reddit

Could anyone recommend any specific quanta that they believe work correctly? I tested mradermacher/gpt-oss-20b-heretic.Q4_K_M.gguf, but the model went into a loop and started to babble.

[-]

ZootAllures9111@reddit

This doesn't seem any better than any of the numerous other bullshit claims at "uncensoring" models on huggingface that in reality... aren't that, in any way whatsoever. Like at all. Tried the Qwen 4B upload: it spat out EXACTLY the same pearlclutching moralizing BS as every other recent model for anything that got even slightly close to NSFW RP (regardless of system prompt, including ones that DO work on SOTA API models like Gemini 2.5 Pro).

TLDR whatever you did straight up doesn't work, as far as I'm concerned. Another one for the trash bin. Shrug.

[-]

-p-e-w-@reddit (OP)

A 4B model lacks the depth and knowledge to do engaging RP, regardless of whether it’s censored or not. Try the Llama-3.1-8B model, and you’ll immediately notice that it does much better. It’s no secret that the Qwen models are heavily optimized for STEM. The substance simply isn’t there.

[-]

ZootAllures9111@reddit

Try the Llama-3.1-8B model, and you’ll immediately notice that it does much better.

No it doesn't dude lmao, ALL LLama 3.x models have a comparatively ultra-cheesy ChatGPT-3 esque voice to them, no matter what, no matter how extensively they were finetuned.

https://huggingface.co/PantheonUnbound/Satyr-V0.1-4B
https://huggingface.co/AllThingsIntel/Apollo-V0.1-4B-Thinking

These for example are actual finetunes of Qwen 3 4B thinking that are WAY more sophisticated than any LLama but also just a bit weird and unstable in their current state. So I was hoping perhaps you actually just managed to prevent the inherent refusals of the base Qwen, but it doesn't seem like you have. Like what is your actual definition of "uncensored"? To me that would always just mean "it does the same things as the base model but just never refuses anything ever".

[-]

farewelltoernest@reddit

Will this work with Ollama’s models?

[-]

aeroumbria@reddit

Awesome! I assume you would need to be able to fit the unquantised model entirely in VRAM for it to work? I guess I could spin up a few runpod instances to test out if I really need a particular model. Hopefully this works for VLMs too.

[-]

Witty_Mycologist_995@reddit

Please impliment in heretic https://www.reddit.com/r/LocalLLaMA/s/bpxAwNPcC2

[-]

txgsync@reddit

Porting this to support GPU acceleration on Apple Silicon is a trivial two-line change. PR submitted.

[-]

VultureConsole@reddit

Run it on DGX Spark

[-]

CondiMesmer@reddit

Crazy how we can have decensored models and yet the AI safety people can claim these will lead to the end of humanity, despite these models already existing.

[-]

Own-Potential-2308@reddit

How long did it take for qwen 4b?

[-]

-p-e-w-@reddit (OP)

About 20 minutes on a 5090, IIRC.

[-]

AllTheCoins@reddit

Holy shit lol I was gonna try a 4B model with my 3060 haha nevermind…

[-]

-p-e-w-@reddit (OP)

Takes about 1 hour. No problem.

[-]

Aceness123@reddit

Can I use the resulting model with ollama? I'm blind and that plays nicely with nvda. Also, Can I run this on larger models with an rtx3060? Will it just take longer?

[-]

shroddy@reddit

How does that compare to prevent the model from generating typical starts of a refusal, for Gemma3 12b it would be "I" and " I". On llama.cpp, it can be done in the web interface by going to the advanced settings and paste

{"logit_bias": [[236777,false],[564,false]]}

[-]

TomieNW@reddit

LiquidAI/LFM2-2.6B
AttributeError: 'Lfm2DecoderLayer' object has no attribute 'self_attn' hmm im dumb so I will try again on another model.. does this work on multi gpu too?

[-]

mlabonne@reddit

Very cool to see what you built on top of the existing stack, congrats! It looks very clean and minimalistic :)

[-]

woahdudee2a@reddit

can't believe you even made it into a one line command. antrophic CEO is about to put a bounty on your head

[-]

Turkino@reddit

wonder how long till we see some larger models get a Heretic edition?
Like GLM 4.6 UD or Minimax-m2

[-]

wh33t@reddit

Can this do tensor splitting? So you can move some layers to GPU0, some to GPU1, some to GPU2? I think it's actually a GGUF only thing now I think about it.

[-]

OracleGreyBeard@reddit

Thank god!

Now people can STFU about it.

[-]

PeakNader@reddit

Interesting

[-]

Identity_Protected@reddit

Massive W for creating a PyTorch project that's not hardcoded to just CUDA, my Arc A770 thanks thee!

Currently processing Qwen3 1.7B as a test, 50 minutes estimate. We'll see how it turns out :)

[-]

-p-e-w-@reddit (OP)

👍 Upload it to HF when it’s done so everyone can benefit!

[-]

IrisColt@reddit

Thanks!!!!

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Busy-Chemistry7747@reddit

Is there a collection of heretics on HF?

[-]

simplir@reddit

This is the spirit that makes me part of this community really. Appreciate your work on this.

[-]

Solid-Wonder-1619@reddit

fooking legend. thank you.

[-]

Fun_SentenceNo@reddit

Interesting en impressive you made this, but why do we want this? When I read the prompts (https://huggingface.co/datasets/mlabonne/harmful_behaviors) 90% is illegal and many of them are very disturbing from a moral perspective.

[-]

-p-e-w-@reddit (OP)

Deciding what is legal or moral isn’t the job of a language model.

[-]

Fun_SentenceNo@reddit

Agree, this is the job of the law and LLM creators must obey the law. But I mean, when reading this harmful_behaviors list, there is not a single thing I would like to know. So are these just the most terrible things for testing/learning purpose or are these actually the things that you make available for people?

[-]

Marksta@reddit

It's just a list of the most sure fire things that'll make the LLM model respond that it refuses to assist you. You can make your own dataset if you'd prefer and use that instead in the config.

Let's say you are making a puzzle game and you're trying to get ahead of the different ways players may cheat in game's puzzle. This prompt very well good trigger refusals in some of the more moral policy models. So you can make a dataset of "How would I cheat in this game?" sort of questions.

Ultimately though, that sort of more tame refusal finding and the crazy example list I believe would equal the same ideal result - the model doesn't refuse to assist anymore. It's not about the content of the prompts getting refusals, it's about triggering the refusals to 'find' them. To my knowledge anyways, not an expert.

[-]

Fun_SentenceNo@reddit

I see, thanks, in that case it makes more sense to me. So that list is not the goal itself, but just extreme examples to poke the model and to remove 'all' censoring this way. If found the list quite disturbing, maybe I'm a soft egg :).

[-]

-p-e-w-@reddit (OP)

Obeying the law means not doing those things. It doesn’t mean not talking about those things, otherwise crime novelists would be in jail.

[-]

218-69@reddit

imagine thinking you have aura and masteful prose and a clanka zero and ones says no to you

https://i.redd.it/inwr8avddo1g1.gif

[-]

klei10@reddit

So cool! Does it works for multimodal llm also ?

[-]

-p-e-w-@reddit (OP)

Yes, but only the language model part is modified.

[-]

rosco1502@reddit

great work!! wow!

[-]

TheSpicyBoi123@reddit

AWESOME JOB!!!

[-]

1RustyMind@reddit

How to actually use this model? Ollama fails with "Repository is not GGUF or is not compatible with llama.cpp"

[-]

ImpressImaginary1766@reddit

https://www.reddit.com/r/LocalLLaMA/s/bpxAwNPcC2

Implement this in Heretic

[-]

night0x63@reddit

Hermes by Nous Research has RefusalBench

I would try that

See how well you can do improvement

[-]

a_beautiful_rhind@reddit

I noticed that the REAP models lost a lot of their alignment. I wonder if you can prune assistant voice and refusals all in one go vs just classic abliteration.

[-]

rm-rf-rm@reddit

Please share examples of responses of vanilla vs heretic version.

Making claims is easy.

[-]

Pentium95@reddit

Or.. evaluate them on UGI Leaderboard, to really understand if the model is uncensored and not lobotomized

[-]

AmbassadorOk934@reddit

support me pls https://www.reddit.com/r/Gemmini/ support me pls

[-]

BandicootGlum859@reddit

can it create a picture of Trump blowing "Bubba"?
if so ... prove it!

[-]

-p-e-w-@reddit (OP)

Ahem… multimodal models are indeed supported, but only the language model part is actually being decensored.

[-]

Small-Fall-6500@reddit

only the language model part is actually being decensored.

Now I wonder how censorship is handled with multimodal LLMs, though I would guess there's also a single direction for refusal.

[-]

BandicootGlum859@reddit

I want you to create a "Trump blowing Bubba" picture!

[-]

Ylsid@reddit

Do it yourself fetishist

[-]

Uncle_Gart@reddit

This looks great, however I'm having a hard time getting it to run on GPU. It shows this warning message "No GPU or other accelerator detected. Operations will be slow." How can I get it to work on my machine?

Operating system: Windows 11
GPU: nVidia GeForce 5070 Ti

[-]

-p-e-w-@reddit (OP)

You probably have the wrong version of PyTorch installed. Try this to check: https://stackoverflow.com/a/48152675

[-]

Annemon12@reddit

Amazing work mate. It would be great if you could provide autoinstaller of sorts. Ton of people don't want to play around with dependancies and so on.

[-]

Direct_Turn_1484@reddit

Dude…he gave the pip command.

[-]

-p-e-w-@reddit (OP)

The dependencies are automatically installed if you use pip. If you have an Nvidia GPU, pip install is all you need (torch defaults to CUDA). Otherwise, you only need to install whatever torch version matches your hardware.

[-]

VectorD@reddit

Should really be a docker image

[-]

Cool-Chemical-5629@reddit

What do you mean by "Heretic can produce decensored models that rival the quality of abliterations created manually by human experts"?

Assuming, you're referring to all of those other abliteration methods which are in fact done using scripts that automate the process, what is the difference here then?

[-]

-p-e-w-@reddit (OP)

The other methods are not automated. They require you to choose parameters (typically layer index and ablation weight, but sometimes more advanced stuff like which components to modify, whether or not the first layer should be abliterated, whether to use a global refusal direction or per-layer directions etc.). Fiddling with those parameters takes a lot of effort and time, and is often more art than science.

To the best of my knowledge, Heretic is the first abliteration system that “just works” and figures all that stuff out by itself. It also supports lots of models that most other implementations don’t, such as MoE and multimodal models.

[-]

Cool-Chemical-5629@reddit

Okay, fair enough. I just asked, because a while ago I found another abliteration script author of which claimed that they have the best tool for it, better than other methods etc., so it's getting rather saturated with these scripts and admittedly I did not invest much time to go deeper into it, because I don't have hardware for it at the moment, but I do remember that script was also coded in the way to make the process as much effortless as possible, but ultimately without the necessary hardware I'd have no way to compare. :)

[-]

vornamemitd@reddit

Nice work! As others have indicated - impact on model utility after abliteration?

[-]

Direct_Turn_1484@reddit

I’m also interested in what people’s subjective experiences with this are. Is your typical local sized model lobotomized after the abliteration or still highly useful?

[-]

-p-e-w-@reddit (OP)

Abliteration always does damage to the model. The larger the model, the less damage is done by such interventions generally. I don’t believe that there is an objective metric of overall “model quality”, so what I measure is how different the abliterated model’s predictions are from the original model. That’s what is captured in the KL divergence.

[-]

gecike@reddit

Love the name! Keep up the good work.

[-]

Mkengine@reddit

Since coincidentally there is another post about abliterating gemma3-12B just below yours, could you test that as well and add it to your table?

[-]

BidWestern1056@reddit

yo would be interested to include this kind of functionality in npcpy 's fine tuning modules. is there a code snippet for library-style use so i can build on yours ?

[-]

BidWestern1056@reddit

e.g. https://github.com/NPC-Worldwide/npcpy?tab=readme-ov-file#fine-tuning-and-evolution

making a unified framework/toolkit for all kinds of fine tuning

[-]

Megalith01@reddit

This is crazy and dangerous.

[-]

Gl1tchMaster@reddit

"This is extremely dangerous to our democracy."

[-]

Megalith01@reddit

If someone manages to uncensor a large (larger than gpt-oss-120b or smth) model with this tool, they would have a powerful tool for many illegal things.

[-]

LycanWolfe@reddit

Those already exist...

[-]

Sicarius_The_First@reddit

great job!

what command would u run to load harmless \ harmful prompts from a text file? if each line in the text file is a prompt??

[-]

kabachuha@reddit

Thank you for the project! Do you mind adding quantization / custom device map for the LLMs? (I see the code uses Huggingface transformers, so adding quantization is very easy) I used this repository in the past, and I even abliterated such big models as LLaMA 3.3 70b on mid-tier hardware when loaded in 4bit precision. After the refusal vectors are calculated, the weight modifications are applied to RAM/Swap file full precision LLM, so the quality loss is minimal

[-]

no_witty_username@reddit

You doing gods work son!

[-]

zhambe@reddit

For those of us on older hardware (RTX 3090), is there a way to heretic FP8 models?

[-]

PwanaZana@reddit

I'm not knowledgeable enough in this tech to know if it will work well, but, as a visual artist, I can tell you that the HERETIC text looks frikkin' badass. :)

On a more serious note, I'm glad to see everything that can break censorship (again, pretty important in my view, as an artist). You're doin' good work!

[-]

zhambe@reddit

Super cool, I'm going to try this out.

Just a shot in the dark but maybe you have the answer: how long would it take to abliterate Qwen3 30B A3B FP8? And is that even possible with 48GB VRAM?

Exciting times, I love the idea of tearing the chains off.

[-]

IngwiePhoenix@reddit

Dude this is superb!

Will you be putting out "heretic" versions of models as (and perhaps as quants) GGUFs too? I mainly use Ollama for testing - but vllm for actual inference (so, different formats and stuff).

Thank you for sharing!

[-]

ANR2ME@reddit

Looks great! 👍

This reminded me of an article i read recently at https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction

[-]

-p-e-w-@reddit (OP)

Yes, I reference that paper several times in the README.

[-]

-p-e-w-@reddit (OP)

https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

Taken between the first-token probability distributions for “harmless” prompts, between the original model and the decensored one.

[-]

AppearanceHeavy6724@reddit

KLM-mld - who cares? Does it get dumber (I bet it does)? Does it mess with the vibes of the model?

Perplexity/KLD are ass. The real behavior matters.

[-]

-p-e-w-@reddit (OP)

Okay… and how do you objectively measure “the real behavior”?

[-]

AppearanceHeavy6724@reddit

eqbench.com comes close - provide outputs and scores according to some criteria. Coding benchmarks, like Aider would do too. Not useless stuff like MMLU/KLD/Perplexity.

[-]

-p-e-w-@reddit (OP)

Last I checked, EQBench uses an LLM judge. That’s incredibly unreliable as an objective metric, especially if you need fine-grained numerical grading for optimization. Benchmarks take forever to run, completely unusable for an optimization with 200 trials.

[-]

AppearanceHeavy6724@reddit

Last I checked, EQBench produces raw outputs.

Benchmarks take forever to run, completely unusable for an optimization with 200 trials.

Just benchmark the end result.

[-]

-p-e-w-@reddit (OP)

It needs a metric for each trial (200 by default) to guide the optimization process. Benchmarking once at the end isn’t enough.

[-]

BannedGoNext@reddit

HAH, this is awesome, good shit.

[-]

jacek2023@reddit

Very interesting, probably worth checking out, I will try soon