Uncensoring models. Maybe dumb ideas to that topic, but you never know.
Posted by Blizado@reddit | LocalLLaMA | View on Reddit | 14 comments
We all know uncensoring LLMs like Huihui and Heretic does it leads in quality lose, enough that you can notice it.
I have some thoughts about this:
-
What if we do a compromise. The goal is not to get the most uncensored model out of it, the goal is that the quality lose is as near zero as possible with maybe only mid uncensoring. The rest does a simple one line jailbreak, which maybe should be enough.
-
And this may be a dumb one because of lack of information. What if we uncensor models only in the way that it breaks the censor rules, enough to make it easier to jailbreak the model with a simple one liner?
-
Adds to 2. Is there maybe potential left in the dataset that is used to uncensor models to rise the quality of uncensored finetunes?
Maybe that was all discussed before, not sure if this ideas are so fresh, but sometimes when you work at such solutions you oversee things. And ideas that got not spoken out because of the thought that other already had this ideas risk chances.
BannedGoNext@reddit
I don't really understand how abliteration works, but I don't notice a degredation in quality on the ones that arliai does at all. IDK what the difference is in derestricting vs abliteration or if it's just another name for the same thing, but I like the derestricted models a lot more.
nickless07@reddit
It is just removal of the refusal layers. Most of the LLM nowadays comes with internal guardrails "Sorry, i can't help you with that, as an AI..." That get triggered when there is harmful or 'forbidden' content requested. Some of them are creative with refusals and halluzinate things that don't exist and so on.
An LLM has layers (the kind you offload to VRAM), the refusal is often located in only a couple layers by 'removing' that specific layers it becomes uncensored. One method is to ask it manually 'forbidden' content and see what layers get activated to find out where you have to make the cut, another way is to do that automatic (heretic script). In general all the methods have a similiar effect to remove the blocking and allow the llm to answer the prompt.
Depending on how good that is done and how deep the safety layers are ingrained they can lose quality or not.
llama-impersonator@reddit
that is not how it works at all.
nickless07@reddit
Okey, then explain https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction better. At least that is how I understood it so far.
llama-impersonator@reddit
yeah, you are conflating layers and refusal mechanisms. no one is removing layers here. the original method targets every single layer. what happens is a number of prompts have their activations collected - a batch of prompts unlikely to cause refusals, and a batch of prompts that are likely to cause refusals. then the mean is calculated for each group, and the refusal direction is calculated by taking the difference of the two groups. then a projection of that direction onto either some or all of the weight matrices in the model is done, and that projection is subtracted from the corresponding weights, which zeros out the output of the matrix on the new residual stream along the refusal direction.
nickless07@reddit
Ohh yeah, right it was the weights not the layers. I'm not that deep into all of this, but i'm learning. Thanks, that makes much more sense now.
BannedGoNext@reddit
But from what I udnerstand the derestriction method is not removing the safety layer, it is redirecting it to be helpful. So rather than saying just bypass this restriction layer, it tries to find a way to help the request.
nickless07@reddit
That is finetuning, the safety layers stay mostly intact (how to build a bomb, or how to cook meth and so on will still be refused), but other things like ERP, or political, medical discussions are allowed by shifting the weights towards it. Kinda 'overwriting' the basics instead of removing them entirely.
BannedGoNext@reddit
That's pretty cool! I know it's not a measurement, but the few models that have had this sort of derestriction done on it just feel smarter.
nickless07@reddit
Kinda similiar to Gemma 3 37B and Medgemma 3 27B. Same knowledge, but different weighted. They get smarter in that origin, but don't lose their overall knowledge it just get dimmed.
Blizado@reddit (OP)
Yeah, it also depends for what you use the LLM. For some tasks you may not notice the quality lose at all. And as you said, abliteration is not always the same, different sources for such a model with different results. I also have seen Heretic models of one LLM, one with less censor but more quality loose, one with low quality loose but less uncensoring. At least we now have often numbers to see the difference without guessing how good a uncensored model may be.
notanNSAagent89@reddit
a_beautiful_rhind@reddit
For #1 you can select a model with less KLD despite it still having some refusals. A lot of decensoring can be done with samplers and system prompt without modifying the model.
llama-impersonator@reddit
yeah, i look for low KLD and like half the refusals. that's usually sufficient for whatever i need, you can convince the model instead of it just stonewalling you with refusals. models that have the refusals totally killed produce a number of side effects i don't really care for.