Uncensoring models. Maybe dumb ideas to that topic, but you never know.

[-]

BannedGoNext@reddit

I don't really understand how abliteration works, but I don't notice a degredation in quality on the ones that arliai does at all. IDK what the difference is in derestricting vs abliteration or if it's just another name for the same thing, but I like the derestricted models a lot more.

[-]

nickless07@reddit

It is just removal of the refusal layers. Most of the LLM nowadays comes with internal guardrails "Sorry, i can't help you with that, as an AI..." That get triggered when there is harmful or 'forbidden' content requested. Some of them are creative with refusals and halluzinate things that don't exist and so on.
An LLM has layers (the kind you offload to VRAM), the refusal is often located in only a couple layers by 'removing' that specific layers it becomes uncensored. One method is to ask it manually 'forbidden' content and see what layers get activated to find out where you have to make the cut, another way is to do that automatic (heretic script). In general all the methods have a similiar effect to remove the blocking and allow the llm to answer the prompt.
Depending on how good that is done and how deep the safety layers are ingrained they can lose quality or not.

[-]

llama-impersonator@reddit

that is not how it works at all.

[-]

nickless07@reddit

Okey, then explain https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction better. At least that is how I understood it so far.

[-]

llama-impersonator@reddit

yeah, you are conflating layers and refusal mechanisms. no one is removing layers here. the original method targets every single layer. what happens is a number of prompts have their activations collected - a batch of prompts unlikely to cause refusals, and a batch of prompts that are likely to cause refusals. then the mean is calculated for each group, and the refusal direction is calculated by taking the difference of the two groups. then a projection of that direction onto either some or all of the weight matrices in the model is done, and that projection is subtracted from the corresponding weights, which zeros out the output of the matrix on the new residual stream along the refusal direction.

[-]

nickless07@reddit

Ohh yeah, right it was the weights not the layers. I'm not that deep into all of this, but i'm learning. Thanks, that makes much more sense now.

[-]

BannedGoNext@reddit

But from what I udnerstand the derestriction method is not removing the safety layer, it is redirecting it to be helpful. So rather than saying just bypass this restriction layer, it tries to find a way to help the request.

[-]

nickless07@reddit

That is finetuning, the safety layers stay mostly intact (how to build a bomb, or how to cook meth and so on will still be refused), but other things like ERP, or political, medical discussions are allowed by shifting the weights towards it. Kinda 'overwriting' the basics instead of removing them entirely.

[-]

BannedGoNext@reddit

That's pretty cool! I know it's not a measurement, but the few models that have had this sort of derestriction done on it just feel smarter.

[-]

nickless07@reddit

Kinda similiar to Gemma 3 37B and Medgemma 3 27B. Same knowledge, but different weighted. They get smarter in that origin, but don't lose their overall knowledge it just get dimmed.

[-]

Blizado@reddit (OP)

Yeah, it also depends for what you use the LLM. For some tasks you may not notice the quality lose at all. And as you said, abliteration is not always the same, different sources for such a model with different results. I also have seen Heretic models of one LLM, one with less censor but more quality loose, one with low quality loose but less uncensoring. At least we now have often numbers to see the difference without guessing how good a uncensored model may be.

[-]

notanNSAagent89@reddit

that was/is always the goal. (not new innovative idea)
it's called prompt engineering (not new innovative idea)
datasets exists to fine tune ai to not use negative words and deny your requests. (not new innovative idea)

[-]

a_beautiful_rhind@reddit

For #1 you can select a model with less KLD despite it still having some refusals. A lot of decensoring can be done with samplers and system prompt without modifying the model.

[-]

llama-impersonator@reddit

yeah, i look for low KLD and like half the refusals. that's usually sufficient for whatever i need, you can convince the model instead of it just stonewalling you with refusals. models that have the refusals totally killed produce a number of side effects i don't really care for.