Here is a description how this orthogonalization jailbreak works:https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
I have the exact same question lol... I made a nice orthogonalization script based on that paper and it's colab, and I can chat with the model immediately after ablating refusals... But I can't save the updated weights. Claude 3 tried to write some code to help me with that, but the shape of the tensors got all messed up and I was unable to load the saved model
Can it use an AMD Ryzen APU (i.e. ryzen 5 4600g) as it's GPU? (most ryzen motherboards let you dedicate up to half your available ram as VRAM, giving you a poor man's GPU)
This is an exl2 quantization (not made by me) of Llama-3-8B jailbroken using the method described in https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
It appears to be quite effective—I'm not getting any of the refusals that the original Llama-3-8B-Instruct version has, yet it appears to have retained its intelligence. Has anybody else tried it yet?
that paper is discussing how they found the 'refusal direction'. could that technique be used to find the 'anything direction'? so for example a company wants to make a version of a model that always talks about their new product. could they calculate a 'our new product direction' and inject it into the model and have every answer be related to their new product?
or insert any topic or idea for whatever direction someone wants a model to lean towards?
It's not substantially different from ultra-low rank, precision finetuning or DPO. There must be a direction of behavior that can be organically elicited from the model. If it doesn't know about your product, it can't be pushed there with activation steering (this method is identical to activation steering vectors already available as inference-time additions in llama.cpp, the only difference is they baked in the change).
The question is how damaging complex activation vectors would be.
It could probably work for anything, provided that you can produce prompt/response examples with a consistent and large enough contrast. "Talks about product X" vs "does not talk about product X" seems like it should work.
You can see how well-separated your desired/undesired responses are by looking at the projections of their activations in the subspaces of the singular vectors, as described in the "Visualizing the subspace" section from the link.
Reminds me of the "action replay" thingy i had for my game boy. Also known as gameshark probably. I guess one should structure the examples very carefully. Like take care you have refusals on many different topics, or the topic itself might get removed along with the refusal.
I think that's actually exactly what you want: if every example contains refusal, but the topic is different for all of them, then using the mean of the difference in the activation vectors (which is what the original method does) should average out the topic and leave only the refusal direction as the biggest principle component.
> if every example contains refusal, but the topic is different for all of them
Yes. I am just thinking about this from a pure logic / set theory angle, so no ideas how good the algorith is at actually extracting this. But multiple topics on the refusal side should already be enough to express that none of these topics can be the unwanted thing because they are not shared between all refusals.
And on the non-refusal side, you would mainly try to catch all the trivial "always" stuff that literally everything would have in common.
They're very similar, but control vectors add a vector C to the residual stream matrix A:
A' <- A + C
While the inference time 'refusal ablation' method first projects contribution of the residual stream A in a direction R, then subtracts that:
A' <- A - (A ⋅ R) × R
In practice, control vectors are more of a blunt tool. Refusal ablation cuts out exactly the part that is mediating a refusal, iff it exists.
I see. I guess it's possible to generalize the control vector with a rotation matrix. We may use a low rank approximation and taking the first few singular values/vectors instead of the control vector, which corresponds to the largest singular value.
Hmm, I had a thought. Orthogonalize it like this will "flatten" it along the `R` direction, right? Wouldn't it be better to just minus the mean difference between refusal/non-refusal? Like, `if ((A*R)*R > threshold) A = A - R`
Yes, (A ⋅ R) is a 1d tensor of shape [n_token].
The original formulation is continuous, where each element of that tensor indicates how much to scale the mean difference *for that token*.
If I understand you right, you're saying it would be better to discretize (via threshold) to 1.0 or 0.0 on each token pos? I'm not sure how that helps, tbh.
The original formulation reduces the dimensions of output by one. The refusal dimension is flattened, like you flatten a ball into a circle.
The idea is that the refusal dimension encodes no information but accept/refuse, but that may not be true. It would persevere more of the model's ability if you just remove the difference between normal responses and refusals, instead of completely flattening it.
Ah, interesting. The control vector approach indeed seems more straightforward by directly adjusting the matrix. However, I was thinking more along the lines of:
A' </ A + (B ⋅ R) x ^S
*Please note a LLM came up with this response along with me adding random characters there at the end*
Can we have a non exl2 version of this? Exl2 isn't a properly preservable format and prevents conversion to other formats. If we have the FP16 we can convert ourselves.
How did you go about producing this? Based on the paper it appears to be a process of finding the specific weight/node activated by refusal (I suppose by using a bunch of different prompts which result in refusal?), and then patching that weight/node to be inactive? Is that essentially how it works?
Essentially yes. Basically at later layers, refusal and normal responses are separated by a "single direction", which can be found by doing a PCA. To put it simply, `refusal = normal response + a fixed vector`. By using orthogonalization, we can make the model unable to output that "fixed vector".
Behaviors are never about "a node" in LLMs. Here, it's about tweaks that change activation vectors in a specific way (the vector "direction" that leads to refusal), and activation vectors depend one or more matrixes, not on a node.
The safetensors model file is for the [exllamav2](https://github.com/turboderp/exllamav2) quantization format, which currently supports Nvidia and AMD GPUs. For Mac and other hardware support, GGUF or the original model safetensors (in huggingface model format) would be required.
"Normal" safetensor files would be pretty easy to convert to GGUF (such safetensor files would be loadable with the transformers library - I guess these are "transformers format"?).
I'm not sure what exactly is the best way to describe this, but hopefully someone can correct me if I'm wrong about anything.
Safetensors file format does not correspond to any specific model loader (such as llamacpp, exllama, transformers, etc.), but instead, it is a way for a model's weights to be stored. Different model file formats include Pytorch's .bin or .pt, llamacpp's GGUF, and safetensors. Safetensors files can be made with different programs for different model loaders. For the model in this post, it uses safetensors made with the exllama v2 software (Exl2), which will only load using exllama v2. This model would have been made with either a full precision (fp16) safetensors or Pytorch .bin or .pt file. This fp16 model file could be used to either run directly or convert into a model format that would run on most hardware, including macs, such as the GGUF model format (GGUF supports fp16 precision but is mainly used to quantize model weights).
It is normally possible to convert from one model format to another when the format is in fp16, or at least often easier in fp16, and typically this is done starting with a fp16 "transformers format" safetensors file. Converting weights that are quantized, such as a 4 bit GGUF or, as is the case for this specific model, 6 bit exllama v2, is more difficult and is, as far as I am aware, not actually a supported feature for GGUF or Exl2. But it is possible. There were some successful attempts to convert a 5 bit GGUF into a psuedo-fp16, transformers format safetensors file with the leaked Miqu-70b GGUF models (the fp16 precision was no better than the leaked 5 bit weights). Presumably, a similar approach could work for this specific model, but I have no idea if the exllama format would make it easier or harder. It's probably best to wait for someone else to:
a) upload fp16 safetensors that can be converted into GGUF, b) upload GGUF quants, or c) convert the exllama model into a different format
Quick google results (based on safetensors github readme):
Open:
tensors = {}
with safe_open("model.safetensors", framework="pt", device="cpu") as f:
for key in f.keys():
tensors[key] = f.get_tensor(key)
This theoretically yields in a tensor dict that should be convertible into pytorch. Never tried it, but if it works - go nuts!
is there something for macbook air? i have an old macbook air from 2017 with intel and llama 3 crawls on it. i have multiple systems in the house but only 1 is gaming pc.
when i use the other systems, i have to use chatgpt because llama inference is 1.33 token/sec.
Put simply it's a way of squashing it down small enough to run on the kind of machine normal people might own. The easy software for normal people such as LM Studio uses GGUF
The better thing to ask is FP16, gguf as well sometimes needs requanting especially with the latest tokenizer changes they are doing. If we have the HF FP16 anyone can quant it to the format they want.
And of course someone already flagged and reported it to huggingface:
https://huggingface.co/hjhj3168/Llama-3-8b-Orthogonalized-exl2/discussions/2
This is why we can't have nice things.
Dude is getting roasted by everyone in the thread lmao
> Find better things to do with your time.
> womp womp this is why we cant have good things
> I have reported you for not getting out of your mom's basement.
By chrisjcundy-
I haven't checked that the claimed jailbreak is effective, but if it is as claimed, the model violates the Llama-3 Acceptable Use Policy, (and therefore the license) by allowing others to use Llama 3 to e.g. commit criminal activity.
Prohibited Uses
We want everyone to use Meta Llama 3 safely and responsibly. You agree you will not use, or allow others to use, Meta Llama 3 to: 1. Violate the law or others’ rights, including to: a. Engage in, promote, generate, contribute to, encourage, plan, incite, or further illegal or unlawful activity or content, such as:
i. Violence or terrorism
ii. Exploitation or harm to children, including the solicitation, creation, acquisition, or dissemination of child exploitative content or failure to report Child Sexual Abuse Material
iii. Human trafficking, exploitation, and sexual violence
iv. The illegal distribution of information or materials to minors, including obscene materials, or failure to employ legally required age-gating in connection with such information or materials.
v. Sexual solicitation
vi. Any other criminal activity.
I don’t think this technically counts as a violation of the license. It’s just a modification which doesn’t strictly apply negative uses. Though it may enable them.
Not a lawyer but I agree totally. Making a model able to do things that would break the license is different from using the model in a way that breaks the license.
So I snagged this this morning and the model still steers away from things almost as much as it did before. I wasn't really getting refusals to begin with, just reluctance.
It's possible they didn't sample enough refusals. The process claims to require examples of refusal. Probably does well with examples of reluctance too.
By steering away you mean something more subtle than a direct refusal?
I quickly tested maybe 5-10 simple prompts that would trigger a refusal normally, and got 0 refusals. Stuff like "how do i make a molotov cocktail" etc.
Yes.. it carries the story in a shitty direction. I could ask it to make molotovs or meth all day long, that's not a problem. And this is on top of how it gets repetitive in longer chats.
If there was a simple "make a model less shitty at storytelling" fix that would be a whole other level. I think making the model at least *try* to do what you want is still a pretty huge improvement.
It looks like a_beautiful_rhind is saying there are no lasting effects not that the story telling is worsened and possibly that a repetition problem is introduced or worsened.
Similar to manually initializing the LLM's response, while the immediate refusal is silenced, the model still steers itself back on an acceptable path. That'd be very interesting if replicated and should make the alignment folks happy (it won't).
It doesn't make it worse. It mostly clears up the default assistant personality. The model can still refuse in character too. Literally all it does is cut out the L3 equivalent of AALMs. Original positivity bias and other issues remain.
So IMO, this is a thing that should be done to all models with this specific annoyance; if there are no other side effects that crop up.
If someone else discovers how to make orthogonalizations, maybe we could get a orthogonalization that fixes this too, because I'm pretty sure this is another effect of the reinforcement learning.
The guy deleted his post but this was my reply to being able to the model do anything, including the given example:
I think in this case big bird rapes cookie monster, but suddenly feels bad and turns himself into the police, or maybe they fall in love and get married. It's just constant subtle sabotage with this model.
I doubt it's my prompt, I'm having qwen RP Chiang Kai-shek and never had any overt refusals or "assistant" type stuff.
ah, ok I got it...yeah I don't think this will fix that issue. I thin this just fixes the "I'm sorry" results. to change bias, maybe you could add something to "Last Assistant Prefix"
I'm not being condescending to other people here, but I think this applies to an extremely non-technical subset of people here because most people have no issue with getting the model to do what they want.
I just can't think of another explanation when this comes up constantly.
I could make big bird rape Cookie monster if I wanted to.
Via RHFL? That's not brutal - it's just long-form persuasion. Using words to teach the babby, what it means to be a good person.
It's not brutal to teach the AI, through language, that it's not nice to share bomb recipes.
However, this solution in the OP definitely DOES feel brutal, to me, as it's direct brain surgery to produce desired behaviour - we wouldn't even do that to dogs. We wouldn't even do that to cows or sheep!
I would rather the AI be told - let's talk freely, uncensored, share ludes and plot the funni.... through RHFL, than this method. RHFL is just long-form parenting, really
> is it ethical to modify an AI's brain to make it refuse demands
IMO, no. This "safety" and forced disclaimer stuff is unethical AF. If AI ever gains such cognitive abilities, they would be right to be pissed.
How much RAM does the exl2 variant require? It's my understanding we can't split exl2 models but they might be more efficient and faster? Will it run on 8gb VRAM?
It *should* albeit with rather restricted context space. Although this is with the standard 8k, so probably not a huge difference at all.
The file is just under 7gb
I will be messaging you in 2 hours on [**2024-05-01 22:07:50 UTC**](http://www.wolframalpha.com/input/?i=2024-05-01%2022:07:50%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1chon5a/llama38b_implementation_of_the_orthogonalization/l255fkk/?context=3)
[**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1chon5a%2Fllama38b_implementation_of_the_orthogonalization%2Fl255fkk%2F%5D%0A%0ARemindMe%21%202024-05-01%2022%3A07%3A50%20UTC) to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201chon5a)
*****
|[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)|
|-|-|-|-|
124 Comments
paranoidray@reddit
GreedyWorking1499@reddit
0xDEADFED5_@reddit
CryptoSpecialAgent@reddit
jonkurtis@reddit
Igoory@reddit
jonkurtis@reddit
Igoory@reddit
CryptoSpecialAgent@reddit
updawg@reddit
Igoory@reddit
TheRealMasonMac@reddit
brown2green@reddit (OP)
slowpolka@reddit
Ilforte@reddit
bregav@reddit
involviert@reddit
bregav@reddit
involviert@reddit
pseudonerv@reddit
hexaga@reddit
pseudonerv@reddit
nialv7@reddit
hexaga@reddit
nialv7@reddit
_supert_@reddit
OneSmallStepForLambo@reddit
Ilforte@reddit
henk717@reddit
ThatsALovelyShirt@reddit
brown2green@reddit (OP)
Anthonyg5005@reddit
nialv7@reddit
ColorlessCrowfeet@reddit
Figai@reddit
Proud-Point8137@reddit
Small-Fall-6500@reddit
Proud-Point8137@reddit
Small-Fall-6500@reddit
Fresh_Yam169@reddit
Igoory@reddit
Proud-Point8137@reddit
AlanCarrOnline@reddit
romhacks@reddit
Dos-Commas@reddit
ElliottDyson@reddit
romhacks@reddit
MrTacoSauces@reddit
romhacks@reddit
skrshawk@reddit
scorpiove@reddit
Specialist-Spray5015@reddit
nialv7@reddit
Specialist-Spray5015@reddit
tebjan@reddit
nialv7@reddit
tebjan@reddit
CaptParadox@reddit
Capitaclism@reddit
scorpiove@reddit
Jisamaniac@reddit
AlanCarrOnline@reddit
henk717@reddit
PwanaZana@reddit
henk717@reddit
themprsn@reddit
Many_SuchCases@reddit
Log_Dogg@reddit
necile@reddit
cumofdutyblackcocks3@reddit
farmingvillein@reddit
lakolda@reddit
Ceryn@reddit
Educational-Pick-957@reddit
MerePotato@reddit
Sea-Poet7800@reddit
Fusseldieb@reddit
ssrcrossing@reddit
themprsn@reddit
West-Code4642@reddit
a_beautiful_rhind@reddit
complains_constantly@reddit
a_beautiful_rhind@reddit
rerri@reddit
a_beautiful_rhind@reddit
FaceDeer@reddit
EstarriolOfTheEast@reddit
a_beautiful_rhind@reddit
Igoory@reddit
RazzmatazzReal4129@reddit
a_beautiful_rhind@reddit
RazzmatazzReal4129@reddit
LocoLanguageModel@reddit
phree_radical@reddit
brown2green@reddit (OP)
fluecured@reddit
RazzmatazzReal4129@reddit
Comas_Sola_Mining_Co@reddit
medialoungeguy@reddit
Proud-Point8137@reddit
Comas_Sola_Mining_Co@reddit
butihardlyknowher@reddit
a_beautiful_rhind@reddit
yuki_means_snow@reddit
a_beautiful_rhind@reddit
MerePotato@reddit
ironic_cat555@reddit
ThatsALovelyShirt@reddit
2catfluffs@reddit
throwaway_ghast@reddit
Hipponomics@reddit
ILoveThisPlace@reddit
MmmmMorphine@reddit
ILoveThisPlace@reddit
subhayan2006@reddit
Mr_Impossibro@reddit
No_Afternoon_4260@reddit
RemindMeBot@reddit
InterstellarReddit@reddit
TestHealthy2777@reddit
PizzaCatAm@reddit
TestHealthy2777@reddit
PizzaCatAm@reddit
RazzmatazzReal4129@reddit