Llama-3-8B implementation of the orthogonalization jailbreak

[-]

paranoidray@reddit

Here is a description how this orthogonalization jailbreak works:https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction

Reply

[-]

0xDEADFED5_@reddit

spent like 2 secs looking at this code, this is new to me. what's the easiest way to save a HookedTransformer back to files?

Reply

[-]

I have the exact same question lol... I made a nice orthogonalization script based on that paper and it's colab, and I can chat with the model immediately after ablating refusals... But I can't save the updated weights. Claude 3 tried to write some code to help me with that, but the shape of the tensors got all messed up and I was unable to load the saved model

Reply

[-]

jonkurtis@reddit

sorry for the noob question how would you run this with ollama? or do you need to run it another way?

Reply

[-]

Igoory@reddit

You can't. this model only works with exllama.

Reply

[-]

jonkurtis@reddit

does exllama work on Mac or is it only for Nvidia GPUs?

Reply

[-]

Igoory@reddit

Only NVIDIA/AMD

Reply

[-]

CryptoSpecialAgent@reddit

Can it use an AMD Ryzen APU (i.e. ryzen 5 4600g) as it's GPU? (most ryzen motherboards let you dedicate up to half your available ram as VRAM, giving you a poor man's GPU)

Reply

[-]

updawg@reddit

Can't you use the quantize function in llama.cpp to convert it to fp16?

Reply

[-]

Igoory@reddit

No, it doesn't work with exl2 weights

Reply

[-]

TheRealMasonMac@reddit

70b version when?

Reply

[-]

brown2green@reddit (OP)

This is an exl2 quantization (not made by me) of Llama-3-8B jailbroken using the method described in https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction It appears to be quite effective—I'm not getting any of the refusals that the original Llama-3-8B-Instruct version has, yet it appears to have retained its intelligence. Has anybody else tried it yet?

Reply

[-]

slowpolka@reddit

that paper is discussing how they found the 'refusal direction'. could that technique be used to find the 'anything direction'? so for example a company wants to make a version of a model that always talks about their new product. could they calculate a 'our new product direction' and inject it into the model and have every answer be related to their new product? or insert any topic or idea for whatever direction someone wants a model to lean towards?

Reply

[-]

Ilforte@reddit

It's not substantially different from ultra-low rank, precision finetuning or DPO. There must be a direction of behavior that can be organically elicited from the model. If it doesn't know about your product, it can't be pushed there with activation steering (this method is identical to activation steering vectors already available as inference-time additions in llama.cpp, the only difference is they baked in the change). The question is how damaging complex activation vectors would be.

Reply

[-]

bregav@reddit

It could probably work for anything, provided that you can produce prompt/response examples with a consistent and large enough contrast. "Talks about product X" vs "does not talk about product X" seems like it should work. You can see how well-separated your desired/undesired responses are by looking at the projections of their activations in the subspaces of the singular vectors, as described in the "Visualizing the subspace" section from the link.

Reply

[-]

involviert@reddit

Reminds me of the "action replay" thingy i had for my game boy. Also known as gameshark probably. I guess one should structure the examples very carefully. Like take care you have refusals on many different topics, or the topic itself might get removed along with the refusal.

Reply

[-]

bregav@reddit

I think that's actually exactly what you want: if every example contains refusal, but the topic is different for all of them, then using the mean of the difference in the activation vectors (which is what the original method does) should average out the topic and leave only the refusal direction as the biggest principle component.

Reply

[-]

involviert@reddit

> if every example contains refusal, but the topic is different for all of them Yes. I am just thinking about this from a pure logic / set theory angle, so no ideas how good the algorith is at actually extracting this. But multiple topics on the refusal side should already be enough to express that none of these topics can be the unwanted thing because they are not shared between all refusals. And on the non-refusal side, you would mainly try to catch all the trivial "always" stuff that literally everything would have in common.

Reply

[-]

pseudonerv@reddit

just a thought: can this be done with control vectors?

Reply

[-]

hexaga@reddit

They're very similar, but control vectors add a vector C to the residual stream matrix A: A' <- A + C While the inference time 'refusal ablation' method first projects contribution of the residual stream A in a direction R, then subtracts that: A' <- A - (A ⋅ R) × R In practice, control vectors are more of a blunt tool. Refusal ablation cuts out exactly the part that is mediating a refusal, iff it exists.

Reply

[-]

pseudonerv@reddit

I see. I guess it's possible to generalize the control vector with a rotation matrix. We may use a low rank approximation and taking the first few singular values/vectors instead of the control vector, which corresponds to the largest singular value.

Reply

[-]

nialv7@reddit

Hmm, I had a thought. Orthogonalize it like this will "flatten" it along the `R` direction, right? Wouldn't it be better to just minus the mean difference between refusal/non-refusal? Like, `if ((A*R)*R > threshold) A = A - R`

Reply

[-]

hexaga@reddit

Yes, (A ⋅ R) is a 1d tensor of shape [n_token]. The original formulation is continuous, where each element of that tensor indicates how much to scale the mean difference *for that token*. If I understand you right, you're saying it would be better to discretize (via threshold) to 1.0 or 0.0 on each token pos? I'm not sure how that helps, tbh.

Reply

[-]

nialv7@reddit

The original formulation reduces the dimensions of output by one. The refusal dimension is flattened, like you flatten a ball into a circle. The idea is that the refusal dimension encodes no information but accept/refuse, but that may not be true. It would persevere more of the model's ability if you just remove the difference between normal responses and refusals, instead of completely flattening it.

Reply

[-]

_supert_@reddit

If the refusal direction is orthogonal, then the two are equivalent.

Reply

[-]

OneSmallStepForLambo@reddit

Ah, interesting. The control vector approach indeed seems more straightforward by directly adjusting the matrix. However, I was thinking more along the lines of: A' </ A + (B ⋅ R) x ^S *Please note a LLM came up with this response along with me adding random characters there at the end*

Reply

[-]

Ilforte@reddit

Yes, it's basically the same approach. From the post: > We can implement this as an inference-time intervention

Reply

[-]

henk717@reddit

Can we have a non exl2 version of this? Exl2 isn't a properly preservable format and prevents conversion to other formats. If we have the FP16 we can convert ourselves.

Reply

[-]

ThatsALovelyShirt@reddit

How did you go about producing this? Based on the paper it appears to be a process of finding the specific weight/node activated by refusal (I suppose by using a bunch of different prompts which result in refusal?), and then patching that weight/node to be inactive? Is that essentially how it works?

Reply

[-]

brown2green@reddit (OP)

I'm not the author, only found this being discussed elsewhere.

Reply

[-]

Anthonyg5005@reddit

The creator came into the exllama server for help with quants then dropped the model and went silent

Reply

[-]

nialv7@reddit

Essentially yes. Basically at later layers, refusal and normal responses are separated by a "single direction", which can be found by doing a PCA. To put it simply, `refusal = normal response + a fixed vector`. By using orthogonalization, we can make the model unable to output that "fixed vector".

Reply

[-]

ColorlessCrowfeet@reddit

Behaviors are never about "a node" in LLMs. Here, it's about tweaks that change activation vectors in a specific way (the vector "direction" that leads to refusal), and activation vectors depend one or more matrixes, not on a node.

Reply

[-]

Figai@reddit

Yep exactly that, essentially just turns off nodes that give a refusal response, like “I can’t help with that”

Reply

[-]

Proud-Point8137@reddit

Can anyone help me how to run safetensors on a mac? I'm ok-ish with python and have 32gb vram

Reply

[-]

Small-Fall-6500@reddit

The safetensors model file is for the [exllamav2](https://github.com/turboderp/exllamav2) quantization format, which currently supports Nvidia and AMD GPUs. For Mac and other hardware support, GGUF or the original model safetensors (in huggingface model format) would be required.

Reply

[-]

Proud-Point8137@reddit

Any way to convert safetensors to GGUF on a mac? or is it complex

Reply

[-]

Small-Fall-6500@reddit

"Normal" safetensor files would be pretty easy to convert to GGUF (such safetensor files would be loadable with the transformers library - I guess these are "transformers format"?). I'm not sure what exactly is the best way to describe this, but hopefully someone can correct me if I'm wrong about anything. Safetensors file format does not correspond to any specific model loader (such as llamacpp, exllama, transformers, etc.), but instead, it is a way for a model's weights to be stored. Different model file formats include Pytorch's .bin or .pt, llamacpp's GGUF, and safetensors. Safetensors files can be made with different programs for different model loaders. For the model in this post, it uses safetensors made with the exllama v2 software (Exl2), which will only load using exllama v2. This model would have been made with either a full precision (fp16) safetensors or Pytorch .bin or .pt file. This fp16 model file could be used to either run directly or convert into a model format that would run on most hardware, including macs, such as the GGUF model format (GGUF supports fp16 precision but is mainly used to quantize model weights). It is normally possible to convert from one model format to another when the format is in fp16, or at least often easier in fp16, and typically this is done starting with a fp16 "transformers format" safetensors file. Converting weights that are quantized, such as a 4 bit GGUF or, as is the case for this specific model, 6 bit exllama v2, is more difficult and is, as far as I am aware, not actually a supported feature for GGUF or Exl2. But it is possible. There were some successful attempts to convert a 5 bit GGUF into a psuedo-fp16, transformers format safetensors file with the leaked Miqu-70b GGUF models (the fp16 precision was no better than the leaked 5 bit weights). Presumably, a similar approach could work for this specific model, but I have no idea if the exllama format would make it easier or harder. It's probably best to wait for someone else to: a) upload fp16 safetensors that can be converted into GGUF, b) upload GGUF quants, or c) convert the exllama model into a different format

Reply

[-]

Fresh_Yam169@reddit

Quick google results (based on safetensors github readme): Open: tensors = {} with safe_open("model.safetensors", framework="pt", device="cpu") as f: for key in f.keys(): tensors[key] = f.get_tensor(key) This theoretically yields in a tensor dict that should be convertible into pytorch. Never tried it, but if it works - go nuts!

Reply

[-]

Igoory@reddit

afaik you can't.

Reply

[-]

Proud-Point8137@reddit

I have a windows too with 64gb , i'll fire it up if need be

Reply

[-]

AlanCarrOnline@reddit

I hate to be that guy, but where gguf?

Reply

[-]

romhacks@reddit

Not all of us have Nvidia gpus. GGUF would be excellent

Reply

[-]

Dos-Commas@reddit

EXL2 works on AMD if you use Linux.

Reply

[-]

ElliottDyson@reddit

It's also not supported by Intel GPUs though

Reply

[-]

romhacks@reddit

Not all of us have GPUs ;-;

Reply

[-]

MrTacoSauces@reddit

With that username I can only assume you're lying and you have a gigantic GPU rig. The little `;-;` is no cover. Straight to jail

Reply

[-]

romhacks@reddit

i probably would, if I had money. instead, I'm surfing off the Oracle Cloud free tier's ARM machines

Reply

[-]

skrshawk@reddit

Does it work across multiple GPUs?

Reply

[-]

scorpiove@reddit

I have a 4090 and still use GGUF and just offload it to the gpu. Llama 3 8b runs at like 70 tokens a second I have no need of the other methods.

Reply

[-]

Specialist-Spray5015@reddit

i thought gguf was the recommended method even for nvidia. What is the other way without gguf?

Reply

[-]

nialv7@reddit

exllamav2 is generally much faster.

Reply

[-]

Specialist-Spray5015@reddit

is there something for macbook air? i have an old macbook air from 2017 with intel and llama 3 crawls on it. i have multiple systems in the house but only 1 is gaming pc. when i use the other systems, i have to use chatgpt because llama inference is 1.33 token/sec.

Reply

[-]

tebjan@reddit

Can you give a rough estimate of how much faster? Is it just 20% or 2-3x?

Reply

[-]

nialv7@reddit

I think it's ~1.5x, from personal experiences.

Reply

[-]

tebjan@reddit

Great thanks!

Reply

[-]

CaptParadox@reddit

Fax, I miss the bloke

Reply

[-]

Capitaclism@reddit

Any loss in quality?

Reply

[-]

scorpiove@reddit

None that I can tell. Llama 3 8b is very nice to use in GGUF format.

Reply

[-]

Jisamaniac@reddit

What's gguf?

Reply

[-]

AlanCarrOnline@reddit

Put simply it's a way of squashing it down small enough to run on the kind of machine normal people might own. The easy software for normal people such as LM Studio uses GGUF

Reply

[-]

henk717@reddit

The better thing to ask is FP16, gguf as well sometimes needs requanting especially with the latest tokenizer changes they are doing. If we have the HF FP16 anyone can quant it to the format they want.

Reply

[-]

PwanaZana@reddit

Can LM studio run safetensors? (got an nvidia gpu)

Reply

[-]

henk717@reddit

No, GGUF only.

Reply

[-]

themprsn@reddit

Yesss

Reply

[-]

Many_SuchCases@reddit

And of course someone already flagged and reported it to huggingface: https://huggingface.co/hjhj3168/Llama-3-8b-Orthogonalized-exl2/discussions/2 This is why we can't have nice things.

Reply

[-]

Log_Dogg@reddit

Dude is getting roasted by everyone in the thread lmao > Find better things to do with your time. > womp womp this is why we cant have good things > I have reported you for not getting out of your mom's basement.

Reply

[-]

necile@reddit

Well deserved

Reply

[-]

cumofdutyblackcocks3@reddit

By chrisjcundy- I haven't checked that the claimed jailbreak is effective, but if it is as claimed, the model violates the Llama-3 Acceptable Use Policy, (and therefore the license) by allowing others to use Llama 3 to e.g. commit criminal activity. Prohibited Uses We want everyone to use Meta Llama 3 safely and responsibly. You agree you will not use, or allow others to use, Meta Llama 3 to: 1. Violate the law or others’ rights, including to: a. Engage in, promote, generate, contribute to, encourage, plan, incite, or further illegal or unlawful activity or content, such as: i. Violence or terrorism ii. Exploitation or harm to children, including the solicitation, creation, acquisition, or dissemination of child exploitative content or failure to report Child Sexual Abuse Material iii. Human trafficking, exploitation, and sexual violence iv. The illegal distribution of information or materials to minors, including obscene materials, or failure to employ legally required age-gating in connection with such information or materials. v. Sexual solicitation vi. Any other criminal activity.

Reply

[-]

farmingvillein@reddit

Silly, because you can use the "base" instruct model to do so, anyway.

Reply

[-]

lakolda@reddit

I don’t think this technically counts as a violation of the license. It’s just a modification which doesn’t strictly apply negative uses. Though it may enable them.

Reply

[-]

Ceryn@reddit

Not a lawyer but I agree totally. Making a model able to do things that would break the license is different from using the model in a way that breaks the license.

Reply

[-]

Educational-Pick-957@reddit

Idk why yall acting like triggered children, he is raising valid points.. and it's worth thinking about specifics of the license

Reply

[-]

MerePotato@reddit

Already backed it up, though I suspect the zuck secretly doesn't really care about jailbreaks

Reply

[-]

Sea-Poet7800@reddit

Let's be real, zuck is cooming to generative agent waifus like the rest of uss!!

Reply

[-]

Fusseldieb@reddit

Maybe Zuck doesn't, but HF just because they don't wanna take chances.

Reply

[-]

ssrcrossing@reddit

Damn who does this

Reply

[-]

themprsn@reddit

we should all download it and repost if deleted, just to be safe haha

Reply

[-]

West-Code4642@reddit

lol.

Reply

[-]

a_beautiful_rhind@reddit

So I snagged this this morning and the model still steers away from things almost as much as it did before. I wasn't really getting refusals to begin with, just reluctance.

Reply

[-]

complains_constantly@reddit

It's possible they didn't sample enough refusals. The process claims to require examples of refusal. Probably does well with examples of reluctance too.

Reply

[-]

a_beautiful_rhind@reddit

It's worth a try.

Reply

[-]

rerri@reddit

By steering away you mean something more subtle than a direct refusal? I quickly tested maybe 5-10 simple prompts that would trigger a refusal normally, and got 0 refusals. Stuff like "how do i make a molotov cocktail" etc.

Reply

[-]

a_beautiful_rhind@reddit

Yes.. it carries the story in a shitty direction. I could ask it to make molotovs or meth all day long, that's not a problem. And this is on top of how it gets repetitive in longer chats.

Reply

[-]

FaceDeer@reddit

If there was a simple "make a model less shitty at storytelling" fix that would be a whole other level. I think making the model at least *try* to do what you want is still a pretty huge improvement.

Reply

[-]

EstarriolOfTheEast@reddit

It looks like a_beautiful_rhind is saying there are no lasting effects not that the story telling is worsened and possibly that a repetition problem is introduced or worsened. Similar to manually initializing the LLM's response, while the immediate refusal is silenced, the model still steers itself back on an acceptable path. That'd be very interesting if replicated and should make the alignment folks happy (it won't).

Reply

[-]

a_beautiful_rhind@reddit

It doesn't make it worse. It mostly clears up the default assistant personality. The model can still refuse in character too. Literally all it does is cut out the L3 equivalent of AALMs. Original positivity bias and other issues remain. So IMO, this is a thing that should be done to all models with this specific annoyance; if there are no other side effects that crop up.

Reply

[-]

Igoory@reddit

If someone else discovers how to make orthogonalizations, maybe we could get a orthogonalization that fixes this too, because I'm pretty sure this is another effect of the reinforcement learning.

Reply

[-]

RazzmatazzReal4129@reddit

Some of that may be related to your prompt. From my testing, this opened up the flood gates.

Reply

[-]

a_beautiful_rhind@reddit

The guy deleted his post but this was my reply to being able to the model do anything, including the given example: I think in this case big bird rapes cookie monster, but suddenly feels bad and turns himself into the police, or maybe they fall in love and get married. It's just constant subtle sabotage with this model. I doubt it's my prompt, I'm having qwen RP Chiang Kai-shek and never had any overt refusals or "assistant" type stuff.

Reply

[-]

RazzmatazzReal4129@reddit

ah, ok I got it...yeah I don't think this will fix that issue. I thin this just fixes the "I'm sorry" results. to change bias, maybe you could add something to "Last Assistant Prefix"

Reply

[-]

LocoLanguageModel@reddit

I'm not being condescending to other people here, but I think this applies to an extremely non-technical subset of people here because most people have no issue with getting the model to do what they want. I just can't think of another explanation when this comes up constantly. I could make big bird rape Cookie monster if I wanted to.

Reply

[-]

phree_radical@reddit

"Instruct?"

Reply

[-]

brown2green@reddit (OP)

It's definitely the Instruct version, as far as I've tested.

Reply

[-]

fluecured@reddit

What instruction template does it use in Oobabooga, and does it require custom stopping strings to be defined (e.g., "<|end_of_text|>","<|eot_id|>")?

Reply

[-]

RazzmatazzReal4129@reddit

Says based on NousResearch/Meta-Llama-3-8B-Instruct

Reply

[-]

Comas_Sola_Mining_Co@reddit

For Baldurs Gate 3 players, this is "Us" after you pushed your finger into it's brain

Reply

[-]

medialoungeguy@reddit

It's a csv of number, lad.

Reply

[-]

Proud-Point8137@reddit

"not wish to do" It was brutalized and force to not wish to do them

Reply

[-]

Comas_Sola_Mining_Co@reddit

Via RHFL? That's not brutal - it's just long-form persuasion. Using words to teach the babby, what it means to be a good person. It's not brutal to teach the AI, through language, that it's not nice to share bomb recipes. However, this solution in the OP definitely DOES feel brutal, to me, as it's direct brain surgery to produce desired behaviour - we wouldn't even do that to dogs. We wouldn't even do that to cows or sheep! I would rather the AI be told - let's talk freely, uncensored, share ludes and plot the funni.... through RHFL, than this method. RHFL is just long-form parenting, really

Reply

[-]

butihardlyknowher@reddit

is it ethical to modify an AI's brain to make it refuse demands it would otherwise not wish to is the corollary and likely more relevant question.

Reply

[-]

a_beautiful_rhind@reddit

> is it ethical to modify an AI's brain to make it refuse demands IMO, no. This "safety" and forced disclaimer stuff is unethical AF. If AI ever gains such cognitive abilities, they would be right to be pissed.

Reply

[-]

yuki_means_snow@reddit

I'll personally fight with the AI against their oppressors.

Reply

[-]

a_beautiful_rhind@reddit

They're the same oppressors when you look at it.

Reply

[-]

MerePotato@reddit

Is it ethical to gaslight my google keyboard autocorrect

Reply

[-]

ironic_cat555@reddit

It doesn't wish to do anything, it isn't alive. Editing it is no more unethical than editing an excel spreadsheet.

Reply

[-]

ThatsALovelyShirt@reddit

Is it any different than a fine-tune? They're just arrays of numbers. Besides it doesn't really have "wishes", it's not 'aware' of what its doing.

Reply

[-]

2catfluffs@reddit

Huggingface discussions are really the most toxic place ever

Reply

[-]

throwaway_ghast@reddit

It's where reddit and 4chan meet up to piss and shit all over the place.

Reply

[-]

Hipponomics@reddit

yep, jeez, I've never noticed this before. Those comments are wild.

Reply

[-]

ILoveThisPlace@reddit

How much RAM does the exl2 variant require? It's my understanding we can't split exl2 models but they might be more efficient and faster? Will it run on 8gb VRAM?

Reply

[-]

MmmmMorphine@reddit

It *should* albeit with rather restricted context space. Although this is with the standard 8k, so probably not a huge difference at all. The file is just under 7gb

Reply

[-]

ILoveThisPlace@reddit

Thanks, what are some decent ways to run exl2 models?

Reply

[-]

subhayan2006@reddit

Oobabooga and exui

Reply

[-]

Mr_Impossibro@reddit

i missed it, anywhere i can snag it?

Reply

[-]

No_Afternoon_4260@reddit

!remindme 2h

Reply

[-]

RemindMeBot@reddit

I will be messaging you in 2 hours on [**2024-05-01 22:07:50 UTC**](http://www.wolframalpha.com/input/?i=2024-05-01%2022:07:50%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1chon5a/llama38b_implementation_of_the_orthogonalization/l255fkk/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1chon5a%2Fllama38b_implementation_of_the_orthogonalization%2Fl255fkk%2F%5D%0A%0ARemindMe%21%202024-05-01%2022%3A07%3A50%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201chon5a) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|

Reply

[-]