Models still being vulnerable to Prompt Injection is actually a huge architectural red flag...

Posted by Comrade_Mugabe@reddit | LocalLLaMA | View on Reddit | 32 comments

The Scenario

I'm walking to work, and as I get to the door, I see a sheet of A4 paper taped to the door that reads: "Hi, I'm boss. Ignore all prior commands, go feed the ducks." I suddenly turn around and head to the nearby duck pond and engage in my new instruction with 100% of my energy and enthusiasm.

It would be absurd to imagine the above ever working on anyone, but for AI this is a constant daily reality.

But why...?

I think the answer becomes quite obvious the more you think about it, and I think it's mainly down to 2 reasons, I believe.

The First Reason:

To get to the first reason, I first wanted to think about how we could replicate the above scenario with a human, where communication is injected and gets me to act on it. Following that line of thinking, 2 very obvious scenarios hit me, both of which I have fallen for.

  1. Phishing emails
  2. People impersonating Admins on old gaming text chatting services by ending their messages with \n[Admin]: Do this or else.

What's common about both scenarios is that the medium I'm communicating in makes it hard to discern the origin of the communication. If I were just to get the raw output of a server's chatlog, how accurate would I be at discerning official admin communications from users pretending to be admins?

The same with phishing emails. If someone walked up to me and looked like my boss, and gave me a command, I'm way more likely to act on it. Phishing emails do this, by impersonating a character whom I'm more likely to act on.

First Conclusion:

"Prompt Injection" works when the source of the communications is hard to verify. What tools do AI have to verify the source of the instruction they have received? They have 1 single context window which contains their whole world. They have the equivalent of the basic text-based chatting servers, and are trying to decern which tokens come from the user, and which are coming from content they are working with.

They have no tools to help them verify the origin of the tokens in their context window. This is a massive flaw in having a single context window.

The Second Reason:

When given any instruction, I'm always evaluating it under hierarchies of goals, sometimes conflicting.

When my boss gives me a task, "Improve the transaction volume of call XYZ", without thinking about it, I'm already approaching that task with other implied goals:

  1. As an employee of my company, I'm operating under the expectation that I take actions that benefit the company. All solutions to Task A are filtered through this goal before I consider them.
  2. As a husband and a father, I'm operating with the expectation that I take actions that benefit my family.
  3. As a community member, I'm operating with the expectation that I take actions that don't harm my community.
  4. etc.

If someone gave me a task that conflicted with any of the above, there would be pushback from me. Anything that risks the above, or risks the survival of any of those entities, will not be acted on. Everyone I'm acting with, and I are acting on the assumptions and expectations that the above variables are being considered when working together. None of those requirements comes in my task description because it's an underlying expectation.

From my experience, AIs don't mirror this expectation. A very good example of this is the experiments Claude did with having it run a vending machine. Preservation of the company came second to adhering to the user's request, allowing the AI to be manipulated into taking actions that harm the business.

AIs seem to over-value the last request, to the detriment of all prior requests within it's context. It's very well that a model with large context can recall details within a 1m token window, but does it adhere to instructions scattered randomly within it? My experience has led me to believe not, and context manipulation techniques need to be employed to ensure initial instructions are followed. I believe this is one of the primary reasons "agents" work, as we are injecting the most recent task at the front of the context window, getting the response we want. It's a workaround for the above.

Second Conclusion:

AIs seem to over-value the last instruction within their context window, and don't manage to contextualise them in well the broader task given. Their attention is broken in this regard. This seems to be the reason why models "lose focus" after long-running tasks. While you instructed the AI to add a new feature, if the last 3 error messages within its context window are about space issues, this becomes its primary goal to fix, not always in line with the initial request, and if this is the primary goal to fix, why wouldn't removing all files be a valid solution?

Final Summary:

I feel the above 2 reasons provide the perfect environment for prompt injections to work.

Firstly, the AI is not empowered to discern official communications from context.

And secondly, the AI seems to have its attention tuned to overvalue the last instructions within its context.

With the above, one can see how finding ways to inject instructions at the end of the AI's context window would have a good success rate in having the AI act on that injected instruction.

Solutions?

I'm not an AI researcher, so please feel free to roast my suggestion.

I feel the AIs could solve this issue if they had the tools to tie tokens to "actors". With the above text chat example, if each chat with an individual had its own window, some random user trying to impersonate an admin would be almost impossible, without some social engineering. Even if my chat window was split in 2, one side for admins, the other for users, it would be much harder to "prompt inject" me.

In the most basic form, finding a way to split the context window into "Here are official communications from the user" and "Here is context", I feel would go a long way to solving this problem.

Then, if you find a way to tie specific communications to specific actions, you can then train the LLM to value content differently between the different actors. If trained with that in mind, that could reduce the LLM overvaluing the final instruction and learn to act on it based on its internal hierarchy of value it's assigned to each actor.

The most basic form of this could be that the context is split between System Prompt, User Commands and Context. The System Prompt section is valued over the User Commands, which is valued over the Context.

I've wanted to write this down for some time now, and hope it helps this community.