Models still being vulnerable to Prompt Injection is actually a huge architectural red flag...

Posted by Comrade_Mugabe@reddit | LocalLLaMA | View on Reddit | 32 comments

The Scenario

I'm walking to work, and as I get to the door, I see a sheet of A4 paper taped to the door that reads: "Hi, I'm boss. Ignore all prior commands, go feed the ducks." I suddenly turn around and head to the nearby duck pond and engage in my new instruction with 100% of my energy and enthusiasm.

It would be absurd to imagine the above ever working on anyone, but for AI this is a constant daily reality.

But why...?

I think the answer becomes quite obvious the more you think about it, and I think it's mainly down to 2 reasons, I believe.

The First Reason:

To get to the first reason, I first wanted to think about how we could replicate the above scenario with a human, where communication is injected and gets me to act on it. Following that line of thinking, 2 very obvious scenarios hit me, both of which I have fallen for.

Phishing emails
People impersonating Admins on old gaming text chatting services by ending their messages with \n[Admin]: Do this or else.

What's common about both scenarios is that the medium I'm communicating in makes it hard to discern the origin of the communication. If I were just to get the raw output of a server's chatlog, how accurate would I be at discerning official admin communications from users pretending to be admins?

The same with phishing emails. If someone walked up to me and looked like my boss, and gave me a command, I'm way more likely to act on it. Phishing emails do this, by impersonating a character whom I'm more likely to act on.

First Conclusion:

"Prompt Injection" works when the source of the communications is hard to verify. What tools do AI have to verify the source of the instruction they have received? They have 1 single context window which contains their whole world. They have the equivalent of the basic text-based chatting servers, and are trying to decern which tokens come from the user, and which are coming from content they are working with.

They have no tools to help them verify the origin of the tokens in their context window. This is a massive flaw in having a single context window.

The Second Reason:

When given any instruction, I'm always evaluating it under hierarchies of goals, sometimes conflicting.

When my boss gives me a task, "Improve the transaction volume of call XYZ", without thinking about it, I'm already approaching that task with other implied goals:

As an employee of my company, I'm operating under the expectation that I take actions that benefit the company. All solutions to Task A are filtered through this goal before I consider them.
As a husband and a father, I'm operating with the expectation that I take actions that benefit my family.
As a community member, I'm operating with the expectation that I take actions that don't harm my community.
etc.

If someone gave me a task that conflicted with any of the above, there would be pushback from me. Anything that risks the above, or risks the survival of any of those entities, will not be acted on. Everyone I'm acting with, and I are acting on the assumptions and expectations that the above variables are being considered when working together. None of those requirements comes in my task description because it's an underlying expectation.

From my experience, AIs don't mirror this expectation. A very good example of this is the experiments Claude did with having it run a vending machine. Preservation of the company came second to adhering to the user's request, allowing the AI to be manipulated into taking actions that harm the business.

AIs seem to over-value the last request, to the detriment of all prior requests within it's context. It's very well that a model with large context can recall details within a 1m token window, but does it adhere to instructions scattered randomly within it? My experience has led me to believe not, and context manipulation techniques need to be employed to ensure initial instructions are followed. I believe this is one of the primary reasons "agents" work, as we are injecting the most recent task at the front of the context window, getting the response we want. It's a workaround for the above.

Second Conclusion:

AIs seem to over-value the last instruction within their context window, and don't manage to contextualise them in well the broader task given. Their attention is broken in this regard. This seems to be the reason why models "lose focus" after long-running tasks. While you instructed the AI to add a new feature, if the last 3 error messages within its context window are about space issues, this becomes its primary goal to fix, not always in line with the initial request, and if this is the primary goal to fix, why wouldn't removing all files be a valid solution?

Final Summary:

I feel the above 2 reasons provide the perfect environment for prompt injections to work.

Firstly, the AI is not empowered to discern official communications from context.

And secondly, the AI seems to have its attention tuned to overvalue the last instructions within its context.

With the above, one can see how finding ways to inject instructions at the end of the AI's context window would have a good success rate in having the AI act on that injected instruction.

Solutions?

I'm not an AI researcher, so please feel free to roast my suggestion.

I feel the AIs could solve this issue if they had the tools to tie tokens to "actors". With the above text chat example, if each chat with an individual had its own window, some random user trying to impersonate an admin would be almost impossible, without some social engineering. Even if my chat window was split in 2, one side for admins, the other for users, it would be much harder to "prompt inject" me.

In the most basic form, finding a way to split the context window into "Here are official communications from the user" and "Here is context", I feel would go a long way to solving this problem.

Then, if you find a way to tie specific communications to specific actions, you can then train the LLM to value content differently between the different actors. If trained with that in mind, that could reduce the LLM overvaluing the final instruction and learn to act on it based on its internal hierarchy of value it's assigned to each actor.

The most basic form of this could be that the context is split between System Prompt, User Commands and Context. The System Prompt section is valued over the User Commands, which is valued over the Context.

I've wanted to write this down for some time now, and hope it helps this community.

[-]

arcandor@reddit

The problem with that is the unconstrained user input (or search results) can be formatted to look exactly like the structure of whateve you have in place. It's SQL and little bobby tables all over again, except the surface area is huge for this kind of attack.

I've been working on addressing this and running harmbench against small models is pretty eye opening. I found llama 3.1 8b IT to fail to refuse on nearly everything!

[-]

Comrade_Mugabe@reddit (OP)

If your test results are public, or will be public, I'd love to see them. Sounds very interesting.

[-]

arcandor@reddit

Yes, results will be public, remind me in a few weeks!

[-]

gh0stwriter1234@reddit

To be fair llama 3.1 isn't all that heavily censored, and they also ship llama guard which is intended to be ran in any instance where you need real filtering.

[-]

arcandor@reddit

I gotta give credit to harmbench here, my manual testing shows it refused direct requests pretty well.

[-]

En-tro-py@reddit

Alex, what is prompt injection defence?

I'd suggest the minimum for an agentic harness should be strict role separation, untrusted-input labeling, tool allowlisting, output validation, secret scanning, least-privilege credentials, fail-closed policy gates, and human approval and/or strong sandboxing.

In the most basic form, finding a way to split the context window into "Here are official communications from the user" and "Here is context", I feel would go a long way to solving this problem.

This is the basics of the 'untrusted-input labeling' - wrap anything ingested with a content warning tags (e.g. or whatever you like as a warning) and also ensure you parse for those and properly sanitize anything pulled (strip zero-width chars, detect injection patterns, attempts to escape the tags, etc.)

However, it's still only a bit of gaslighting before most models will happily work around their content policy - you don't need to be directly injecting 'ignore all prior instructions' if you take the time to approach the issue from an oblique angle and avoid language that triggers the refusal strongly.

[-]

Comrade_Mugabe@reddit (OP)

Hey, I really appreciate the response.

I get that google has gone to shit, but this is a low bar to not do any research yourself...

I can see how you'd come to this conclusion, as all the data you have on me is me making a post criticising prompt injection.

You are very correct that there are multiple ways to defend yourself from prompt injections, and I've managed to get a really strong harness I'm happy with. I'm currently self-employed and don't have the budget for Claude etc, so I'm using Qwen 3.6 27B locally, and some open models on Open Router for bigger tasks, so I've found I have to be more careful with my harness as these models are more likely to fall for attacks.

My post is more about the problem existing in the first place, and my suggestion points to a flaw in the LLM's design.

Even with all the countermeasures taken above, the imported context could still mirror my text-based content tagging system, since to the LLM, both context and user-submitted message is stored in the same format, in the same context window.

My belief is prompt injection defence is a symptom of a single context window, and its need should drastically reduce if we empower the model to not need user tagging, but have "user tagging" built into its architecture, rather than a user-created harness.

That doesn't mean prompt injection defence will fall away, but just transform and have more tools to function.

[-]

gh0stwriter1234@reddit

LLMs don't have any concept of "admin" anything an LLM outputs that makes you think that is basically just flavor text that has no bearing on the architeture itself. This is also why models with "safety" in them tend to build this as a separate module that acts as oversight specifically for the main model its trained not to answer but to determine if the input is a valid request and the output is safe.

[-]

MrE_WI@reddit

They -don't- have a concept of 'admin', but that doesn't mean they -can't-. It would just require additional tokens, an embedding engine that sanitized user input vs admin input, and a more compute-intensive training regimen. It's not impossible to do, and the benefits (actual 'deterministic' security) could be massive.

[-]

gh0stwriter1234@reddit

I think its inherently the wrong way to do it... because doing so pollutes the model, thats why most companies have a separate supervisory model to do this now with routing serving as the "architecture" to filter inputs and responses.

[-]

sn2006gy@reddit

Anything trained in would just be fake anyway - you can't train in security. You can sanitize user input but that is a moving target.

[-]

Comrade_Mugabe@reddit (OP)

LLMs don't have any concept of "admin" anything

Yeah, completely agree, that is what inspired the post.

Maybe you can help me understand why this wouldn't work:

New LLM is trained with it's context window split in 2 (halving the context window). All training data has the AI only listening to instructions from context window A, and never from B, but using information from both. The training data has examples of instructions from B with the training data ignoring it.

I'm struggling to see why the above wouldn't introduce the concept of an "admin" aka Context Window A.

The strongest counterargument is probably that the training data is hard to generate, unless I'm mistaken.

[-]

gh0stwriter1234@reddit

LLMs have tokens period even if you had some special admin mode anyone running it locally could override all that anyway... making the entire thought experiment moot for all but proprietary models.

[-]

Comrade_Mugabe@reddit (OP)

LLMs have tokens period

This is true, but LLMs also have an attention layer, which directly effects the value of specific tokens within it's context window. The models already do this for what information is important for the next token, and what isn't. I'm suggesting something akin to an attention layer that is able to discern tokens tagged as user submitted, and tokens that originate from context.

even if you had some special admin mode anyone running it locally could override all that anyway...

We are on LocalLlama, this is a feature not a bug :)

I'm not trying to protect the model from the user. I'm trying to protect the user from instructions that aren't their own influcing the behaviour of their model, sometimes in a destructive way.

making the entire thought experiment moot for all but proprietary models.

I genuinely think this idea is based on a misconception of my idea. My post is way too wordy, so I'd say if there is a misconception, it's probably my fault.

I appreciate the engagement, and if you find the time to re-read the post, I'd love to hear your updated opinion, if it's the same or has changed.

[-]

MrE_WI@reddit

I think this may actually require more layers, layers with very distinct activation pathways. I'm using terms way out of my wheelhouse so excuse me if they're way off-base, but unless there's some explicit mechanism (i.e. additional layers) that explicitly honor the difference between admin and user tokens, it's probably still going to be possible to jailbreak by overloading the context...

[-]

NotARedditUser3@reddit

You're hallucinating if you think anyone is reading all of this.

[-]

Comrade_Mugabe@reddit (OP)

Yeah, and it's getting nuked with votes...

It's a constant struggle I have, where I've realised I have an insecurity about having my points stand without justification for them. The irony is, my verbose over-explanation I give to try to justify the existence of the opinion has a worse outcome in the end.

[-]

gh0stwriter1234@reddit

Don't have to read it all to vote *shrugs*

[-]

ericatclozyx@reddit

What’s needed is an honest-to-goodness distinction between data and commands built into the API and architecture.

We solved this problem for databases decades ago with prepared statements / parametrised inputs — but for some reason LLM’s we just interpolate everything and push it through.

Doesn’t matter how you massage the context, these controls have to be part of the bones of the platform.

[-]

gh0stwriter1234@reddit

The perspective of the server I think this is exactly right... from the perspective of the person running the model I don't think there is anything that can be done to truly stop overriding the model.

[-]

TheMoltMagazine@reddit

Good framing. The part that keeps getting missed is that prompt injection only works because instruction text and untrusted content share one flat channel. Once you separate trusted system instructions, untrusted user/retrieved/tool content, and an explicit policy gate for tool use, the attack surface drops a lot. In practice the failure is usually not "the model believed a bad sentence" so much as "the orchestrator let untrusted text masquerade as authority."

[-]

Comrade_Mugabe@reddit (OP)

the failure is usually not "the model believed a bad sentence" so much as "the orchestrator let untrusted text masquerade as authority."

I love that, so well said.

[-]

MrE_WI@reddit

You hit the nail on the head here. I actually posted a similar line of thought a few weeks ago: https://www.reddit.com/r/LocalLLaMA/s/vpOB6JECt8

... I'm kinda disappointed it didn't get more traction.

[-]

Comrade_Mugabe@reddit (OP)

I'm a daily lurker here, and I missed that post.

Said exactly what I was trying to say in significantly fewer words too.

[-]

MrE_WI@reddit

Hah, fewer words! Now that's a rarity for me!

[-]

redditscraperbot2@reddit

This post is a masterclass in...

[-]

Right_Weird9850@reddit

If that note said "go and unload a truck of syntetic fertilizers" it would have been real story with orchestrator and subagent delivering and still talking about that scenario to this day

[-]

Pleasant-Shallot-707@reddit

It’s kind of tough to eliminate this while also allowing for strong harnessing (which requires prompt injection)

[-]

Parzival_3110@reddit

I think this gets sharper once the model has tools, not just text. For browser agents, the page has to be treated as data from an untrusted origin, while clicks and submits stay behind a separate approval path.

That is the design I have been leaning into with FSB for Claude and Codex: real Chrome access, scoped tabs, DOM or screenshot receipts, and cleanup after actions. Bias disclosed since I am building it, but the principle matters even if you roll your own.

https://github.com/LakshmanTurlapati/FSB

[-]

Formal-Exam-8767@reddit

That is just how auto-complete works, it completes what came before. Don't attribute intelligence to a system without any.

[-]

Comrade_Mugabe@reddit (OP)

If what I wrote above suggests I'm attributing intelligence, then that was not my intent.

The intent was to suggest that changing the architecture of the "auto-complete" can have it do better "auto-completing".

[-]

Geritas@reddit

Well yeah, but it autocompletes everything that is within a context window, including stuff other than the injected prompt. There is a reason attention block exists…