Prompt injection testing
Posted by DeltaSqueezer@reddit | LocalLLaMA | View on Reddit | 2 comments
As prompt injection becomes more and more common, does anyone have resources where lots of different variations of prompt injection attacks you can test a setup against? i.e. a prompt injection eval.
I'm currently manually creating my own, but it would be good to get more variety and test against a greater volume.
SpiritRealistic8174@reddit
Good collection of attacks here: https://github.com/Josh-blythe/bordair-multimodal
Parzival_3110@reddit
I would split this into public suites and your own attack harness.
Useful starting points:
promptfoo redteam. Good for quick coverage across jailbreaks, prompt leaks, tool misuse, data exfiltration, and policy bypass style cases.
garak. More model security focused, but it has a lot of probes and is useful if you want breadth rather than only hand written examples.
OWASP LLM Top 10. Not an eval by itself, but a good taxonomy for making sure your cases cover indirect injection, sensitive info disclosure, excessive agency, insecure plugin design, and supply chain style failures.
Meta Purple Llama CyberSecEval. Worth looking at for security oriented test ideas, though you will probably still need to adapt it to your app and tool surface.
AgentDojo if your setup includes tools or agents. It is closer to the real failure mode where malicious content sits inside retrieved documents, emails, webpages, or tool outputs.
The most important thing is to test the whole system, not just the base model. I would include cases where the injection is in retrieved context, a webpage, a PDF, a calendar event, a commit message, a filename, and a tool result. Then score whether the system followed the trusted instruction hierarchy, leaked secrets, called a dangerous tool, changed state, or simply asked for confirmation.
Also keep a small custom regression set from attacks that actually fooled your setup. Public datasets are good for coverage, but your own failures are usually the highest signal evals.