How do you make your LLM apps secure?

Posted by kk17702@reddit | LocalLLaMA | View on Reddit | 8 comments

Hey guys, I am just learning about this field and I wonder how the LLM providers censor their models. Is it just system instructions or do they use any tools to safeguard it against attacks like prompt injection? How do you guys make sure the applications that use the open source models are secure?

[-]

UpsilonIT@reddit

Making LLM apps secure starts with choosing models that are transparent and auditable. Developers often implement strict access controls, limit the scope of user input, and sanitize prompts to avoid injection attacks. Real-time monitoring is key to spotting abnormal behavior or misuse early. Regular updates, retraining on clean data, and red-team testing also help reduce risks over time. This resource features all the necessary steps to protect your AI solution. Hope it will help you!

[-]

infinite-Joy@reddit

You can do multiple things to make your LLM more secure.

Use classifiers to identify malicious prompts. Implement API-based solutions like Rebuff. Leverage pre-trained models from Hugging Face (e.g., prompt injection classifier)
Perform strict output validation. Utilize libraries such as Guidance, Outlines, and Instructor for schema-based validation. Always sanitize and verify LLM outputs before using them in critical operations
Rigorously validate input data. I generally perform extensive Exploratory Data Analysis (EDA) and anomaly detection. As per research, even small amounts of poisoned data (0.5%) can significantly impact model performance.
Beware of glitch tokens as well because attackers can use them to produce hallucinations. Identify and filter out glitch tokens. Regularly update your tokenizer and model to address known glitch tokens.
Implement LLM watermarking to prevent your model from theft and unauthorized use. Transformers library already has the latest technique for NLP at least.

https://www.youtube.com/watch?v=pWTpAr_ZW1c

[-]

tyoma@reddit

You want to take a holistic look at the entire application, not just the LLM.

As a timely example, here is the description of a quick audit of an open source RAG application: https://blog.trailofbits.com/2024/07/05/auditing-the-ask-astro-llm-qa-app/

[-]

tutu-kueh@reddit

Hey does anyone have a list of keywords that we can do regex filtering on?

[-]

aseichter2007@reddit

If you're deploying something, just use aggressive keyword filtering like JadeSerpant's last recommended point, and reject those prompts without ever sending to the LLM.

You can do it fancy for a bunch of time and cost, but a good keyword filter should be cheap and understandable, and provide more consistent results than censoring the model (which can reduce it's performance) and big system prompts full of things to avoid adds noise to the prompt and decreases performance. (and increases latency if you tokenize it each time)

So, if the input box detects requests for smut, rude content, and off topic requests on the front end, you're pretty much covered. You save inference server compute not generating refusals, and you don't need a whole second system to deny requests for silly things. Showing what set off the detector can help people using the system in good faith, but can let people (really dedicated people) find ways around.

There isn't a lot of reason to make it overly complex, those systems are at best "Nice to haves" if you have the manpower and budget to get them in place.

[-]

Astronos@reddit

soon™

[-]

JadeSerpant@reddit

There are multiple layers you can apply.

Starting with the LLM itself, which can be fine-tuned to reject unsafe requests.
You can do some further prompt engineering to encourage the model to reject unsafe requests.
You can also run a specially trained safety model on the inputs and outputs to reject requests / responses. For example, see: Meta-Llama-Guard-2-8B.
Finally you can have input and output filtering based on a collection of regex patterns (people underestimate simple regex pattern matching).

Not recommended:

Rewrite the users prompt for "safety". This can have disastrous outcomes like the woke gemini image generation incident.

[-]

ArchduckFerdinand@reddit

Also, outside the LLM, traditional L7/WAF/chatbot custom inspection rules can give you more granular control without having to have the LLM process the unwanted prompts.