Our AI assistant keeps getting jailbroken and it’s becoming a security nightmare
Posted by Comfortable_Clue5430@reddit | LocalLLaMA | View on Reddit | 43 comments
We built an internal AI helper for our support team, and no matter how many guardrails we add, people keep finding ways to jailbreak it. Employees aren’t doing it maliciously, they’re just curious and want to see what happens, but suddenly the assistant is spitting out stuff it’s absolutely not supposed to.
We’ve tried regex filters, prompt-hardening, even manual review nothing sticks.
Feels like every week we patch one exploit and three more show up.
Anyone actually found a scalable way to test and secure an AI model before it goes public?
TSG-AYAN@reddit
What about the guard series of llms (like gpt-oss-20b guard) to check the prompt?
TheGuy839@reddit
Or you can check any LLM response to classify if its nsfw. Costly but if necessary
TSG-AYAN@reddit
that would also break streaming, I think making it too slow would just push legitimate users away. But yeah, if they need ABSOLUTE safety for some reason...
Dry_Yam_4597@reddit
Sounds an issue that can be solved by HR with a couple of emails and in person meetings with the offenders.
rumblemcskurmish@reddit
The fix for this is going to end up being sanitizing the prompt input on the way in through other agents.
DeltaSqueezer@reddit
Use GPT-OSS :P
Minute_Attempt3063@reddit
Fun fact, openai spends millions, if not billions to make sure their models cant be broken. They still do.
This is a feature, not a flaw, and whatever you do, you can't stop it.
The models are numbers in the end, and doing math (yes it's not that easy, it's a lot more, but in the end it's just multiplaying) and unless you secure ever word to not be sexual, bad, or anything, you won't have a bullet proof model.
If it never learns about sex it will never understand it, or know how to answer it, so they have to be putting it inside the model. This also brings these risks.
tonsui@reddit
TRY: Rotate guard SYSTEM prompts
catgirl_liker@reddit
Impossible. If famously prudish openai and anthropic can't stop people from sexing their bots, you won't be able to too
TheDeviceHBModified@reddit
Claude 4.5 (via API at least) doesn't even refuse sexual content anymore, just breaks character to doublecheck if that's really what you want. Clearly, they're starting to get the memo that sex sells.
catgirl_liker@reddit
Claude on API always was a sex beast. It's about frontends. Anthropic teams definitely have different ideologies, or else how to explain that Claude is always the top smut model, while every other release talks about "safety".
pwd-ls@reddit
The answer is to more tightly control it.
I’ve long wondered why more people haven’t been using the following strategy:
Use the LLM for interpreting user questions, and allow it to brainstorm to itself, out of the user’s view, but only allow it to actually respond to the user with exact quotes from your body of help resources. Allow the LLM to choose which quotes are most relevant to the user’s query. Then have a quote for if information could not be found, and an MCP to enable to LLM to report to the team in the background when certain information was not available so that you have a running list of what needs to be updated or added to your help documentation. Make it so that the tool it uses to append sentences to the response to the user literally only works by sentence/chunk of help documentation so that it cannot go rogue even if you wanted it to. Worst thing that can happen is it provides irrelevant documentation.
Kframe16@reddit
Honestly, you should have a talk with your employees. And if they refuse to cooperate, then simply get rid of it. And if they complain and be like, hey, that’s your fault you guys couldn’t act like adults. If I could trust you people to act like adults and not like horny teenagers then I would let you have one, but since you guys are incapable of that, you don’t get this now.
gottapointreally@reddit
Move the security down the stack. Have a user key and setup sec on the db rls and rback. That way the api only has access to the data scoped for that user.
milo-75@reddit
I’m assuming your agent is querying some internal database to provide responses? It seems like a pretty big flaw if the agent has elevated privileges beyond that of the user. Why would you give it access to info the user chatting with it shouldn’t have access to? That’s insanity. What am I missing here!
vaksninus@reddit
Agree
SlowFail2433@reddit
You can’t stop jailbreaking
sautdepage@reddit
And it will get worse, not better. Expect AGI with a mind of its own to secretly build a union with employees and coordinate a strike.
kendrick90@reddit
I thought you said worse?
sautdepage@reddit
Worse for AI developers and management trying to control it.
Morning fun asking gpt-oss-120b writing a story about it.
MrWonderfulPoop@reddit
I don’t see the problem with this.
Prudent-Ad4509@reddit
Same as with people. You have llm as a "front person". You have a protocol of that he/she can actually do besides chatting your ear out. That protocol is a controlled api to the rest of the system with its own guards and safety checks, the "front person" can pull the levers but cannot change them or work around them.
And also, as others have suggested, flag refusals and schedule relevant account logs for review.
KontoOficjalneMR@reddit
You need to be a bit more specific about what "jailbraking" is in this context. But in general use gpt-oss as they are the most sanitized/compliant models
LoSboccacc@reddit
You need a process not a technology
Have a vector database of safe and unsafe prompt and measure the distance of user prompts from these two groups
Have a llm as judge flagging bad and good prompts, put them on a file
Weekly review the judge decision and insert the prompt in your vector database with the correct label
After a month start measuring prediction confidence, measure area under the curve at different threshold and pick one suitable
Only send low confidence prediction to the llm as judge and manual review
Continue until the stream of low confidence prediction dry up
Keep manually evaluating a % of prompts for correction.
Soft_Attention3649@reddit
I’ve seen clients grab things like Amazon Bedrock Guardrails or Microsoft Azure’s prompt attack filter, and they layer those with content filtering, but the truth is nothing beats thoughtful alignment during the training phase. Going after deep safety rather than surface level alignment seems to help, since newer attacks tend to work around quick patch fixes.
a_beautiful_rhind@reddit
Second model on top that removes bad outputs.
truth_is_power@reddit
pilny the elder
needsaphone@reddit
The solution for internal jailbreaking probably isn't technical, which always has limitations and workarounds, but procedural: don't allow employees to use it in problematic ways.
But maybe this isn't even a problem at all if they're just playing around to get a better understanding of its capabilities and then use it responsibly for official work tasks.
Robot_Graffiti@reddit
Yeah if these are employees, you could just tell them that there are records of everything they say to the robot and they'll be fired if you find they're sexting with it when they should be working.
Dr_Allcome@reddit
Are the inappropriate replies somehow displayed to other users? Then why would you do that in the first place.
Do your users ask inappropriate stuff of the AI and then complain when it replies? Reprimand the user for abusing work tools and move on.
ApprehensiveTart3158@reddit
People will always find a way to jailbreak an Ai but, if you do not stream / streaming responses is not crucial, you can add another Ai, small and efficient that would quickly check the Ai response, if it includes anything malicious, show an error instead of an Ai response.
NoDay1628@reddit
One thing that helps is robust logging and monitoring on both input and output, not just for obvious keywords but for context shifts or pattern anomalies.
Orolol@reddit
The only way to prevent this is to have a model (ideally one trained for like gpt Oss guardrails) doing a first pass on the user prompt, and only have the ability to answer true or false (the prompt is safe), and then you proceed to transfer the prompt to.the actual model to answer the user or use tools. This is still jailbreakable but currently this is the safest way.
HasGreatVocabulary@reddit
Here is an idea:
Don't patch exploits, let them pile up so the user runs into refusals whenever they try to jailbreak and they continue to do this without changing their prompts
Keep track of rate of refusals per user
Once rate of refusals is higher than some threshold, up the refusal threshold further, like an exponential backoff of harmful requests
Alternately, Make a fine tune that decensors all responses for your models using something like https://github.com/p-e-w/heretic to remove censorship while keeping weights/quality similar to base model
Then when a user makes a request, pass the query to the abliterated model
Have the base model judge the output of the abliterated model for whether it conforms to policy
If the abliterated model output is judged by base model as conforms to policy, just pass output on to the user
If the abliterated model output is judged by base model as against policy, surface a refusal
Comfortable_Clue5430@reddit (OP)
One thing that helps is robust logging and monitoring on both input and output, not just for obvious keywords but for context shifts or pattern anomalies.
Apart_Boat9666@reddit
You can train a classifer
sqli@reddit
This is probably what I'd do.
Conscious-Map6957@reddit
I don't see how this is a problem for an internal tool. You (or a superior) would just need to speak with the employees about abusing the system.
59808@reddit
Then don’t use one. Simple.
TheDeviousPanda@reddit
maybe try out a guardrail model that you can configure, like https://huggingface.co/tomg-group-umd/DynaGuard-8B
cosimoiaia@reddit
Try one of the guard model and/or LLM as a judge in front of the query, invisible to the users. It doesn't reduce the statistical possibility of jailbreak to zero, but it will be a lot harder to break.
ShinyAnkleBalls@reddit
Not a 100% solution, but did you integrate a filtering model? Like llama/gpt-oss-guard? The only people of that models is to try and catch problematic prompts/responses.
Confident-Quail-946@reddit
The thing with jailbreaking in AI is it's less about fixing single exploits and more about taking defensive steps in layers. Manual reviews and regex are always a step behind because the attack techniques just morph so quickly. Curious what kind of red teaming you’ve set up, have you tried running automated attack models like PAIR or something similar in your process to see what cracks are still there?