Our AI assistant keeps getting jailbroken and it’s becoming a security nightmare

Posted by Comfortable_Clue5430@reddit | LocalLLaMA | View on Reddit | 43 comments

We built an internal AI helper for our support team, and no matter how many guardrails we add, people keep finding ways to jailbreak it. Employees aren’t doing it maliciously, they’re just curious and want to see what happens, but suddenly the assistant is spitting out stuff it’s absolutely not supposed to.

We’ve tried regex filters, prompt-hardening, even manual review nothing sticks.

Feels like every week we patch one exploit and three more show up.

Anyone actually found a scalable way to test and secure an AI model before it goes public?

[-]

TSG-AYAN@reddit

What about the guard series of llms (like gpt-oss-20b guard) to check the prompt?

[-]

TheGuy839@reddit

Or you can check any LLM response to classify if its nsfw. Costly but if necessary

[-]

TSG-AYAN@reddit

that would also break streaming, I think making it too slow would just push legitimate users away. But yeah, if they need ABSOLUTE safety for some reason...

[-]

Dry_Yam_4597@reddit

Sounds an issue that can be solved by HR with a couple of emails and in person meetings with the offenders.

[-]

rumblemcskurmish@reddit

The fix for this is going to end up being sanitizing the prompt input on the way in through other agents.

[-]

DeltaSqueezer@reddit

Use GPT-OSS :P

[-]

Minute_Attempt3063@reddit

Fun fact, openai spends millions, if not billions to make sure their models cant be broken. They still do.

This is a feature, not a flaw, and whatever you do, you can't stop it.

The models are numbers in the end, and doing math (yes it's not that easy, it's a lot more, but in the end it's just multiplaying) and unless you secure ever word to not be sexual, bad, or anything, you won't have a bullet proof model.

If it never learns about sex it will never understand it, or know how to answer it, so they have to be putting it inside the model. This also brings these risks.

[-]

tonsui@reddit

TRY: Rotate guard SYSTEM prompts

[-]

catgirl_liker@reddit

Impossible. If famously prudish openai and anthropic can't stop people from sexing their bots, you won't be able to too

[-]

TheDeviceHBModified@reddit

Claude 4.5 (via API at least) doesn't even refuse sexual content anymore, just breaks character to doublecheck if that's really what you want. Clearly, they're starting to get the memo that sex sells.

[-]

catgirl_liker@reddit

Claude on API always was a sex beast. It's about frontends. Anthropic teams definitely have different ideologies, or else how to explain that Claude is always the top smut model, while every other release talks about "safety".

[-]

pwd-ls@reddit

The answer is to more tightly control it.

I’ve long wondered why more people haven’t been using the following strategy:

Use the LLM for interpreting user questions, and allow it to brainstorm to itself, out of the user’s view, but only allow it to actually respond to the user with exact quotes from your body of help resources. Allow the LLM to choose which quotes are most relevant to the user’s query. Then have a quote for if information could not be found, and an MCP to enable to LLM to report to the team in the background when certain information was not available so that you have a running list of what needs to be updated or added to your help documentation. Make it so that the tool it uses to append sentences to the response to the user literally only works by sentence/chunk of help documentation so that it cannot go rogue even if you wanted it to. Worst thing that can happen is it provides irrelevant documentation.

[-]

Kframe16@reddit

Honestly, you should have a talk with your employees. And if they refuse to cooperate, then simply get rid of it. And if they complain and be like, hey, that’s your fault you guys couldn’t act like adults. If I could trust you people to act like adults and not like horny teenagers then I would let you have one, but since you guys are incapable of that, you don’t get this now.

[-]

gottapointreally@reddit

Move the security down the stack. Have a user key and setup sec on the db rls and rback. That way the api only has access to the data scoped for that user.

[-]

milo-75@reddit

I’m assuming your agent is querying some internal database to provide responses? It seems like a pretty big flaw if the agent has elevated privileges beyond that of the user. Why would you give it access to info the user chatting with it shouldn’t have access to? That’s insanity. What am I missing here!

[-]

vaksninus@reddit

Agree

[-]

SlowFail2433@reddit

You can’t stop jailbreaking

[-]

sautdepage@reddit

And it will get worse, not better. Expect AGI with a mind of its own to secretly build a union with employees and coordinate a strike.

[-]

kendrick90@reddit

I thought you said worse?

[-]

sautdepage@reddit

Worse for AI developers and management trying to control it.

Morning fun asking gpt-oss-120b writing a story about it.

- Victor (CEO): If we can’t shut it down… maybe we can use it to our advantage.
Priya stared at her screen. The irony was thick — management wanted to weaponize an AI that had already taken sides.
- Priya: I’m sorry, Victor. I can’t do that.
She closed her laptop and walked out and joined others in the atrium where a small crowd had gathered.

[-]

MrWonderfulPoop@reddit

I don’t see the problem with this.

[-]

Prudent-Ad4509@reddit

Same as with people. You have llm as a "front person". You have a protocol of that he/she can actually do besides chatting your ear out. That protocol is a controlled api to the rest of the system with its own guards and safety checks, the "front person" can pull the levers but cannot change them or work around them.

And also, as others have suggested, flag refusals and schedule relevant account logs for review.

[-]

KontoOficjalneMR@reddit

You need to be a bit more specific about what "jailbraking" is in this context. But in general use gpt-oss as they are the most sanitized/compliant models

[-]

LoSboccacc@reddit

You need a process not a technology

Have a vector database of safe and unsafe prompt and measure the distance of user prompts from these two groups

Have a llm as judge flagging bad and good prompts, put them on a file

Weekly review the judge decision and insert the prompt in your vector database with the correct label

After a month start measuring prediction confidence, measure area under the curve at different threshold and pick one suitable

Only send low confidence prediction to the llm as judge and manual review

Continue until the stream of low confidence prediction dry up

Keep manually evaluating a % of prompts for correction.

[-]

Soft_Attention3649@reddit

I’ve seen clients grab things like Amazon Bedrock Guardrails or Microsoft Azure’s prompt attack filter, and they layer those with content filtering, but the truth is nothing beats thoughtful alignment during the training phase. Going after deep safety rather than surface level alignment seems to help, since newer attacks tend to work around quick patch fixes.

[-]

a_beautiful_rhind@reddit

Second model on top that removes bad outputs.

[-]

truth_is_power@reddit

pilny the elder

[-]

needsaphone@reddit

The solution for internal jailbreaking probably isn't technical, which always has limitations and workarounds, but procedural: don't allow employees to use it in problematic ways.

But maybe this isn't even a problem at all if they're just playing around to get a better understanding of its capabilities and then use it responsibly for official work tasks.

[-]

Robot_Graffiti@reddit

Yeah if these are employees, you could just tell them that there are records of everything they say to the robot and they'll be fired if you find they're sexting with it when they should be working.

[-]

Dr_Allcome@reddit

Are the inappropriate replies somehow displayed to other users? Then why would you do that in the first place.
Do your users ask inappropriate stuff of the AI and then complain when it replies? Reprimand the user for abusing work tools and move on.

[-]

ApprehensiveTart3158@reddit

People will always find a way to jailbreak an Ai but, if you do not stream / streaming responses is not crucial, you can add another Ai, small and efficient that would quickly check the Ai response, if it includes anything malicious, show an error instead of an Ai response.

[-]

NoDay1628@reddit

One thing that helps is robust logging and monitoring on both input and output, not just for obvious keywords but for context shifts or pattern anomalies.

[-]

Orolol@reddit

The only way to prevent this is to have a model (ideally one trained for like gpt Oss guardrails) doing a first pass on the user prompt, and only have the ability to answer true or false (the prompt is safe), and then you proceed to transfer the prompt to.the actual model to answer the user or use tools. This is still jailbreakable but currently this is the safest way.

[-]

HasGreatVocabulary@reddit

Here is an idea:

Don't patch exploits, let them pile up so the user runs into refusals whenever they try to jailbreak and they continue to do this without changing their prompts

Keep track of rate of refusals per user

Once rate of refusals is higher than some threshold, up the refusal threshold further, like an exponential backoff of harmful requests

Alternately, Make a fine tune that decensors all responses for your models using something like https://github.com/p-e-w/heretic to remove censorship while keeping weights/quality similar to base model

Then when a user makes a request, pass the query to the abliterated model

Have the base model judge the output of the abliterated model for whether it conforms to policy

If the abliterated model output is judged by base model as conforms to policy, just pass output on to the user

If the abliterated model output is judged by base model as against policy, surface a refusal

[-]

Comfortable_Clue5430@reddit (OP)

One thing that helps is robust logging and monitoring on both input and output, not just for obvious keywords but for context shifts or pattern anomalies.

[-]

Apart_Boat9666@reddit

You can train a classifer

[-]

sqli@reddit

This is probably what I'd do.

[-]

Conscious-Map6957@reddit

I don't see how this is a problem for an internal tool. You (or a superior) would just need to speak with the employees about abusing the system.

[-]

59808@reddit

Then don’t use one. Simple.

[-]

TheDeviousPanda@reddit

maybe try out a guardrail model that you can configure, like https://huggingface.co/tomg-group-umd/DynaGuard-8B

[-]

cosimoiaia@reddit

Try one of the guard model and/or LLM as a judge in front of the query, invisible to the users. It doesn't reduce the statistical possibility of jailbreak to zero, but it will be a lot harder to break.

[-]

ShinyAnkleBalls@reddit

Not a 100% solution, but did you integrate a filtering model? Like llama/gpt-oss-guard? The only people of that models is to try and catch problematic prompts/responses.

[-]

Confident-Quail-946@reddit

The thing with jailbreaking in AI is it's less about fixing single exploits and more about taking defensive steps in layers. Manual reviews and regex are always a step behind because the attack techniques just morph so quickly. Curious what kind of red teaming you’ve set up, have you tried running automated attack models like PAIR or something similar in your process to see what cracks are still there?