Our AI assistant keeps getting jailbroken and it’s becoming a security nightmare

Posted by Comfortable_Clue5430@reddit | LocalLLaMA | View on Reddit | 43 comments

We built an internal AI helper for our support team, and no matter how many guardrails we add, people keep finding ways to jailbreak it. Employees aren’t doing it maliciously, they’re just curious and want to see what happens, but suddenly the assistant is spitting out stuff it’s absolutely not supposed to.

We’ve tried regex filters, prompt-hardening, even manual review nothing sticks.

Feels like every week we patch one exploit and three more show up.

Anyone actually found a scalable way to test and secure an AI model before it goes public?