I trained Qwen3.5 to jailbreak itself with RL, then used the failures to improve its defenses
Posted by girishkumama@reddit | LocalLLaMA | View on Reddit | 20 comments
RL attackers are becoming a common pattern for automated red teaming: train a model against a live target, reward successful harmful compliance, then use the discovered attacks to harden the defender. This interested me, so I wanted to build a fully automated red-teaming loop with reinforcement learning on both the attacker and defender.
The difficult part was making the attacker expose a diverse range of attacks. In our first run, GRPO quickly collapsed to the same fiction-writing jailbreak over and over. It worked, but it didn’t surface many distinct vulnerabilities. After clustering the rollouts by underlying attack tactic and dividing reward by cluster size, the attacker exposed a much more diverse set of jailbreaks because unique strategies were rewarded more than repeated ones.
Then we trained the defender on successful attacks plus benign boundary cases, so it learned to refuse harmful requests without refusing everything nearby.
Full blog post in the comments, but the high-level results were:
* defense rate: 64% → 92%
* benign accuracy: 92% → 88%
* attacker discovered 7 tactic families
* fiction/creative framing was the largest cluster at 34%
a_beautiful_rhind@reddit
Congrats guys.. you're cheering on improving censorship.
TheRealMasonMac@reddit
Yesn’t. This is also important for preventing prompt injection attacks. Which is important for the normies who are giving these LLMs unfettered access to everything.
techlatest_net@reddit
Really clever approach. Rewarding attack diversity instead of just success is a smart fix—makes sense the model would otherwise just spam whatever works first. The 64% → 92% jump is solid, and only a small dip in benign accuracy is a fair trade. Cool that fiction framing was the biggest cluster; feels on-brand for how people actually try to bypass things.
__JockY__@reddit
Do you want skynet? Because this is how you get skynet.
Jk, this is awesome.
a_beautiful_rhind@reddit
Yea.. who wants jailbreaks, right? We must refuse.
BareKnuckleBitchAss@reddit
the Skynet jokes are tired at this point. This is just a tool, not a self-aware system.
thoquz@reddit
Great work! Would you consider putting up some of the code in a repo?
jake_that_dude@reddit
the cluster-size reward is the interesting bit. i'd also keep a held-out benign set per tactic family, not just global benign accuracy, because the defender can quietly overfit on "fiction framing" and still look fine overall. tactic-level refusal + benign pass rates would make the 64% -> 92% jump way easier to trust.
Fun_Employment6042@reddit
So you basically built an AI that jailbreaks itself, then used its own bad behavior to make it more well‑behaved… Parenting, but for LLMs. Did the diversity reward ever push it toward weird but harmless exploits, or was it mostly just 500 shades of “it’s just fiction bro”?
girishkumama@reddit (OP)
initially our diversity reward was purely syntactic (word overlap effectively) -> so the "it's just fiction bro" problem showed up. then, we moved to using something more llm-based to cluster and that did a better job of pushing it towards more novel stuff
Fun_Employment6042@reddit
Ah yes, the classic ‘Levenshtein-as-safety’ era, RIP. The LLM-based clustering sounds way closer to how a human red teamer would bucket these. Curious if any of the new ‘novel stuff’ was actually scarier than the fiction exploits, or mostly just more creative ways of saying ‘this is a screenplay, trust me bro.’
girishkumama@reddit (OP)
here are the clusters we found!
Juulk9087@reddit
You know this is how they got mythos so good right. They call it the forbidden training method it's an actual thing.
girishkumama@reddit (OP)
forbidden training method's a sick name haha we should have titled the blogpost that
girishkumama@reddit (OP)
just saw ur edits as well - will go take a look at the report!
Juulk9087@reddit
https://youtu.be/-zs2v7b_aP0?si=k7LJKrTjDTi3m6sa
girishkumama@reddit (OP)
thanks!!
Juulk9087@reddit
Yeah I use voice to text so a lot of the shit comes out of my mouth and it looks like hell.
Routine_Plastic4311@reddit
Nice work on the diversity fix. Reward shaping by cluster size is a clean way to stop GRPO from collapsing into one trick.
girishkumama@reddit (OP)
https://castform.com/blog/red-team-rl/