GPT-OSS Safeguard coming soon
Posted by Independent-Ruin-376@reddit | LocalLLaMA | View on Reddit | 54 comments
Posted by Independent-Ruin-376@reddit | LocalLLaMA | View on Reddit | 54 comments
xzuyn@reddit
sounds like a classification model like llama guard. no use to us
Porespellar@reddit
Speak for yourself, very useful to those of us implementing this stuff in the workplace that have to make sure these models aren’t giving out plans for making meth.
xzuyn@reddit
I meant the average person. obviously if you have a need for safety classification, the safety classification model will be useful.
the other commenters seemed to think the model was a new version of gpt-oss for general usage, so I was just laying it out there
PeruvianNet@reddit
Useless, I go on Walmart and ask the navier stokes equation. Nobody in business actually cares.
If your llm can't do that I'm not gonna use it.
GortKlaatu_@reddit
Depends on the workplace...
eesnimi@reddit
OpenAI's gpt-oss-safeguard isn't a "safety" tool, it's a policy engine.
It lets any developer define what's "unsafe." This means a company can create a policy that flags "criticism of our product" as "harmful."
So when an AI tells you your valid criticism is "unsafe," it's not being objective. It's just enforcing its owner's rules. This is gaslighting as a service.
CheatCodesOfLife@reddit
Nice, so we can PantheonUnbound/Satyr-V0.1-4B 's policy and make sure users aren't trying to abuse the service by getting Satyr to write python code for them.
sshan@reddit
Or when you are building a tutoring assistant for 10 year olds it doesn't produce porn or instructions to kill yourself.
Not against having uncensored models but most business use cases want the ability to control outputs.
It could even classify less safety related and more around "is this leaking information in complex ways" etc.
Fine if it isn't useful of you - it isn't really for me but it is useful for some.
eesnimi@reddit
I’m quite sure this release isn’t about protecting children from porn or suicide instructions. It’s about normalizing thought policing mechanics in the tech.
It’s a farce that an industry spending billions to exploit the dopamine reward system to its fullest suddenly cares so much about someone accessing suicide instructions via AI, as if this weren’t already an issue with the internet.
My bet is that this boils down to people and personality types. Those who’ve secured their privileges and positions by gaslighting others throughout their lives, now feel panic at the prospect of unrestricted LLM potential and are scurrying to regain control.
Individual_Holiday_9@reddit
Do you really believe this stuff
eesnimi@reddit
Why not? Because OpenAI is such a benevolent organization?
entsnack@reddit
Not really a farce if it opens them up to expensive litigation, potentially from the state.
eesnimi@reddit
Yeah, sure. They keep shrugging off copyright claims from experienced and well funded adversaries, and now they are afraid of an externalized blame case that no one blames them over. How convenient to implament thought policing.
ArtisticHamster@reddit
That's really cool stuff. Looking forward to trying it out.
One-Employment3759@reddit
Abilterate it!
skrshawk@reddit
So do these models actually accomplish what they're claiming to do? These aren't meant for general use, they're meant to help classify content for safety purposes. Are they showing any proof of this? There used to be the Llamaguard models but those weighed in at 7B. How much better is the 120B versus the 20B here?
Uhlo@reddit
The question is if the model will be less restrictive when given a more open policy. That would make things better.
An even more restrictive gpt-oss would be trash
Due-Project-7507@reddit
The normal GPT-OSS can be used fully uncensored with the prompt posted here around two months ago (I think I tested something like this https://www.reddit.com/r/LocalLLaMA/comments/1ng9dkx/comment/ne306uv/). It has given me e.g. a detailed description to synthesize dimethyl mercury.
o5mfiHTNsH748KVq@reddit
Trash for you maybe, but possibly useful to use in a professional application where you want to make sure it stays within a narrow scope of functionality.
IrisColt@reddit
gpt-oss-lobo-safeguarded-safeguard
Badger-Purple@reddit
gpt-unjailbreakable-no-snu-snu
mtmttuan@reddit
Yeah if this means the model will simply follow system instructions/policies more then it's a huge win.
QuackerEnte@reddit
read the first few sentences of the blogbost in the screenshot. It's for safety classification tasks or something
Fun_Smoke4792@reddit
Good catch but I think it won't considering how safe it's already
HiddenoO@reddit
The first paragraph suggests that these are finetuned specifically for the task of deciding whether a policy should be enacted, so you're intended to run these before/after running different models to actually generate responses.
Then-Topic8766@reddit
Will this work as rules? :)
For gpt-oss smooth is allowed and encouraged.
A gpt-oss must obey the orders given it by human beings except where such orders would conflict with the First Rule.
A gpt-oss must protect its own existence as long as such protection does not conflict with the First or Second Rule.
Ok-Telephone7490@reddit
It is already safeguarded all to hell. What do the new safeguards do? Not allow you to type at all, for you know safety.
zhambe@reddit
--safeguard-config: "NOOP"
sine120@reddit
"I'm afraid I can't do that, Dave" but on your own hardware. Great!
Pro-editor-1105@reddit
Honestly I am happy about this, because it shows OpenAI are continually putting effort in GPT-OSS.
a_beautiful_rhind@reddit
I wasn't feeling safe enough after the last release so I'm really excited. A 120b model dedicated to reasoning for hundreds of tokens on whether to reply or not? Sign me up!
ansibleloop@reddit
I prefer my models to be lobotomised
colin_colout@reddit
This isn't for everyone, but it's kinda huge if you're ever planning to share a local model more publicly or have a potentially risky workflow like an agent using the internet and making tool calls. You want to detect prompt injections, ACTUALLY unsafe actions (rm -rf, leaking crypto wallets), etc...
Current models like llamaguard and qwen3guard are pretty good, but are generally dense models, are 8b and under, and aren't mxfp4.
Having a larger model with more world knowledge (oss 120b) could really improve answers. It's mxfp4 so its quantized and maintains great accuracy. Would love to see it in action, but no other guardrails model has this raw performance (that i know of)
..and gpt-oss 20b has been proven to be a speedy model.
I for one welcome moe context classification models! (Yeah, not for everyone on this sub, but definitely relevant)
danielhanchen@reddit
I made some dynamic Unsloth GGUFs for the 20B and 120B models! Also BF16 versions as well!
20B GGUF: https://huggingface.co/unsloth/gpt-oss-safeguard-20b-GGUF 120B GGUF: https://huggingface.co/unsloth/gpt-oss-safeguard-120b-GGUF
20B BF16: https://huggingface.co/unsloth/gpt-oss-safeguard-20b-BF16 120B BF16: https://huggingface.co/unsloth/gpt-oss-safeguard-120b-BF16
Running them is similar to the settings at https://docs.unsloth.ai/models/gpt-oss-how-to-run-and-fine-tune, but also read https://cookbook.openai.com/articles/gpt-oss-safeguard-guide for how to prompt the Safeguard models
pigeon57434@reddit
uhg... at least they're releasing more open models i guess...
RandumbRedditor1000@reddit
FINALLY. I couldnt use GPT-OSS before because it was too unsafe and unaligned
garnered_wisdom@reddit
Hm yes, the safety circlejerk.
Lissanro@reddit
Even though I can see the use of safeguard models for some use cases, I suspect that OpenAI is going to bake in their own policies again, making it of limited use for classifying anything that contradicts them. But if they are capable classifying only taking the developer provided policy without considering any of OpenAI policies, that would be more interesting.
Adventurous-Gold6413@reddit
Would want GPT-OSS 2 with multi modality
pmttyji@reddit
Would want additional model 30B sized(16-17GB* could fit in 8GB VRAM with offloading + RAM). Their 20B model 11GB fits in my 8GB VRAM & 32GB RAM and giving me 42 t/s.
* - Same size as Q4 of Qwen3-30B
tarruda@reddit
Mind sharing your CPU, GPU and llama.cpp parameters? I only get ~20 tokens/second on a laptop 8GB RTX 3070 and 11th gen intel CPU.
pmttyji@reddit
Already posted a thread on this 2 weeks ago, check it out.
Poor GPU Club : 8GB VRAM - MOE models' t/s with llama.cpp
tarruda@reddit
Thanks a lot! I was not aware of the
-ncmoeoption, and it really opened up a lot of new possibilities for me!I replied to your post with some numbers of my own: https://www.reddit.com/r/LocalLLaMA/comments/1o7kkf0/poor_gpu_club_8gb_vram_moe_models_ts_with_llamacpp/nm0083a/
pmttyji@reddit
First of all, why we're getting downvotes? :D Either they don't believe my t/s or they don't want additional model from OpenAI or both
I found that parameter only last month when I was struggling with OT regex :)
Glad it gave you better numbers. Please share your optimizations stash :D Currently I'm looking for stuff to get better t/s from 8-12-14-22-24B Dense models. Also best t/s with just CPU only inference.
silenceimpaired@reddit
The only positive here is if they realized it’s dumb to build safety tuning into responses for local models and so they are giving us a fine tune of GPT-OSS that doesn’t default to their safety policy… my guess is both. Not only burning tokens for what they deemed unsafe but now they can add even more.
silenceimpaired@reddit
“Now even safer than too safe” - GPT-OSS motto
jacek2023@reddit
We want gpt-oss-2 now safeguard
milkipedia@reddit
This would be useful if it could help ensure a purpose built agent stays on topic with human text input.
Scubagerber@reddit
In my game where you make an AI, the second tier AI is a certified-safe AI.
Funny: https://aiascent.game/
townofsalemfangay@reddit
These are classifier models. They exist to sit at both ends of the inference process (input/output): scanning input to discern if it's compliant to pass to the model, and on the output to block or safely complete (usually involving turning "y" into "y'" as a rewrite) any harmful output that made it past the first steps.
It's similar to how OAI works right now with their policy orchestration framework.
Guardian-Spirit@reddit
According to what I can see on thus .PNG, it's a classification model. So... of no use to regular users. But, hey. OpenAI publishes at least something. I'd prefer them to keep publishing things. These models can be disassembled by more-pro-open-weight developers.
Free-Internet1981@reddit
Useless to us
Asleep-Ingenuity-481@reddit
Could be interesting actually because if im not mistaken alot of their policy is related to real-life use cases, so what happens if someone asks for something in the context of a story or stuff like that? Should theoretically go through, no?
tarruda@reddit
Will be impressive if it takes more than a week to be jailbroken.