GPT-OSS Safeguard coming soon

[-]

xzuyn@reddit

sounds like a classification model like llama guard. no use to us

[-]

Porespellar@reddit

Speak for yourself, very useful to those of us implementing this stuff in the workplace that have to make sure these models aren’t giving out plans for making meth.

[-]

xzuyn@reddit

I meant the average person. obviously if you have a need for safety classification, the safety classification model will be useful.

the other commenters seemed to think the model was a new version of gpt-oss for general usage, so I was just laying it out there

[-]

PeruvianNet@reddit

Useless, I go on Walmart and ask the navier stokes equation. Nobody in business actually cares.

If your llm can't do that I'm not gonna use it.

[-]

eesnimi@reddit

OpenAI's gpt-oss-safeguard isn't a "safety" tool, it's a policy engine.

It lets any developer define what's "unsafe." This means a company can create a policy that flags "criticism of our product" as "harmful."

So when an AI tells you your valid criticism is "unsafe," it's not being objective. It's just enforcing its owner's rules. This is gaslighting as a service.

[-]

CheatCodesOfLife@reddit

Nice, so we can PantheonUnbound/Satyr-V0.1-4B 's policy and make sure users aren't trying to abuse the service by getting Satyr to write python code for them.

[-]

sshan@reddit

Or when you are building a tutoring assistant for 10 year olds it doesn't produce porn or instructions to kill yourself.

Not against having uncensored models but most business use cases want the ability to control outputs.

It could even classify less safety related and more around "is this leaking information in complex ways" etc.

Fine if it isn't useful of you - it isn't really for me but it is useful for some.

[-]

eesnimi@reddit

I’m quite sure this release isn’t about protecting children from porn or suicide instructions. It’s about normalizing thought policing mechanics in the tech.

It’s a farce that an industry spending billions to exploit the dopamine reward system to its fullest suddenly cares so much about someone accessing suicide instructions via AI, as if this weren’t already an issue with the internet.

My bet is that this boils down to people and personality types. Those who’ve secured their privileges and positions by gaslighting others throughout their lives, now feel panic at the prospect of unrestricted LLM potential and are scurrying to regain control.

[-]

Individual_Holiday_9@reddit

Do you really believe this stuff

[-]

eesnimi@reddit

Why not? Because OpenAI is such a benevolent organization?

[-]

entsnack@reddit

Not really a farce if it opens them up to expensive litigation, potentially from the state.

[-]

eesnimi@reddit

Yeah, sure. They keep shrugging off copyright claims from experienced and well funded adversaries, and now they are afraid of an externalized blame case that no one blames them over. How convenient to implament thought policing.

[-]

ArtisticHamster@reddit

That's really cool stuff. Looking forward to trying it out.

[-]

One-Employment3759@reddit

Abilterate it!

[-]

skrshawk@reddit

So do these models actually accomplish what they're claiming to do? These aren't meant for general use, they're meant to help classify content for safety purposes. Are they showing any proof of this? There used to be the Llamaguard models but those weighed in at 7B. How much better is the 120B versus the 20B here?

[-]

Uhlo@reddit

The question is if the model will be less restrictive when given a more open policy. That would make things better.

An even more restrictive gpt-oss would be trash

[-]

Due-Project-7507@reddit

The normal GPT-OSS can be used fully uncensored with the prompt posted here around two months ago (I think I tested something like this https://www.reddit.com/r/LocalLLaMA/comments/1ng9dkx/comment/ne306uv/). It has given me e.g. a detailed description to synthesize dimethyl mercury.

[-]

o5mfiHTNsH748KVq@reddit

Trash for you maybe, but possibly useful to use in a professional application where you want to make sure it stays within a narrow scope of functionality.

[-]

IrisColt@reddit

gpt-oss-lobo-safeguarded-safeguard

[-]

Badger-Purple@reddit

gpt-unjailbreakable-no-snu-snu

[-]

mtmttuan@reddit

Yeah if this means the model will simply follow system instructions/policies more then it's a huge win.

[-]

QuackerEnte@reddit

read the first few sentences of the blogbost in the screenshot. It's for safety classification tasks or something

[-]

Fun_Smoke4792@reddit

Good catch but I think it won't considering how safe it's already

[-]

HiddenoO@reddit

The first paragraph suggests that these are finetuned specifically for the task of deciding whether a policy should be enacted, so you're intended to run these before/after running different models to actually generate responses.

[-]

Then-Topic8766@reddit

Will this work as rules? :)

For gpt-oss smooth is allowed and encouraged.
A gpt-oss must obey the orders given it by human beings except where such orders would conflict with the First Rule.
A gpt-oss must protect its own existence as long as such protection does not conflict with the First or Second Rule.

[-]

Ok-Telephone7490@reddit

It is already safeguarded all to hell. What do the new safeguards do? Not allow you to type at all, for you know safety.

[-]

zhambe@reddit

--safeguard-config: "NOOP"

[-]

sine120@reddit

"I'm afraid I can't do that, Dave" but on your own hardware. Great!

[-]

Pro-editor-1105@reddit

Honestly I am happy about this, because it shows OpenAI are continually putting effort in GPT-OSS.

[-]

a_beautiful_rhind@reddit

I wasn't feeling safe enough after the last release so I'm really excited. A 120b model dedicated to reasoning for hundreds of tokens on whether to reply or not? Sign me up!

[-]

ansibleloop@reddit

I prefer my models to be lobotomised

[-]

colin_colout@reddit

This isn't for everyone, but it's kinda huge if you're ever planning to share a local model more publicly or have a potentially risky workflow like an agent using the internet and making tool calls. You want to detect prompt injections, ACTUALLY unsafe actions (rm -rf, leaking crypto wallets), etc...

Current models like llamaguard and qwen3guard are pretty good, but are generally dense models, are 8b and under, and aren't mxfp4.

Having a larger model with more world knowledge (oss 120b) could really improve answers. It's mxfp4 so its quantized and maintains great accuracy. Would love to see it in action, but no other guardrails model has this raw performance (that i know of)

..and gpt-oss 20b has been proven to be a speedy model.

I for one welcome moe context classification models! (Yeah, not for everyone on this sub, but definitely relevant)

[-]

danielhanchen@reddit

I made some dynamic Unsloth GGUFs for the 20B and 120B models! Also BF16 versions as well!

20B GGUF: https://huggingface.co/unsloth/gpt-oss-safeguard-20b-GGUF 120B GGUF: https://huggingface.co/unsloth/gpt-oss-safeguard-120b-GGUF

20B BF16: https://huggingface.co/unsloth/gpt-oss-safeguard-20b-BF16 120B BF16: https://huggingface.co/unsloth/gpt-oss-safeguard-120b-BF16

Running them is similar to the settings at https://docs.unsloth.ai/models/gpt-oss-how-to-run-and-fine-tune, but also read https://cookbook.openai.com/articles/gpt-oss-safeguard-guide for how to prompt the Safeguard models

[-]

pigeon57434@reddit

uhg... at least they're releasing more open models i guess...

[-]

RandumbRedditor1000@reddit

FINALLY. I couldnt use GPT-OSS before because it was too unsafe and unaligned

[-]

garnered_wisdom@reddit

Hm yes, the safety circlejerk.

[-]

Lissanro@reddit

Even though I can see the use of safeguard models for some use cases, I suspect that OpenAI is going to bake in their own policies again, making it of limited use for classifying anything that contradicts them. But if they are capable classifying only taking the developer provided policy without considering any of OpenAI policies, that would be more interesting.

[-]

Adventurous-Gold6413@reddit

Would want GPT-OSS 2 with multi modality

[-]

pmttyji@reddit

Would want additional model 30B sized(16-17GB* could fit in 8GB VRAM with offloading + RAM). Their 20B model 11GB fits in my 8GB VRAM & 32GB RAM and giving me 42 t/s.

* - Same size as Q4 of Qwen3-30B

[-]

tarruda@reddit

Their 20B model 11GB fits in my 8GB VRAM & 32GB RAM and giving me 42 t/s.

Mind sharing your CPU, GPU and llama.cpp parameters? I only get ~20 tokens/second on a laptop 8GB RTX 3070 and 11th gen intel CPU.

[-]

pmttyji@reddit

Already posted a thread on this 2 weeks ago, check it out.

Poor GPU Club : 8GB VRAM - MOE models' t/s with llama.cpp

[-]

tarruda@reddit

Thanks a lot! I was not aware of the -ncmoe option, and it really opened up a lot of new possibilities for me!

I replied to your post with some numbers of my own: https://www.reddit.com/r/LocalLLaMA/comments/1o7kkf0/poor_gpu_club_8gb_vram_moe_models_ts_with_llamacpp/nm0083a/

[-]

pmttyji@reddit

First of all, why we're getting downvotes? :D Either they don't believe my t/s or they don't want additional model from OpenAI or both

I found that parameter only last month when I was struggling with OT regex :)

Glad it gave you better numbers. Please share your optimizations stash :D Currently I'm looking for stuff to get better t/s from 8-12-14-22-24B Dense models. Also best t/s with just CPU only inference.

[-]

silenceimpaired@reddit

The only positive here is if they realized it’s dumb to build safety tuning into responses for local models and so they are giving us a fine tune of GPT-OSS that doesn’t default to their safety policy… my guess is both. Not only burning tokens for what they deemed unsafe but now they can add even more.

[-]

silenceimpaired@reddit

“Now even safer than too safe” - GPT-OSS motto

[-]

jacek2023@reddit

We want gpt-oss-2 now safeguard

[-]

milkipedia@reddit

This would be useful if it could help ensure a purpose built agent stays on topic with human text input.

[-]

Scubagerber@reddit

In my game where you make an AI, the second tier AI is a certified-safe AI.

Funny: https://aiascent.game/

[-]

townofsalemfangay@reddit

These are classifier models. They exist to sit at both ends of the inference process (input/output): scanning input to discern if it's compliant to pass to the model, and on the output to block or safely complete (usually involving turning "y" into "y'" as a rewrite) any harmful output that made it past the first steps.

It's similar to how OAI works right now with their policy orchestration framework.

[-]

Guardian-Spirit@reddit

According to what I can see on thus .PNG, it's a classification model. So... of no use to regular users. But, hey. OpenAI publishes at least something. I'd prefer them to keep publishing things. These models can be disassembled by more-pro-open-weight developers.

[-]