gpt-oss-safeguard is a new family of open-source safety models (120b & 20b) from OpenAI. They use reasoning to classify content based on a custom, developer-provided policy at inference time, providing an explainable chain-of-thought for each decision.
It basically decouples the safety policy from the model's execution, which gives developers a much more "white-box" environment. You can even review the model's chain-of-thought to see why it made a decision.
The main benefit is that you can rapidly calibrate safety rules just by editing the policy text, without having to train a whole new classifier (which costs a lot of time and money).
Big deal for application safety, especially with all the legal risks around AI emerging.
You can try it out with the examples OpenAI provides, or just feed it your own policy to see how it works.
Replies
Flowtica Scribe
Hi everyone!
OpenAI is offering a new direction for safety alignment on their open-weight models.
It basically decouples the safety policy from the model's execution, which gives developers a much more "white-box" environment. You can even review the model's chain-of-thought to see why it made a decision.
The main benefit is that you can rapidly calibrate safety rules just by editing the policy text, without having to train a whole new classifier (which costs a lot of time and money).
Big deal for application safety, especially with all the legal risks around AI emerging.
You can try it out with the examples OpenAI provides, or just feed it your own policy to see how it works.