Zac Zuo

gpt-oss-safeguard - Open safety reasoning models with custom safety policies

gpt-oss-safeguard is a new family of open-source safety models (120b & 20b) from OpenAI. They use reasoning to classify content based on a custom, developer-provided policy at inference time, providing an explainable chain-of-thought for each decision.

Add a comment

Replies

Best
Zac Zuo

Hi everyone!

OpenAI is offering a new direction for safety alignment on their open-weight models.

It basically decouples the safety policy from the model's execution, which gives developers a much more "white-box" environment. You can even review the model's chain-of-thought to see why it made a decision.

The main benefit is that you can rapidly calibrate safety rules just by editing the policy text, without having to train a whole new classifier (which costs a lot of time and money).

Big deal for application safety, especially with all the legal risks around AI emerging.

You can try it out with the examples OpenAI provides, or just feed it your own policy to see how it works.