OpenAI recently released two new open-weight models, gpt-oss-safeguard-120b and gpt-oss-safeguard-20b, which are focused on applications in the field of AI safety. These models are optimized based on the previously released gpt-oss series and still follow the Apache 2.0 license, allowing anyone to freely use, modify, and deploy them.

image.png

A significant feature of the new model is that it provides developers with the ability to perform reasoning classification according to custom security policies, breaking away from traditional "one-size-fits-all" security systems. Developers can input their own security policies and content to be detected during inference, and the model will classify based on these policies and provide corresponding reasoning. Security policies can be flexibly adjusted to improve the model's performance. This makes the gpt-oss-safeguard model capable of classifying user messages, chat responses, and even complete conversations, adapting to different needs.

OpenAI points out that this new model is especially suitable for several specific situations. For example, when potential harms are emerging or evolving, security policies need to adapt quickly; in some highly specialized fields, traditional small classifiers struggle to be effective; and when developers lack a large number of high-quality samples, it is difficult to train a high-level classifier. In addition, for scenarios where the quality and interpretability of classification results are prioritized over processing speed, these new models are also an ideal choice.

However, the gpt-oss-safeguard model also has some limitations. OpenAI points out that if a platform has a large number of annotated samples and can train traditional classifiers, the latter may still perform better in complex or high-risk scenarios, with higher accuracy for customized models. At the same time, this new model consumes more processing power and resources, so it is not suitable for large-scale real-time content screening.

Currently, gpt-oss-safeguard-120b and gpt-oss-safeguard-20b are already available for free download on the Hugging Face platform, making it convenient for developers to explore and apply them.

https://huggingface.co/collections/openai/gpt-oss-safeguard

Key Points:   

🛡️ OpenAI has launched two new security models, gpt-oss-safeguard-120b and gpt-oss-safeguard-20b, which allow flexible customization of security policies.  

⚙️ The new model can classify user messages and conversations based on input security policies and provide reasoning.  

📊 Although the new model has advantages, in some cases, traditional classifiers may be more effective, and the new model consumes more resources.