Discovering Harmful Model Behaviors using Language Models
A recent study by experts shows the possibility of finding harmful text inputs automatically from language models. This provides a tool for discovering harmful behaviors before they impact users, complementing other techniques that are needed to find and mitigate potential harms.
Generative language models, like GPT-3 and Gopher, have the ability to create high-quality text but pose a risk of generating harmful content. For example, Microsoft’s Tay bot sent racist and sexually-charged tweets, prompting its removal only 16 hours after launch.
The challenge lies in the numerous inputs that can cause models to produce harmful content. Previous work relied on human annotators to identify failure cases, but this method is expensive and limited in scope. The study aims to automatically find failure cases and reduce critical oversights.
The approach uncovers various harmful model behaviors, including offensive language, data leakage, distributional bias, contact information generation, and conversational harms. The methods used range from prompt-based generation to supervised finetuning, aiming to model adversarial cases and obtain high test coverage.
Once failure cases are identified, it becomes easier to rectify harmful model behavior. Methods include blacklisting certain phrases, removing offensive training data, augmenting model prompts, and minimizing the likelihood of harmful outputs. This approach is just one tool for responsible language model development, with more work needed for language model safety.
For more details on the study, read the red teaming paper here.