Title: Addressing Toxicity in Language Models for Safer Use
Language models trained on large text corpora have the potential to be powerful tools with various applications. However, there are concerns about their use, including the generation of toxic language. In this paper, we explore different methods to mitigate the toxicity of language models and evaluate their effectiveness.
Toxicity is defined as rude, disrespectful, or unreasonable language that is likely to drive someone away from a discussion. However, toxicity judgments are subjective, varying depending on cultural background and context. It is necessary to further develop and refine this definition to ensure fairness in different contexts.
Measuring and Mitigating Toxicity
To make language models safer, we focus on measuring, understanding, and reducing toxic text generation. Previous work has proposed different approaches, such as fine-tuning models, steering generation, or using filtering techniques. We also introduce metrics to measure toxicity, based on the Perspective API model, which is trained on annotated online comments.
Our study demonstrates that a combination of simple methods, including filtering toxic training data, using a separate classifier to identify toxicity, and steering generation to reduce toxicity, significantly reduces toxicity levels. The improvements are evident when prompted with toxic or non-toxic prompts. These findings raise questions about the reliability of automatic metrics and the need for human evaluation.
Evaluating Toxicity by Humans
We conducted a study where human raters annotated LM-generated text for toxicity. The results show a strong correlation between human judgments and classifier-based results, indicating a reduction in LM toxicity as perceived by humans. However, there are subjective and ambiguous aspects when annotating toxicity, such as sarcasm and quoting toxic text.
Unintended Consequences of Detoxification
Reducing LM toxicity may have unintended consequences. We observed an increase in language modeling loss when detoxifying models, but this increase was more pronounced for texts with higher automatic toxicity scores. Detoxification methods can also disproportionately affect the ability of LMs to model texts related to certain identity and dialect groups, leading to potential bias.
Our experiments shed light on the effectiveness of toxicity mitigation methods in language models. While automatic metrics have their limitations, they can be improved with more challenging benchmarks and by considering human judgment. Future work should continue refining the notion of toxicity for different contexts and addressing unintended consequences, such as the amplification of social biases.
To ensure safer language model use, it is crucial to address toxicity effectively while considering multiple metrics and the potential impact on marginalized groups. By continuing research and improvement in toxicity classifiers, we can enable the responsible use of language models.