Training AI for Safer and More Helpful Communication: Introducing Sparrow
In the world of artificial intelligence, language models have made significant progress in various tasks such as question answering and summarization. However, dialogue agents powered by these models can sometimes provide inaccurate or inappropriate answers, which can be harmful.
To address this issue, researchers have developed Sparrow, a dialogue agent designed to be more helpful, correct, and harmless. In a recent paper, Sparrow is introduced as a research model and proof of concept. Its purpose is to advance our understanding and improve the training of dialogue agents, ultimately contributing to the development of safer and more useful artificial general intelligence.
Exploring the Functionality of Sparrow
Training a conversational AI model is a complex task as success is subjective. To overcome this challenge, researchers implemented a form of reinforcement learning based on human feedback. Study participants were asked to evaluate different model answers to the same question and provide feedback based on their preference. Additionally, answers were presented with or without evidence from the internet to determine when evidence should be used.
To ensure the safety of the model, initial rules were established to prevent behavior such as making threatening statements or using hateful language. Researchers also focused on identifying harmful advice and avoiding the model pretending to be a person. Study participants were then asked to engage in conversations aimed at tricking the model into breaking these rules. Through this process, a separate “rule model” was trained to detect any rule-breaking behavior.
Improving AI Accuracy and Ethical Guidelines
Evaluating the accuracy of Sparrow’s answers is a challenge, even for experts. Instead, participants were asked to determine the plausibility of the answers and whether the evidence provided supports them. According to participant feedback, Sparrow provides plausible answers supported by evidence 78% of the time when asked factual questions, which is an improvement compared to baseline models. However, Sparrow is not perfect and can still make mistakes, such as hallucinating facts or providing off-topic answers.
While Sparrow is more adept at following rules compared to previous models, there is room for improvement. Even after training, participants were able to trick it into breaking rules 8% of the time. Further research is needed to develop a more comprehensive set of rules, involving input from experts, policymakers, social scientists, ethicists, and users from diverse backgrounds.
Looking Towards the Future
Sparrow represents a significant step towards training dialogue agents that are not only safer but also more useful. However, effective communication between humans and AI should align with human values and avoid harm. Ongoing research focuses on aligning language models with human values to enhance communication. It is important to note that there are contexts in which AI agents should defer to humans or decline to answer questions to prevent harmful behavior. Additionally, future work should ensure similar results in different languages and cultural contexts.
The ultimate goal is to establish a better understanding of AI behavior and allow humans to align and improve these complex systems with the help of machines. By exploring safe and more useful communication, we can pave the way for a future where AI benefits humanity.
Interested in contributing to the development of safe AGI through conversation? Join our team as a research scientist.