Reinforcement Learning from Human Feedback: Optimizing Reward Models
Reinforcement learning from human feedback is a common approach used to train AI models to predict human preferences. However, optimizing the reward model too much can actually hinder the true performance, as noted in Goodhart’s law. Despite this observation, measuring the exact impact has been difficult due to the expensive nature of collecting human preference data. In this study, we propose a synthetic setup where a fixed “gold-standard” reward model takes on the role of humans, providing labels to train a proxy reward model. By examining how the gold reward model score changes during optimization, we can gain valuable insights.
Optimization Methods and Functional Relationships
We investigate the relationship between the gold reward model score and the proxy reward model optimization using two methods: reinforcement learning and best-of-n sampling. Interestingly, we discover that the relationship follows a different functional form depending on the optimization method. Additionally, the coefficients of this relationship scale smoothly with the number of reward model parameters.
Factors Affecting the Relationship
To further understand this relationship, we analyze the impact of various factors on its behavior. These factors include the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty in the reinforcement learning setup.
Implications for AI Alignment
Considering these empirical results, we can draw significant implications for theoretical considerations in AI alignment. By understanding how optimization methods and factors influence the performance of reward models, we can improve the overall alignment between AI systems and human preferences.
In conclusion, our study sheds light on the complex dynamics involved in training AI models with reinforcement learning from human feedback. By exploring the relationship between the gold reward model score and proxy reward model optimization, we uncover valuable insights that can inform future advancements in AI alignment.