The Relationship Between Proxy Reward Models and Gold Standard Reward Models in RL

Reinforcement Learning from Human Feedback: Optimizing Reward Models


Reinforcement learning from human feedback is a common approach used to train AI models to predict human preferences. However, optimizing the reward model too much can actually hinder the true performance, as noted in Goodhart’s law. Despite this observation, measuring the exact impact has been difficult due to the expensive nature of collecting human preference data. In this study, we propose a synthetic setup where a fixed “gold-standard” reward model takes on the role of humans, providing labels to train a proxy reward model. By examining how the gold reward model score changes during optimization, we can gain valuable insights.

Optimization Methods and Functional Relationships

We investigate the relationship between the gold reward model score and the proxy reward model optimization using two methods: reinforcement learning and best-of-n sampling. Interestingly, we discover that the relationship follows a different functional form depending on the optimization method. Additionally, the coefficients of this relationship scale smoothly with the number of reward model parameters.

Factors Affecting the Relationship

To further understand this relationship, we analyze the impact of various factors on its behavior. These factors include the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty in the reinforcement learning setup.

Implications for AI Alignment

Considering these empirical results, we can draw significant implications for theoretical considerations in AI alignment. By understanding how optimization methods and factors influence the performance of reward models, we can improve the overall alignment between AI systems and human preferences.

In conclusion, our study sheds light on the complex dynamics involved in training AI models with reinforcement learning from human feedback. By exploring the relationship between the gold reward model score and proxy reward model optimization, we uncover valuable insights that can inform future advancements in AI alignment.

Source link

Stay in the Loop

Get the daily email from AI Headliner that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

You might also like...