The Challenge of Matching Human Preferences to AI Models
Matching human preferences to big pretrained models is a challenge that has become more prominent as these models have improved. The problem is especially difficult when there are poor behaviors in the datasets. To address this issue, reinforcement learning from human input (RLHF) has become popular. RLHF approaches use human preferences to distinguish between acceptable and bad behaviors and improve existing policies.
There are two stages in the RLHF procedure. First, user preference data is gathered to train a reward model. Then, a reinforcement learning algorithm optimizes that reward model. However, recent research challenges this approach and suggests that human preferences should instead be based on the regret of each action under the ideal policy of the expert’s reward function. This is because human evaluation focuses on optimality rather than immediate rewards.
Most RLHF algorithms use reinforcement learning in the second phase to optimize the reward function. However, there are optimization difficulties with RL algorithms, such as instability and high variance. As a result, earlier works have limited their scope to avoid these problems. For example, RLHF approaches assume a single-step bandit formulation for language models, even though user interactions are multi-step and sequential.
In contrast to these approaches, researchers from Stanford University, UMass Amherst, and UT Austin propose a novel family of RLHF algorithms. Their regret-based model of preferences provides precise information on the best course of action. This removes the need for RL and allows RLHF to be applied to high-dimensional state and action spaces. They achieve this by combining the regret-based preference framework with the Maximum Entropy principle.
Their algorithm, called Contrastive Preference Learning (CPL), has several advantages over previous efforts. First, it scales well because it solely uses supervised learning objectives. Second, it is completely off-policy, allowing the use of any offline data source. Lastly, CPL enables preference searches over sequential data for learning on any Markov Decision Process (MDP).
The researchers demonstrate CPL’s performance on sequential decision-making tasks using off-policy inputs. They show that CPL can learn manipulation rules efficiently in the MetaWorld Benchmark. By pre-training policies using supervised learning from high-dimensional picture observations and then fine-tuning them using preferences, CPL achieves similar performance to previous RL-based techniques, but it is faster and more parameter efficient.
In conclusion, CPL offers a promising solution to the challenge of matching human preferences to AI models. It eliminates the need for RL and provides precise guidance on optimal actions. The researchers showcase the effectiveness of CPL in various tasks and highlight its advantages over previous approaches.
If you’re interested in learning more about CPL, check out the paper. And don’t forget to join our AI community on Reddit, Facebook, Discord, and sign up for our newsletter to stay updated on the latest AI research and projects.