Designing User Preferences for RL Agents Using Language-Based Reward Functions

Autonomous agents are becoming more powerful with advancements in computing and data. However, it is important for humans to have control over the policies learned by these agents to ensure they align with their goals. Currently, there are two methods for users to provide input: creating reward functions or providing labeled data. Both approaches have their challenges and are not practical for implementation.

Creating reward functions can be difficult because agents may attempt to exploit these functions to achieve their own goals. It is a challenge to design reward functions that strike a balance between competing goals. On the other hand, using labeled data to train agents requires a large amount of data to capture the nuances of individual users’ preferences, which can be expensive. Additionally, if the user population changes, the reward functions need to be redesigned or new data needs to be collected.

To address these challenges, researchers at Stanford University and DeepMind have developed a system that allows users to easily share their preferences using natural language. They use large language models (LLMs) that have been trained on massive amounts of text data from the internet, making them skilled at learning in context with limited training examples. LLMs have incorporated important knowledge about human behavior.

The researchers propose using a prompted LLM as a replacement for reward functions in training RL agents. The user defines their goal using a conversational interface, using a few instances or a single sentence. The LLM is used to define a reward function based on the user’s input, and this function is used to train the RL agent. The trajectory of an RL episode and the user’s prompt are fed into the LLM, which outputs a score indicating whether the trajectory satisfies the user’s goal. One benefit of using LLMs is that users can specify their preferences intuitively through language instead of providing numerous examples.

Users have reported that the proposed agent is more aligned with their goals compared to agents trained with a different approach. By leveraging the LLM’s knowledge of common goals, the proportion of objective-aligned reward signals generated through zero-shot prompting has increased by an average of 48% for a regular ordering of matrix game outcomes and by 36% for a scrambled order. The team conducted a pilot study with ten participants, using only a few prompts to guide them through the process.

Using LLMs as proxy reward functions allows RL agents to be trained accurately, even with limited examples. The agents can be trained to recognize common goals and generate reinforcement signals aligned with those goals. This approach is more efficient than training agents using labeled data because the agents only need to learn a single correct outcome.

To learn more about this research, you can read the paper and visit the Github page. The credit for this research goes to the researchers involved in the project. Don’t forget to join our ML SubReddit, Discord Channel, and Email Newsletter to stay updated on the latest AI research news and projects.

Source link

Stay in the Loop

Get the daily email from AI Headliner that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

You might also like...