Reward shaping is a challenge in reinforcement learning (RL). It involves developing reward functions to guide an agent towards desirable behaviors. However, this process is time-consuming and often done manually, which can lead to sub-optimal results. Inverse reinforcement learning (IRL) and preference learning are two approaches to address reward shaping. These methods require significant effort or data collection, and neural network-based reward models may not generalize well.
A group of researchers from The University of Hong Kong, Nanjing University, Carnegie Mellon University, Microsoft Research, and the University of Waterloo has introduced TEXT2REWARD, a framework that creates rich reward code based on goal descriptions. The framework uses large language models (LLMs) to generate dense reward code given a RL objective. This reward code is then used to train a policy using RL algorithms like PPO or SAC. Unlike inverse RL, TEXT2REWARD produces symbolic rewards that are interpretable and can cover a wide range of tasks.
In addition, TEXT2REWARD addresses the challenges of RL training and language ambiguity by incorporating user input and adjusting the reward as needed. The researchers conducted studies on robotics manipulation benchmarks and locomotion environments to evaluate the effectiveness of their approach. The policies trained with TEXT2REWARD achieved similar or better success rates and convergence speeds compared to manually calibrated reward code in most tasks.
TEXT2REWARD also demonstrated the ability to learn unique locomotor behaviors with a high success rate. The framework can be applied to real-world robots, and it can iteratively improve the success rate of learned policies with human input.
Overall, TEXT2REWARD provides an interpretable and generalizable solution for reward shaping in RL. It enables a human-in-the-loop pipeline and has wide coverage for different RL tasks. The researchers hope that their findings will inspire further research in the intersection of reinforcement learning and code creation.
If you’re interested in learning more about TEXT2REWARD, you can check out the research paper, code, and project linked in the article.
Sources:
– Paper: https://www.marktechpost.com/2023/10/04/meet-text2reward-a-data-free-framework-that-automates-the-generation-of-dense-reward-functions-based-on-large-language-models/
– Code: https://www.marktechpost.com/2023/10/04/meet-text2reward-a-data-free-framework-that-automates-the-generation-of-dense-reward-functions-based-on-large-language-models/
– Project: https://www.marktechpost.com/2023/10/04/meet-text2reward-a-data-free-framework-that-automates-the-generation-of-dense-reward-functions-based-on-large-language-models/
About the author:
Aneesh Tickoo is a consulting intern at MarktechPost. He is studying Data Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. Aneesh is passionate about machine learning and image processing and enjoys collaborating on interesting projects.