The Significance of Reward in Reinforcement Learning
Reward is a crucial factor in reinforcement learning, as it serves as the driving force for reinforcement agents. Sutton and Littman’s reward hypothesis suggests that rewards should be able to express a wide range of tasks for the learning agents.
To better understand this hypothesis, let’s consider a thought experiment involving Alice, a designer, and Bob, a learning agent. Alice thinks of a task she wants Bob to learn, and then translates this task into a learning signal, such as a reward, for Bob to learn from. The question we address is whether there is always a reward function that can convey the task to Bob based on Alice’s choice of task.
Introducing Three Task Types
To study this question, we introduce three types of tasks: A set of acceptable policies (SOAP), a policy order (PO), and a trajectory order (TO). Our goal is to determine if a reward can effectively capture each of these task types in finite environments, specifically focusing on Markov reward functions.
Our first main result reveals that there are environment-task pairs for which there is no Markov reward function that can effectively capture the task. For example, the task of “going all the way around the grid clockwise or counterclockwise” in a typical grid world cannot be captured by a Markov reward function, as it is not able to convey the necessary information to the learning agent.
Our second main result shows that for any finite environment-task pair, there is a procedure that can decide whether the task can be captured by Markov reward and can output the desired reward function if it exists.
While these results provide initial insights into the scope of the reward hypothesis, there is still much to be done to generalize these findings beyond finite environments, Markov rewards, and simple notions of “task” and “expressivity”. This work aims to offer new perspectives on reward and its role in reinforcement learning.