Reinforcement learning (RL) has made significant advancements in addressing real-life problems, with offline RL making it even more practical. Instead of directly interacting with the environment, algorithms can now be trained using a single pre-recorded dataset. However, evaluating these policies can be costly and time-consuming.
For instance, when training robotic manipulators, there are usually limited robot resources. Training multiple policies using offline RL on a single dataset provides a data-efficient advantage compared to online RL. But evaluating each policy requires interacting with the robot numerous times, making it a challenging task.
To make RL more applicable to real-world applications like robotics, a new approach called active offline policy selection (A-OPS) has been proposed. A-OPS leverages the pre-recorded dataset and allows limited interactions with the real environment to enhance policy selection.
To minimize interactions with the real environment, three key features are implemented:
1. Off-policy policy evaluation, such as fitted Q-evaluation (FQE), provides an initial estimate of each policy’s performance based on the offline dataset. FQE scores have proven to align well with the ground truth performance in various environments, including real-world robotics.
2. The returns of the policies are modeled using a Gaussian process, combining FQE scores with a small number of newly collected episodic returns from the robot. This approach allows gaining knowledge about all policies after evaluating just one policy, as their distributions are correlated.
3. Bayesian optimization is applied to prioritize more promising policies for evaluation. Policies with high predicted performance and large variance are given priority to improve data-efficiency.
The effectiveness of A-OPS has been demonstrated in different environments, including dm-control, Atari, simulated, and real robotics. By using A-OPS, the regret is reduced rapidly, and the best policy can be identified with a moderate number of evaluations.
The results suggest that offline policy selection can be done effectively with only a small number of environment interactions, thanks to the utilization of offline data, a specialized kernel, and Bayesian optimization. The code for A-OPS is openly available on GitHub, along with an example dataset for experimentation.
In summary, A-OPS offers a practical solution for offline policy selection in reinforcement learning, enabling efficient evaluation of policies using limited interactions with the real environment.