Reinforcement learning has come a long way in solving real-world problems. Offline reinforcement learning, in particular, has made training algorithms a lot more efficient by using pre-recorded datasets. However, evaluating these policies can be time-consuming and costly.
To address this issue, a new approach called active offline policy selection (A-OPS) has been proposed. A-OPS aims to minimize interactions with the real environment while still selecting the best policy for deployment.
Three key features are implemented to make this process efficient. Firstly, off-policy policy evaluation, such as fitted Q-evaluation (FQE), provides an initial estimate of the performance of each policy based on the offline dataset. Additionally, the returns of the policies are modeled jointly using a Gaussian process, allowing for correlations between different policies. Moreover, Bayesian optimization is used to prioritize the evaluation of more promising policies.
The effectiveness of this approach has been demonstrated in various environments, including dm-control, Atari, and real robotics. The results show that A-OPS can rapidly reduce regret and identify the best policy with only a small number of environment interactions.
The open-source code for A-OPS is available on GitHub, along with an example dataset to try. This new approach shows promise in making offline policy selection more effective and efficient by utilizing offline data, a special kernel, and Bayesian optimization.