Federated Offline Reinforcement Learning: Collaborative Single-Policy Coverage Suffices
Woo, Jiin, Shi, Laixi, Joshi, Gauri, Chi, Yuejie
–arXiv.org Artificial Intelligence
Offline RL (Levine et al., 2020), also known as batch RL, addresses the challenge of learning a near-optimal policy using offline datasets collected a priori, without further interactions with an environment. Fueled by the cost-effectiveness of utilizing pre-collected datasets compared to real-time explorations, offline RL has received increasing attention. However, the performance of offline RL crucially depends on the quality of offline datasets due to the lack of additional interactions with the environment, where the quality is determined by how thoroughly the state-action space is explored during data collection. Encouragingly, recent research (Li et al., 2022; Rashidinejad et al., 2021; Shi et al., 2022; Xie et al., 2021b) indicates that being more conservative on unseen state-action pairs, known as the principle of pessimism, enables learning of a near-optimal policy even with partial coverage of the state-action space, as long as the distribution of datasets encompasses the trajectory of the optimal policy. However, acquiring high-quality datasets that have good coverage of the optimal policy poses challenges because it requires the state-action visitation distribution induced by a behavior policy employed for data collection to be very close to the optimal policy. Alternatively, multiple datasets can be merged into one dataset to supplement insufficient coverage of one other, but this may be impractical when offline datasets are scattered and cannot be easily shared due to privacy and communication constraints.
arXiv.org Artificial Intelligence
Feb-8-2024