Logarithmic Switching Cost in Reinforcement Learning beyond Linear MDPs
Qiao, Dan, Yin, Ming, Wang, Yu-Xiang
–arXiv.org Artificial Intelligence
In many real-world reinforcement learning (RL) tasks, limited computing resources make it challenging to apply fully adaptive algorithms that continually update the exploration policy. As a surrogate, it is more cost-effective to collect data in large batches using the current policy and make changes to the policy after the entire batch is completed. For example, in a recommendation system [Afsar et al., 2021], it is easier to gather new data quickly, but deploying a new policy takes longer as it requires significant computing and human resources. Therefore, it's not feasible to switch policies based on real-time data, as typical RL algorithms would require. A practical solution is to run several experiments in parallel and make decisions on policy updates only after the entire batch has been completed. Similar limitations occur in other RL based applications such as healthcare [Yu et al., 2021], robotics [Kober et al., 2013], and new material design [Zhou et al., 2019], where the agent must minimize the number of policy updates while still learning an effective policy using a similar number of trajectories as fully-adaptive methods. On the theoretical side, Bai et al. [2019] brought up the definition of switching cost, which measures the number of policy updates.
arXiv.org Artificial Intelligence
Feb-24-2023
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East
- Genre:
- Research Report (0.64)
- Industry:
- Health & Medicine (0.34)
- Technology: