global switching cost
Logarithmic Switching Cost in Reinforcement Learning beyond Linear MDPs
Qiao, Dan, Yin, Ming, Wang, Yu-Xiang
In many real-world reinforcement learning (RL) tasks, limited computing resources make it challenging to apply fully adaptive algorithms that continually update the exploration policy. As a surrogate, it is more cost-effective to collect data in large batches using the current policy and make changes to the policy after the entire batch is completed. For example, in a recommendation system [Afsar et al., 2021], it is easier to gather new data quickly, but deploying a new policy takes longer as it requires significant computing and human resources. Therefore, it's not feasible to switch policies based on real-time data, as typical RL algorithms would require. A practical solution is to run several experiments in parallel and make decisions on policy updates only after the entire batch has been completed. Similar limitations occur in other RL based applications such as healthcare [Yu et al., 2021], robotics [Kober et al., 2013], and new material design [Zhou et al., 2019], where the agent must minimize the number of policy updates while still learning an effective policy using a similar number of trajectories as fully-adaptive methods. On the theoretical side, Bai et al. [2019] brought up the definition of switching cost, which measures the number of policy updates.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
A Provably Efficient Algorithm for Linear Markov Decision Process with Low Switching Cost
Gao, Minbo, Xie, Tianle, Du, Simon S., Yang, Lin F.
Many real-world applications, such as those in medical domains, recommendation systems, etc, can be formulated as large state space reinforcement learning problems with only a small budget of the number of policy changes, i.e., low switching cost. This paper focuses on the linear Markov Decision Process (MDP) recently studied in [Yang et al 2019, Jin et al 2020] where the linear function approximation is used for generalization on the large state space. We present the first algorithm for linear MDP with a low switching cost. Our algorithm achieves an $\widetilde{O}\left(\sqrt{d^3H^4K}\right)$ regret bound with a near-optimal $O\left(d H\log K\right)$ global switching cost where $d$ is the feature dimension, $H$ is the planning horizon and $K$ is the number of episodes the agent plays. Our regret bound matches the best existing polynomial algorithm by [Jin et al 2020] and our switching cost is exponentially smaller than theirs. When specialized to tabular MDP, our switching cost bound improves those in [Bai et al 2019, Zhang et al 20020]. We complement our positive result with an $\Omega\left(dH/\log d\right)$ global switching cost lower bound for any no-regret algorithm.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (2 more...)
- Research Report (0.64)
- Workflow (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.70)