Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning
Li, Yuanzhi, Wang, Ruosong, Yang, Lin F.
–arXiv.org Artificial Intelligence
Reinforcement learning (RL) is one of the most important paradigms in machine learning. What makes RL different from other paradigms is that it models the long-term effects in decision-making problems. For instance, in a finite-horizon Markov decision process (MDP), which is one of the most fundamental models for RL, an agent interacts with the environment for a total of H steps and receives a sequence of H random reward values, along with stochastic state transitions, as feedback. The goal of the agent is to find a policy to maximize the expected sum of these rewards values instead of any single one of them. Since decisions made at early stages could significantly impact the future, the agent must take possible future transitions into consideration when choosing the policy. On the other hand, when H 1, RL reduces to the contextual bandits problem in which it suffices to act myopically to achieve optimality. Due to the important role of the horizon length in RL, Jiang and Agarwal [JA18] propose to study how the sample complexity of RL depends on the horizon length. More formally, let us consider the episodic RL setting, where the horizon length is H and the underlying MDP has unknown and time invariant transition probabilities and rewards.
arXiv.org Artificial Intelligence
Oct-31-2021
- Country:
- North America > United States
- Pennsylvania > Allegheny County
- Pittsburgh (0.04)
- California > Los Angeles County
- Long Beach (0.04)
- Pennsylvania > Allegheny County
- Europe > United Kingdom
- England > Greater London > London (0.04)
- Asia > Middle East
- Jordan (0.04)
- North America > United States
- Genre:
- Research Report > New Finding (0.46)