Efficiently Solving MDPs with Stochastic Mirror Descent
Markov decision processes (MDPs) are a fundamental mathematical abstraction for sequential decision making under uncertainty and they serve as a basic modeling tool in reinforcement learning (RL) and stochastic control [5, 24, 30]. Two prominent classes of MDPs are average-reward MDPs (AMDPs) and discounted MDPs (DMDPs). Each have been studied extensively; AMDPs are applicable to optimal control, learning automata, and various real-world reinforcement learning settings [17, 3, 22] and DMDPs have a number of nice theoretical properties including reward convergence and operator monotonicity [6]. In this paper we consider the prevalent computational learning problem of finding an approximately optimal policy of an MDP given only restricted access to the model. In particular, we consider the problem of computing an ɛ-optimal policy, i.e. a policy with an additive ɛ error in expected cumulative reward over infinite horizon, under the standard assumption of a generative model [14, 13], which allows one to sample from state-transitions given the current state-action pair. This problem is well-studied and there are multiple known upper and lower bounds on its sample complexity [4, 32, 28, 31]. In this work, we provide a unified framework based on primal-dual stochastic mirror descent (SMD) for learning an ɛ-optimal policies for both AMDPs and DMDPs with a generative model.
Aug-28-2020
- Country:
- North America > United States
- Massachusetts > Middlesex County
- Belmont (0.04)
- California > Santa Clara County
- Palo Alto (0.04)
- Massachusetts > Middlesex County
- Europe > United Kingdom
- England > Greater London > London (0.04)
- North America > United States
- Genre:
- Research Report (0.64)
- Industry:
- Education (0.48)
- Information Technology (0.34)