Structural Return Maximization for Reinforcement Learning
Joseph, Joshua, Velez, Javier, Roy, Nicholas
Reinforcement Learning (RL) (Sutton & Barto, 1998) is a framework for sequential decision making under uncertainty with the objective of finding a policy that maximizes the sum of rewards, or return, of an agent. A straightforward model-based approach to batch RL, where the algorithm learns a policy from a fixed set of data, is to fit a dynamics model by minimizing a form of prediction error (e.g., minimum squared error) and then compute the optimal policy with respect to the learned model (Bertsekas, 2000). As discussed in Baxter & Bartlett (2001) and Joseph et al. (2013), learning a model for RL by minimizing prediction error can result in a policy that performs arbitrarily poorly for unfavorably chosen model classes. To overcome this limitation, a second approach is to not use a model and directly learn the policy from a policy class that explicitly maximizes an estimate of return (Meuleau et al., 2000). With limited data, approaches that explicitly maximize estimated return are vulnerable to learning policies which perform poorly since the return cannot be confidently estimated.
May-11-2014
- Country:
- Europe
- Middle East
- Cyprus (0.04)
- Northern Cyprus (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Middle East
- North America > United States
- Massachusetts > Middlesex County
- Cambridge (0.04)
- New York > New York County
- New York City (0.04)
- Texas > Travis County
- Austin (0.04)
- Massachusetts > Middlesex County
- Europe
- Genre:
- Research Report (0.64)
- Technology: