Reinforcement Learning
Approximate Policy Iteration with a Policy Language Bias
Fern, Alan, Yoon, Sungwook, Givan, Robert
We explore approximate policy iteration, replacing the usual costfunction learningstep with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve. In particular, we induce high-quality domain-specific planners for classical planningdomains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs.
Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales
Tanaka, Saori C., Doya, Kenji, Okada, Go, Ueda, Kazutaka, Okamoto, Yasumasa, Yamawaki, Shigeto
To understand the brain mechanisms involved in reward prediction on different time scales, we developed a Markov decision task that requires prediction of both immediate and future rewards, and analyzed subjects'brain activities using functional MRI. We estimated the time course of reward prediction and reward prediction error on different time scales from subjects' performance data, and used them as the explanatory variables for SPM analysis. We found topographic mapsof different time scales in medial frontal cortex and striatum. The result suggests that different cortico-basal ganglia loops are specialized for reward prediction on different time scales.
A Convergent Form of Approximate Policy Iteration
Perkins, Theodore J., Precup, Doina
We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a "policy improvement operator" to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.
Convergent Combinations of Reinforcement Learning with Linear Function Approximation
Schoknecht, Ralf, Merke, Artur
Convergence for iterative reinforcement learning algorithms like TD(O) depends on the sampling strategy for the transitions. However, in practical applications it is convenient to take transition data from arbitrary sources without losing convergence. In this paper we investigate the problem of repeated synchronous updates based on a fixed set of transitions. Our main theorem yields sufficient conditions of convergence for combinations of reinforcement learning algorithms and linear function approximation. This allows to analyse if a certain reinforcement learning algorithm and a certain function approximator are compatible.
Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games
Wang, Xiaofeng, Sandholm, Tuomas
Multiagent learning is a key problem in AI. In the presence of multiple Nash equilibria, even agents with non-conflicting interests may not be able to learn an optimal coordination policy. The problem is exaccerbated if the agents do not know the game and independently receive noisy payoffs. So, multiagent reinforfcement learning involves two interrelated problems: identifying the game and learning to play.
Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
There are several reinforcement learning algorithms that yield approximate solutions for the problem of policy evaluation when the value function is represented with a linear function approximator. In this paper we show that each of the solutions is optimal with respect to a specific objective function.
Learning a Forward Model of a Reflex
Porr, Bernd, Wörgötter, Florentin
We develop a systems theoretical treatment of a behavioural system that interacts with its environment in a closed loop situation such that its motor actions influence its sensor inputs. The simplest form of a feedback is a reflex. Reflexes occur always "too late"; i.e., only after a (unpleasant, painful, dangerous) reflex-eliciting sensor event has occurred. This defines an objective problem which can be solved if another sensor input exists which can predict the primary reflex and can generate an earlier reaction. In contrast to previous approaches, our linear learning algorithm allows for an analytical proof that this system learns to apply feedforward control with the result that slow feedback loops are replaced by their equivalent feed-forward controller creating a forward model. In other words, learning turns the reactive system into a proactive system. By means of a robot implementation we demonstrate the applicability of the theoretical results which can be used in a variety of different areas in physics and engineering.
Timing and Partial Observability in the Dopamine System
Daw, Nathaniel D., Courville, Aaron C., Touretzky, David S.
According to a series of influential models, dopamine (DA) neurons signal reward prediction error using a temporal-difference (TD) algorithm. We address a problem not convincingly solved in these accounts: how to maintain a representation of cues that predict delayed consequences. Our new model uses a TD rule grounded in partially observable semi-Markov processes, a formalism that captures two largely neglected features of DA experiments: hidden state and temporal variability. Previous models predicted rewards using a tapped delay line representation of sensory inputs; we replace this with a more active process of inference about the underlying state of the world. The DA system can then learn to map these inferred states to reward predictions using TD. The new model can explain previously vexing data on the responses of DA neurons in the face of temporal variability. By combining statistical model-based learning with a physiologically grounded TD theory, it also brings into contact with physiology some insights about behavior that had previously been confined to more abstract psychological models.
Convergent Combinations of Reinforcement Learning with Linear Function Approximation
Schoknecht, Ralf, Merke, Artur
Convergence for iterative reinforcement learning algorithms like TD(O) depends on the sampling strategy for the transitions. However, in practical applications it is convenient to take transition data from arbitrary sources without losing convergence. In this paper we investigate the problem of repeated synchronous updates based on a fixed set of transitions. Our main theorem yields sufficient conditions of convergence for combinations of reinforcement learning algorithms and linear function approximation. This allows to analyse if a certain reinforcement learning algorithm and a certain function approximator are compatible.