Doya, Kenji
Robust Reinforcement Learning
Morimoto, Jun, Doya, Kenji
This paper proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both off-line learning by simulations and for online action planning. However, the difference between the model and the real environment can lead to unpredictable, often unwanted results. Based on the theory of H oocontrol, we consider a differential game in which a'disturbing' agent (disturber) tries to make the worst possible disturbance while a'control' agent (actor) tries to make the best control input. The problem is formulated as finding a minmax solution of a value function that takes into account the norm of the output deviation and the norm of the disturbance.
Robust Reinforcement Learning
Morimoto, Jun, Doya, Kenji
This paper proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both off-line learning by simulations and for online action planning. However, the difference between the model and the real environment can lead to unpredictable, often unwanted results. Based on the theory of H oocontrol, we consider a differential game in which a'disturbing' agent (disturber) tries to make the worst possible disturbance while a'control' agent (actor) tries to make the best control input. The problem is formulated as finding a minmax solution of a value function that takes into account the norm of the output deviation and the norm of the disturbance.
Robust Reinforcement Learning
Morimoto, Jun, Doya, Kenji
KenjiDoya ATR International; CREST, JST 2-2 Hikaridai Seika-cho Soraku-gun Kyoto 619-0288 JAPAN doya@isd.atr.co.jp Abstract This paper proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors.The use of environmental models in RL is quite popular for both off-line learning by simulations and for online action planning. However, the difference between the model and the real environment can lead to unpredictable, often unwanted results. Based on the theory of H oocontrol, we consider a differential game in which a'disturbing' agent (disturber) tries to make the worst possible disturbance while a'control' agent (actor) tries to make the best control input. The problem is formulated as finding a minmax solutionof a value function that takes into account the norm of the output deviation and the norm of the disturbance. We derive online learning algorithms for estimating the value function and for calculating the worst disturbance and the best control in reference tothe value function.
Efficient Nonlinear Control with Actor-Tutor Architecture
Doya, Kenji
A new reinforcement learning architecture for nonlinear control is proposed. A direct feedback controller, or the actor, is trained by a value-gradient based controller, or the tutor. This architecture enables both efficient use of the value function and simple computation for real-time implementation. Good performance was verified in multidimensional nonlinear control tasks using Gaussian softmax networks.
Efficient Nonlinear Control with Actor-Tutor Architecture
Doya, Kenji
Itwas demonstrated in the simulation of a pendulum swing-up task that the value-gradient based control scheme requires much less learning trials than the conventional "actor-critic"control scheme (Barto et al., 1983). In the actor-critic scheme, the actor, a direct feedback controller, improves its control policystochastically using the TD error as the effective reinforcement (Figure 1a). Despite its relatively slow learning, the actor-critic architecture has the virtue of simple computation in generating control command. In order to train a direct controller while making efficient use of the value function, we propose a new reinforcement learning scheme which we call the "actor-tutor" architecture (Figure 1b).
Temporal Difference Learning in Continuous Time and Space
Doya, Kenji
Elucidation of the relationship between TD learning and dynamic programming (DP) has provided good theoretical insights (Barto et al., 1995). However, conventional TD algorithms were based on discrete-time, discrete-state formulations. In applying these algorithms to control problems, time, space and action had to be appropriately discretized using a priori knowledge or by trial and error. Furthermore, when a TD algorithm is used for neurobiological modeling, discrete-time operation is often very unnatural. There have been several attempts to extend TD-like algorithms to continuous cases. Bradtke et al. (1994) showed convergence results for DPbased algorithms for a discrete-time, continuous-state linear system with a quadratic cost. Bradtke and Duff (1995) derived TD-like algorithms for continuous-time, discrete-state systems (semi-Markov decision problems). Baird (1993) proposed the "advantage updating" algorithm by modifying Q-Iearning so that it works with arbitrary small time steps.
Temporal Difference Learning in Continuous Time and Space
Doya, Kenji
Elucidation of the relationship between TD learning and dynamic programming (DP) has provided good theoretical insights (Barto et al., 1995). However, conventional TD algorithms were based on discrete-time, discrete-state formulations. In applying these algorithms to control problems, time, space and action had to be appropriately discretized using a priori knowledge or by trial and error. Furthermore, when a TD algorithm is used for neurobiological modeling, discrete-time operation is often very unnatural. There have been several attempts to extend TD-like algorithms to continuous cases. Bradtke et al. (1994) showed convergence results for DPbased algorithms for a discrete-time, continuous-state linear system with a quadratic cost. Bradtke and Duff (1995) derived TD-like algorithms for continuous-time, discrete-state systems (semi-Markov decision problems). Baird (1993) proposed the "advantage updating" algorithm by modifying Q-Iearning so that it works with arbitrary small time steps.
Dynamics of Attention as Near Saddle-Node Bifurcation Behavior
Nakahara, Hiroyuki, Doya, Kenji
Most studies of attention have focused on the selection process of incoming sensory cues (Posner et al., 1980; Koch et al., 1985; Desimone et al., 1995). Emphasis was placed on the phenomena of causing different percepts for the same sensory stimuli. However, the selection of sensory input itself is not the final goal of attention. We consider attention as a means for goal-directed behavior and survival of the animal. In this view, dynamical properties of attention are crucial. While attention has to be maintained long enough to enable robust response to sensory input, it also has to be shifted quickly to a novel cue that is potentially important. Long-term maintenance and quick transition are critical requirements for attention dynamics.
Dynamics of Attention as Near Saddle-Node Bifurcation Behavior
Nakahara, Hiroyuki, Doya, Kenji
Most studies of attention have focused on the selection process of incoming sensory cues (Posner et al., 1980; Koch et al., 1985; Desimone et al., 1995). Emphasis was placed on the phenomena of causing different percepts for the same sensory stimuli. However, the selection of sensory input itself is not the final goal of attention. We consider attention as a means for goal-directed behavior and survival of the animal. In this view, dynamical properties of attention are crucial. While attention has to be maintained long enough to enable robust response to sensory input, it also has to be shifted quickly to a novel cue that is potentially important. Long-term maintenance and quick transition are critical requirements for attention dynamics.
A Novel Reinforcement Model of Birdsong Vocalization Learning
Doya, Kenji, Sejnowski, Terrence J.
Songbirds learn to imitate a tutor song through auditory and motor learning. Wehave developed a theoretical framework for song learning that accounts for response properties of neurons that have been observed in many of the nuclei that are involved in song learning. Specifically, we suggest that the anteriorforebrain pathway, which is not needed for song production in the adult but is essential for song acquisition, provides synaptic perturbations and adaptive evaluations for syllable vocalization learning. A computer model based on reinforcement learning was constructed thatcould replicate a real zebra finch song with 90% accuracy based on a spectrographic measure.