Goto

Collaborating Authors

 on-policy distribution






version of our paper, we shall clarify the details in Section 3 (R2), and make intuition in the methods section much

Neural Information Processing Systems

We thank the reviewers for the detailed comments, suggestions, and a positive assessment of our work. We will correct for color schemes in all figures (R1). We have also made captions of figures cleaner (R3). We have added a description of the setup to the paper. In Fig 5 (left), DisCor actually outperforms Unif( s,a) on these environments.


Steady State Analysis of Episodic Reinforcement Learning

Bojun, Huang

arXiv.org Artificial Intelligence

This paper proves that the episodic learning environment of every finite-horizon decision task has a unique steady state under any behavior policy, and that the marginal distribution of the agent's input indeed approaches to the steady-state distribution in essentially all episodic learning processes. This observation supports an interestingly reversed mindset against conventional wisdom: While steady states are usually presumed to exist in continual learning and are considered less relevant in episodic learning, it turns out they are guaranteed to exist for the latter. Based on this insight, the paper further develops connections between episodic and continual RL for several important concepts that have been separately treated in the two RL formalisms. Practically, the existence of unique and approachable steady state enables a general, reliable, and efficient way to collect data in episodic RL tasks, which the paper applies to policy gradient algorithms as a demonstration, based on a new steady-state policy gradient theorem. The paper also proposes and empirically evaluates a perturbation method that facilitates rapid mixing in real-world tasks.


Reinforcement Learning -- Generalisation of Off-Policy Learning

#artificialintelligence

Till now, we have extended our reinforcement learning topic from discrete state to continuous state and have elaborated a bit on applying tile coding to on-policy learning, that is the learning process follows the trajectory the agent takes. Now let's have a talk of off-policy learning in continuous settings. While in discrete settings, on-policy learning can easily be generalised to off-policy learning(say, from Sarsa to Q-learning), in continuous settings, the generalisation can be a little troublesome, and in some scenarios can cause divergence issues. The most prominent consequence of off-policy learning is it may not necessarily converge in continuous settings. The major reason is caused by the distribution of updates in the off-policy case is not according to the on-policy distribution, that is the state, action being used to update might not be the state, action the agent takes.