Goto

Collaborating Authors

 mujoco environment


Details

Neural Information Processing Systems

The training is stalled if the size of the replay buffer is smaller than the minibatch size, i.e., if |B|< M. Algorithms 3 and 4 show the critic network update and the actor network and uncertainty parameter sampler update, respectively. Although we write the gradient-based update in the form of a mini-batch stochastic gradient update for simplicity, we employ an adaptive approach such as Adam [16]. The update of pk follows the exponential moving average with the momentum (1/Tlast), where Tlast is the number of steps spent in the last episode (Tlast is set to 1000 for the first episode). The reason behind this design choice is as follows. The short episode is a meaning that a bad uncertainty parameter ฯ‰ is used in the last episode.



Derivations

Neural Information Processing Systems

Lemma 1 (Ensemble Sample Diversity Decomposition) Given the state-action visit distribution of the ensemble policy ฯ. The entropy of this distribution is H(ฯ). By definition, I(ฯ;z) = H(ฯ) H(ฯ|z) = H(z) H(z|ฯ) (4) By randomly selecting the latent variable z, we consider that H(z) is a constant depending on the number of z. Lemma 3 Let X1,X2,...,XN be an infinite sequence of i.i.d. The PDF of XN:N can be derived by taking the derivative of PDF.






60cb558c40e4f18479664069d9642d5a-AuthorFeedback.pdf

Neural Information Processing Systems

We thank all the reviewers for the time and expertise invested in these reviews. A: We are sorry that some abuse of notations in the paper hinders the5 understanding ofourmethod. A: Such an assumption comes from an empirical41 observation that in robotics control problems, some key poses in different dynamics are still alike.