Supplementary material: Enhanced Meta Reinforcement Learning using Demonstrations in Sparse Reward Environments

Neural Information Processing Systems 

We will use the well known Performance Difference Lemma [16] in our analysis. We can obtain a performance difference lemma for the meta-policies as follows. Here, we get (a)is from Assumption 3.1 from which we have P In this section, we describe all the simulation and real-world environments in detail. B.1 Simulation Environments Point 2DNavigation: Point 2DNavigation [9] is a 2 dimensional goal reaching environment with S R2, A R2, and the following dynamics, xt+1 = xt +dxt, yt+1 = xt +dyt, such that dx2t +dy2t 0.12 Where xt and yt are the x and y location of the agent, dxt and dyt are the actions taken which correspond to the displacement in the x and y direction respectively, all taken at time step t. The goals are located on a semi circle of radius 2, and the episode terminates when the agent reaches the goal or spends more than 100time steps in the environment.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found