a03caec56cd82478bf197475b48c05f9-Supplemental.pdf

Neural Information Processing Systems 

Algorithm 1 shows the pseudocode of LIAM.Algorithm 1 Pseudocode of LIAM's algorithmfor m = 1,...,M episodes do Reset the hidden state of the encoder LSTM Sample E fixed policies from Π Create E parallel environments and gather initial observations a The fixed policies in the predator-prey consist of a combination of heuristic and pretrained policies. First we created four heuristic policies, which are: (i) going after the prey, (ii) going after one of the predators, (iii) going after the agent (predator or prey) that is closest, (iv) going after the predator that is closest. CARL has access to the trajectories of all the other agents in the environment during training, but during execution only to the local trajectory. To extract such representations, we use self-supervised learning based on recent advances on contrastive learning [Oord et al., 2018, He et al., 2020, Chen et al., 2020a,b]. During training and given a batch of episode trajectories we construct the positive and negative pairs following Equation (4) and minimise the InfoNCE loss [Oord et al., 2018] Following the work of Chung et al. [2015] we can write the lower bound in the log-evidence of the We train LIAM-V AE similarly to LIAM.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found