A Proofs of Theorems

Neural Information Processing Systems 

We use the agent's trajectories to construct The main differences between our method and theirs are: 1) We use trajectories sampled by multiple policies to construct training samples, while they only use trajectories sampled by one specific policy; 2) We use an adjacency matrix to explicitly aggregate the adjacency information and sample training pairs based on the adjacency matrix, while they directly sample training pairs from trajectories. However, it is hard for the method by Savinov et al. to handle this situation as these two Clear B . for n = 1 to N do Reset the environment and sample the initial state s Store the sampled trajectory in B . We visualize the LLE of state embeddings and two adjacency distance heatmaps by both methods respectively in Figure 11(b) and 11(c). We provide Algorithm 1 to show the training procedure of HRAC. Each episode has a maximum length of 200.