Goto

Collaborating Authors

 Reinforcement Learning


Self-ImitationLearningviaGeneralizedLower BoundQ-learning

Neural Information Processing Systems

NaiveIS estimator involves products of the form π(at | xt)/µ(at | xt) and is infeasible in practice due to high variance. To control the variance, a line of prior work has focused on operator-based estimation to avoid fullIS products, which reduces the estimation procedure into repeated iterations of off-policyevaluation operators [1-3].







24662461d2194d1bc70a47b6b6771026-Paper-Conference.pdf

Neural Information Processing Systems

Existing works mainly focus on arranging the levels to explicitly form a curriculum. In this work, we take a close look atthelearning process itself under themulti-leveltraining inProcgen.


DiscoveredPolicyOptimisation

Neural Information Processing Systems

Most of these advancements came through the continual development of new algorithms, which were designed using a combination of mathematical derivations, intuitions, and experimentation. Such an approach of creating algorithms manually is limited by human understanding and ingenuity.


ALocalTemporalDifferenceCodeforDistributional ReinforcementLearning

Neural Information Processing Systems

However, since this decoder effectively approximates thenth derivative of the input vector, it is very sensitive to noise. In our framework, the input is often very noisy, since it corresponds to the converging points of different learning traces. In this section we describe two linear decoders that differ from that in [35] and are more noise-resilient. A.9 and A.10 is crucial for long temporal horizons, since regularization causes the overall magnitude of the recoveredτ-space to decrease asτ increases3. Normalization amends thedecreasing magnitude problem bymaking theτ-space to sum to 1 for everyτ.