Reinforcement Learning
Self-ImitationLearningviaGeneralizedLower BoundQ-learning
NaiveIS estimator involves products of the form π(at | xt)/µ(at | xt) and is infeasible in practice due to high variance. To control the variance, a line of prior work has focused on operator-based estimation to avoid fullIS products, which reduces the estimation procedure into repeated iterations of off-policyevaluation operators [1-3].
ALocalTemporalDifferenceCodeforDistributional ReinforcementLearning
However, since this decoder effectively approximates thenth derivative of the input vector, it is very sensitive to noise. In our framework, the input is often very noisy, since it corresponds to the converging points of different learning traces. In this section we describe two linear decoders that differ from that in [35] and are more noise-resilient. A.9 and A.10 is crucial for long temporal horizons, since regularization causes the overall magnitude of the recoveredτ-space to decrease asτ increases3. Normalization amends thedecreasing magnitude problem bymaking theτ-space to sum to 1 for everyτ.