Implicit Distributional Reinforcement Learning: Appendix A Proof of Lemma 1 Denote H = E a π log π

Neural Information Processing Systems 

Additional ablation studies on Ant is shown in Figure 1a for a thorough comparison. In Ant, the performance of IDAC is on par with that of IDAC-Gaussian, which outperforms the other variants. Furthermore, we would like to learn the interaction between DGN and SIA by running ablation studies by holding each of them as a control factor; we conduct the corresponding experiments on Walker2d. From Figure 1b, we can observe that by removing either SIA (resulting in IDAC-Gaussian) or DGN (resulting in IDAC-noDGN) from IDAC in general negatively impacts its performance, which echoes our motivation that we integrate DGN and SIA to allow them to help strengthen each other: (i) Modeling G exploits distributional information to help better estimate its mean Q (note C51, which outperforms DQN by exploiting distributional information, also conducts its argmax operation on Q); (ii) A more flexible policy may become more necessary given a better estimated Q. In Figure 1, we include a thorough comparison with SDPG (implemented based on the stable baselines codebase).