A Appendix A

Neural Information Processing Systems 

The agent (white circle) has to reach the goal (red star) avoiding the barrier on right. Instead, we develop a new objective that can learn Q while recovering state-only rewards below. Writing the new objective using Q -functions, we get the modification to Eq. 9: max Q, we can freely transform between Q and r . All experiments are repeated over 10 seeds. J (,) is concave for all 2 .

Similar Docs  Excel Report  more

TitleSimilaritySource
None found