9 Appendix For all the following derivations, we use D
–Neural Information Processing Systems
Based on Lemma2, we can derive the upper-bound of our original objective: Theorem 1 (Surrogate Objective as the Divergence Upper-bound) . We provide proof with a counter-example. Based on Assumption 1, we have the following: Corollary 1. A similar strategy is adopted by [3]. Sec 3.2 to learn a value function v (s,s
Neural Information Processing Systems
Aug-15-2025, 03:29:48 GMT
- Technology: