Appendix A Proofs
–Neural Information Processing Systems
The first part of the proof follows the policy gradient theorem. This concludes the proof of Theorem 1.Theorem 2 Since the second-order derivative formulation as stated in Theorem 1 and Theorem 2 are both unbiased derivative estimate. The randomly initiated neural network uses ReLU layers as nonlinearity followed by a linear layer in the end. In order to train the optimal policy, in the gridworld example, we use tabular value-iteration algorithm to learn the Q value of each state action pair. So the number of available actions is 5, while the number of available states is 5 5 = 25 .
Neural Information Processing Systems
Aug-14-2025, 07:48:56 GMT
- Technology: