Appendix A Proofs

Neural Information Processing Systems 

The first part of the proof follows the policy gradient theorem. This concludes the proof of Theorem 1.Theorem 2 Since the second-order derivative formulation as stated in Theorem 1 and Theorem 2 are both unbiased derivative estimate. The randomly initiated neural network uses ReLU layers as nonlinearity followed by a linear layer in the end. In order to train the optimal policy, in the gridworld example, we use tabular value-iteration algorithm to learn the Q value of each state action pair. So the number of available actions is 5, while the number of available states is 5 5 = 25 .