024677efb8e4aee2eaeef17b54695bbe-Supplemental.pdf
–Neural Information Processing Systems
In this section, we derive a lower bound for the trace of the covariance of the PG estimator in environments with stochastic dynamics. Let us assume that the initial policyπ(ai|si) follows the uniform distribution such thatπ(ai = 1|si) = π(ai = +1|si) = 12 for alli. Its optimal policy fort, πtθf(t|s), should producet x because otherwise it has the risk of ending up with ν reward, which is not an optimum. Since FiGAR-C is unaware of underlying state changes, its best strategy is to shorten the duration ofactions tobemoreresponsive. In VPG, we do not use any technique for variance reduction such asvalue functions and reward-to-go policygradient; hence, the formula for its gradient estimator is identical to Equation (3).
Neural Information Processing Systems
Feb-7-2026, 06:53:00 GMT
- Technology: