Reviews: Loaded DiCE: Trading off Bias and Variance in Any-Order Score Function Gradient Estimators for Reinforcement Learning
–Neural Information Processing Systems
The DiCE gradient estimator [1] allows the computation of higher-order derivatives in stochastic computation graphs. This may be useful in contexts such multi-agent learning or meta-RL where the proper application of methods such as MAML require the computation of second-order derivatives. The current paper extends DiCE and derives a more general objective that allows integration of the advantage A(s_t, a_t) Q(s_t, a_t) - V(s_t) in order to control for the variance while providing unbiased estimates. The advantage can be approximated by trading off variance for bias using parametric function approximators and methods such as Generalized Advantage Estimation (GAE). Moreover, the authors propose to further control the variance of the higher-order gradients by discounting the impact past actions on the current advantage, thus limiting the range of causal dependencies. This paper is well executed: it is well written, technically sound and potentially impactful.
Neural Information Processing Systems
Jan-24-2025, 15:42:56 GMT
- Technology: