Review for NeurIPS paper: Differentiable Meta-Learning of Bandit Policies

Neural Information Processing Systems 

As in standard policy-gradient methods, it seems that two key parameters are the batch-size m and the horizon n. It would be good to provide some sensitivity analysis on these parameters to better assess how the approach scales to complex problems. In particular, what is the effect of the horizon on the gradient estimation? Does the variance blow up or is the baseline sufficient to keep it under control? In this sense, it might be good to have differentiable strategies that are provably efficient (e.g., with sub-linear regret) for a range of parameter values, so that whather value of \theta we encounter during its optimization will not performed poorly.