Review for NeurIPS paper: Dynamic Regret of Policy Optimization in Non-Stationary Environments

Neural Information Processing Systems 

Weaknesses: (1) The paper assumes a full-information reward feedback, which can be hardly thought as a realistic assumption. Instead, it would be much appreciated to consider the bandit feedback as what [1] does. This is undesired in practice. There are some recent efforts in removing such dependency [2,3]. The basic idea is to run another meta bandits algorithm for selecting the optimal parameter.