Review for NeurIPS paper: Dynamic Regret of Policy Optimization in Non-Stationary Environments
–Neural Information Processing Systems
Weaknesses: (1) The paper assumes a full-information reward feedback, which can be hardly thought as a realistic assumption. Instead, it would be much appreciated to consider the bandit feedback as what [1] does. This is undesired in practice. There are some recent efforts in removing such dependency [2,3]. The basic idea is to run another meta bandits algorithm for selecting the optimal parameter.
Neural Information Processing Systems
Jan-24-2025, 03:58:48 GMT
- Technology: