Review for NeurIPS paper: Dynamic Regret of Policy Optimization in Non-Stationary Environments

Jan-24-2025, 03:58:48 GMT–Neural Information Processing Systems

Weaknesses: （1） The paper assumes a full-information reward feedback, which can be hardly thought as a realistic assumption. Instead, it would be much appreciated to consider the bandit feedback as what [1] does. This is undesired in practice. There are some recent efforts in removing such dependency [2,3]. The basic idea is to run another meta bandits algorithm for selecting the optimal parameter.

algorithm, non-stationary environment, policy optimization, (10 more...)

Neural Information Processing Systems

Jan-24-2025, 03:58:48 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology
  - Data Science > Data Mining
    - Big Data (0.76)
  - Artificial Intelligence > Machine Learning
    - Reinforcement Learning (0.55)