Tempo Adaptation in Non-stationary Reinforcement Learning

Neural Information Processing Systems 

We first raise and tackle a time synchronization'' issue between the agent and the environment in non-stationary reinforcement learning (RL), a crucial factor hindering its real-world applications. In reality, environmental changes occur over wall-clock time ( t) rather than episode progress ( k), where wall-clock time signifies the actual elapsed time within the fixed duration t \in [0, T] . In existing works, at episode k, the agent rolls a trajectory and trains a policy before transitioning to episode k 1 . In the context of the time-desynchronized environment, however, the agent at time t_{k} allocates \Delta t for trajectory generation and training, subsequently moves to the next episode at t_{k 1} t_{k} \Delta t . Despite a fixed total number of episodes ( K), the agent accumulates different trajectories influenced by the choice of interaction times ( t_1,t_2,...,t_K), significantly impacting the suboptimality gap of the policy.