Reinforcement Learning under Drift

Cheung, Wang Chi, Simchi-Levi, David, Zhu, Ruihao

arXiv.org Machine Learning 

Consider a discrete-time Markovian decision process (MDP) where a decision-maker (DM) interacts with a system iteratively: in each round, the DM first observes the current state of the system, and then picks an available action. Afterwards, it receives an instant random reward, and the system transits to the next state according to some state transition distribution. The reward distribution and the state transition distribution depend on the current state and the chosen action, but are independent of all the previous states and actions. The goal of the DM is to maximize its cumulative rewards under the following challenges: - Uncertainty: the reward and the state transition distributions are initially unknown to the DM. - Non-stationarity: the environment is non-stationary, and both of the reward distributions and the state transition distributions can evolve over time.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found