Reinforcement Learning under Drift
Cheung, Wang Chi, Simchi-Levi, David, Zhu, Ruihao
Consider a discrete-time Markovian decision process (MDP) where a decision-maker (DM) interacts with a system iteratively: in each round, the DM first observes the current state of the system, and then picks an available action. Afterwards, it receives an instant random reward, and the system transits to the next state according to some state transition distribution. The reward distribution and the state transition distribution depend on the current state and the chosen action, but are independent of all the previous states and actions. The goal of the DM is to maximize its cumulative rewards under the following challenges: - Uncertainty: the reward and the state transition distributions are initially unknown to the DM. - Non-stationarity: the environment is non-stationary, and both of the reward distributions and the state transition distributions can evolve over time.
Jun-7-2019
- Country:
- Asia > Singapore (0.04)
- North America
- United States
- Massachusetts > Middlesex County
- Cambridge (0.14)
- California > Santa Clara County
- Palo Alto (0.04)
- Massachusetts > Middlesex County
- Canada > Quebec
- Montreal (0.04)
- United States
- Europe
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Sweden > Stockholm
- Stockholm (0.04)
- United Kingdom > England
- Genre:
- Research Report (0.40)
- Technology: