Learning and Optimization with Seasonal Patterns
Chen, Ningyuan, Wang, Chun, Wang, Longlin
Online learning, or more specifically, the multi-armed bandit (MAB) problem, focuses on the task of learning the reward distributions from an unknown environment while simultaneously optimizing cumulative rewards over a fixed time horizon T. This problem has been studied extensively when the environment (i.e., reward distributions) is stationary over time, with numerous algorithms proposed to tackle the tradeoff between exploration and exploitation when making decisions (see Bubeck et al. 2012 for a comprehensive review). While the stationarity assumption about the reward distributions greatly simplifies the analysis, it does not hold in many decision problems in OR/MS and other fields when the environment is time-varying. For example, a fashion retailer should take into account the seasonal demand shift when setting the prices for apparels, and a hospital needs to consider the variation of the patient arrival rate when scheduling the medical staff. Despite the practical relevance, it is difficult to develop a learning policy for non-stationary rewards, especially when the dynamics can change arbitrarily over time. Recent studies (Besbes et al., 2015) have considered cases in which the environment does not change fast with respect to the length of the time horizon, e.g., when a budget sublinear in T is imposed on the total variation of the underlying reward distribution.
Jun-11-2020
- Genre:
- Research Report (0.70)
- Overview (0.66)
- Industry:
- Education (0.48)
- Health & Medicine (0.34)
- Energy > Oil & Gas
- Upstream (0.48)
- Technology: