Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret

Neural Information Processing Systems 

We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies \pi {\text{O}} and \pi {\text{E}}: \pi {\text{O}} ( O'' for online'') interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while \pi {\text{E}} ( E'' for exploit'') exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., \pi {\text{E}} \pi {\text{O}}) for the risk-averse users. We individually consider the gap-independent vs. gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective.