Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning

Wu, Jingfeng, Braverman, Vladimir, Yang, Lin F.

arXiv.org Machine Learning 

In single-objective reinforcement learning (RL), a scalar reward is pre-specified and an agent learns a policy to maximize the long-term cumulative reward [Azar et al., 2017, Jin et al., 2018]. However, in many real-world applications, we need to optimize multiple objectives for the same (unknown) environment, even when these objectives are possibly contradicting [Roijers et al., 2013]. For example, in an autonomous driving application, each passenger may have a different preference of driving styles: some of the passengers prefer a very steady riding experience while other passengers enjoy the fast acceleration of the car. Therefore, traditional single-objective RL approach may fail to be applied in such scenarios. One way to tackle this issue is the multi-objective reinforcement learning (MORL) [Roijers et al., 2013, Yang et al., 2019, Natarajan and Tadepalli, 2005, Abels et al., 2018] method, which models the multiple objectives by a vectorized reward, and an additional preference vector to specify the relative importance of each objective. The agent of MORL needs to find policies to optimize the cumulative preference-weighted rewards under all possible preferences.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found