Reviews: A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation

Neural Information Processing Systems 

Summary: The paper proposes a new algorithm for learning linear preferences, which are objectives derived from a linear weighting of a vector reward function, in multi-objective reinforcement learning (MORL). The proposed algorithm achieves this by performing updates that use the convex envelope of the solution frontier to update the parameters of the action-value function, hence its name: envelop Q-learning. This is done by first defining a multi-objective version of the action-value function along with a pseudo-metric, the supremum over the states, actions, and preferences. Then, a Bellman operator is defined for the multi-objective action-value function along with an optimality filter, which together define a new optimality operator. Using all these definitions, the paper then shows three main theoretical results: 1) the optimality operator has a fixed point that maximizes the amount of reward under any given preference, 2) the optimality operator is a contraction, and 3) for any Q in the pseudo-metric space, iterative applications of the optimality operator will result in an action-value function which distance to the fixed point is equal to zero, i.e. is equivalent to the fixed point.