Reviews: Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives
–Neural Information Processing Systems
Summary: This paper studies a generalization of online reinforcement learning (in the infinite horizon undiscounted setting with finite state and action space and communicating MDP) where the agent aims at maximizing a certain type of concave function of the rewards (extended to global concave functions in appendix). More precisely, every time an action "a" is played in state "s", the agent receives a vector of rewards V(s,a) (instead of a scalar reward r(s,a)) and tries to maximize a concave function of the empirical average of the vectorial outcomes. This problem is very general and models a wide variety of different settings ranging from multi-objective optimization in MDPs, to maximum entropy exploration and online learning in MDPs with knapsack constraints. In section 2 the authors introduce the necessary background and formalize the notions of "optimal gain" and "regret" in this setting. Defining the "optimal gain" (called the "offline benchmark" in the paper) is not straightforward.
Neural Information Processing Systems
Jan-26-2025, 03:15:57 GMT