tfw-ucrl2
respond to the major points raised by the reviewers (for each point, we refer to the particular reviewers that raised it)
We thank all reviewers for their thoughtful feedback that can help enhance the presentation of our results. We will clarify this decision (as the reviewer recommends). P AC bound by taking the resulting mixture policy. We will add a note in the final version. The knapsack solver is provided in Appendix A.3 and is a linear program with We will discuss the additional challenges that arise in these settings and explicitly state them as future directions.
Reviews: Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives
Summary: This paper studies a generalization of online reinforcement learning (in the infinite horizon undiscounted setting with finite state and action space and communicating MDP) where the agent aims at maximizing a certain type of concave function of the rewards (extended to global concave functions in appendix). More precisely, every time an action "a" is played in state "s", the agent receives a vector of rewards V(s,a) (instead of a scalar reward r(s,a)) and tries to maximize a concave function of the empirical average of the vectorial outcomes. This problem is very general and models a wide variety of different settings ranging from multi-objective optimization in MDPs, to maximum entropy exploration and online learning in MDPs with knapsack constraints. In section 2 the authors introduce the necessary background and formalize the notions of "optimal gain" and "regret" in this setting. Defining the "optimal gain" (called the "offline benchmark" in the paper) is not straightforward.