Reinforcement Learning
Weighted importance sampling for off-policy learning with linear function approximation
Importance sampling is an essential component of off-policy model-free reinforcement learning algorithms. However, its most effective variant, \emph{weighted} importance sampling, does not carry over easily to function approximation and, because of this, it is not utilized in existing off-policy learning algorithms. In this paper, we take two steps toward bridging this gap. First, we show that weighted importance sampling can be viewed as a special case of weighting the error of individual training samples, and that this weighting has theoretical and empirical benefits similar to those of weighted importance sampling. Second, we show that these benefits extend to a new weighted-importance-sampling version of off-policy LSTD(lambda). We show empirically that our new WIS-LSTD(lambda) algorithm can result in much more rapid and reliable convergence than conventional off-policy LSTD(lambda) (Yu 2010, Bertsekas & Yu 2009).
Non-Cooperative Inverse Reinforcement Learning
Making decisions in the presence of a strategic opponent requires one to take into account the opponent's ability to actively mask its intended objective. To describe such strategic situations, we introduce the non-cooperative inverse reinforcement learning (N-CIRL) formalism. The N-CIRL formalism consists of two agents with completely misaligned objectives, where only one of the agents knows the true objective function.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. The paper presents a new technique for solving MDPs. The new technique, presented as an alternative to approximate policy/value iteration, consists in directly minimizing the Optimal Bellman Residual (OBR). The authors first motivate their method by showing that the loss bound of OBR is often tighter than the loss bound of policy/value iteration, which is a known result [9,15]. The authors then show that an empirical estimate of OBR is consistent in the Vapnick sense, i.e. minimizing the empirical OBR is equivalent to minimizing an upper bound on the true OBR, which is unknown when the MDP model is unknown. Finally, the authors show that OBR can be decomposed into a difference of two convex functions, and a standard Difference of Convex Functions (DC) optimization method can be used for finding a local optimum.