Goto

Collaborating Authors

 Reinforcement Learning


Improving Optimization Bounds using Machine Learning: Decision Diagrams meet Deep Reinforcement Learning

arXiv.org Artificial Intelligence

Finding tight bounds on the optimal solution is a critical element of practical solution methods for discrete optimization problems. In the last decade, decision diagrams (DDs) have brought a new perspective on obtaining upper and lower bounds that can be significantly better than classical bounding mechanisms, such as linear relaxations. It is well known that the quality of the bound achieved through this flexible bounding method is highly reliant on the ordering of variables chosen for building the diagram, and finding an ordering that optimizes standard metrics, or even improving one, is an NP-hard problem. In this paper, we propose an innovative and generic approach based on deep reinforcement learning for obtaining an ordering for tightening the bounds obtained with relaxed and restricted DDs. We apply the approach to both the Maximum Independent Set Problem and the Maximum Cut Problem. Experimental results on synthetic instances show that the deep reinforcement learning approach, by achieving tighter objective function bounds, generally outperforms ordering methods commonly used in the literature when the distribution of instances is known. To the best knowledge of the authors, this is the first paper to apply machine learning to directly improve relaxation bounds obtained by general-purpose bounding mechanisms for combinatorial optimization problems.


A Multi-Agent Reinforcement Learning Method for Impression Allocation in Online Display Advertising

arXiv.org Artificial Intelligence

In online display advertising, guaranteed contracts and real-time bidding (RTB) are two major ways to sell impressions for a publisher. Despite the increasing popularity of RTB, there is still half of online display advertising revenue generated from guaranteed contracts. Therefore, simultaneously selling impressions through both guaranteed contracts and RTB is a straightforward choice for a publisher to maximize its yield. However, deriving the optimal strategy to allocate impressions is not a trivial task, especially when the environment is unstable in real-world applications. In this paper, we formulate the impression allocation problem as an auction problem where each contract can submit virtual bids for individual impressions. With this formulation, we derive the optimal impression allocation strategy by solving the optimal bidding functions for contracts. Since the bids from contracts are decided by the publisher, we propose a multi-agent reinforcement learning (MARL) approach to derive cooperative policies for the publisher to maximize its yield in an unstable environment. The proposed approach also resolves the common challenges in MARL such as input dimension explosion, reward credit assignment, and non-stationary environment. Experimental evaluations on large-scale real datasets demonstrate the effectiveness of our approach.


A Low-Cost Ethics Shaping Approach for Designing Reinforcement Learning Agents

arXiv.org Artificial Intelligence

This paper proposes a low-cost, easily realizable strategy to equip a reinforcement learning (RL) agent the capability of behaving ethically. Our model allows the designers of RL agents to solely focus on the task to achieve, without having to worry about the implementation of multiple trivial ethical patterns to follow. Based on the assumption that the majority of human behavior, regardless which goals they are achieving, is ethical, our design integrates human policy with the RL policy to achieve the target objective with less chance of violating the ethical code that human beings normally obey.


Visualizing and Understanding Atari Agents

arXiv.org Artificial Intelligence

While deep reinforcement learning (deep RL) agents are effective at maximizing rewards, it is often unclear what strategies they use to do so. In this paper, we take a step toward explaining deep RL agents through a case study using Atari 2600 environments. In particular, we focus on using saliency maps to understand how an agent learns and executes a policy. We introduce a method for generating useful saliency maps and use it to show 1) what strong agents attend to, 2) whether agents are making decisions for the right or wrong reasons, and 3) how agents evolve during learning. We also test our method on non-expert human subjects and find that it improves their ability to reason about these agents. Overall, our results show that saliency information can provide significant insight into an RL agent's decisions and learning behavior.


Active Inverse Reward Design

arXiv.org Machine Learning

Reward design, the problem of selecting an appropriate reward function for an AI system, is both critically important, as it encodes the task the system should perform, and challenging, as it requires reasoning about and understanding the agent's environment in detail. AI practitioners often iterate on the reward function for their systems in a trial-and-error process to get their desired behavior. Inverse reward design (IRD) is a preference inference method that infers a true reward function from an observed, possibly misspecified, proxy reward function. This allows the system to determine when it should trust its observed reward function and respond appropriately. This has been shown to avoid problems in reward design such as negative side-effects (omitting a seemingly irrelevant but important aspect of the task) and reward hacking (learning to exploit unanticipated loopholes). In this paper, we actively select the $\textit{set of proxy reward functions}$ available to the designer. This improves the quality of inference and simplifies the associated reward design problem. We present two types of queries: discrete queries, where the system designer chooses from a discrete set of reward functions, and feature queries, where the system queries the designer for weights on a small set of features. We evaluate this approach with experiments in a personal shopping assistant domain and a 2D navigation domain. We find that our approach leads to reduced regret at test time compared with vanilla IRD. Our results indicate that actively selecting the set of available reward functions is a promising direction to improve the efficiency and effectiveness of reward design.


Probabilistic Prediction of Interactive Driving Behavior via Hierarchical Inverse Reinforcement Learning

arXiv.org Machine Learning

Autonomous vehicles (AVs) are on the road. To safely and efficiently interact with other road participants, AVs have to accurately predict the behavior of surrounding vehicles and plan accordingly. Such prediction should be probabilistic, to address the uncertainties in human behavior. Such prediction should also be interactive, since the distribution over all possible trajectories of the predicted vehicle depends not only on historical information, but also on future plans of other vehicles that interact with it. To achieve such interaction-aware predictions, we propose a probabilistic prediction approach based on hierarchical inverse reinforcement learning (IRL). First, we explicitly consider the hierarchical trajectory-generation process of human drivers involving both discrete and continuous driving decisions. Based on this, the distribution over all future trajectories of the predicted vehicle is formulated as a mixture of distributions partitioned by the discrete decisions. Then we apply IRL hierarchically to learn the distributions from real human demonstrations. A case study for the ramp-merging driving scenario is provided. The quantitative results show that the proposed approach can accurately predict both the discrete driving decisions such as yield or pass as well as the continuous trajectories.


A Block Coordinate Ascent Algorithm for Mean-Variance Optimization

arXiv.org Machine Learning

Risk management plays a central role in sequential decision-making problems, common in fields such as portfolio management [Lai et al., 2011], autonomous driving [Maurer et al., 2016], and healthcare [Parker, 2009]. A common risk-measure is the variance of the expected sum of rewards/costs and the mean-variance tradeoff function [Sobel, 1982; Mannor and Tsitsiklis, 2011] is one of the most widely used objective functions in risk-sensitive decision-making. Other risk-sensitive objectives have also been studied, for example, Borkar [2002] studied exponential utility functions, Tamar et al. [2012] experimented with the Sharpe Ratio measurement, Chow et al. [2018] studied value at risk (VaR) and mean-VaR optimization, Chow and Ghavamzadeh [2014], Tamar et al. [2015b], and Chow et al. [2018] investigated conditional value at risk (CVaR) and mean-CVaR optimization in a static setting, and Tamar et al. [2015a] investigated coherent risk for both linear and nonlinear system dynamics. Compared with other widely used performance measurements, such as the Sharpe Ratio and CVaR, the mean-variance measurement has explicit interpretability and computational advantages [Markowitz et al., 2000; Li and Ng, 2000]. For example, the Sharpe Ratio tends to lead to solutions with less mean return [Tamar et al., 2012].


Addressing Sample Inefficiency and Reward Bias in Inverse Reinforcement Learning

arXiv.org Machine Learning

The Generative Adversarial Imitation Learning (GAIL) framework from Ho & Ermon (2016) is known for being surprisingly sample efficient in terms of demonstrations provided by an expert policy. However, the algorithm requires a significantly larger number of policy interactions with the environment in order to imitate the expert. In this work we address this problem by proposing a sample efficient algorithm for inverse reinforcement learning that incorporates both off-policy reinforcement learning and adversarial imitation learning. We also show that GAIL has a number of biases associated with the choice of reward function, which can unintentionally encode prior knowledge of some tasks, and prevent learning in others. We address these shortcomings by analyzing the issue and correcting invalid assumptions used when defining the learned reward function. We demonstrate that our algorithm achieves state-of-the-art performance for an inverse reinforcement learning framework on a variety of standard benchmark tasks, and from demonstrations provided from both learned agents and human experts.


Variance Reduction in Monte Carlo Counterfactual Regret Minimization (VR-MCCFR) for Extensive Form Games using Baselines

arXiv.org Artificial Intelligence

Learning strategies for imperfect information games from samples of interaction is a challenging problem. A common method for this setting, Monte Carlo Counterfactual Regret Minimization (MCCFR), can have slow long-term convergence rates due to high variance. In this paper, we introduce a variance reduction technique (VR-MCCFR) that applies to any sampling variant of MCCFR. Using this technique, per-iteration estimated values and updates are reformulated as a function of sampled values and state-action baselines, similar to their use in policy gradient reinforcement learning. The new formulation allows estimates to be bootstrapped from other estimates within the same episode, propagating the benefits of baselines along the sampled trajectory; the estimates remain unbiased even when bootstrapping from other estimates. Finally, we show that given a perfect baseline, the variance of the value estimates can be reduced to zero. Experimental evaluation shows that VR-MCCFR brings an order of magnitude speedup, while the empirical variance decreases by three orders of magnitude. The decreased variance allows for the first time CFR+ to be used with sampling, increasing the speedup to two orders of magnitude.


Online Convex Optimization for Sequential Decision Processes and Extensive-Form Games

arXiv.org Artificial Intelligence

Regret minimization is a powerful tool for solving large-scale extensive-form games. State-of-the-art methods rely on minimizing regret locally at each decision point. In this work we derive a new framework for regret minimization on sequential decision problems and extensive-form games with general compact convex sets at each decision point and general convex losses, as opposed to prior work which has been for simplex decision points and linear losses. We call our framework laminar regret decomposition. It generalizes the CFR algorithm to this more general setting. Furthermore, our framework enables a new proof of CFR even in the known setting, which is derived from a perspective of decomposing polytope regret, thereby leading to an arguably simpler interpretation of the algorithm. Our generalization to convex compact sets and convex losses allows us to develop new algorithms for several problems: regularized sequential decision making, regularized Nash equilibria in extensive-form games, and computing approximate extensive-form perfect equilibria. Our generalization also leads to the first regret-minimization algorithm for computing reduced-normal-form quantal response equilibria based on minimizing local regrets. Experiments show that our framework leads to algorithms that scale at a rate comparable to the fastest variants of counterfactual regret minimization for computing Nash equilibrium, and therefore our approach leads to the first algorithm for computing quantal response equilibria in extremely large games. Finally we show that our framework enables a new kind of scalable opponent exploitation approach.