AITopics

2006.08714

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.88)
Information Technology > Data Science > Data Mining > Big Data (0.67)

arXiv.org Machine LearningJun-9-2020

Differentiable Meta-Learning in Contextual Bandits

Kveton, Branislav, Mladenov, Martin, Hsu, Chih-Wei, Zaheer, Manzil, Szepesvari, Csaba, Boutilier, Craig

We study a contextual bandit setting where the learning agent has access to sampled bandit instances from an unknown prior distribution $\mathcal{P}$. The goal of the agent is to achieve high reward on average over the instances drawn from $\mathcal{P}$. This setting is of a particular importance because it formalizes the offline optimization of bandit policies, to perform well on average over anticipated bandit instances. The main idea in our work is to optimize differentiable bandit policies by policy gradients. We derive reward gradients that reflect the structure of our problem, and propose contextual policies that are parameterized in a differentiable way and have low regret. Our algorithmic and theoretical contributions are supported by extensive experiments that show the importance of baseline subtraction, learned biases, and the practicality of our approach on a range of classification tasks.

big data, cosoftelim, health & medicine, (20 more...)

2006.05094

Country: North America (0.46)

Genre: Research Report (1.00)

Industry:

Health & Medicine (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)
Information Technology > Data Science > Data Mining > Big Data (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.46)

arXiv.org Machine LearningFeb-17-2020

Differentiable Bandit Exploration

Boutilier, Craig, Hsu, Chih-Wei, Kveton, Branislav, Mladenov, Martin, Szepesvari, Csaba, Zaheer, Manzil

We learn bandit policies that maximize the average reward over bandit instances drawn from an unknown distribution $\mathcal{P}$, from a sample from $\mathcal{P}$. Our approach is an instance of meta-learning and its appeal is that the properties of $\mathcal{P}$ can be exploited without restricting it. We parameterize our policies in a differentiable way and optimize them by policy gradients - an approach that is easy to implement and pleasantly general. Then the challenge is to design effective gradient estimators and good policy classes. To make policy gradients practical, we introduce novel variance reduction techniques. We experiment with various bandit policy classes, including neural networks and a novel soft-elimination policy. The latter has regret guarantees and is a natural starting point for our optimization. Our experiments highlight the versatility of our approach. We also observe that neural network policies can learn implicit biases, which are only expressed through sampled bandit instances during training.

deep learning, neural network, softelim, (20 more...)

2002.06772

Country: North America (0.28)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Neural Information Processing SystemsFeb-14-2020, 13:28:38 GMT

Data center cooling using model-predictive control

Lazic, Nevena, Boutilier, Craig, Lu, Tyler, Wong, Eehern, Roy, Binz, Ryu, MK, Imwalle, Greg

Despite impressive recent advances in reinforcement learning (RL), its deployment in real-world physical systems is often complicated by unexpected events, limited data, and the potential for expensive failures. In this paper, we describe an application of RL "in the wild" to the task of regulating temperatures and airflow inside a large-scale data center (DC). Adopting a data-driven, model-based approach, we demonstrate that an RL agent with little prior knowledge is able to effectively and safely regulate conditions on a server floor after just a few hours of exploration, while improving operational efficiency relative to existing PID controllers. Papers published at the Neural Information Processing Systems Conference.

artificial intelligence, data center, upstream oil & gas, (3 more...)

Neural Information Processing Systems

Industry:

Information Technology > Services (0.68)
Energy > Oil & Gas > Upstream (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)

arXiv.org Artificial IntelligenceFeb-12-2020

Data Efficient Training for Reinforcement Learning with Adaptive Behavior Policy Sharing

Liu, Ge, Wu, Rui, Cheng, Heng-Tze, Wang, Jing, Ooi, Jayden, Li, Lihong, Li, Ang, Li, Wai Lok Sibon, Boutilier, Craig, Chi, Ed

Deep Reinforcement Learning (RL) is proven powerful for decision making in simulated environments. However, training deep RL model is challenging in real world applications such as production-scale health-care or recommender systems because of the expensiveness of interaction and limitation of budget at deployment. One aspect of the data inefficiency comes from the expensive hyper-parameter tuning when optimizing deep neural networks. We propose Adaptive Behavior Policy Sharing (ABPS), a data-efficient training algorithm that allows sharing of experience collected by behavior policy that is adaptively selected from a pool of agents trained with an ensemble of hyper-parameters. We further extend ABPS to evolve hyper-parameters during training by hybridizing ABPS with an adapted version of Population Based Training (ABPS-PBT). We conduct experiments with multiple Atari games with up to 16 hyper-parameter/architecture setups. ABPS achieves superior overall performance, reduced variance on top 25% agents, and equivalent performance on the best agent compared to conventional hyper-parameter tuning with independent training, even though ABPS only requires the same number of environmental interactions as training a single agent. We also show that ABPS-PBT further improves the convergence speed and reduces the variance.

agent, computer game, deep learning, (23 more...)

2002.05229

Country: North America > United States (0.28)

Genre: Research Report (0.66)

Industry:

Leisure & Entertainment > Games > Computer Games (0.54)
Health & Medicine (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

arXiv.org Artificial IntelligenceNov-20-2019

Gradient-based Optimization for Bayesian Preference Elicitation

Vendrov, Ivan, Lu, Tyler, Huang, Qingqing, Boutilier, Craig

Effective techniques for eliciting user preferences have taken on added importance as recommender systems (RSs) become increasingly interactive and conversational. A common and conceptually appealing Bayesian criterion for selecting queries is expected value of information (EVOI) . Unfortunately, it is computationally prohibitive to construct queries with maximum EVOI in RSs with large item spaces. We tackle this issue by introducing a continuous formulation of EVOI as a differentiable network that can be optimized using gradient methods available in modern machine learning (ML) computational frameworks (e.g., TensorFlow, PyTorch). We exploit this to develop a novel, scalable Monte Carlo method for EVOI optimization, which is more scalable for large item spaces than methods requiring explicit enumeration of items. While we emphasize the use of this approach for pairwise (or k -wise) comparisons of items, we also demonstrate how our method can be adapted to queries involving subsets of item attributes or "partial items," which are often more cognitively manageable for users. Experiments show that our gradient-based EVOI technique achieves state-of-the-art performance across several domains while scaling to large item spaces.

artificial intelligence, optimization problem, query, (19 more...)

1911.09153

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

arXiv.org Artificial IntelligenceOct-9-2019

CAQL: Continuous Action Q-Learning

Ryu, Moonkyung, Chow, Yinlam, Anderson, Ross, Tjandraatmadja, Christian, Boutilier, Craig

A BSTRACT V alue-based reinforcement learning (RL) methods like Q-learning have shown success in a variety of domains. One challenge in applying Q-learning to continuous-action RL problems, however, is the continuous action maximization ( max-Q) required for optimal Bellman backup. In this work, we develop CAQL, a (class of) algorithm(s) for continuous-action Q-learning that can use several plug-and-play optimizers for the max-Q problem. Leveraging recent optimization results for deep neural networks, we show that max-Q can be solved optimally using mixed-integer programming (MIP) . When the Q-function representation has sufficient power, MIP-based optimization gives rise to better policies and is more robust than approximate methods (e.g., gradient ascent, cross-entropy search). We further develop several techniques to accelerate inference in CAQL, which despite their approximate nature, perform well. We compare CAQL with state-of-the-art RL algorithms on benchmark continuous-control problems that have different degrees of action constraints and show that CAQL outperforms policy-based methods in heavily constrained environments, often dramatically. When the action space is finite, value-based algorithms such as Q-learning (Watkins & Dayan, 1992), which implicitly finds a policy by learning the optimal value function, are often very efficient because action optimization can be done by exhaustive enumeration. By contrast, in problems with a continuous action spaces (e.g., robotics (Peters & Schaal, 2006)), policy-based algorithms, such as policy gradient (PG) (Sutton et al., 2000; Silver et al., 2014) or cross-entropy policy search (CEPS) (Mannor et al., 2003; Kalashnikov et al., 2018), which directly learn a return-maximizing policy, have proven more practical. Recently, methods such as ensemble critic (Fujimoto et al., 2018) and entropy regularization (Haarnoja et al., 2018) have been developed to improve the performance of policy-based RL algorithms. Policy-based approaches require a reasonable choice of policy parameterization. In some continuous control problems, Gaussian distributions over actions conditioned on some state representation is used. However, in applications such as RSs, where actions often take the form of high-dimensional item-feature vectors, policies cannot typically be modeled by common action distributions.

deep learning, neural network, null, (17 more...)

1909.12397

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning > Representation Of Examples (0.34)

arXiv.org Machine LearningSep-26-2019

RecSim: A Configurable Simulation Platform for Recommender Systems

Ie, Eugene, Hsu, Chih-wei, Mladenov, Martin, Jain, Vihan, Narvekar, Sanmit, Wang, Jing, Wu, Rui, Boutilier, Craig

We propose RecSim, a configurable platform for authoring simulation environments for recommender systems (RSs) that naturally supports sequential interaction with users. RecSim allows the creation of new environments that reflect particular aspects of user behavior and item structure at a level of abstraction well-suited to pushing the limits of current reinforcement learning (RL) and RS techniques in sequential interactive recommendation problems. Environments can be easily configured that vary assumptions about: user preferences and item familiarity; user latent state and its dynamics; and choice models and other user response behavior. We outline how RecSim offers value to RL and RS researchers and practitioners, and how it can serve as a vehicle for academic-industrial collaboration.

agent, computer game, survey article, (21 more...)

1909.04847

Country:

Europe (0.93)
North America > United States > California (0.14)
North America > United States > Texas (0.14)

Genre: Research Report (0.82)

Industry:

Information Technology (1.00)
Leisure & Entertainment > Games > Computer Games (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)

arXiv.org Machine LearningJun-21-2019

Randomized Exploration in Generalized Linear Bandits

Kveton, Branislav, Zaheer, Manzil, Szepesvari, Csaba, Li, Lihong, Ghavamzadeh, Mohammad, Boutilier, Craig

We study two randomized algorithms for generalized linear bandits, GLM-TSL and GLM-FPL. GLM-TSL samples a generalized linear model (GLM) from the Laplace approximation to the posterior distribution. GLM-FPL, a new algorithm proposed in this work, fits a GLM to a randomly perturbed history of past rewards. We prove a $\tilde{O}(d \sqrt{n} + d^2)$ upper bound on the $n$-round regret of GLM-TSL, where $d$ is the number of features. This is the first regret bound of a Thompson sampling-like algorithm in GLM bandits where the leading term is $\tilde{O}(d \sqrt{n})$. We apply both GLM-TSL and GLM-FPL to logistic and neural network bandits, and show that they perform well empirically. In more complex models, GLM-FPL is significantly faster. Our results showcase the role of randomization, beyond posterior sampling, in exploration.

artificial intelligence, bandit, neural network, (18 more...)

1906.08947

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceMay-31-2019

Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology

Ie, Eugene, Jain, Vihan, Wang, Jing, Narvekar, Sanmit, Agarwal, Ritesh, Wu, Rui, Cheng, Heng-Tze, Lustman, Morgane, Gatto, Vince, Covington, Paul, McFadden, Jim, Chandra, Tushar, Boutilier, Craig

Recommender systems have become ubiquitous, transforming user interactions with products, services and content in a wide variety of domains. In content recommendation, recommenders generally surface relevant and/or novel personalized content based on learned models of user preferences (e.g., as in collaborative filtering [Breese et al., 1998, Konstan et al., 1997, Srebro et al., 2004, Salakhutdinov and Mnih, 2007]) or predictive models of user responses to specific recommendations. Well-known applications of recommender systems include video recommendations on YouTube [Covington et al., 2016], movie recommendations on Netflix [Gomez-Uribe and Hunt, 2016] and playlist construction on Spotify [Jacobson et al., 2016]. It is increasingly common to train deep neural networks (DNNs) [van den Oord et al., 2013, Wang et al., 2015, Covington et al., 2016, Cheng et al., 2016] to predict user responses (e.g., click-through rates, content engagement, ratings, likes) to generate, score and serve candidate recommendations. Practical recommender systems largely focus on myopic prediction--estimating a user's immediate response to a recommendation--without considering the long-term impact on subsequent user behavior. This can be limiting: modeling a recommendation's stochastic impact on the future affords opportunities to trade off user engagement in the near-term for longer-term benefit (e.g., by probing a user's interests, or improving satisfaction).

deep learning, neural network, slate, (21 more...)

1905.12767

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > Massachusetts > Middlesex County (0.14)

Genre: Research Report > New Finding (0.93)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Information Technology > Services (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)