Goto

Collaborating Authors

 Reinforcement Learning


Exploratory Gradient Boosting for Reinforcement Learning in Complex Domains

arXiv.org Machine Learning

High-dimensional observations and complex real-world dynamics present major challenges in reinforcement learning for both function approximation and exploration. We address both of these challenges with two complementary techniques: First, we develop a gradient-boosting style, non-parametric function approximator for learning on $Q$-function residuals. And second, we propose an exploration strategy inspired by the principles of state abstraction and information acquisition under uncertainty. We demonstrate the empirical effectiveness of these techniques, first, as a preliminary check, on two standard tasks (Blackjack and $n$-Chain), and then on two much larger and more realistic tasks with high-dimensional observation spaces. Specifically, we introduce two benchmarks built within the game Minecraft where the observations are pixel arrays of the agent's visual field. A combination of our two algorithmic techniques performs competitively on the standard reinforcement-learning tasks while consistently and substantially outperforming baselines on the two tasks with high-dimensional observation spaces. The new function approximator, exploration strategy, and evaluation benchmarks are each of independent interest in the pursuit of reinforcement-learning methods that scale to real-world domains.


Differentially Private Policy Evaluation

arXiv.org Machine Learning

Learning how to make decisions under uncertainty is becoming paramount in many practical applications, such as medical treatment design, energy management, adaptive user interfaces, recommender systems etc. Reinforcement learning [Sutton and Barto, 1998] provides a variety of algorithms capable of handling such tasks. However, in many practical applications, aside from obtaining good predictive performance, one might also require that the data used to learn the predictor be kept confidential. This is especially true in medical applications, where patient confidentiality is very important, and in other applications which are user-centric (such as recommender systems). Differential privacy (DP) [Dwork, 2006] is a very active research area, originating from cryptography, but which has now been embraced by the machine learning community. DP is a formal model of privacy used to design mechanisms that reduce the amount of information leaked by the result of queries to a database containing sensitive information about multiple users [Dwork, 2006].


Easy Monotonic Policy Iteration

arXiv.org Machine Learning

A key problem in reinforcement learning for control with general function approximators (such as deep neural networks and other nonlinear functions) is that, for many algorithms employed in practice, updates to the policy or $Q$-function may fail to improve performance---or worse, actually cause the policy performance to degrade. Prior work has addressed this for policy iteration by deriving tight policy improvement bounds; by optimizing the lower bound on policy improvement, a better policy is guaranteed. However, existing approaches suffer from bounds that are hard to optimize in practice because they include sup norm terms which cannot be efficiently estimated or differentiated. In this work, we derive a better policy improvement bound where the sup norm of the policy divergence has been replaced with an average divergence; this leads to an algorithm, Easy Monotonic Policy Iteration, that generates sequences of policies with guaranteed non-decreasing returns and is easy to implement in a sample-based framework.


Continuous control with deep reinforcement learning

arXiv.org Machine Learning

We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.


A Roadmap towards Machine Intelligence

arXiv.org Artificial Intelligence

A machine capable of performing complex tasks without requiring laborious programming would be tremendously useful in almost any human endeavor, from performing menial jobs for us to helping the advancement of basic and applied research. Given the current availability of powerful hardware and large amounts of machine-readable data, as well as the widespread interest in sophisticated machine learning methods, the times should be ripe for the development of intelligent machines. Still, since "solving AI" seems too complex a task to be pursued all at once, in the last decades the computational community has preferred to focus on solving relatively narrow empirical problems that are important for specific applications, but do not address the overarching goal of developing general-purpose intelligent machines. In this article, we propose an alternative approach: we first define the general characteristics we think intelligent machines should possess, and then we present a concrete roadmap to develop them in realistic, small steps, that are however incrementally structured in such a way that, jointly, they should lead us close to the ultimate goal of implementing a powerful AI. The article is organized as follows.


Meta-learning within Projective Simulation

arXiv.org Machine Learning

Learning models of artificial intelligence can nowadays perform very well on a large variety of tasks. However, in practice different task environments are best handled by different learning models, rather than a single, universal, approach. Most non-trivial models thus require the adjustment of several to many learning parameters, which is often done on a case-by-case basis by an external party. Meta-learning refers to the ability of an agent to autonomously and dynamically adjust its own learning parameters, or meta-parameters. In this work we show how projective simulation, a recently developed model of artificial intelligence, can naturally be extended to account for meta-learning in reinforcement learning settings. The projective simulation approach is based on a random walk process over a network of clips. The suggested meta-learning scheme builds upon the same design and employs clip networks to monitor the agent's performance and to adjust its meta-parameters "on the fly". We distinguish between "reflexive adaptation" and "adaptation through learning", and show the utility of both approaches. In addition, a trade-off between flexibility and learning-time is addressed. The extended model is examined on three different kinds of reinforcement learning tasks, in which the agent has different optimal values of the meta-parameters, and is shown to perform well, reaching near-optimal to optimal success rates in all of them, without ever needing to manually adjust any meta-parameter.


Experimental analysis of data-driven control for a building heating system

arXiv.org Artificial Intelligence

Driven by the opportunity to harvest the flexibility related to building climate control for demand response applications, this work presents a data-driven control approach building upon recent advancements in reinforcement learning. More specifically, model assisted batch reinforcement learning is applied to the setting of building climate control subjected to a dynamic pricing. The underlying sequential decision making problem is cast on a markov decision problem, after which the control algorithm is detailed. In this work, fitted Q-iteration is used to construct a policy from a batch of experimental tuples. In those regions of the state space where the experimental sample density is low, virtual support samples are added using an artificial neural network. Finally, the resulting policy is shaped using domain knowledge. The control approach has been evaluated quantitatively using a simulation and qualitatively in a living lab. From the quantitative analysis it has been found that the control approach converges in approximately 20 days to obtain a control policy with a performance within 90% of the mathematical optimum. The experimental analysis confirms that within 10 to 20 days sensible policies are obtained that can be used for different outside temperature regimes.


Generalization and Exploration via Randomized Value Functions

arXiv.org Machine Learning

We propose randomized least-squares value iteration (RLSVI) -- a new reinforcement learning algorithm designed to explore and generalize efficiently via linearly parameterized value functions. We explain why versions of least-squares value iteration that use Boltzmann or epsilon-greedy exploration can be highly inefficient, and we present computational results that demonstrate dramatic efficiency gains enjoyed by RLSVI. Further, we establish an upper bound on the expected regret of RLSVI that demonstrates near-optimality in a tabula rasa learning context. More broadly, our results suggest that randomized value functions offer a promising approach to tackling a critical challenge in reinforcement learning: synthesizing efficient exploration and effective generalization.


Active Information Acquisition

arXiv.org Machine Learning

We propose a general framework for sequential and dynamic acquisition of useful information in order to solve a particular task. While our goal could in principle be tackled by general reinforcement learning, our particular setting is constrained enough to allow more efficient algorithms. In this paper, we work under the Learning to Search framework and show how to formulate the goal of finding a dynamic information acquisition policy in that framework. We apply our formulation on two tasks, sentiment analysis and image recognition, and show that the learned policies exhibit good statistical performance. As an emergent byproduct, the learned policies show a tendency to focus on the most prominent parts of each instance and give harder instances more attention without explicitly being trained to do so.


Quantum machine learning with glow for episodic tasks and decision games

arXiv.org Artificial Intelligence

We consider a general class of models, where a reinforcement learning (RL) agent learns from cyclic interactions with an external environment via classical signals. Perceptual inputs are encoded as quantum states, which are subsequently transformed by a quantum channel representing the agent's memory, while the outcomes of measurements performed at the channel's output determine the agent's actions. The learning takes place via stepwise modifications of the channel properties. They are described by an update rule that is inspired by the projective simulation (PS) model and equipped with a glow mechanism that allows for a backpropagation of policy changes, analogous to the eligibility traces in RL and edge glow in PS. In this way, the model combines features of PS with the ability for generalization, offered by its physical embodiment as a quantum system. We apply the agent to various setups of an invasion game and a grid world, which serve as elementary model tasks allowing a direct comparison with a basic classical PS agent.