Goto

Collaborating Authors

 Reinforcement Learning


Vicarious - General Game Playing with Schema Networks

#artificialintelligence

The success of deep reinforcement learning (deep RL) in playing games has resulted in a large amount of excitement in the AI community and beyond (Mnih et al., 2015; Mnih et al., 2016; Silver et al., 2016; Van Hasselt et al., 2016). State-of-the-art scores in many different games have now surpassed human level. But to what extent do these feats demonstrate that the AI has developed a human-like understanding of the objectives of the game? When humans play a new game, they first develop a conceptual understanding of the game. Suppose you were seeing a game like Breakout (see below) for the first time.


Deep Active Learning for Dialogue Generation

arXiv.org Artificial Intelligence

We propose an online, end-to-end, neural generative conversational model for open-domain dialogue. It is trained using a unique combination of offline two-phase supervised learning and online human-in-the-loop active learning. While most existing research proposes offline supervision or hand-crafted reward functions for online reinforcement, we devise a novel interactive learning mechanism based on hamming-diverse beam search for response generation and one-character user-feedback at each step. Experiments show that our model inherently promotes the generation of semantically relevant and interesting responses, and can be used to train agents with customized personas, moods and conversational styles.


Sorry humans, Microsoft's AI is the first to reach a perfect Ms. Pac-Man score

#artificialintelligence

At long last, the perfect score for arcade classic Ms. Pac-Man has been achieved, though not by a human. Maluuba -- a deep learning team acquired by Microsoft in January -- has created an AI system that's learned how to reach the game's maximum point value of 999,900 on Atari 2600, using a unique combination of reinforcement learning with a divide-and-conquer method. AI researchers have a documented penchant for using video games to test machine learning; they better mimic real-world chaos in a controlled environment versus more static games like chess. In 2015, Google's DeepMind AI was able to learn how to master 49 Atari games using reinforcement learning, which provides positive or negative feedback each time the AI attempts to solve a problem. Though AI has conquered a wealth of retro games, Ms. Pac-Man has remained elusive for years, due to the game's intentional lack of predictability.


Micosoft's AI earns perfect Ms Pac-Man score

#artificialintelligence

Some tasks are just too complex, too nuanced to tackle all at once, like beating all 256 levels of Ms. Pac-Man on the Atari 2600 while earning a perfect score of 999,990. That's why Microsoft didn't even try to train its AI to take it on in one go. Instead the company, as it announced on Wednesday, split this monumental challenge up into smaller, chomp-sized pieces and trained a hivemind of 150 AIs to accomplish it as a team. Developed by Maluuba, a Canadian AI firm that Microsoft recently acquired, the AI system relies on reinforcement learning to develop its strategy. Reinforcement learning is an AI training technique wherein the algorithm is rewarded for using more efficient outcomes and dissuaded from using the less effective based on previously observed outcomes.


Learning from Human Preferences

#artificialintelligence

One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behavior. In collaboration with DeepMind's safety team, we've developed an algorithm which can infer what humans want by being told which of two proposed behaviors is better. We present a learning algorithm that uses small amounts of human feedback to solve modern RL environments. Machine learning systems with human feedback have been explored before, but we've scaled up the approach to be able to work on much more complicated tasks. Our algorithm needed 900 bits of feedback from a human evaluator to learn to backflip -- a seemingly simple task which is simple to judge but challenging to specify.


Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Improving Revenues

AAAI Conferences

Taxis (which include cars working with car aggregation systems such as Uber, Grab, Lyft etc.) have become a critical component in the urban transportation. While most research and applications in the context of taxis have focused on improving performance from a customer perspective, in this paper, we focus on improving performance from a taxi driver perspective. Higher revenues for taxi drivers can help bring more drivers into the system thereby improving availability for customers in dense urban cities. Typically, when there is no customer on board, taxi drivers will cruise around to find customers either directly (on the street) or indirectly (due to a request from a nearby customer on phone or on aggregation systems). For such cruising taxis, we develop a Reinforcement Learning (RL) based system to learn from real trajectory logs of drivers to advise them on the right locations to find customers which maximize their revenue. There are multiple translational challenges involved in building this RL system based on real data, such as annotating the activities (e.g., roaming, going to a taxi stand, etc.) observed in trajectory logs, identifying the right features for a state, action space and evaluating against real driver performance observed in the dataset. We also provide a dynamic abstraction mechanism to improve the basic learning mechanism. Finally, we provide a thorough evaluation on a real world data set from a developed Asian city and demonstrate that an RL based system can provide significant benefits to the drivers.


Approximately-Optimal Queries for Planning in Reward-Uncertain Markov Decision Processes

AAAI Conferences

When planning actions to take on behalf of its human operator, a robot might be uncertain about its operator's reward function. We address the problem of how the robot should formulate an (approximately) optimal query to pose to the operator, given how its uncertainty affects which policies it should plan to pursue. We explain how a robot whose queries ask the operator to choose the best from among k choices can, without loss of optimality, restrict consideration to choices only over alternative policies. Further, we present a method for constructing an approximately-optimal policy query that enjoys a performance bound, where the method need not enumerate all policies. Finally, because queries posed to the operator of a robotic system are often expressed in terms of preferences over trajectories rather than policies, we show how our constructed policy query can be projected into the space of trajectory queries. Our empirical results demonstrate that our projection technique can outperform other known techniques for choosing trajectory queries, particularly when the number of trajectories the operator is asked to compare is small.


Accelerated Reinforcement Learning Algorithms with Nonparametric Function Approximation for Opportunistic Spectrum Access

arXiv.org Machine Learning

We study the problem of throughput maximization by predicting spectrum opportunities using reinforcement learning. Our kernel-based reinforcement learning approach is coupled with a sparsification technique that efficiently captures the environment states to control dimensionality and finds the best possible channel access actions based on the current state. This approach allows learning and planning over the intrinsic state-action space and extends well to large state and action spaces. For stationary Markov environments, we derive the optimal policy for channel access, its associated limiting throughput, and propose a fast online algorithm for achieving the optimal throughput. We then show that the maximum-likelihood channel prediction and access algorithm is suboptimal in general, and derive conditions under which the two algorithms are equivalent. For reactive Markov environments, we derive kernel variants of Q-learning, R-learning and propose an accelerated R-learning algorithm that achieves faster convergence. We finally test our algorithms against a generic reactive network. Simulation results are shown to validate the theory and show the performance gains over current state-of-the-art techniques.


An Alternative Softmax Operator for Reinforcement Learning

arXiv.org Artificial Intelligence

A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to hedge against problems that arise from putting all of one's weight behind a single maximum utility decision. The Boltzmann softmax operator is the most commonly used softmax operator in this setting, but we show that this operator is prone to misbehavior. In this work, we study a differentiable softmax operator that, among other properties, is a non-expansion ensuring a convergent behavior in learning and planning. We introduce a variant of SARSA algorithm that, by utilizing the new operator, computes a Boltzmann policy with a state-dependent temperature parameter. We show that the algorithm is convergent and that it performs favorably in practice.


[R] Deep Reinforcement Learning from Human Preferences • r/MachineLearning

@machinelearnbot

Abstract: For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than 1% of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time.