Goto

Collaborating Authors

 Reinforcement Learning


Generalization to New Actions in Reinforcement Learning

arXiv.org Artificial Intelligence

A fundamental trait of intelligence is the ability to achieve goals in the face of novel circumstances, such as making decisions from new action choices. However, standard reinforcement learning assumes a fixed set of actions and requires expensive retraining when given a new action set. To make learning agents more adaptable, we introduce the problem of zero-shot generalization to new actions. We propose a two-stage framework where the agent first infers action representations from action information acquired separately from the task. A policy flexible to varying action sets is then trained with generalization objectives. We benchmark generalization on sequential tasks, such as selecting from an unseen tool-set to solve physical reasoning puzzles and stacking towers with novel 3D shapes. Videos and code are available at https://sites.google.com/view/action-generalization


Distributional Reinforcement Learning for mmWave Communications with Intelligent Reflectors on a UAV

arXiv.org Artificial Intelligence

In this paper, a novel communication framework that uses an unmanned aerial vehicle (UAV)-carried intelligent reflector (IR) is proposed to enhance multi-user downlink transmissions over millimeter wave (mmWave) frequencies. In order to maximize the downlink sum-rate, the optimal precoding matrix (at the base station) and reflection coefficient (at the IR) are jointly derived. Next, to address the uncertainty of mmWave channels and maintain line-of-sight links in a real-time manner, a distributional reinforcement learning approach, based on quantile regression optimization, is proposed to learn the propagation environment of mmWave communications, and, then, optimize the location of the UAV-IR so as to maximize the long-term downlink communication capacity. Simulation results show that the proposed learning-based deployment of the UAV-IR yields a significant advantage, compared to a non-learning UAV-IR, a static IR, and a direct transmission schemes, in terms of the average data rate and the achievable line-of-sight probability of downlink mmWave communications.


A Variant of the Wang-Foster-Kakade Lower Bound for the Discounted Setting

arXiv.org Artificial Intelligence

Recently, Wang et al. (2020) showed a highly intriguing hardness result for batch reinforcement learning (RL) with linearly realizable value function and good feature coverage in the finite-horizon case. In this note we show that once adapted to the discounted setting, the construction can be simplified to a 2-state MDP with 1-dimensional features, such that learning is impossible even with an infinite amount of data. Wang et al. (2020) recently showed that in finite-horizon batch RL, the sample complexity of evaluating a given policy ฯ€ has an information-theoretic lower bound that is exponential in the horizon, even if realizable linear features are given (i.e., ฯ•: S A R


Ocado enters non-food retail and logistics sectors with new robot acquisitions

#artificialintelligence

As the coronavirus pandemic accelerates the automation of the retail industry, Ocado Group PLC (LON:OCDO) has stepped up its investment in robotics and machine learning. The FTSE 100 group is now buying a company that specialises in an issue that Amazon's Jeff Bezos has several times stated as perhaps the most difficult and last remaining element in the race to automate the retail industry. Ocado has agreed to buy Kindred Systems Inc, a US company specialising in'piece picking' robots, for roughly US$262mln. Using automated intelligence (AI) and deep learning, robots from Kindred and its rivals are increasingly being used by retail and logistics companies to achieve Bezos's tricky task of picking up and moving items without breaking them. Kindred robots use automated intelligence (AI) to power their vision and motion control, while the piece-picking arms are developed using'deep reinforcement learning', a form of AI that improves the learning process for robots handling a wide variety of large, small, hard and soft items such as in grocery.


A VR film/game with AI characters can be different every time you watch or play โ€“ MIT Technology Review

#artificialintelligence

The square-faced, three-legged alien shoves and jostles to get at the enormous plant taking over its tiny planet. But each bite just makes the forbidden fruit grow bigger. Suddenly the plant's weight flips the whole sphere upside down and all the little creatures drop into space. Reach in and catch one! Agence, a short interactive VR film from Toronto-based studio Transitional Forms and the National Film Board of Canada, won't be breaking any box office records.


NEARL: Non-Explicit Action Reinforcement Learning for Robotic Control

arXiv.org Artificial Intelligence

Traditionally, reinforcement learning methods predict the next action based on the current state. However, in many situations, directly applying actions to control systems or robots is dangerous and may lead to unexpected behaviors because action is rather low-level. In this paper, we propose a novel hierarchical reinforcement learning framework without explicit action. Our meta policy tries to manipulate the next optimal state and actual action is produced by the inverse dynamics model. To stabilize the training process, we integrate adversarial learning and information bottleneck into our framework. Under our framework, widely available state-only demonstrations can be exploited effectively for imitation learning. Also, prior knowledge and constraints can be applied to meta policy. We test our algorithm in simulation tasks and its combination with imitation learning. The experimental results show the reliability and robustness of our algorithms.


Instance based Generalization in Reinforcement Learning

arXiv.org Machine Learning

Agents trained via deep reinforcement learning (RL) routinely fail to generalize to unseen environments, even when these share the same underlying dynamics as the training levels. Understanding the generalization properties of RL is one of the challenges of modern machine learning. Towards this goal, we analyze policy learning in the context of Partially Observable Markov Decision Processes (POMDPs) and formalize the dynamics of training levels as instances. We prove that, independently of the exploration strategy, reusing instances introduces significant changes on the effective Markov dynamics the agent observes during training. Maximizing expected rewards impacts the learned belief state of the agent by inducing undesired instance-specific speed-running policies instead of generalizable ones, which are sub-optimal on the training set. We provide generalization bounds to the value gap in train and test environments based on the number of training instances, and use insights based on these to improve performance on unseen levels. We propose training a shared belief representation over an ensemble of specialized policies, from which we compute a consensus policy that is used for data collection, disallowing instance-specific exploitation. We experimentally validate our theory, observations, and the proposed computational solution over the CoinRun benchmark.


Sample-efficient reinforcement learning using deep Gaussian processes

arXiv.org Machine Learning

Reinforcement learning provides a framework for learning to control which actions to take towards completing a task through trial-and-error. In many applications observing interactions is costly, necessitating sample-efficient learning. In model-based reinforcement learning efficiency is improved by learning to simulate the world dynamics. The challenge is that model inaccuracies rapidly accumulate over planned trajectories. We introduce deep Gaussian processes where the depth of the compositions introduces model complexity while incorporating prior knowledge on the dynamics brings smoothness and structure. Our approach is able to sample a Bayesian posterior over trajectories. We demonstrate highly improved early sample-efficiency over competing methods. This is shown across a number of continuous control tasks, including the half-cheetah whose contact dynamics have previously posed an insurmountable problem for earlier sample-efficient Gaussian process based models.


Incorporating Rivalry in Reinforcement Learning for a Competitive Game

arXiv.org Artificial Intelligence

Recent advances in reinforcement learning with social agents have allowed us to achieve human-level performance on some interaction tasks. However, most interactive scenarios do not have as end-goal performance alone; instead, the social impact of these agents when interacting with humans is as important and, in most cases, never explored properly. This preregistration study focuses on providing a novel learning mechanism based on a rivalry social impact. Our scenario explored different reinforcement learning-based agents playing a competitive card game against human players. Based on the concept of competitive rivalry, our analysis aims to investigate if we can change the assessment of these agents from a human perspective.


Useful Policy Invariant Shaping from Arbitrary Advice

arXiv.org Artificial Intelligence

Reinforcement learning is a powerful learning paradigm in which agents can learn to maximize sparse and delayed reward signals. Although RL has had many impressive successes in complex domains, learning can take hours, days, or even years of training data. A major challenge of contemporary RL research is to discover how to learn with less data. Previous work has shown that domain information can be successfully used to shape the reward; by adding additional reward information, the agent can learn with much less data. Furthermore, if the reward is constructed from a potential function, the optimal policy is guaranteed to be unaltered. While such potential-based reward shaping (PBRS) holds promise, it is limited by the need for a well-defined potential function. Ideally, we would like to be able to take arbitrary advice from a human or other agent and improve performance without affecting the optimal policy. The recently introduced dynamic potential based advice (DPBA) method tackles this challenge by admitting arbitrary advice from a human or other agent and improves performance without affecting the optimal policy. The main contribution of this paper is to expose, theoretically and empirically, a flaw in DPBA. Alternatively, to achieve the ideal goals, we present a simple method called policy invariant explicit shaping (PIES) and show theoretically and empirically that PIES succeeds where DPBA fails.