Goto

Collaborating Authors

 Reinforcement Learning


Data-driven deep reinforcement learning

Robohub

One of the primary factors behind the success of machine learning approaches in open world settings, such as image recognition and natural language processing, has been the ability of high-capacity deep neural network function approximators to learn generalizable models from large amounts of data. Deep reinforcement learning methods, however, require active online data collection, where the model actively interacts with its environment. This makes such methods hard to scale to complex real-world problems, where active data collection means that large datasets of experience must be collected for every experiment – this can be expensive and, for systems such as autonomous vehicles or robots, potentially unsafe. In a number of domains of practical interest, such as autonomous driving, robotics, and games, there exist plentiful amounts of previously collected interaction data which, consists of informative behaviours that are a rich source of prior information. Deep RL algorithms that can utilize such prior datasets will not only scale to real-world problems, but will also lead to solutions that generalize substantially better.


RoboNet: A dataset for large-scale multi-robot learning

Robohub

Note that any GIF compression artifacts in this animation are not present in the dataset itself. After collecting a diverse dataset, we experimentally investigate how it can be used to enable general skill learning that transfers to new environments. First, we pre-train visual dynamics models on a subset of data from RoboNet, and then fine-tune them to work in an unseen test environment using a small amount of new data. The constructed test environments (one of which is visualized below) all include different lab settings, new cameras and viewpoints, held-out robots, and novel objects purchased after data collection concluded. Example test environment constructed in a new lab, with a temporary uncalibrated camera, and a new Baxter robot.


Facebook taught an AI the 'theory of mind'

#artificialintelligence

When it comes to competitive games, AI systems have already shown they can easily mop the floor with the best humanity has to offer. But life in the real world isn't a zero sum game like poker or Starcraft and we need AI to work with us, not against us. That's why a research team from Facebook taught an AI how to play the cooperative card game Hanabi (the Japanese word for fireworks), to gain a better understanding of how humans think. Specifically, the Facebook team set out to instill upon its AI system the theory of mind. "Theory of mind is this idea of understanding the beliefs and intentions of other agents or other players or humans," Noam Brown, a researcher at Facebook AI, told Engadget.


Spectrum Management in Dynamic Spectrum Access: A Deep Reinforcement Learning Approach

#artificialintelligence

Generally, in dynamic spectrum access (DSA) networks, co-operations and centralized control are unavailable and DSA users have to carry out wireless transmissions individually. DSA users have to know other users' behaviors by sensing and analyzing wireless environments, so that DSA users can adjust their parameters properly and carry out effective wireless transmissions. In this thesis, machine learning and deep learning technologies are leveraged in DSA network to enable appropriate and intelligent spectrum managements, including both spectrum access and power allocations. Accordingly, a novel spectrum management framework utilizing deep reinforcement learning is proposed, in which deep reinforcement learning is employed to accurately learn wireless environments and generate optimal spectrum management strategies to adapt to the variations of wireless environments. Due to the model-free nature of reinforcement learning, DSA users only need to directly interact with environments to obtain optimal strategies rather than relying on accurate channel estimations.


No-Regret Exploration in Goal-Oriented Reinforcement Learning

arXiv.org Machine Learning

Many popular reinforcement learning problems (e.g., navigation in a maze, some Atari games, mountain car) are instances of the so-called episodic setting or stochastic shortest path (SSP) problem, where an agent has to achieve a predefined goal state (e.g., the top of the hill) while maximizing the cumulative reward or minimizing the cumulative cost. Despite its popularity, most of the literature studying the exploration-exploitation dilemma either focused on different problems (i.e., fixed-horizon and infinite-horizon) or made the restrictive loop-free assumption (which implies that no same state can be visited twice during any episode). In this paper, we study the general SSP setting and introduce the algorithm UC-SSP whose regret scales as $\displaystyle \widetilde{O}(c_{\max}^{3/2} c_{\min}^{-1/2} D S \sqrt{ A D K})$ after $K$ episodes for any unknown SSP with $S$ non-terminal states, $A$ actions, an SSP-diameter of $D$ and positive costs in $[c_{\min}, c_{\max}]$. UC-SSP is thus the first learning algorithm with vanishing regret in the theoretically challenging setting of episodic RL.


Hierarchical Cooperative Multi-Agent Reinforcement Learning with Skill Discovery

arXiv.org Machine Learning

Human players in professional team sports achieve high level coordination by dynamically choosing complementary skills and executing primitive actions to perform these skills. As a step toward creating intelligent agents with this capability for fully cooperative multi-agent settings, we propose a two-level hierarchical multi-agent reinforcement learning (MARL) algorithm with unsupervised skill discovery. Agents learn useful and distinct skills at the low level via independent Q-learning, while they learn to select complementary latent skill variables at the high level via centralized multi-agent training with an extrinsic team reward. The set of low-level skills emerges from an intrinsic reward that solely promotes the decodability of latent skill variables from the trajectory of a low-level skill, without the need for hand-crafted rewards for each skill. For scalable decentralized execution, each agent independently chooses latent skill variables and primitive actions based on local observations. Our overall method enables the use of general cooperative MARL algorithms for training high level policies and single-agent RL for training low level skills. Experiments on a stochastic high dimensional team game show the emergence of useful skills and cooperative team play. The interpretability of the learned skills show the promise of the proposed method for achieving human-AI cooperation in team sports games.


Improving Network Automation and Security with Artificial Intelligence - IT Peer Network

#artificialintelligence

Communication service providers (CommSPs) are already saving money and generating revenue from network transformation investments. There is an expectation these benefits will continue to increase as NFV functions scale across the various elements of the infrastructure--enterprise, radio access network, wireless core, cable and cloud. New 5G and edge computing use cases promise to deliver new revenue along with even more data that must be moved, stored, processed and analyzed. The industry is looking to Artificial Intelligence (AI) and Machine Learning (ML) to enable CommSPs to solve problems and unlock value for their own business operations and their customers. As an example, distributed AI based on reinforcement learning will play a key role in building automated and self-managed networks.


Risk-Averse Trust Region Optimization for Reward-Volatility Reduction

arXiv.org Machine Learning

In real-world decision-making problems, for instance in the fields of finance, robotics or autonomous driving, keeping uncertainty under control is as important as maximizing expected returns. Risk aversion has been addressed in the reinforcement learning literature through risk measures related to the variance of returns. However, in many cases, the risk is measured not only on a long-term perspective, but also on the step-wise rewards (e.g., in trading, to ensure the stability of the investment bank, it is essential to monitor the risk of portfolio positions on a daily basis). In this paper, we define a novel measure of risk, which we call reward volatility, consisting of the variance of the rewards under the state-occupancy measure. We show that the reward volatility bounds the return variance so that reducing the former also constrains the latter. We derive a policy gradient theorem with a new objective function that exploits the mean-volatility relationship, and develop an actor-only algorithm. Furthermore, thanks to the linearity of the Bellman equations defined under the new objective function, it is possible to adapt the well-known policy gradient algorithms with monotonic improvement guarantees such as TRPO in a risk-averse manner. Finally, we test the proposed approach in two simulated financial environments.


VALAN: Vision and Language Agent Navigation

arXiv.org Machine Learning

VALAN is a lightweight and scalable software framework for deep reinforcement learning based on the SEED RL architecture. The framework facilitates the development and evaluation of embodied agents for solving grounded language understanding tasks, such as Vision-and-Language Navigation and Vision-and-Dialog Navigation, in photo-realistic environments, such as Matterport3D and Google StreetView. We have added a minimal set of abstractions on top of SEED RL allowing us to generalize the architecture to solve a variety of other RL problems. In this article, we will describe VALAN's software abstraction and architecture, and also present an example of using VALAN to design agents for instruction-conditioned indoor navigation.


Making Smart Homes Smarter: Optimizing Energy Consumption with Human in the Loop

arXiv.org Artificial Intelligence

Rapid advancements in the Internet of Things (IoT) have facilitated more efficient deployment of smart environment solutions for specific user requirement. With the increase in the number of IoT devices, it has become difficult for the user to control or operate every individual smart device into achieving some desired goal like optimized power consumption, scheduled appliance running time, etc. Furthermore, existing solutions to automatically adapt the IoT devices are not capable enough to incorporate the user behavior. This paper presents a novel approach to accurately configure IoT devices while achieving the twin objectives of energy optimization along with conforming to user preferences. Our work comprises of unsupervised clustering of devices' data to find the states of operation for each device, followed by probabilistically analyzing user behavior to determine their preferred states. Eventually, we deploy an online reinforcement learning (RL) agent to find the best device settings automatically. Results for three different smart homes' data-sets show the effectiveness of our methodology. To the best of our knowledge, this is the first time that a practical approach has been adopted to achieve the above mentioned objectives without any human interaction within the system.