AITopics

Importance sampling (IS) is a common reweighting strategy for off-policy prediction in reinforcement learning. While it is consistent and unbiased, it can result in high variance updates to the weights for the value function. In this work, we explore a resampling strategy as an alternative to reweighting. We propose Importance Resampling (IR) for off-policy prediction, which resamples experience from a replay buffer and applies standard on-policy updates. The approach avoids using importance sampling ratios in the update, instead correcting the distribution before the update. We characterize the bias and consistency of IR, particularly compared to Weighted IS (WIS). We demonstrate in several microworlds that IR has improved sample efficiency and lower variance updates, as compared to IS and several variance-reduced IS strategies, including variants of WIS and V-trace which clips IS ratios. We also provide a demonstration showing IR improves over IS for learning a value function from images in a racing car simulator.

machine learning, reinforcement learning, variance, (17 more...)

1906.04328

Country: North America > Canada > Alberta (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Luketina, Jelena, Nardelli, Nantas, Farquhar, Gregory, Foerster, Jakob, Andreas, Jacob, Grefenstette, Edward, Whiteson, Shimon, Rocktäschel, Tim

A Survey of Reinforcement Learning Informed by Natural Language

To be successful in real-world tasks, Reinforcement Learning (RL) needs to exploit the compositional, relational, and hierarchical structure of the world, and learn to transfer it to the task at hand. Recent advances in representation learning for language make it possible to build models that acquire world knowledge from text corpora and integrate this knowledge into downstream decision making problems. We thus argue that the time is right to investigate a tight integration of natural language understanding into RL in particular. We survey the state of the field, including work on instruction following, text games, and learning from textual domain knowledge. Finally, we call for the development of new environments as well as further investigation into the potential uses of recent Natural Language Processing (NLP) techniques for such tasks.

machine learning, natural language, reinforcement learning, (17 more...)

1906.03926

Country:

North America > United States > Massachusetts (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Sweden > Skåne County > Malmö (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre:

Overview (0.46)
Research Report (0.40)

Industry:

Education (0.93)
Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Boosting Soft Actor-Critic: Emphasizing Recent Experience without Forgetting the Past

Wang, Che, Ross, Keith

Soft Actor-Critic (SAC) [10, 11] is an off-policy actor-critic deep reinforcement learning (DRL) algorithm based on maximum entropy reinforcement learning. By combining off-policy updates with an actor-critic formulation, SAC achieves state-of-the-art performance on a range of continuous-action benchmark tasks, outperforming prior on-policy and off-policy methods. The off-policy method employed by SAC samples data uniformly from past experience when performing parameter updates. We propose Emphasizing Recent Experience (ERE), a simple but powerful off-policy sampling technique, which emphasizes recently observed data while not forgetting the past. The ERE algorithm samples more aggressively from recent experience, and also orders the updates to ensure that updates from old data do not overwrite updates from new data. We compare vanilla SAC and SAC ERE, and show that ERE is more sample efficient than vanilla SAC for continuous-action Mujoco tasks [31]. We also consider combining SAC with Priority Experience Replay (PER) [28], a scheme originally proposed for deep Q-learning which prioritizes the data based on temporal-difference (TD) error. We show that SAC PER can marginally improve the sample efficiency performance of SAC, but much less so than SAC ERE. Finally, we propose an algorithm which integrates ERE and PER and show that this hybrid algorithm can give the best results for some of the Mujoco tasks.

artificial intelligence, machine learning, reinforcement learning, (12 more...)

1906.04009

Country:

North America > United States > New York (0.04)
Asia > China > Shanghai > Shanghai (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > Arizona > Maricopa County > Phoenix (0.04)

Genre: Research Report > New Finding (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Gaina, Raluca D., Lucas, Simon M., Perez-Liebana, Diego

Project Thyia: A Forever Gameplayer

The space of Artificial Intelligence entities is dominated by conversational bots. Some of them fit in our pockets and we take them everywhere we go, or allow them to be a part of human homes. Siri, Alexa, they are recognised as present in our world. But a lot of games research is restricted to existing in the separate realm of software. We enter different worlds when playing games, but those worlds cease to exist once we quit. Similarly, AI game-players are run once on a game (or maybe for longer periods of time, in the case of learning algorithms which need some, still limited, period for training), and they cease to exist once the game ends. But what if they didn't? What if there existed artificial game-players that continuously played games, learned from their experiences and kept getting better? What if they interacted with the real world and us, humans: live-streaming games, chatting with viewers, accepting suggestions for strategies or games to play, forming opinions on popular game titles? In this paper, we introduce the vision behind a new project called Thyia, which focuses around creating a present, continuous, `always-on', interactive game-player.

evolutionary algorithm, machine learning, reinforcement learning, (17 more...)

1906.04023

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Wisconsin (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.94)
(2 more...)

arXiv.org Machine LearningJun-10-2019

Deep Reinforcement Learning with Discrete Normalized Advantage Functions for Resource Management in Network Slicing

Qi, Chen, Hua, Yuxiu, Li, Rongpeng, Zhao, Zhifeng, Zhang, Honggang

Network slicing promises to provision diversified services with distinct requirements in one infrastructure. Deep reinforcement learning (e.g., deep $\mathcal{Q}$-learning, DQL) is assumed to be an appropriate algorithm to solve the demand-aware inter-slice resource management issue in network slicing by regarding the varying demands and the allocated bandwidth as the environment state and the action, respectively. However, allocating bandwidth in a finer resolution usually implies larger action space, and unfortunately DQL fails to quickly converge in this case. In this paper, we introduce discrete normalized advantage functions (DNAF) into DQL, by separating the $\mathcal{Q}$-value function as a state-value function term and an advantage term and exploiting a deterministic policy gradient descent (DPGD) algorithm to avoid the unnecessary calculation of $\mathcal{Q}$-value for every state-action pair. Furthermore, as DPGD only works in continuous action space, we embed a k-nearest neighbor algorithm into DQL to quickly find a valid action in the discrete space nearest to the DPGD output. Finally, we verify the faster convergence of the DNAF-based DQL through extensive simulations.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Machine Learning

1906.04594

Country: Asia > China (0.29)

Genre: Research Report (0.42)

Industry: Telecommunications (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.88)

arXiv.org Machine LearningJun-10-2019

Exploration via Hindsight Goal Generation

Ren, Zhizhou, Dong, Kefan, Zhou, Yuan, Liu, Qiang, Peng, Jian

Goal-oriented reinforcement learning has recently been a practical framework for robotic manipulation tasks, in which an agent is required to reach a certain goal defined by a function on the state space. However, the sparsity of such reward definition makes traditional reinforcement learning algorithms very inefficient. Hindsight Experience Replay (HER), a recent advance, has greatly improved sample efficiency and practical applicability for such problems. It exploits previous replays by constructing imaginary goals in a simple heuristic way, acting like an implicit curriculum to alleviate the challenge of sparse reward signal. In this paper, we introduce Hindsight Goal Generation (HGG), a novel algorithmic framework that generates valuable hindsight goals which are easy for an agent to achieve in the short term and are also potential for guiding the agent to reach the actual goal in the long term. We have extensively evaluated our goal generation algorithm on a number of robotic manipulation tasks and demonstrated substantially improvement over the original HER in terms of sample efficiency.

hindsight goal, machine learning, reinforcement learning, (14 more...)

arXiv.org Machine Learning

1906.04279

Country: North America > United States (0.68)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Rahtz, Matthew, Fang, James, Dragan, Anca D., Hadfield-Menell, Dylan

An Extensible Interactive Interface for Agent Design

arXiv.org Machine LearningJun-10-2019

In artificial intelligence, we often specify tasks through a reward function. While this works well in some settings, many tasks are hard to specify this way. In deep reinforcement learning, for example, directly specifying a reward as a function of a high-dimensional observation is challenging. Instead, we present an interface for specifying tasks interactively using demonstrations. Our approach defines a set of increasingly complex policies. The interface allows the user to switch between these policies at fixed intervals to generate demonstrations of novel, more complex, tasks. We train new policies based on these demonstrations and repeat the process. We present a case study of our approach in the Lunar Lander domain, and show that this simple approach can quickly learn a successful landing policy and outperforms an existing comparison-based deep RL method.

demonstration, machine learning, reinforcement learning, (16 more...)

arXiv.org Machine Learning

1906.02641

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Industry: Education (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.71)

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Nachum, Ofir, Chow, Yinlam, Dai, Bo, Li, Lihong

In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which quantify the likelihood that the new policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset -- can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset. Furthermore, it eschews any direct use of importance weights, thus avoiding potential optimization instabilities endemic of previous methods. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation and find that our algorithm significantly improves accuracy compared to existing techniques.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

1906.04733

Country: North America (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)

arXiv.org Artificial IntelligenceJun-9-2019

Gossip-based Actor-Learner Architectures for Deep Reinforcement Learning

Assran, Mahmoud, Romoff, Joshua, Ballas, Nicolas, Pineau, Joelle, Rabbat, Mike

Multi-simulator training has contributed to the recent success of Deep Reinforcement Learning by stabilizing learning and allowing for higher training throughputs. We propose Gossip-based Actor-Learner Architectures (GALA) where several actor-learners (such as A2C agents) are organized in a peer-to-peer communication topology, and exchange information through asynchronous gossip in order to take advantage of a large number of distributed simulators. We prove that GALA agents remain within an epsilon-ball of one-another during training when using loosely coupled asynchronous communication. By reducing the amount of synchronization between agents, GALA is more computationally efficient and scalable compared to A2C, its fully-synchronous counterpart. GALA also outperforms A2C, being more robust and sample efficient. We show that we can run several loosely coupled GALA agents in parallel on a single GPU and achieve significantly higher hardware utilization and frame-rates than vanilla A2C at comparable power draws.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

1906.04585

Country: North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games > Computer Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Lanier, John B., McAleer, Stephen, Baldi, Pierre

Curiosity-Driven Multi-Criteria Hindsight Experience Replay

arXiv.org Artificial IntelligenceJun-9-2019

Dealing with sparse rewards is a longstanding challenge in reinforcement learning. The recent use of hindsight methods have achieved success on a variety of sparse-reward tasks, but they fail on complex tasks such as stacking multiple blocks with a robot arm in simulation. Curiosity-driven exploration using the prediction error of a learned dynamics model as an intrinsic reward has been shown to be effective for exploring a number of sparse-reward environments. We present a method that combines hindsight with curiosity-driven exploration and curriculum learning in order to solve the challenging sparse-reward block stacking task. We are the first to stack more than two blocks using only sparse reward without human demonstrations.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

1906.0371

Country: North America > United States > California > Orange County > Irvine (0.04)

Genre: Research Report (0.50)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.90)