AITopics

2006.08555

Country:

North America > United States > California > Orange County > Irvine (0.15)
Asia > Middle East > Jordan (0.04)
North America > United States > Texas (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Ning, Kun-Peng, Huang, Sheng-Jun

Reinforcement Learning with Supervision from Noisy Demonstrations

arXiv.org Machine LearningJun-14-2020

Reinforcement learning has achieved great success in various applications. To learn an effective policy for the agent, it usually requires a huge amount of data by interacting with the environment, which could be computational costly and time consuming. To overcome this challenge, the framework called Reinforcement Learning with Expert Demonstrations (RLED) was proposed to exploit the supervision from expert demonstrations. Although the RLED methods can reduce the number of learning iterations, they usually assume the demonstrations are perfect, and thus may be seriously misled by the noisy demonstrations in real applications. In this paper, we propose a novel framework to adaptively learn the policy by jointly interacting with the environment and exploiting the expert demonstrations. Specifically, for each step of the demonstration trajectory, we form an instance, and define a joint loss function to simultaneously maximize the expected reward and minimize the difference between agent behaviors and demonstrations. Most importantly, by calculating the expected gain of the value function, we assign each instance with a weight to estimate its potential utility, and thus can emphasize the more helpful demonstrations while filter out noisy ones. Experimental results in various environments with multiple popular reinforcement learning algorithms show that the proposed approach can learn robustly with noisy demonstrations, and achieve higher performance in fewer iterations.

demonstration, machine learning, reinforcement learning, (15 more...)

2006.07808

Country:

Asia > China > Jiangsu Province > Nanjing (0.05)
North America > United States > New York (0.04)

Genre:

Workflow (0.66)
Research Report (0.50)

Industry: Leisure & Entertainment (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceJun-14-2020

Tackling Morpion Solitaire with AlphaZero-likeRanked Reward Reinforcement Learning

Wang, Hui, Preuss, Mike, Emmerich, Michael, Plaat, Aske

Morpion Solitaire is a popular single player game, performed with paper and pencil. Due to its large state space (on the order of the game of Go) traditional search algorithms, such as MCTS, have not been able to find good solutions. A later algorithm, Nested Rollout Policy Adaptation, was able to find a new record of 82 steps, albeit with large computational resources. After achieving this record, to the best of our knowledge, there has been no further progress reported, for about a decade. In this paper we take the recent impressive performance of deep self-learning reinforcement learning approaches from AlphaGo/AlphaZero as inspiration to design a searcher for Morpion Solitaire. A challenge of Morpion Solitaire is that the state space is sparse, there are few win/loss signals. Instead, we use an approach known as ranked reward to create a reinforcement learning self-play framework for Morpion Solitaire. This enables us to find medium-quality solutions with reasonable computational effort. Our record is a 67 steps solution, which is very close to the human best (68) without any other adaptation to the problem than using ranked reward. We list many further avenues for potential improvement.

machine learning, morpion solitaire, reinforcement learning, (12 more...)

2006.0797

Country:

Europe > Netherlands > South Holland > Leiden (0.05)
Asia > China (0.05)
North America > United States > New York (0.04)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment > Games > Go (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Papoudakis, Georgios, Christianos, Filippos, Schäfer, Lukas, Albrecht, Stefano V.

Comparative Evaluation of Multi-Agent Deep Reinforcement Learning Algorithms

arXiv.org Artificial IntelligenceJun-14-2020

Multi-agent deep reinforcement learning (MARL) suffers from a lack of commonly-used evaluation tasks and criteria, making comparisons between approaches difficult. In this work, we evaluate and compare three different classes of MARL algorithms (independent learners, centralised training with decentralised execution, and value decomposition) in a diverse range of multi-agent learning tasks. Our results show that (1) algorithm performance depends strongly on environment properties and no algorithm learns efficiently across all learning tasks; (2) independent learners often achieve equal or better performance than more complex algorithms; (3) tested algorithms struggle to solve multi-agent tasks with sparse rewards. We report detailed empirical data, including a reliability analysis, and provide insights into the limitations of the tested algorithms.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2006.07869

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.54)

Industry: Leisure & Entertainment > Games (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Machine LearningJun-14-2020

Optimistic Distributionally Robust Policy Optimization

Song, Jun, Zhao, Chaoyue

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), as the widely employed policy based reinforcement learning (RL) methods, are prone to converge to a sub-optimal solution as they limit the policy representation to a particular parametric distribution class. To address this issue, we develop an innovative Optimistic Distributionally Robust Policy Optimization (ODRPO) algorithm, which effectively utilizes Optimistic Distributionally Robust Optimization (DRO) approach to solve the trust region constrained optimization problem without parameterizing the policies. Our algorithm improves TRPO and PPO with a higher sample efficiency and a better performance of the final policy while attaining the learning stability. Moreover, it achieves a globally optimal policy update that is not promised in the prevailing policy based RL algorithms. Experiments across tabular domains and robotic locomotion tasks demonstrate the effectiveness of our approach.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2006.07815

Country:

Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > Wales (0.04)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Machine LearningJun-14-2020

A Finite Time Analysis of Two Time-Scale Actor Critic Methods

Wu, Yue, Zhang, Weitong, Xu, Pan, Gu, Quanquan

Actor-critic (AC) methods have exhibited great empirical success compared with other reinforcement learning algorithms, where the actor uses the policy gradient to improve the learning policy and the critic uses temporal difference learning to estimate the policy gradient. Under the two time-scale learning rate schedule, the asymptotic convergence of AC has been well studied in the literature. However, the non-asymptotic convergence and finite sample complexity of actor-critic methods are largely open. In this work, we provide a non-asymptotic analysis for two time-scale actor-critic methods under non-i.i.d. setting. We prove that the actor-critic method is guaranteed to find a first-order stationary point (i.e., $\|\nabla J(\boldsymbol{\theta})\|_2^2 \le \epsilon$) of the non-concave performance function $J(\boldsymbol{\theta})$, with $\mathcal{\tilde{O}}(\epsilon^{-2.5})$ sample complexity. To the best of our knowledge, this is the first work providing finite-time analysis and sample complexity bound for two time-scale actor-critic methods.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2005.0135

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Buffet, Olivier, Pietquin, Olivier, Weng, Paul

Reinforcement Learning

arXiv.org Machine LearningJun-13-2020

Reinforcement learning (RL) is a general framework for adaptive control, which has proven to be efficient in many domains, e.g., board games, video games or autonomous vehicles. In such problems, an agent faces a sequential decision-making problem where, at every time step, it observes its state, performs an action, receives a reward and moves to a new state. An RL agent learns by trial and error a good policy (or controller) based on observations and numeric reward feedback on the previously performed action. In this chapter, we present the basic framework of RL and recall the two main families of approaches that have been developed to learn a good policy. The first one, which is value-based, consists in estimating the value of an optimal policy, value from which a policy can be recovered, while the other, called policy search, directly works in a policy space. Actor-critic methods can be seen as a policy search technique where the policy value that is learned guides the policy improvement. Besides, we give an overview of some extensions of the standard RL framework, notably when risk-averse behavior needs to be taken into account or when rewards are not available or not known.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

2005.14419

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Shanghai > Shanghai (0.04)
North America > United States > Michigan (0.04)
(4 more...)

Genre:

Overview (1.00)
Instructional Material > Course Syllabus & Notes (0.46)

Industry:

Leisure & Entertainment > Games > Computer Games (1.00)
Transportation (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)

Cowen-Rivers, Alexander I., Palenicek, Daniel, Moens, Vincent, Abdullah, Mohammed, Sootla, Aivar, Wang, Jun, Ammar, Haitham

SAMBA: Safe Model-Based & Active Reinforcement Learning

arXiv.org Artificial IntelligenceJun-12-2020

In this paper, we propose SAMBA, a novel framework for safe reinforcement learning that combines aspects from probabilistic modelling, information theory, and statistics. Our method builds upon PILCO to enable active exploration using novel(semi-)metrics for out-of-sample Gaussian process evaluation optimised through a multi-objective problem that supports conditional-value-at-risk constraints. We evaluate our algorithm on a variety of safe dynamical system benchmarks involving both low and high-dimensional state representations. Our results show orders of magnitude reductions in samples and violations compared to state-of-the-art methods. Lastly, we provide intuition as to the effectiveness of the framework by a detailed analysis of our active metrics and safety constraints.

artificial intelligence, null, reinforcement learning, (16 more...)

2006.09436

Country:

Europe (0.14)
North America > United States > Wisconsin (0.14)

Genre: Research Report > New Finding (0.86)

Industry: Energy > Oil & Gas (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Ammanabrolu, Prithviraj, Tien, Ethan, Hausknecht, Matthew, Riedl, Mark O.

How to Avoid Being Eaten by a Grue: Structured Exploration Strategies for Textual Worlds

arXiv.org Artificial IntelligenceJun-12-2020

Text-based games are long puzzles or quests, characterized by a sequence of sparse and potentially deceptive rewards. They provide an ideal platform to develop agents that perceive and act upon the world using a combinatorially sized natural language state-action space. Standard Reinforcement Learning agents are poorly equipped to effectively explore such spaces and often struggle to overcome bottlenecks---states that agents are unable to pass through simply because they do not see the right action sequence enough times to be sufficiently reinforced. We introduce Q*BERT, an agent that learns to build a knowledge graph of the world by answering questions, which leads to greater sample efficiency. To overcome bottlenecks, we further introduce MC!Q*BERT an agent that uses an knowledge-graph-based intrinsic motivation to detect bottlenecks and a novel exploration strategy to efficiently learn a chain of policy modules to overcome them. We present an ablation study and results demonstrating how our method outperforms the current state-of-the-art on nine text games, including the popular game, Zork, where, for the first time, a learning agent gets past the bottleneck where the player is eaten by a Grue.

machine learning, natural language, reinforcement learning, (17 more...)

2006.07409

Country: North America > United States > Texas > Travis County > Austin (0.04)

Genre:

Workflow (0.88)
Research Report (0.70)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

arXiv.org Machine LearningJun-12-2020

TorsionNet: A Reinforcement Learning Approach to Sequential Conformer Search

Gogineni, Tarun, Xu, Ziping, Punzalan, Exequiel, Jiang, Runxuan, Kammeraad, Joshua, Tewari, Ambuj, Zimmerman, Paul

Molecular geometry prediction of flexible molecules, or conformer search, is a long-standing challenge in computational chemistry. This task is of great importance for predicting structure-activity relationships for a wide variety of substances ranging from biomolecules to ubiquitous materials. Substantial computational resources are invested in Monte Carlo and Molecular Dynamics methods to generate diverse and representative conformer sets for medium to large molecules, which are yet intractable to chemoinformatic conformer search methods. We present TorsionNet, an efficient sequential conformer search technique based on reinforcement learning under the rigid rotor approximation. The model is trained via curriculum learning, whose theoretical benefit is explored in detail, to maximize a novel metric grounded in thermodynamics called the Gibbs Score. Our experimental results show that TorsionNet outperforms the highest scoring chemoinformatics method by 4x on large branched alkanes, and by several orders of magnitude on the previously unexplored biopolymer lignin, with applications in renewable energy.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

2006.07078

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Michigan (0.04)
North America > Puerto Rico > San Juan > San Juan (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.48)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Energy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)