AITopics

2103.07084

Country: Asia > Japan > Kyūshū & Okinawa > Kyūshū (0.14)

Genre: Research Report > New Finding (0.68)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Artificial IntelligenceOct-31-2020

Robust Imitation Learning from Noisy Demonstrations

Tangkaratt, Voot, Charoenphakdee, Nontawat, Sugiyama, Masashi

The goal of sequential decision making is to learn a good policy that makes good decisions (Puterman, 1994). Imitation learning (IL) is an approach that learns a policy from demonstrations (i.e., sequences of demonstrators' decisions) (Schaal, 1999). Researchers have shown that a good policy can be learned efficiently from high-quality demonstrations collected from experts (Ng and Russell, 2000; Syed et al., 2008; Ziebart et al., 2010; Ho and Ermon, 2016; Sun et al., 2019). However, demonstrations in the realworld often have lower quality due to noise or insufficient expertise of demonstrators, especially when humans are involved in the data collection process (Mandlekar et al., 2018). This is problematic because low-quality demonstrations can reduce the efficiency of IL both in theory and practice (Tangkaratt et al., 2020). In this paper, we theoretically and experimentally show that IL can perform well even in the presence of noises.

deep learning, demonstration, neural network, (19 more...)

2010.10181

Country: Asia > Japan (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Machine LearningOct-2-2020

Meta-Model-Based Meta-Policy Optimization

Hiraoka, Takuya, Imagawa, Takahisa, Tangkaratt, Voot, Osa, Takayuki, Onishi, Takashi, Tsuruoka, Yoshimasa

Model-based reinforcement learning (MBRL) has been applied to meta-learning settings and has demonstrated its high sample efficiency. However, in previous MBRL for meta-learning settings, policies are optimized via rollouts that fully rely on a predictive model of an environment. Thus, its performance in a real environment tends to degrade when the predictive model is inaccurate. In this paper, we prove that performance degradation can be suppressed by using branched meta-rollouts. On the basis of this theoretical analysis, we propose Meta-Model-based Meta-Policy Optimization (M3PO), in which the branched meta-rollouts are used for policy optimization. We demonstrate that M3PO outperforms existing meta reinforcement learning methods in continuous-control benchmarks.

artificial intelligence, neural network, rollout, (19 more...)

2006.02608

Country: Asia > Japan (0.14)

Genre: Research Report > New Finding (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

arXiv.org Machine LearningSep-15-2019

VILD: Variational Imitation Learning with Diverse-quality Demonstrations

Tangkaratt, Voot, Han, Bo, Khan, Mohammad Emtiyaz, Sugiyama, Masashi

The goal of imitation learning (IL) is to learn a good policy from high-quality demonstrations. However, the quality of demonstrations in reality can be diverse, since it is easier and cheaper to collect demonstrations from a mix of experts and amateurs. IL in such situations can be challenging, especially when the level of demonstrators' expertise is unknown. We propose a new IL method called \underline{v}ariational \underline{i}mitation \underline{l}earning with \underline{d}iverse-quality demonstrations (VILD), where we explicitly model the level of demonstrators' expertise with a probabilistic graphical model and estimate it along with a reward function. We show that a naive approach to estimation is not suitable to large state and action spaces, and fix its issues by using a variational approach which can be easily implemented using existing reinforcement learning methods. Experiments on continuous-control benchmarks demonstrate that VILD outperforms state-of-the-art methods. Our work enables scalable and data-efficient IL under more realistic settings than before.

deep learning, demonstration, neural network, (22 more...)

1909.06769

Country:

Europe (0.67)
North America > United States > Colorado (0.14)
Asia > Middle East > Israel (0.14)
Asia > Japan > Honshū (0.14)

Genre: Research Report (1.00)

Industry:

Transportation > Ground > Road (0.67)
Automobiles & Trucks (0.67)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

arXiv.org Machine LearningJan-29-2019

Imitation Learning from Imperfect Demonstration

Wu, Yueh-Hua, Charoenphakdee, Nontawat, Bao, Han, Tangkaratt, Voot, Sugiyama, Masashi

Imitation learning (IL) aims to learn an optimal policy from demonstrations. However, such demonstrations are often imperfect since collecting optimal ones is costly. To effectively learn from imperfect demonstrations, we propose a novel approach that utilizes confidence scores, which describe the quality of demonstrations. More specifically, we propose two confidence-based IL methods, namely two-step importance weighting IL (2IWIL) and generative adversarial IL with imperfect demonstration and confidence (IC-GAIL). We show that confidence scores given only to a small portion of sub-optimal demonstrations significantly improve the performance of IL both theoretically and empirically.

artificial intelligence, demonstration, reinforcement learning, (18 more...)

1901.09387

Country:

Asia (0.28)
North America > United States > New York (0.14)

Genre:

Research Report > Promising Solution (0.34)
Overview > Innovation (0.34)

Industry: Leisure & Entertainment > Sports > Basketball (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Machine LearningJan-4-2019

Hierarchical Reinforcement Learning via Advantage-Weighted Information Maximization

Osa, Takayuki, Tangkaratt, Voot, Sugiyama, Masashi

Real-world tasks are often highly structured. Hierarchical reinforcement learning (HRL) has attracted research interest as an approach for leveraging the hierarchical structure of a given task in reinforcement learning (RL). However, identifying the hierarchical policy structure that enhances the performance of RL is not a trivial task. In this paper, we propose an HRL method that learns a latent variable of a hierarchical policy using mutual information maximization. Our approach can be interpreted as a way to learn a discrete and latent representation of the state-action space. To learn option policies that correspond to modes of the advantage function, we introduce advantage-weighted importance sampling. In our HRL method, the gating policy learns to select option policies based on an option-value function, and these option policies are optimized based on the deterministic policy gradient method. This framework is derived by leveraging the analogy between a monolithic policy in standard RL and a hierarchical policy in HRL by using a deterministic option policy. Experimental results indicate that our HRL approach can learn a diversity of options and that it can enhance the performance of RL in continuous control tasks.

adinfohrl, artificial intelligence, reinforcement learning, (18 more...)

1901.01365

Country:

Asia > Japan (0.15)
Europe > Sweden (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.96)

arXiv.org Machine LearningDec-23-2018

TD-Regularized Actor-Critic Methods

Parisi, Simone, Tangkaratt, Voot, Peters, Jan, Khan, Mohammad Emtiyaz

Actor-critic methods can achieve incredible performance on difficult reinforcement learning problems, but they are also prone to instability. This is partly due to the interaction between the actor and critic during learning, e.g., an inaccurate step taken by one of them might adversely affect the other and destabilize the learning. To avoid such issues, we propose to regularize the learning objective of the actor by penalizing the temporal difference (TD) error of the critic. This improves stability by avoiding large steps in the actor update whenever the critic is highly inaccurate. The resulting method, which we call the TD-regularized actor-critic method, is a simple plug-and-play approach to improve stability and overall performance of the actor-critic methods. Evaluations on standard benchmarks confirm this.

artificial intelligence, optimization problem, td error, (18 more...)

1812.08288

Country:

Europe > Germany (0.28)
North America > United States (0.28)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)

arXiv.org Artificial IntelligenceDec-6-2018

Active Deep Q-learning with Demonstration

Chen, Si-An, Tangkaratt, Voot, Lin, Hsuan-Tien, Sugiyama, Masashi

Recent research has shown that although Reinforcement Learning (RL) can benefit from expert demonstration, it usually takes considerable efforts to obtain enough demonstration. The efforts prevent training decent RL agents with expert demonstration in practice. In this work, we propose Active Reinforcement Learning with Demonstration (ARLD), a new framework to streamline RL in terms of demonstration efforts by allowing the RL agent to query for demonstration actively during training. Under the framework, we propose Active Deep Q-Network, a novel query strategy which adapts to the dynamically-changing distributions during the RL training process by estimating the uncertainty of recent states. The expert demonstration data within Active DQN are then utilized by optimizing supervised max-margin loss in addition to temporal difference loss within usual DQN training. We propose two methods of estimating the uncertainty based on two state-of-the-art DQN models, namely the divergence of bootstrapped DQN and the variance of noisy DQN. The empirical results validate that both methods not only learn faster than other passive expert demonstration methods with the same amount of demonstration and but also reach super-expert level of performance across four different tasks.

artificial intelligence, demonstration, survey article, (20 more...)

1812.02632

Country:

North America > United States > Wisconsin (0.14)
North America > United States > Arizona (0.14)

Genre: Research Report (0.83)

Industry: Education (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceJun-13-2018

Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam

Khan, Mohammad Emtiyaz, Nielsen, Didrik, Tangkaratt, Voot, Lin, Wu, Gal, Yarin, Srivastava, Akash

Uncertainty computation in deep learning is essential to design robust and reliable systems. Variational inference (VI) is a promising approach for such computation, but requires more effort to implement and execute compared to maximum-likelihood methods. In this paper, we propose new natural-gradient algorithms to reduce such efforts for Gaussian mean-field VI. Our algorithms can be implemented within the Adam optimizer by perturbing the network weights during gradient evaluations, and uncertainty estimates can be cheaply obtained by using the vector that adapts the learning rate. This requires lower memory, computation, and implementation effort than existing VI methods, while obtaining uncertainty estimates of comparable quality. Our empirical results confirm this and further suggest that the weight-perturbation in our algorithm could be useful for exploration in reinforcement learning and stochastic optimization.

approximation, deep learning, neural network, (19 more...)

1806.04854

Country:

Asia (0.92)
North America > Canada (0.67)
North America > United States (0.46)
Europe > United Kingdom > England (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Education (0.67)
Health & Medicine (0.46)

arXiv.org Machine LearningFeb-21-2018

Guide Actor-Critic for Continuous Control

Tangkaratt, Voot, Abdolmaleki, Abbas, Sugiyama, Masashi

Actor-critic methods solve reinforcement learning problems by updating a parameterized policy known as an actor in a direction that increases an estimate of the expected return known as a critic. However, existing actor-critic methods only use values or gradients of the critic to update the policy parameter. In this paper, we propose a novel actor-critic method called the guide actor-critic (GAC). GAC firstly learns a guide actor that locally maximizes the critic and then it updates the policy parameter based on the guide actor by supervised learning. Our main theoretical contributions are two folds. First, we show that GAC updates the guide actor by performing second-order optimization in the action space where the curvature matrix is based on the Hessians of the critic. Second, we show that the deterministic policy gradient method is a special case of GAC when the Hessians are ignored. Through experiments, we show that our method is a promising reinforcement learning method for continuous controls.

neural network, optimization problem, psq, (18 more...)

1705.07606

Country:

Europe (1.00)
North America > United States > New York (0.15)
North America > Canada > Quebec (0.14)
(3 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)