Goto

Collaborating Authors

 Reinforcement Learning


MushroomRL: Simplifying Reinforcement Learning Research

arXiv.org Machine Learning

MushroomRL is an open-source Python library developed to simplify the process of implementing and running Reinforcement Learning (RL) experiments. Compared to other available libraries, MushroomRL has been created with the purpose of providing a comprehensive and flexible framework to minimize the effort in implementing and testing novel RL methodologies. Indeed, the architecture of MushroomRL is built in such a way that every component of an RL problem is already provided, and most of the time users can only focus on the implementation of their own algorithms and experiments. The result is a library from which RL researchers can significantly benefit in the critical phase of the empirical analysis of their works. MushroomRL stable code, tutorials and documentation can be found at https://github.com/MushroomRL/mushroom-rl.


Debate Dynamics for Human-comprehensible Fact-checking on Knowledge Graphs

arXiv.org Artificial Intelligence

We propose a novel method for fact-checking on knowledge graphs based on debate dynamics. The underlying idea is to frame the task of triple classification as a debate game between two reinforcement learning agents which extract arguments -- paths in the knowledge graph -- with the goal to justify the fact being true (thesis) or the fact being false (antithesis), respectively. Based on these arguments, a binary classifier, referred to as the judge, decides whether the fact is true or false. The two agents can be considered as sparse feature extractors that present interpretable evidence for either the thesis or the antithesis. In contrast to black-box methods, the arguments enable the user to gain an understanding for the decision of the judge. Moreover, our method allows for interactive reasoning on knowledge graphs where the users can raise additional arguments or evaluate the debate taking common sense reasoning and external information into account. Such interactive systems can increase the acceptance of various AI applications based on knowledge graphs and can further lead to higher efficiency, robustness, and fairness.


A Probabilistic Simulator of Spatial Demand for Product Allocation

arXiv.org Artificial Intelligence

Connecting consumers with relevant products is a very important problem in both online and offline commerce. In physical retail, product placement is an effective way to connect consumers with products. However, selecting product locations within a store can be a tedious process. Moreover, learning important spatial patterns in offline retail is challenging due to the scarcity of data and the high cost of exploration and experimentation in the physical world. To address these challenges, we propose a stochastic model of spatial demand in physical retail. We show that the proposed model is more predictive of demand than existing baselines. We also perform a preliminary study into different automation techniques and show that an optimal product allocation policy can be learned through Deep Q-Learning.


Population-Guided Parallel Policy Search for Reinforcement Learning

arXiv.org Artificial Intelligence

A BSTRACT In this paper, a new population-guided parallel learning scheme is proposed to enhance the performance of off-policy reinforcement learning (RL). In the proposed scheme, multiple identical learners with their own value-functions and policies share a common experience replay buffer, and search a good policy in collaboration with the guidance of the best policy information. The key point is that the information of the best policy is fused in a soft manner by constructing an augmented loss function for policy update to enlarge the overall search region by the multiple learners. The guidance by the previous best policy and the enlarged range enable faster and better policy search. Monotone improvement of the expected cumulative return by the proposed scheme is proved theoretically. Working algorithms are constructed by applying the proposed scheme to the twin delayed deep deterministic (TD3) policy gradient algorithm. Numerical results show that the constructed algorithm outperforms most of the current state-of-the-art RL algorithms, and the gain is significant in the case of sparse reward environment. With the success of RL in relatively easy tasks, more challenging tasks such as sparse reward environments (Oh et al. (2018); Zheng et al. (2018); Burda et al. (2019)) are emerging, and developing good RL algorithms for such challenging tasks is of great importance from both theoretical and practical perspectives. In this paper, we consider parallel learning, which is an important line of RL research to enhance the learning performance by having multiple learners for the same environment. In this paper, in order to enhance the learning performance, we apply parallelism to RL based on a population of policies, but the usage is different from the previous methods. One of the advantages of using a population is the capability to evaluate policies in the population. Once all policies in the population are evaluated, we can use information of the best policy to enhance the performance.


Sample-based Distributional Policy Gradient

arXiv.org Machine Learning

Distributional reinforcement learning (DRL) is a recent reinforcement learning framework whose success has been supported by various empirical studies. It relies on the key idea of replacing the expected return with the return distribution, which captures the intrinsic randomness of the long term rewards. Most of the existing literature on DRL focuses on problems with discrete action space and value based methods. In this work, motivated by applications in robotics with continuous action space control settings, we propose sample-based distributional policy gradient (SDPG) algorithm. It models the return distribution using samples via a reparameterization technique widely used in generative modeling and inference. We compare SDPG with the state-of-art policy gradient method in DRL, distributed distributional deterministic policy gradients (D4PG), which has demonstrated state-of-art performance. We apply SDPG and D4PG to multiple OpenAI Gym environments and observe that our algorithm shows better sample efficiency as well as higher reward for most tasks.


A Nonparametric Offpolicy Policy Gradient

arXiv.org Machine Learning

A Nonparametric Off-Policy Policy GradientSamuele Tosatto 1 Jo ao Carvalho 1 Hany Abdulsamad 1 Jan Peters 1,2 1 Technische Universit at Darmstadt 2 Max Planck Institute for Intelligent Systems Abstract Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes. The need for intensive interactions with the environment is especially observed in many widely popular policy gradient algorithms that perform updates using on-policy samples. The price of such inefficiency becomes evident in real world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited. We address this issue by building on the general sample efficiency of off-policy algorithms. With nonparametric regression and density estimation methods we construct a nonparametric Bellman equation in a principled manner, which allows us to obtain closed-form estimates of the value function, and to analytically express the full policy gradient. We provide a theoretical analysis of our estimate to show that it is consistent under mild smoothness assumptions and empirically show that our approach has better sample efficiency than state-of-the-art policy gradient methods. 1 Introduction Reinforcement learning has made overwhelming progress in recent years (Mnih et al., 2015; Haarnoja et al., 2018; Schulman et al., 2015). However, the vast majority of reinforcement learning approaches are on-policy algorithms with limited applicability to real world scenarios, due to high sample complexity. In contrast, off-policy techniques are theoretically more sample efficient, because they decouple the proceduresPreliminary work. TG NOPG-D DPG TG NOPG-S PWIS Figure 1: Example showing the bias of offline-DPG (left) and the variance of PWIS-G(PO)MDP (right) in the policy-parameter space of a 2d-LQR setting. Both algorithms diverge while they move away from the "on-policy" region.


EEG-based Drowsiness Estimation for Driving Safety using Deep Q-Learning

arXiv.org Machine Learning

Fatigue is the most vital factor of road fatalities and one manifestation of fatigue during driving is drowsiness . In this paper, we propose using deep Q - learning to analyze an electroencephalogram (EEG) dataset captured during a simulated endurance drivi ng test . By measur ing the correlation between drowsiness and driving performance, t h is experiment represents an important brain - computer interface (BCI) paradigm especially from an application perspective. We adapt the terminologies in the driving test to fit the reinforcement learning framework, thus formulate the drowsiness estimation problem as an optimization of a Q - learning task . B y referring to the latest deep Q - Learning technologies and attending to the characteristics of EEG data, we tailor a deep Q - network for action proposition that can indirectly estimate drowsiness . Our results show that the trained model can trace the variations of mind state in a satisfactory way against the testing EEG data, which demonstrates the feasibility and practicab ilit y of this new computation paradigm . We also show that our method outperforms the supervised learning counterpart and is superior for real applications. To the best of our knowledge, we are the first to introduce the deep reinforcement learning method to th is BCI scenario, and our method can be potentially generalized to other BCI cases . Fatigue is regarded as the most severe factor causing road fatalities [1] . To understand the correlation between fatigue and driving performance, both from theory to practice, is of persistent interest for researchers.


Addressing Value Estimation Errors in Reinforcement Learning with a State-Action Return Distribution Function

arXiv.org Artificial Intelligence

In current reinforcement learning (RL) methods, function approximation errors are known to lead to the overestimated or underestimated state-action values Q, which further lead to suboptimal policies. We show that the learning of a state-action return distribution function can be used to improve the estimation accuracy of the Q-value. We combine the distributional return function within the maximum entropy RL framework in order to develop what we call the Distributional Soft Actor-Critic algorithm, DSAC, which is an off-policy method for continuous control setting. Unlike traditional distributional Q algorithms which typically only learn a discrete return distribution, DSAC can directly learn a continuous return distribution by truncating the difference between the target and current return distribution to prevent gradient explosion. Additionally, we propose a new Parallel Asynchronous Buffer-Actor-Learner architecture (PABAL) to improve the learning efficiency. We evaluate our method on the suite of MuJoCo continuous control tasks, achieving the state of the art performance.


Reinforcement Learning is full of Manipulative Consultants

#artificialintelligence

Imagine you go to an investment consultant, and you first ask how he charges. Is it according to the profit you'll make? "The more accurate I am in my predictions of your returns, you'll pay me more. But I will be tested only on the investments you choose to make." This smells a bit fishy, and you start sniffing around for other people who are using this consultant. Turns out he recommended them all only government bonds with low return and low variability.


The Past and Present of Imitation Learning: A Citation Chain Study

arXiv.org Artificial Intelligence

I NTRODUCTION Imitation Learning is a promising area of active research. Early research in'programming by example' began in Software Development [9] before attracting the interest of Robotics and Artificial Intelligence (AI) researchers, who began using the terms'Learning from Demonstration' and'Imitation Learning' to describe their line of work. Over the last 30 years, Imitation Learning has advanced significantly and been used to solve difficult tasks ranging from Autonomous Driving [12] to playing Atari games [5]. In the course of this development, different methods for performing Imitation Learning have fallen into and out of favor. In this paper, I will explore the development of these different methods and attempt to examine how the field has progressed. I will be discussing 4 landmark papers that sequentially cite and inform each other.