Goto

Collaborating Authors

 Reinforcement Learning


r/MachineLearning - [R] Benchmarking Batch Deep Reinforcement Learning Algorithms

#artificialintelligence

Abstract: Widely-used deep reinforcement learning algorithms have been shown to fail in the batch setting--learning from a fixed data set without interaction with the environment. Following this result, there have been several papers showing reasonable performances under a variety of environments and batch settings. In this paper, we benchmark the performance of recent off- policy and batch reinforcement learning algorithms under unified settings on the Atari domain, with data generated by a single partially-trained behavioral policy. We find that under these conditions, many of these algorithms underperform DQN trained online with the same amount of data, as well as the partially-trained behavioral policy. To introduce a strong baseline, we adapt the Batch-Constrained Q-learning algorithm to a discrete-action setting, and show it outperforms all existing algorithms at this task.


Green Deep Reinforcement Learning for Radio Resource Management: Architecture, Algorithm Compression and Challenge

arXiv.org Artificial Intelligence

AI heralds a step-change in the performance and capability of wireless networks and other critical infrastructures. However, it may also cause irreversible environmental damage due to their high energy consumption. Here, we address this challenge in the context of 5G and beyond, where there is a complexity explosion in radio resource management (RRM). On the one hand, deep reinforcement learning (DRL) provides a powerful tool for scalable optimization for high dimensional RRM problems in a dynamic environment. On the other hand, DRL algorithms consume a high amount of energy over time and risk compromising progress made in green radio research. This paper reviews and analyzes how to achieve green DRL for RRM via both architecture and algorithm innovations. Architecturally, a cloud based training and distributed decision-making DRL scheme is proposed, where RRM entities can make lightweight deep local decisions whilst assisted by on-cloud training and updating. On the algorithm level, compression approaches are introduced for both deep neural networks and the underlying Markov Decision Processes, enabling accurate low-dimensional representations of challenges. To scale learning across geographic areas, a spatial transfer learning scheme is proposed to further promote the learning efficiency of distributed DRL entities by exploiting the traffic demand correlations. Together, our proposed architecture and algorithms provide a vision for green and on-demand DRL capability.


Efficient Inference and Exploration for Reinforcement Learning

arXiv.org Machine Learning

Despite an ever growing literature on reinforcement learning algorithms and applications, much less is known about their statistical inference. In this paper, we investigate the large sample behaviors of the Q-value estimates with closed-form characterizations of the asymptotic variances. This allows us to efficiently construct confidence regions for Q-value and optimal value functions, and to develop policies to minimize their estimation errors. This also leads to a policy exploration strategy that relies on estimating the relative discrepancies among the Q estimates. Numerical experiments show superior performances of our exploration strategy than other benchmark approaches.


Curiosity-Driven Recommendation Strategy for Adaptive Learning via Deep Reinforcement Learning

arXiv.org Machine Learning

The design of recommendations strategies in the adaptive learning system focuses on utilizing currently available information to provide individual-specific learning instructions for learners. As a critical motivate for human behaviors, curiosity is essentially the drive to explore knowledge and seek information. In a psychologically inspired view, we aim to incorporate the element of curiosity for guiding learners to study spontaneously. In this paper, a curiosity-driven recommendation policy is proposed under the reinforcement learning framework, allowing for a both efficient and enjoyable personalized learning mode. Given intrinsic rewards from a well-designed predictive model, we apply the actor-critic method to approximate the policy directly through neural networks. Numeric analyses with a large continuous knowledge state space and concrete learning scenarios are used to further demonstrate the power of the proposed method.


Zap Q-Learning With Nonlinear Function Approximation

arXiv.org Machine Learning

The Zap stochastic approximation (SA) algorithm was introduced recently as a means to accelerate convergence in reinforcement learning algorithms. While numerical results were impressive, stability (in the sense of boundedness of parameter estimates) was established in only a few special cases. This class of algorithms is generalized in this paper, and stability is established under very general conditions. This general result can be applied to a wide range of algorithms found in reinforcement learning. Two classes are considered in this paper: (i)The natural generalization of Watkins' algorithm is not always stable in function approximation settings. Parameter estimates may diverge to infinity even in the \textit{linear} function approximation setting with a simple finite state-action MDP. Under mild conditions, the Zap SA algorithm provides a stable algorithm, even in the case of \textit{nonlinear} function approximation. (ii) The GQ algorithm of Maei et.~al.~2010 is designed to address the stability challenge. Analysis is provided to explain why the algorithm may be very slow to converge in practice. The new Zap GQ algorithm is stable even for nonlinear function approximation.


A Simple Randomization Technique for Generalization in Deep Reinforcement Learning

arXiv.org Machine Learning

Deep reinforcement learning (RL) agents often fail to generalize to unseen environments (yet semantically similar to trained agents), particularly when they are trained on high-dimensional state spaces, such as images. In this paper, we propose a simple technique to improve a generalization ability of deep RL agents by introducing a randomized (convolutional) neural network that randomly perturbs input observations. It enables trained agents to adapt to new domains by learning robust features invariant across varied and randomized environments. Furthermore, we consider an inference method based on the Monte Carlo approximation to reduce the variance induced by this randomization. We demonstrate the superiority of our method across 2D CoinRun, 3D DeepMind Lab exploration and 3D robotics control tasks: it significantly outperforms various regularization and data augmentation methods for the same purpose.


Relation learning in a neurocomputational architecture supports cross-domain transfer

arXiv.org Artificial Intelligence

People readily generalise prior knowledge to novel situations and stimuli. Advances in machine learning and artificial intelligence have begun to approximate and even surpass human performance in specific domains, but machine learning systems struggle to generalise information to untrained situations. We present and model that demonstrates human-like extrapolatory generalisation by learning and explicitly representing an open-ended set of relations characterising regularities within the domains it is exposed to. First, when trained to play one video game (e.g., Breakout). the model generalises to a new game (e.g., Pong) with different rules, dimensions, and characteristics in a single shot. Second, the model can learn representations from a different domain (e.g., 3D shape images) that support learning a video game and generalising to a new game in one shot. By exploiting well-established principles from cognitive psychology and neuroscience, the model learns structured representations without feedback, and without requiring knowledge of the relevant relations to be given a priori. We present additional simulations showing that the representations that the model learns support cross-domain generalisation. The model's ability to generalise between different games demonstrates the flexible generalisation afforded by a capacity to learn not only statistical relations, but also other relations that are useful for characterising the domain to be learned. In turn, this kind of flexible, relational generalisation is only possible because the model is capable of representing relations explicitly, a capacity that is notably absent in extant statistical machine learning algorithms.


Toward Synergic Learning for Autonomous Manipulation of Deformable Tissues via Surgical Robots: An Approximate Q-Learning Approach

arXiv.org Artificial Intelligence

-- In this paper, we present a synergic learning algorithm to address the task of indirect manipulation of an unknown deformable tissue. Tissue manipulation is a common yet challenging task in various surgical interventions, which makes it a good candidate for robotic automation. We propose using a linear approximate Q-learning method in which human knowledge contributes to selecting useful yet simple features of tissue manipulation while the algorithm learns to take optimal actions and accomplish the task. The algorithm is implemented and evaluated on a simulation using the OpenCV and CHAI3D libraries. Successful simulation results for four different configurations which are based on realistic tissue manipulation scenarios are presented. Results indicate that with a careful selection of relatively simple and intuitive features, the developed Q-learning algorithm can successfully learn an optimal policy without any prior knowledge of tissue dynamics or camera intrinsic/extrinsic calibration parameters. Robot-Assisted Surgery (RAS) is becoming the norm of many operating room procedures, as it enables enhanced precision, dexterity, and feedback.


Hierarchical Reinforcement Learning with Advantage-Based Auxiliary Rewards

arXiv.org Artificial Intelligence

Hierarchical Reinforcement Learning (HRL) is a promising approach to solving long-horizon problems with sparse and delayed rewards. Many existing HRL algorithms either use pre-trained low-level skills that are unadaptable, or require domain-specific information to define low-level rewards. In this paper, we aim to adapt low-level skills to downstream tasks while maintaining the generality of reward design. We propose an HRL framework which sets auxiliary rewards for low-level skill training based on the advantage function of the high-level policy. This auxiliary reward enables efficient, simultaneous learning of the high-level policy and low-level skills without using task-specific knowledge. In addition, we also theoretically prove that optimizing low-level skills with this auxiliary reward will increase the task return for the joint policy. Experimental results show that our algorithm dramatically outperforms other state-of-the-art HRL methods in Mujoco domains. We also find both low-level and high-level policies trained by our algorithm transferable.


Infinite-horizon Off-Policy Policy Evaluation with Multiple Behavior Policies

arXiv.org Machine Learning

We consider off-policy policy evaluation when the trajectory data are generated by multiple behavior policies. Recent work has shown the key role played by the state or state-action stationary distribution corrections in the infinite horizon context for off-policy policy evaluation. We propose estimated mixture policy (EMP), a novel class of partially policy-agnostic methods to accurately estimate those quantities. With careful analysis, we show that EMP gives rise to estimates with reduced variance for estimating the state stationary distribution correction while it also offers a useful induction bias for estimating the state-action stationary distribution correction. In extensive experiments with both continuous and discrete environments, we demonstrate that our algorithm offers significantly improved accuracy compared to the state-of-the-art methods.