Goto

Collaborating Authors

 Reinforcement Learning


Deep Reinforcement Learning with Stage Incentive Mechanism for Robotic Trajectory Planning

arXiv.org Artificial Intelligence

ABSTRACT To improve the efficiency of deep reinforcement learning (DRL) based methods for robot manipulator trajectory planning in random working environment. Firstly, posture reward function is proposed to accelerate the learning process with a more reasonable trajectory by modeling the distance and direction constraints, which can reduce the blindness of exploration. Secondly, to improve the stability, a reward function at stride reward is proposed by modeling the distance and movement distance of joints constraints, it can make the learning process more stable. In order to further improve learning efficiency, we are inspired by the cognitive process of human behavior and propose a stage incentive mechanism, including hard stage incentive reward function and soft stage incentive reward function. Extensive experiments show that the soft stage incentive reward function proposed is able to improve convergence rate by up to 46.9% with the state-of-the-art DRL methods. The percentage increase in convergence mean reward is 4.4% 15.5% and the percentage decreases with respect to standard deviation by 21.9% 63.2%. In the evaluation, the success rate of trajectory planning for robot manipulator is up to 99.6%.


Watch a Robot AI Beat World-Class Curling Competitors

#artificialintelligence

Artificial intelligence still needs to bridge the "sim-to-real" gap. Deep-learning techniques that are all the rage in AI log superlative performances in mastering cerebral games, including chess and Go, both of which can be played on a computer. But translating simulations to the physical world remains a bigger challenge. A robot named Curly that uses "deep reinforcement learning"--making improvements as it corrects its own errors--came out on top in three of four games against top-ranked human opponents from South Korean teams that included a women's team and a reserve squad for the national wheelchair team. One crucial finding was that the AI system demonstrated its ability to adapt to changing ice conditions.


Continual Model-Based Reinforcement Learning with Hypernetworks

arXiv.org Artificial Intelligence

Lifelong model-based robot learning is predicated upon continual adaptation to the dynamics of new tasks. For example, robots need to learn to manipulate unseen objects with various mass distributions, walk on new types of terrains with different friction, elasticity, and other physical properties, or even learn to adapt to different tasks, such as walking, running, or climbing stairs. This presents at least two challenges for many model-based reinforcement learning (MBRL) and model-predictive control (MPC) formulations, which typically comprise of a dynamics learning phase followed by a planning/policy optimization and execution phase. First, these methods are not scalable because the time required to train the dynamics model grows linearly with the size of the collected experience. Second, as the robot learner encounters and adapts to new tasks, it has to avoid catastrophic forgetting of the dynamics of old tasks, and should ideally exhibit both forward transfer (old tasks improve the learning performance on the new task) and backward transfer (new task improves the performance on old tasks). Many MBRL and MPC methods lack this type of adaptation and positive transfer. In this work, we propose to extend the task-aware continual learning approach based on hypernetworks in [1] to adapt to changing environment dynamics and to address the scalability and positive transfer challenges mentioned above in a reinforcement learning setting.


Bootstrapped Q-learning with Context Relevant Observation Pruning to Generalize in Text-based Games

arXiv.org Machine Learning

We show that Reinforcement Learning (RL) methods for solving Text-Based Games (TBGs) often fail to generalize on unseen games, especially in small data regimes. To address this issue, we propose Context Relevant Episodic State Truncation (CREST) for irrelevant token removal in observation text for improved generalization. Our method first trains a base model using Q-learning, which typically overfits the training games. The base model's action token distribution is used to perform observation pruning that removes irrelevant tokens. A second bootstrapped model is then retrained on the pruned observation text. Our bootstrapped agent shows improved generalization in solving unseen TextWorld games, using 10x-20x fewer training games compared to previous state-of-the-art methods despite requiring less number of training episodes.


Multimodal Safety-Critical Scenarios Generation for Decision-Making Algorithms Evaluation

arXiv.org Machine Learning

Existing neural network-based autonomous systems are shown to be vulnerable against adversarial attacks, therefore sophisticated evaluation on their robustness is of great importance. However, evaluating the robustness only under the worst-case scenarios based on known attacks is not comprehensive, not to mention that some of them even rarely occur in the real world. In addition, the distribution of safety-critical data is usually multimodal, while most traditional attacks and evaluation methods focus on a single modality. To solve the above challenges, we propose a flow-based multimodal safety-critical scenario generator for evaluating decisionmaking algorithms. The proposed generative model is optimized with weighted likelihood maximization and a gradient-based sampling procedure is integrated to improve the sampling efficiency. The safety-critical scenarios are generated by querying the task algorithms and the log-likelihood of the generated scenarios is in proportion to the risk level. Experiments on a self-driving task demonstrate our advantages in terms of testing efficiency and multimodal modeling capability. We evaluate six Reinforcement Learning algorithms with our generated traffic scenarios and provide empirical conclusions about their robustness.


Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization

arXiv.org Machine Learning

Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms in contemporary reinforcement learning. This class of methods is often applied in conjunction with entropy regularization -- an algorithmic scheme that encourages exploration -- and is closely related to soft policy iteration and trust region policy optimization. Despite the empirical success, the theoretical underpinnings for NPG methods remain limited even for the tabular setting. This paper develops $\textit{non-asymptotic}$ convergence guarantees for entropy-regularized NPG methods under softmax parameterization, focusing on discounted Markov decision processes (MDPs). Assuming access to exact policy evaluation, we demonstrate that the algorithm converges linearly -- or even quadratically once it enters a local region around the optimal policy -- when computing optimal value functions of the regularized MDP. Moreover, the algorithm is provably stable vis-\`a-vis inexactness of policy evaluation. Our convergence results accommodate a wide range of learning rates, and shed light upon the role of entropy regularization in enabling fast convergence.


A New Approach for Tactical Decision Making in Lane Changing: Sample Efficient Deep Q Learning with a Safety Feedback Reward

arXiv.org Artificial Intelligence

The efficient design and implementation of DRL agents There has been a growing interest in self-driving cars involves many steps which are starting with state-action by the industry since Darpa Urban Challenge [1]. Despite representations, balancing multi-objective reward function, the great achievements in this competition, the deployment tuning the hyper-parameters of the optimization algorithm, of self-driving cars into production is a quite complicated deciding the network architecture, generating rich data out problem due to reasons such as long tail of edge cases, of realistic scenarios and finally broad evaluation against a safety verification and the need of intelligent algorithms that proper baseline methods with different seeds. Considering are capable of negotiating with human drivers. There are the aforementioned steps, [7] lacks the comparison with a already level-2 capable cars in production that autonomously fair baseline and uses a very naive simulation environment control the vehicle at both the longitudinal and lateral levels.


Hierarchical Affordance Discovery using Intrinsic Motivation

arXiv.org Artificial Intelligence

To be capable of lifelong learning in a real-life environment, robots have to tackle multiple challenges. Being able to relate physical properties they may observe in their environment to possible interactions they may have is one of them. This skill, named affordance learning, is strongly related to embodiment and is mastered through each person's development: each individual learns affordances differently through their own interactions with their surroundings. Current methods for affordance learning usually use either fixed actions to learn these affordances or focus on static setups involving a robotic arm to be operated. In this article, we propose an algorithm using intrinsic motivation to guide the learning of affordances for a mobile robot. This algorithm is capable to autonomously discover, learn and adapt interrelated affordances without pre-programmed actions. Once learned, these affordances may be used by the algorithm to plan sequences of actions in order to perform tasks of various difficulties. We then present one experiment and analyse our system before comparing it with other approaches from reinforcement learning and affordance learning.


A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with Constraints

arXiv.org Artificial Intelligence

Constrained Markov Decision Processes (CMDPs) formalize sequential decision-making problems whose objective is to minimize a cost function while satisfying constraints on various cost functions. In this paper, we consider the setting of episodic fixed-horizon CMDPs. We propose an online algorithm which leverages the linear programming formulation of finitehorizon CMDP for repeated optimistic planning to provide a probably approximately correct (PAC) guarantee on the number of episodes needed to ensure an ǫ-optimal policy, i.e., with resulting objective value within ǫ of the optimal value and satisfying the constraints within ǫ-tolerance, with probability at least 1 δ. S, the number of episodes needed have a linear dependence on the state and action space sizes S and A, respectively, and quadratic dependence on the time horizon H. Markov decision processes (MDPs) [1] offer a natural framework to express sequential decision-making problems and reason about autonomous system behaviors. However, the single cost objective of a traditional MDP formulation may fall short of fully capturing problems with multiple conflicting objectives and additional constraints that must be satisfied.


Probabilistic Machine Learning for Healthcare

arXiv.org Machine Learning

Machine learning can be used to make sense of healthcare data. Probabilistic machine learning models help provide a complete picture of observed data in healthcare. In this review, we examine how probabilistic machine learning can advance healthcare. We consider challenges in the predictive model building pipeline where probabilistic models can be beneficial including calibration and missing data. Beyond predictive models, we also investigate the utility of probabilistic machine learning models in phenotyping, in generative models for clinical use cases, and in reinforcement learning.