Goto

Collaborating Authors

 Reinforcement Learning


Bound Controller for a Quadruped Robot using Pre-Fitting Deep Reinforcement Learning

arXiv.org Artificial Intelligence

The bound gait is an important gait in quadruped robot locomotion. It can be used to cross obstacles and often serves as transition mode between trot and gallop. However, because of the complexity of the models, the bound gait built by the conventional control method is often unnatural and slow to compute. In the present work, we introduce a method to achieve the bound gait based on model-free pre-fit deep reinforcement learning (PF-DRL). We first constructed a net with the same structure as an actor net in the PPO2 and pre-fit it using the data collected from a robot using conventional model-based controller. Next, the trained weights are transferred into the PPO2 and be optimized further. Moreover, target on the symmetrical and periodic characteristic during bounding, we designed a reward function based on contact points. We also used feature engineering to improve the input features of the DRL model and improve performance on flat ground. Finally, we trained the bound controller in simulation and successfully deployed it on the Jueying Mini robot. It performs better than the conventional method with higher computational efficiency and more stable center-of-mass height in our experiments.


Learning When to Switch: Composing Controllers to Traverse a Sequence of Terrain Artifacts

arXiv.org Artificial Intelligence

Legged robots often use separate control policies that are highly engineered for traversing difficult terrain such as stairs, gaps, and steps, where switching between policies is only possible when the robot is in a region that is common to adjacent controllers. Deep Reinforcement Learning (DRL) is a promising alternative to hand-crafted control design, though typically requires the full set of test conditions to be known before training. DRL policies can result in complex (often unrealistic) behaviours that have few or no overlapping regions between adjacent policies, making it difficult to switch behaviours. In this work we develop multiple DRL policies with Curriculum Learning (CL), each that can traverse a single respective terrain condition, while ensuring an overlap between policies. We then train a network for each destination policy that estimates the likelihood of successfully switching from any other policy. We evaluate our switching method on a previously unseen combination of terrain artifacts and show that it performs better than heuristic methods. While our method is trained on individual terrain types, it performs comparably to a Deep Q Network trained on the full set of terrain conditions. This approach allows the development of separate policies in constrained conditions with embedded prior knowledge about each behaviour, that is scalable to any number of behaviours, and prepares DRL methods for applications in the real world


A Level-wise Taxonomic Perspective on Automated Machine Learning to Date and Beyond: Challenges and Opportunities

arXiv.org Artificial Intelligence

Automated machine learning (AutoML) is essentially automating the process of applying machine learning to real-world problems. The primary goals of AutoML tools are to provide methods and processes to make Machine Learning available for non-Machine Learning experts (domain experts), to improve efficiency of Machine Learning and to accelerate research on Machine Learning. Although automation and efficiency are some of AutoML's main selling points, the process still requires a surprising level of human involvement. A number of vital steps of the machine learning pipeline, including understanding the attributes of domain-specific data, defining prediction problems, creating a suitable training data set etc. still tend to be done manually by a data scientist on an ad-hoc basis. Often, this process requires a lot of back-and-forth between the data scientist and domain experts, making the whole process more difficult and inefficient. Altogether, AutoML systems are still far from a "real automatic system". In this review article, we present a level-wise taxonomic perspective on AutoML systems to-date and beyond, i.e., we introduce a new classification system with seven levels to distinguish AutoML systems based on their level of autonomy. We first start with a discussion on how an end-to-end Machine learning pipeline actually looks like and which sub-tasks of Machine learning Pipeline has indeed been automated so far. Next, we highlight the sub-tasks which are still done manually by a data-scientist in most cases and how that limits a domain expert's access to Machine learning. Then, we introduce the novel level-based taxonomy of AutoML systems and define each level according to their scope of automation support. Finally, we provide a road-map of future research endeavor in the area of AutoML and discuss some important challenges in achieving this ambitious goal.


Reward Propagation Using Graph Convolutional Networks

arXiv.org Artificial Intelligence

Potential-based reward shaping provides an approach for designing good reward functions, with the purpose of speeding up learning. However, automatically finding potential functions for complex environments is a difficult problem (in fact, of the same difficulty as learning a value function from scratch). We propose a new framework for learning potential functions by leveraging ideas from graph representation learning. Our approach relies on Graph Convolutional Networks which we use as a key ingredient in combination with the probabilistic inference view of reinforcement learning. More precisely, we leverage Graph Convolutional Networks to perform message passing from rewarding states. The propagated messages can then be used as potential functions for reward shaping to accelerate learning. We verify empirically that our approach can achieve considerable improvements in both small and high-dimensional control problems.


Finding the Near Optimal Policy via Adaptive Reduced Regularization in MDPs

arXiv.org Machine Learning

Reinforcement learning (RL) has achieved great success empirically, especially when policy and value function are parameterized by neural networks. Many studies [16, 21, 24, 11] have shown powerful and striking performance of RL compared to human-level performance. Dynamic Programming [19, 20, 10, 3] and Policy Gradient method [31, 26, 13] are the most frequently used optimization tools in these studies. However, when policy gradient methods are applied, theoretically understanding the success of RL is still limited in the case that policy is searched either on simplex or parameterized space. There is a line of recent work [6, 1, 5] on convergence performance of policy gradient methods for MDPs without parameterization, while another line of recent work [15, 7, 30, 8] focus on MDPs with parameterization. In addition, during the process of learning MDPs, it is often observed that the obtained policy could be quite deterministic while the environment is not fully explored. Some prior works [2, 17, 9, 28] propose to impose the Shannon entropy to each reward to make the policy stochastic, so agent can explore the environment instead of trapping in a local place and achieves success. Intuitively and empirically speaking, adding entropy regularization helps soften the learning process and encourage agents to explore more, so it might fasten convergence.


High Acceleration Reinforcement Learning for Real-World Juggling with Binary Rewards

arXiv.org Machine Learning

Robots that can learn in the physical world will be important to en-able robots to escape their stiff and pre-programmed movements. For dynamic high-acceleration tasks, such as juggling, learning in the real-world is particularly challenging as one must push the limits of the robot and its actuation without harming the system, amplifying the necessity of sample efficiency and safety for robot learning algorithms. In contrast to prior work which mainly focuses on the learning algorithm, we propose a learning system, that directly incorporates these requirements in the design of the policy representation, initialization, and optimization. We demonstrate that this system enables the high-speed Barrett WAM manipulator to learn juggling two balls from 56 minutes of experience with a binary reward signal. The final policy juggles continuously for up to 33 minutes or about 4500 repeated catches. The videos documenting the learning process and the evaluation can be found at https://sites.google.com/view/jugglingbot


The MAGICAL Benchmark for Robust Imitation

arXiv.org Artificial Intelligence

Imitation Learning (IL) algorithms are typically evaluated in the same environment that was used to create demonstrations. This rewards precise reproduction of demonstrations in one particular environment, but provides little information about how robustly an algorithm can generalise the demonstrator's intent to substantially different deployment settings. This paper presents the MAGICAL benchmark suite, which permits systematic evaluation of generalisation by quantifying robustness to different kinds of distribution shift that an IL algorithm is likely to encounter in practice. Using the MAGICAL suite, we confirm that existing IL algorithms overfit significantly to the context in which demonstrations are provided. We also show that standard methods for reducing overfitting are effective at creating narrow perceptual invariances, but are not sufficient to enable transfer to contexts that require substantially different behaviour, which suggests that new approaches will be needed in order to robustly generalise demonstrator intent. Code and data for the MAGICAL suite is available at https://github.com/qxcv/magical/.


Robust Imitation Learning from Noisy Demonstrations

arXiv.org Artificial Intelligence

The goal of sequential decision making is to learn a good policy that makes good decisions (Puterman, 1994). Imitation learning (IL) is an approach that learns a policy from demonstrations (i.e., sequences of demonstrators' decisions) (Schaal, 1999). Researchers have shown that a good policy can be learned efficiently from high-quality demonstrations collected from experts (Ng and Russell, 2000; Syed et al., 2008; Ziebart et al., 2010; Ho and Ermon, 2016; Sun et al., 2019). However, demonstrations in the realworld often have lower quality due to noise or insufficient expertise of demonstrators, especially when humans are involved in the data collection process (Mandlekar et al., 2018). This is problematic because low-quality demonstrations can reduce the efficiency of IL both in theory and practice (Tangkaratt et al., 2020). In this paper, we theoretically and experimentally show that IL can perform well even in the presence of noises.


Policy Iterations for Reinforcement Learning Problems in Continuous Time and Space -- Fundamental Theory and Methods

arXiv.org Artificial Intelligence

Policy iteration (PI) is a recursive process of policy evaluation and improvement for solving an optimal decision-making/control problem, or in other words, a reinforcement learning (RL) problem. PI has also served as the fundamental for developing RL methods. In this paper, we propose two PI methods, called differential PI (DPI) and integral PI (IPI), and their variants, for a general RL framework in continuous time and space (CTS), where the environment is modeled by a system of ordinary differential equations (ODEs). The proposed methods inherit the current ideas of PI in classical RL and optimal control and theoretically support the existing RL algorithms in CTS: TD-learning and value-gradient-based (VGB) greedy policy update. We also provide case studies including 1) discounted RL and 2) optimal control tasks. Fundamental mathematical properties -- admissibility, uniqueness of the solution to the Bellman equation (BE), monotone improvement, convergence, and optimality of the solution to the Hamilton-Jacobi-Bellman equation (HJBE) -- are all investigated in-depth and improved from the existing theory, along with the general and case studies. Finally, the proposed ones are simulated with an inverted-pendulum model and their model-based and partially model-free implementations to support the theory and further investigate them beyond.


Carbon Relay Extends AIOps Platform to Kubernetes HPA - Container Journal

#artificialintelligence

Carbon Relay announced this week that its Red Sky platform for configuring and optimizing container applications using machine learning algorithms now also makes it possible to scale Kubernetes clusters more efficiently. Company CTO Ofer Idan says Carbon Relay has extended the machine learning algorithms it developed for its IT operations platform based on artificial intelligence (AIOps) to now include support for the Kubernetes Horizontal Pod Autoscaler (HPA). That capability can be employed to ensure application performance is maintained consistently as applications scale or prevent the overprovisioning of infrastructure resources, he says. The Red Sky platform is available in both open source and enterprise editions. The enterprise edition includes deep reinforcement learning capabilities to continually train the artificial intelligence (AI) agent, automatic Kubernetes application configuration, data sharing and advanced automation and scheduling capabilities.