Reinforcement Learning
Online 3D Bin Packing with Constrained Deep Reinforcement Learning
Zhao, Hang, She, Qijin, Zhu, Chenyang, Yang, Yin, Xu, Kai
We solve a challenging yet practically useful variant of 3D Bin Packing Problem (3D-BPP). In our problem, the agent has limited information about the items to be packed into the bin, and an item must be packed immediately after its arrival without buffering or readjusting. The item's placement also subjects to the constraints of collision avoidance and physical stability. We formulate this online 3D-BPP as a constrained Markov decision process. To solve the problem, we propose an effective and easy-to-implement constrained deep reinforcement learning (DRL) method under the actor-critic framework. In particular, we introduce a feasibility predictor to predict the feasibility mask for the placement actions and use it to modulate the action probabilities output by the actor during training. Such supervisions and transformations to DRL facilitate the agent to learn feasible policies efficiently. Our method can also be generalized e.g., with the ability to handle lookahead or items with different orientations. We have conducted extensive evaluation showing that the learned policy significantly outperforms the state-of-the-art methods. A user study suggests that our method attains a human-level performance.
Q-Learning with Differential Entropy of Q-Tables
Nguyen, Tung D., Kasmarik, Kathryn E., Abbass, Hussein A.
It is well-known that information loss can occur in the classic and simple Q-learning algorithm. Entropy-based policy search methods were introduced to replace Q-learning and to design algorithms that are more robust against information loss. We conjecture that the reduction in performance during prolonged training sessions of Q-learning is caused by a loss of information, which is non-transparent when only examining the cumulative reward without changing the Q-learning algorithm itself. We introduce Differential Entropy of Q-tables (DE-QT) as an external information loss detector to the Q-learning algorithm. The behaviour of DE-QT over training episodes is analyzed to find an appropriate stopping criterion during training. The results reveal that DE-QT can detect the most appropriate stopping point, where a balance between a high success rate and a high efficiency is met for classic Q-Learning algorithm.
Robots Learning to Move like Animals
Whether it's a dog chasing after a ball, or a monkey swinging through the trees, animals can effortlessly perform an incredibly rich repertoire of agile locomotion skills. But designing controllers that enable legged robots to replicate these agile behaviors can be a very challenging task. The superior agility seen in animals, as compared to robots, might lead one to wonder: can we create more agile robotic controllers with less effort by directly imitating animals? In this work, we present a framework for learning robotic locomotion skills by imitating animals. Given a reference motion clip recorded from an animal (e.g. a dog), our framework uses reinforcement learning to train a control policy that enables a robot to imitate the motion in the real world.
7 Best Courses to Learn Artificial Intelligence in 2020
This is another awesome course by Kirill Eremenko and his SuperDataScience Team on how to solve real-world business problems with AI. If you are business people or just curious how AI can help you then you should join this course. The complex topic of Artificial Intelligence and Machine Learning is presented the best it can without getting too technical. I highly recommend to business professionals trying to improve their skillset and help their business use AI. Talking about social proof, this course is trusted by more than 14,000 students and it has on average 4.3 rating which is amazing proof that this is a great course.
Noise, overestimation and exploration in Deep Reinforcement Learning
We will discuss some statistical noise related phenomena, that were investigated by different authors in the framework of Deep Reinforcement Learning algorithms. The following algorithms are touched: DQN, Double DQN, DDPG, TD3, Hill-Climbing. Firstly, we consider overestimation, that is the harmful property resulting from noise. Then we deal with noise used for exploration, this is the useful noise. We discuss setting the noise parameter in TD3 for typical PyBullet environments associated with articulate bodies such as HopperBulletEnv and Walker2DBulletEnv. In the appendix, in relation with the Hill-Climbing algorithm, we will look at one more example of noise: adaptive noise.
SOAC: The Soft Option Actor-Critic Architecture
Li, Chenghao, Ma, Xiaoteng, Zhang, Chongjie, Yang, Jun, Xia, Li, Zhao, Qianchuan
The option framework has shown great promise by automatically extracting temporally-extended sub-tasks from a long-horizon task. Methods have been proposed for concurrently learning low-level intra-option policies and high-level option selection policy. However, existing methods typically suffer from two major challenges: ineffective exploration and unstable updates. In this paper, we present a novel and stable off-policy approach that builds on the maximum entropy model to address these challenges. Our approach introduces an information-theoretical intrinsic reward for encouraging the identification of diverse and effective options. Meanwhile, we utilize a probability inference model to simplify the optimization problem as fitting optimal trajectories. Experimental results demonstrate that our approach significantly outperforms prior on-policy and off-policy methods in a range of Mujoco benchmark tasks while still providing benefits for transfer learning. In these tasks, our approach learns a diverse set of options, each of whose state-action space has strong coherence.
Machine Learning - Redcrix Technologies (P) Ltd.
Supervised machine learning algorithms can apply what has been learned in the past to new data using labeled examples to predict future events. Starting from the analysis of a known training dataset, the learning algorithm produces an inferred function to make predictions about the output values. The system is able to provide targets for any new input after sufficient training. The learning algorithm can also compare its output with the correct, intended output and find errors in order to modify the model accordingly. In contrast, unsupervised machine learning algorithms are used when the information used to train is neither classified nor labeled.
Learning to explore using active neural SLAM
Advances in machine learning, computer vision and robotics have opened up avenues of building intelligent robots which can navigate in the physical world and perform complex tasks in our homes and offices. Exploration is a key challenge in building intelligent navigation agents. When an autonomous agent is dropped in an unseen environment, it needs to explore as much of the environment as fast as possible. How do we go about training autonomous exploration agents? One popular approach is using end-to-end deep Reinforcement Learning (RL).