Goto

Collaborating Authors

Reward Engineering for Object Pick and Place Training

arXiv.org Artificial Intelligence

Robotic grasping is a crucial area of research as it can result in the acceleration of the automation of several Industries utilizing robots ranging from manufacturing to healthcare. Reinforcement learning is the field of study where an agent learns a policy to execute an action by exploring and exploiting rewards from an environment. Reinforcement learning can thus be used by the agent to learn how to execute a certain task, in our case grasping an object. We have used the Pick and Place environment provided by OpenAI's Gym to engineer rewards. Hindsight Experience Replay (HER) has shown promising results with problems having a sparse reward. In the default configuration of the OpenAI baseline and environment the reward function is calculated using the distance between the target location and the robot end-effector. By weighting the cost based on the distance of the end-effector from the goal in the x,y and z-axes we were able to almost halve the learning time compared to the baselines provided by OpenAI, an intuitive strategy that further reduced learning time. In this project, we were also able to introduce certain user desired trajectories in the learnt policies (city-block / Manhattan trajectories). This helps us understand that by engineering the rewards we can tune the agent to learn policies in a certain way even if it might not be the most optimal but is the desired manner.


A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

arXiv.org Machine Learning

We consider the estimation of the policy gradient in partially observable Markov decision processes (POMDP) with a special class of structured policies that are finite-state controllers. We show that the gradient estimation can be done in the Actor-Critic framework, by making the critic compute a "value" function that does not depend on the states of POMDP. This function is the conditional mean of the true value function that depends on the states. We show that the critic can be implemented using temporal difference (TD) methods with linear function approximations, and the analytical results on TD and Actor-Critic can be transfered to this case. Although Actor-Critic algorithms have been used extensively in Markov decision processes (MDP), up to now they have not been proposed for POMDP as an alternative to the earlier proposal GPOMDP algorithm, an actor-only method. Furthermore, we show that the same idea applies to semi-Markov problems with a subset of finite-state controllers.


Distributional Reward Decomposition for Reinforcement Learning

Neural Information Processing Systems

Many reinforcement learning (RL) tasks have specific properties that can be leveraged to modify existing RL algorithms to adapt to those tasks and further improve performance, and a general class of such properties is the multiple reward channel. In those environments the full reward can be decomposed into sub-rewards obtained from different channels. Existing work on reward decomposition either requires prior knowledge of the environment to decompose the full reward, or decomposes reward without prior knowledge but with degraded performance. In this paper, we propose Distributional Reward Decomposition for Reinforcement Learning (DRDRL), a novel reward decomposition algorithm which captures the multiple reward channel structure under distributional setting. Empirically, our method captures the multi-channel structure and discovers meaningful reward decomposition, without any requirements on prior knowledge.


Diversity-Driven Exploration Strategy for Deep Reinforcement Learning

Neural Information Processing Systems

Efficient exploration remains a challenging research problem in reinforcement learning, especially when an environment contains large state spaces, deceptive local optima, or sparse rewards. To tackle this problem, we present a diversity-driven approach for exploration, which can be easily combined with both off- and on-policy reinforcement learning algorithms. We show that by simply adding a distance measure to the loss function, the proposed methodology significantly enhances an agent's exploratory behaviors, and thus preventing the policy from being trapped in local optima. We further propose an adaptive scaling method for stabilizing the learning process. We demonstrate the effectiveness of our method in huge 2D gridworlds and a variety of benchmark environments, including Atari 2600 and MuJoCo. Experimental results show that our method outperforms baseline approaches in most tasks in terms of mean scores and exploration efficiency.


Diversity-Driven Exploration Strategy for Deep Reinforcement Learning

Neural Information Processing Systems

Efficient exploration remains a challenging research problem in reinforcement learning, especially when an environment contains large state spaces, deceptive local optima, or sparse rewards. To tackle this problem, we present a diversity-driven approach for exploration, which can be easily combined with both off- and on-policy reinforcement learning algorithms. We show that by simply adding a distance measure to the loss function, the proposed methodology significantly enhances an agent's exploratory behaviors, and thus preventing the policy from being trapped in local optima. We further propose an adaptive scaling method for stabilizing the learning process. We demonstrate the effectiveness of our method in huge 2D gridworlds and a variety of benchmark environments, including Atari 2600 and MuJoCo. Experimental results show that our method outperforms baseline approaches in most tasks in terms of mean scores and exploration efficiency.