Goto

Collaborating Authors

 Reinforcement Learning


Lipschitz Continuity in Model-based Reinforcement Learning

arXiv.org Artificial Intelligence

Model-based reinforcement-learning methods learn transition and reward models and use them to guide behavior. We analyze the impact of learning models that are Lipschitz continuous---the distance between function values for two inputs is bounded by a linear function of the distance between the inputs. Our first result shows a tight bound on model errors for multi-step predictions with Lipschitz continuous models. We go on to prove an error bound for the value-function estimate arising from such models and show that the estimated value function is itself Lipschitz continuous. We conclude with empirical results that demonstrate significant benefits to enforcing Lipschitz continuity of neural net models during reinforcement learning.


Cell Selection with Deep Reinforcement Learning in Sparse Mobile Crowdsensing

arXiv.org Artificial Intelligence

Sparse Mobile CrowdSensing (MCS) is a novel MCS paradigm where data inference is incorporated into the MCS process for reducing sensing costs while its quality is guaranteed. Since the sensed data from different cells (sub-areas) of the target sensing area will probably lead to diverse levels of inference data quality, cell selection (i.e., choose which cells of the target area to collect sensed data from participants) is a critical issue that will impact the total amount of data that requires to be collected (i.e., data collection costs) for ensuring a certain level of quality. To address this issue, this paper proposes a Deep Reinforcement learning based Cell selection mechanism for Sparse MCS, called DR-Cell. First, we properly model the key concepts in reinforcement learning including state, action, and reward, and then propose to use a deep recurrent Q-network for learning the Q-function that can help decide which cell is a better choice under a certain state during cell selection. Furthermore, we leverage the transfer learning techniques to reduce the amount of data required for training the Q-function if there are multiple correlated MCS tasks that need to be conducted in the same target area. Experiments on various real-life sensing datasets verify the effectiveness of DR-Cell over the state-of-the-art cell selection mechanisms in Sparse MCS by reducing up to 15% of sensed cells with the same data inference quality guarantee.


Frontier AI: How far are we from artificial "general" intelligence, really?

#artificialintelligence

Another related area that has seen considerable acceleration is reinforcement learning โ€“ a technique where the AI teaches itself how to do something by trying again and again, separating good moves (that lead to rewards) from bad ones, and altering its approach each time, until it masters the skill. Reinforcement learning is another technique that goes back as far as the 1950s, and was considered for a long time an interesting idea that didn't work well. However, that all changed in late 2013 when DeepMind, then an independent startup, taught an AI to play 22 Atari 2600 games, including Space Invaders, at a superhuman level. In 2016, its AlphaGo, an AI trained with reinforcement learning, beat the South Korean Go master Lee Sedol. Then just a few months ago in December 2017, AlphaZero, a more generalized and powerful version of AlphaGo used the same approach to master not just Go, but also chess and shogi. Without any human guidance other than the game rules, AlphaZero taught itself how to play chess at a master level in only four hours.


Beating Atari Games with OpenAI's Evolutionary Strategies โ€ข Filestack Blog

#artificialintelligence

Last month, Filestack sponsored an AI meetup wherein I presented a brief introduction to reinforcement learning and evolutionary strategies. Beforehand, I had promised code examples showing how to beat Atari games using PyTorch. In reality, I did not have time for that kind of side project and so I found some other examples of training agents to play Flappy Bird using Keras, which were entertaining but not complete enough for me to recommend as a springboard for further exploration. Luckily, I recently found some time to develop the promised training scripts. Therefore, I would like to provide an in-depth look of how we can use the PyTorch-ES suite for training reinforcement agents in a variety of environments, including Atari games and OpenAI Gym simulations. In deep reinforcement learning that uses the Q-learning algorithm, which has become very popular, training an intelligent agent includes distinct phases for "observation" and "learning".


Robust Dual View Deep Agent

arXiv.org Machine Learning

Motivated by recent advance of machine learning using Deep Reinforcement Learning this paper proposes a modified architecture that produces more robust agents and speeds up the training process. Our architecture is based on Asynchronous Advantage Actor-Critic (A3C) algorithm where the total input dimensionality is halved by dividing the input into two independent streams. We use ViZDoom, 3D world software that is based on the classical first person shooter video game, Doom, as a test case. The experiments show that in comparison to single input agents, the proposed architecture succeeds to have the same playing performance and shows more robust behavior, achieving significant reduction in the number of training parameters of almost 30%.


A Multi-task Selected Learning Approach for Solving New Type 3D Bin Packing Problem

arXiv.org Artificial Intelligence

This paper studies a new type of 3D bin packing problem (BPP), in which a number of cuboid-shaped items must be put into a bin one by one orthogonally. The objective is to find a way to place these items that can minimize the surface area of the bin. This problem is based on the fact that there is no fixed-sized bin in many real business scenarios and the cost of a bin is proportional to its surface area. Based on previous research on 3D BPP, the surface area is determined by the sequence, spatial locations and orientations of items. It is a new NP-hard combinatorial optimization problem on unfixed-sized bin packing, for which we propose a multi-task framework based on Selected Learning, generating the sequence and orientations of items packed into the bin simultaneously. During training steps, Selected Learning chooses one of loss functions derived from Deep Reinforcement Learning and Supervised Learning corresponding to the training procedure. Numerical results show that the method proposed significantly outperforms Lego baselines by a substantial gain of 7.52%. Moreover, we produce large scale 3D Bin Packing order data set for studying bin packing problems and will release it to the research community.


An Adaptive Clipping Approach for Proximal Policy Optimization

arXiv.org Artificial Intelligence

Very recently proximal policy optimization (PPO) algorithms have been proposed as first-order optimization methods for effective reinforcement learning. While PPO is inspired by the same learning theory that justifies trust region policy optimization (TRPO), PPO substantially simplifies algorithm design and improves data efficiency by performing multiple epochs of \emph{clipped policy optimization} from sampled data. Although clipping in PPO stands for an important new mechanism for efficient and reliable policy update, it may fail to adaptively improve learning performance in accordance with the importance of each sampled state. To address this issue, a new surrogate learning objective featuring an adaptive clipping mechanism is proposed in this paper, enabling us to develop a new algorithm, known as PPO-$\lambda$. PPO-$\lambda$ optimizes policies repeatedly based on a theoretical target for adaptive policy improvement. Meanwhile, destructively large policy update can be effectively prevented through both clipping and adaptive control of a hyperparameter $\lambda$ in PPO-$\lambda$, ensuring high learning reliability. PPO-$\lambda$ enjoys the same simple and efficient design as PPO. Empirically on several Atari game playing tasks and benchmark control tasks, PPO-$\lambda$ also achieved clearly better performance than PPO.


Terrain RL Simulator

arXiv.org Artificial Intelligence

We provide $89$ challenging simulation environments that range in difficulty. The difficulty of solving a task is linked not only to the number of dimensions in the action space but also to the size and shape of the distribution of configurations the agent experiences. Therefore, we are releasing a number of simulation environments that include randomly generated terrain. The library also provides simple mechanisms to create new environments with different agent morphologies and the option to modify the distribution of generated terrain. We believe using these and other more complex simulations will help push the field closer to creating human-level intelligence.


On Learning Intrinsic Rewards for Policy Gradient Methods

arXiv.org Artificial Intelligence

In many sequential decision making tasks, it is challenging to design reward functions that help an RL agent efficiently learn behavior that is considered good by the agent designer. A number of different formulations of the reward-design problem, or close variants thereof, have been proposed in the literature. In this paper we build on the Optimal Rewards Framework of Singh et.al. that defines the optimal intrinsic reward function as one that when used by an RL agent achieves behavior that optimizes the task-specifying or extrinsic reward function. Previous work in this framework has shown how good intrinsic reward functions can be learned for lookahead search based planning agents. Whether it is possible to learn intrinsic reward functions for learning agents remains an open problem. In this paper we derive a novel algorithm for learning intrinsic rewards for policy-gradient based learning agents. We compare the performance of an augmented agent that uses our algorithm to provide additive intrinsic rewards to an A2C-based policy learner (for Atari games) and a PPO-based policy learner (for Mujoco domains) with a baseline agent that uses the same policy learners but with only extrinsic rewards. Our results show improved performance on most but not all of the domains.


Multi-Reward Reinforced Summarization with Saliency and Entailment

arXiv.org Artificial Intelligence

Abstractive text summarization is the task of compressing and rewriting a long document into a short summary while maintaining saliency, directed logical entailment, and non-redundancy. In this work, we address these three important aspects of a good summary via a reinforcement learning approach with two novel reward functions: ROUGE-Sal and Entail, on top of a coverage-based baseline. The ROUGESal reward modifies the ROUGE metric by up-weighting the salient phrases/words detected via a keyphrase classifier. The Entail reward gives high (lengthnormalized) scores to logically-entailed summaries using an entailment classifier. Further, we show superior performance improvement when these rewards are combined with traditional metric (ROUGE) based rewards, via our novel and effective multi-reward approach of optimizing multiple rewards simultaneously in alternate mini-batches. Our method achieves the new state-of-the-art results on CNN/Daily Mail dataset as well as strong improvements in a test-only transfer setup on DUC-2002.