Goto

Collaborating Authors

 Reinforcement Learning


Wizard of AI: Meet India's foremost reinforcement learning expert FactorDaily

#artificialintelligence

It was his unique teaching style that got me and a bunch of my friends hooked to this topic and field โ€“ his enthusiasm towards the material, the intuitive examples that he givesโ€ฆ,


Program Synthesis Through Reinforcement Learning Guided Tree Search

arXiv.org Artificial Intelligence

Program Synthesis is the task of generating a program from a provided specification. Traditionally, this has been treated as a search problem by the programming languages (PL) community and more recently as a supervised learning problem by the machine learning community. Here, we propose a third approach, representing the task of synthesizing a given program as a Markov decision process solvable via reinforcement learning(RL). From observations about the states of partial programs, we attempt to find a program that is optimal over a provided reward metric on pairs of programs and states. We instantiate this approach on a subset of the RISC-V assembly language operating on floating point numbers, and as an optimization inspired by search-based techniques from the PL community, we combine RL with a priority search tree. We evaluate this instantiation and demonstrate the effectiveness of our combined method compared to a variety of baselines, including a pure RL ablation and a state of the art Markov chain Monte Carlo search method on this task.


Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings

arXiv.org Machine Learning

In this work, we take a representation learning perspective on hierarchical reinforcement learning, where the problem of learning lower layers in a hierarchy is transformed into the problem of learning trajectory-level generative models. We show that we can learn continuous latent representations of trajectories, which are effective in solving temporally extended and multi-stage problems. Our proposed model, SeCTAR, draws inspiration from variational autoencoders, and learns latent representations of trajectories. A key component of this method is to learn both a latent-conditioned policy and a latent-conditioned model which are consistent with each other. Given the same latent, the policy generates a trajectory which should match the trajectory predicted by the model. This model provides a built-in prediction mechanism, by predicting the outcome of closed loop policy behavior. We propose a novel algorithm for performing hierarchical RL with this model, combining model-based planning in the learned latent space with an unsupervised exploration objective. We show that our model is effective at reasoning over long horizons with sparse rewards for several simulated tasks, outperforming standard reinforcement learning methods and prior methods for hierarchical reasoning, model-based planning, and exploration.


Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning

arXiv.org Machine Learning

Recommender systems play a crucial role in mitigating the problem of information overload by suggesting users' personalized items or services. The vast majority of traditional recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed strategy. In this paper, we propose a novel recommender system with the capability of continuously improving its strategies during the interactions with users. We model the sequential interactions between users and a recommender system as a Markov Decision Process (MDP) and leverage Reinforcement Learning (RL) to automatically learn the optimal strategies via recommending trial-and-error items and receiving reinforcements of these items from users' feedback. Users' feedback can be positive and negative and both types of feedback have great potentials to boost recommendations. However, the number of negative feedback is much larger than that of positive one; thus incorporating them simultaneously is challenging since positive feedback could be buried by negative one. In this paper, we develop a novel approach to incorporate them into the proposed deep recommender system (DEERS) framework. The experimental results based on real-world e-commerce data demonstrate the effectiveness of the proposed framework. Further experiments have been conducted to understand the importance of both positive and negative feedback in recommendations.


Top Trends in AI in 2018

#artificialintelligence

According to Gartner's hype cycle of emerging technologies, 2017; Deep Learning and Machine Learning have reached the peak of inflated expectations. Artificial General Intelligence (AGI) and Deep Reinforcement Learning are in the phase of innovation trigger. The sentiment over Artificial Intelligence (AI) is euphoric. Every technology firm is jumping on the AI first bandwagon. Companies like Google, Microsoft, Amazon, and Alibaba are pushing the frontiers.


Simplifying Reward Design through Divide-and-Conquer

arXiv.org Artificial Intelligence

While significant advances have been made in planning and reinforcement learning for robots, these algorithms require access to a reward (or cost) function in order to be successful. Unfortunately, designing a good reward function by hand remains challenging in many tasks. When designing the reward, the goal is to choose a function that guides the robot to accomplish the task in any potential test environment that it might encounter. Typically, the designer considers a representative set of training environments, and finds a reward function that induces desirable behavior across all of them, as in Figure 1 (Top). In practice, this can be both challenging and frustrating for the reward designer. The process often results in many iterations of tuning, whereby changing the reward function corrects the behavior in one environment, but breaks it in another, and so on. We posit that designing a good reward function for a single environment at a time is easier than designing one for all training environments in consideration simultaneously. Imagine the task of motion planning in the home. The reward function provided to the planner must correctly encode the desired tradeoffs: the robot must stay away from static objects, it should give wider berth to fragile objects (as in Figure 1 (Bottom)), and it needs to keep a comfortable distance from the person, prioritizing more sensitive areas, such as the head [9].


Diversity is All You Need: Learning Skills without a Reward Function

arXiv.org Artificial Intelligence

Intelligent creatures can explore their environments and learn useful skills without supervision. In this paper, we propose DIAYN ("Diversity is All You Need"), a method for learning useful skills without a reward function. Our proposed method learns skills by maximizing an information theoretic objective using a maximum entropy policy. On a variety of simulated robotic tasks, we show that this simple objective results in the unsupervised emergence of diverse skills, such as walking and jumping. In a number of reinforcement learning benchmark environments, our method is able to learn a skill that solves the benchmark task despite never receiving the true task reward. We show how pretrained skills can provide a good parameter initialization for downstream tasks, and can be composed hierarchically to solve complex, sparse reward tasks. Our results suggest that unsupervised discovery of skills can serve as an effective pretraining mechanism for overcoming challenges of exploration and data efficiency in reinforcement learning.


Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation

arXiv.org Machine Learning

Generating novel graph structures that optimize given objectives while obeying some given underlying rules is fundamental for chemistry, biology and social science research. This is especially important in the task of molecular graph generation, whose goal is to discover novel molecules with desired properties such as drug-likeness and synthetic accessibility, while obeying physical laws such as chemical valency. However, designing models to find molecules that optimize desired properties while incorporating highly complex and non-differentiable rules remains to be a challenging task. Here we propose Graph Convolutional Policy Network (GCPN), a general graph convolutional network based model for goal-directed graph generation through reinforcement learning. The model is trained to optimize domain-specific rewards and adversarial loss through policy gradient, and acts in an environment that incorporates domain-specific rules. Experimental results show that GCPN can achieve 61% improvement on chemical property optimization over state-of-the-art baselines while resembling known molecules, and achieve 184% improvement on the constrained property optimization task.


A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

arXiv.org Machine Learning

Temporal difference learning (TD) is a simple iterative algorithm used to estimate the value function corresponding to a given policy in a Markov decision process. Although TD is one of the most widely used algorithms in reinforcement learning, its theoretical analysis has proved challenging and few guarantees on its statistical efficiency are available. In this work, we provide a simple and explicit finite time analysis of temporal difference learning with linear function approximation. Except for a few key insights, our analysis mirrors standard techniques for analyzing stochastic gradient descent algorithms, and therefore inherits the simplicity and elegance of that literature. A final section of the paper shows that all of our main results extend to the study of Q-learning applied to high-dimensional optimal stopping problems.


Deep Reinforcement Learning for General Video Game AI

arXiv.org Machine Learning

The realization that video games are perfect testbeds for artificial intelligence methods have in recent years spread to the whole AI community, in particular since Chess and Go have been effectively conquered, and there is an almost daily flurry of new papers applying AI methods to video games. In particular, the Arcade Learning Environment (ALE), which builds on an emulator for the Atari 2600 games console and contains several dozens of games [1], have been used in numerous published papers since DeepMind's landmark paper showing that Q-learning combined with deep convolutional networks could learn to play many of the ALE games at superhuman level [2]. As an AI benchmark, ALE is limited in the sense that there is only a finite set of games. This is a limitation it has in common with any framework based on existing published games. However, for being able to test the general video game playing ability of an agent, it is necessary to test on games on which the agent was not optimized.