Goto

Collaborating Authors

 Reinforcement Learning


Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

arXiv.org Machine Learning

The framework of reinforcement learning or optimal control provides a mathematical formalization of intelligent decision making that is powerful and broadly applicable. While the general form of the reinforcement learning problem enables effective reasoning about uncertainty, the connection between reinforcement learning and inference in probabilistic models is not immediately obvious. However, such a connection has considerable value when it comes to algorithm design: formalizing a problem as probabilistic inference in principle allows us to bring to bear a wide array of approximate inference tools, extend the model in flexible and powerful ways, and reason about compositionality and partial observability. In this article, we will discuss how a generalization of the reinforcement learning or optimal control problem, which is sometimes termed maximum entropy reinforcement learning, is equivalent to exact probabilistic inference in the case of deterministic dynamics, and variational inference in the case of stochastic dynamics. We will present a detailed derivation of this framework, overview prior work that has drawn on this and related ideas to propose new reinforcement learning and control algorithms, and describe perspectives on future research.


Reinforcement Learning w/ Keras OpenAI: DQNs โ€“ Towards Data Science

#artificialintelligence

Q-learning (which doesn't stand for anything, by the way) is centered around creating a "virtual table" that accounts for how much reward is assigned to each possible action given the current state of the environment. Let's break that down one step at a time: What do we mean by "virtual table?" Imagine that for each possible configuration of the input space, you have a table that assigns a score for each of the possible actions you can take. If this were magically possible, then it would be extremely easy for you to "beat" the environment: simply choose the action that has the highest score! Two points to note about this score.


Transforming from Autonomous to Smart: Reinforcement Learning Basics

#artificialintelligence

In the blog "From Autonomous to Smart: Importance of Artificial Intelligence," we laid out the artificial intelligence (AI) challenges in creating "smart" edge devices: We also talked about how Moore's Law isn't going to bail us out of these challenges; that the growth of Internet of Things (IOT) data and the complexity of the problems that we are trying to address at the edge (think "smart" cars) is growing much faster than Moore's Law can accommodate. So we are going to use this blog to deep dive into the category of artificial intelligence called reinforcement learning. We are going to see how reinforcement learning might help us to address these challenges; to work smarter at the edge when brute force technology advances will not suffice. With the rapid increases in computing power, it's easy to get seduced into thinking that raw computing power can solve problems like smart edge devices (e.g., cars, trains, airplanes, wind turbines, jet engines, medical devices). Look at the dramatic increase in the number of possible moves between checkers and chess even though the board layout is exactly the same.


[D] What Is In Your Demand Forecasting Toolkit? โ€ข r/MachineLearning

#artificialintelligence

Well, I work in this area now, and since this is upvoted a bit I'll give my thoughts. And I'll assume you're constraining the term "demand forecasting" to how its often used in business contexts....as well as your your recent posts on issues getting RNN/LSTM to work your time-series data. IMO the best tool for most product/service demand prediction tasks is domain knowledge for good feature engineering and for getting your data to be more stationary. Why? Product/service demand forecasting problems often start with only few explanatory variables as well as those variables not explaining the variance well (more precisely, low mutual information) relative to the number of actual factors going into the demand. Contrast this with areas getting more media such as deep reinforcement learning, where states and actions are fully representable/observed (e.g., AlphaGo).


Approximate Temporal Difference Learning is a Gradient Descent for Reversible Policies

arXiv.org Machine Learning

In reinforcement learning, temporal difference (TD) is the most direct algorithm to learn the value function of a policy. For large or infinite state spaces, exact representations of the value function are usually not available, and it must be approximated by a function in some parametric family. However, with \emph{nonlinear} parametric approximations (such as neural networks), TD is not guaranteed to converge to a good approximation of the true value function within the family, and is known to diverge even in relatively simple cases. TD lacks an interpretation as a stochastic gradient descent of an error between the true and approximate value functions, which would provide such guarantees. We prove that approximate TD is a gradient descent provided the current policy is \emph{reversible}. This holds even with nonlinear approximations. A policy with transition probabilities $P(s,s')$ between states is reversible if there exists a function $\mu$ over states such that $\frac{P(s,s')}{P(s',s)}=\frac{\mu(s')}{\mu(s)}$. In particular, every move can be undone with some probability. This condition is restrictive; it is satisfied, for instance, for a navigation problem in any unoriented graph. In this case, approximate TD is exactly a gradient descent of the \emph{Dirichlet norm}, the norm of the difference of \emph{gradients} between the true and approximate value functions. The Dirichlet norm also controls the bias of approximate policy gradient. These results hold even with no decay factor ($\gamma=1$) and do not rely on contractivity of the Bellman operator, thus proving stability of TD even with $\gamma=1$ for reversible policies.


A Simple Intro to Q-Learning in R: Floor Plan Navigation

#artificialintelligence

The question to be answered here is: What's the best way to get from Room 2 to Room 5 (outside)? Notice that by answering this question using reinforcement learning, we will also know how to find optimal routes from any room to outside. And if we run the iterative algorithm again for a new target state, we can find out the optimal route from any room to that new target state. Since Q-Learning is model-free, we don't need to know how likely it is that our agent will move between any room and any other room (the transition probabilities). If you had observed the behavior in this system over time, you might be able to find that information, but it many cases it just isn't available.


r/MachineLearning - [D] Introduction to Deep Q-learning with SynapticJS & ConvNetJS - build a connect 4 AI

@machinelearnbot

Hi everyone, I am quite new to this but I would like to have feedbacks on my first JavaScript AI for connect4: I wanted to build a small ReactApp and train an AI via Node to play with! Let me know what you think!


Dialog-based Interactive Image Retrieval

arXiv.org Artificial Intelligence

Existing methods for interactive image retrieval have demonstrated the merit of integrating user feedback, improving retrieval results. However, most current systems rely on restricted forms of user feedback, such as binary relevance responses, or feedback based on a fixed set of relative attributes, which limits their impact. In this paper, we introduce a new approach to interactive image search that enables users to provide feedback via natural language, allowing for more natural and effective interaction. We formulate the task of dialog-based interactive image retrieval as a reinforcement learning problem, and reward the dialog system for improving the rank of the target image during each dialog turn. To avoid the cumbersome and costly process of collecting human-machine conversations as the dialog system learns, we train our system with a user simulator, which is itself trained to describe the differences between target and candidate images. The efficacy of our approach is demonstrated in a footwear retrieval application. Extensive experiments on both simulated and real-world data show that 1) our proposed learning framework achieves better accuracy than other supervised and reinforcement learning baselines and 2) user feedback based on natural language rather than pre-specified attributes leads to more effective retrieval results, and a more natural and expressive communication interface.


A Deeper Look at Experience Replay

arXiv.org Artificial Intelligence

Recently experience replay is widely used in various deep reinforcement learning (RL) algorithms, in this paper we rethink the utility of experience replay. It introduces a new hyper-parameter, the memory buffer size, which needs carefully tuning. However unfortunately the importance of this new hyper-parameter has been underestimated in the community for a long time. In this paper we did a systematic empirical study of experience replay under various function representations. We showcase that a large replay buffer can significantly hurt the performance. Moreover, we propose a simple O(1) method to remedy the negative influence of a large replay buffer. We showcase its utility in both simple grid world and challenging domains like Atari games.


Towards Diverse Text Generation with Inverse Reinforcement Learning

arXiv.org Machine Learning

Text generation is a crucial task in NLP. Recently, several adversarial generative models have been proposed to improve the exposure bias problem in text generation. Though these models gain great success, they still suffer from the problems of reward sparsity and mode collapse. In order to address these two problems, in this paper, we employ inverse reinforcement learning (IRL) for text generation. Specifically, the IRL framework learns a reward function on training data, and then an optimal policy to maximum the expected total reward. Similar to the adversarial models, the reward and policy function in IRL are optimized alternately. Our method has two advantages: (1) the reward function can produce more dense reward signals. (2) the generation policy, trained by "entropy regularized" policy gradient, encourages to generate more diversified texts. Experiment results demonstrate that our proposed method can generate higher quality texts than the previous methods.