AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Feature Construction for Inverse Reinforcement Learning

Levine, Sergey, Popovic, Zoran, Koltun, Vladlen

Neural Information Processing SystemsFeb-15-2020, 02:12:10 GMT

The goal of inverse reinforcement learning is to find a reward function for a Markov decision process, given example traces from its optimal policy. Current IRL techniques generally rely on user-supplied features that form a concise basis for the reward. We present an algorithm that instead constructs reward features from a large collection of component features, by building logical conjunctions of those component features that are relevant to the example policy. Given example traces, the algorithm returns a reward function as well as the constructed features. The reward function can be used to recover a full, deterministic, stationary policy, and the features can be used to transplant the reward function into any novel environment on which the component features are well defined.

component feature, feature construction, inverse reinforcement learning, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories

Konidaris, George, Kuindersma, Scott, Grupen, Roderic, Barto, Andrew G.

Neural Information Processing SystemsFeb-15-2020, 01:56:48 GMT

We introduce CST, an algorithm for constructing skill trees from demonstration trajectories in continuous reinforcement learning domains. CST uses a changepoint detection method to segment each trajectory into a skill chain by detecting a change of appropriate abstraction, or that a segment is too complex to model as a single skill. The skill chains from each trajectory are then merged to form a skill tree. We demonstrate that CST constructs an appropriate skill tree that can be further refined through learning in a challenging continuous domain, and that it can be used to segment demonstration trajectories on a mobile manipulator into chains of skills where each skill is assigned an appropriate abstraction. Papers published at the Neural Information Processing Systems Conference.

constructing skill tree, demonstration trajectory, reinforcement learning agent, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Regularized Policy Iteration

Farahmand, Amir M., Ghavamzadeh, Mohammad, Mannor, Shie, Szepesvári, Csaba

Neural Information Processing SystemsFeb-15-2020, 01:42:30 GMT

In this paper we consider approximate policy-iteration-based reinforcement learning algorithms. In order to implement a flexible function approximation scheme we propose the use of non-parametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by adding L2-regularization to two widely-used policy evaluation methods: Bellman residual minimization (BRM) and least-squares temporal difference learning (LSTD). We derive efficient implementation for our algorithms when the approximate value-functions belong to a reproducing kernel Hilbert space. We also provide finite-sample performance bounds for our algorithms and show that they are able to achieve optimal rates of convergence under the studied conditions.

machine learning, regularized policy iteration, reinforcement learning, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation

Castro, Dotan D., Volkinshtein, Dmitry, Meir, Ron

Neural Information Processing SystemsFeb-15-2020, 01:42:19 GMT

Actor-critic algorithms for reinforcement learning are achieving renewed popularity due to their good convergence properties in situations where other approaches often fail (e.g., when function approximation is involved). Interestingly, there is growing evidence that actor-critic approaches based on phasic dopamine signals play a key role in biological learning through the cortical and basal ganglia. We derive a temporal difference based actor critic learning algorithm, for which convergence can be proved without assuming separate time scales for the actor and the critic. The approach is demonstrated by applying it to networks of spiking neurons. The established relation between phasic dopamine and the temporal difference signal lends support to the biological relevance of such algorithms.

actor critic learning, convergence and neural implementation, temporal difference, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient

Jie, Tang, Abbeel, Pieter

Neural Information Processing SystemsFeb-15-2020, 01:29:34 GMT

Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (a) using the past experience to estimate {\em only} the gradient of the expected return $U(\theta)$ at the current policy parameterization $\theta$, rather than to obtain a more complete estimate of $U(\theta)$, and (b) using past experience under the current policy {\em only} rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines---a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds.

importance sampling, likelihood ratio policy gradient, policy gradient method, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.65)

Add feedback

Learning to Explore and Exploit in POMDPs

Cai, Chenghui, Liao, Xuejun, Carin, Lawrence

Neural Information Processing SystemsFeb-15-2020, 01:12:46 GMT

A fundamental objective in reinforcement learning is the maintenance of a proper balance between exploration and exploitation. This problem becomes more challenging when the agent can only partially observe the states of its environment. In this paper we propose a dual-policy method for jointly learning the agent behavior and the balance between exploration exploitation, in partially observable environments. The method subsumes traditional exploration, in which the agent takes actions to gather information about the environment, and active learning, in which the agent queries an oracle for optimal actions (with an associated cost for employing the oracle). The form of the employed exploration is dictated by the specific problem.

artificial intelligence, explore and exploit, upstream oil & gas, (4 more...)

Neural Information Processing Systems

Industry: Energy > Oil & Gas > Upstream (0.73)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.30)

Add feedback

Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

Bush, Keith, Pineau, Joelle

Neural Information Processing SystemsFeb-15-2020, 01:12:40 GMT

Interesting real-world datasets often exhibit nonlinear, noisy, continuous-valued states that are unexplorable, are poorly described by first principles, and are only partially observable. If partial observability can be overcome, these constraints suggest the use of model-based reinforcement learning. We experiment with manifold embeddings as the reconstructed observable state-space of an off-line, model-based reinforcement learning approach to control. We demonstrate the embedding of a system changes as a result of learning and that the best performing embeddings well-represent the dynamics of both the uncontrolled and adaptively controlled system. We apply this approach in simulation to learn a neurostimulation policy that is more efficient in treating epilepsy than conventional policies.

manifold embedding, model-based reinforcement learning, partial observability, (1 more...)

Neural Information Processing Systems

Industry:

Health & Medicine > Therapeutic Area > Neurology > Epilepsy (0.30)
Health & Medicine > Therapeutic Area > Genetic Disease (0.30)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Near-optimal Regret Bounds for Reinforcement Learning

Auer, Peter, Jaksch, Thomas, Ortner, Ronald

Neural Information Processing SystemsFeb-15-2020, 01:11:16 GMT

For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s1,s2 there is a policy which moves from s1 to s2 in at most D steps (on average). We present a reinforcement learning algorithm with total regret O(DSAT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of Omega(DSAT) on the total regret of any learning algorithm. Both bounds demonstrate the utility of the diameter as structural parameter of the MDP.

algorithm, near-optimal regret bound, reinforcement learning, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Structure Learning in Human Sequential Decision-Making

Acuna, Daniel, Schrater, Paul R.

Neural Information Processing SystemsFeb-15-2020, 00:57:43 GMT

We use graphical models and structure learning to explore how people learn policies in sequential decision making tasks. Studies of sequential decision-making in humans frequently find suboptimal performance relative to an ideal actor that knows the graph model that generates reward in the environment. We argue that the learning problem humans face also involves learning the graph structure for reward generation in the environment. We formulate the structure learning problem using mixtures of reward models, and solve the optimal action selection problem using Bayesian Reinforcement Learning. We show that structure learning in one and two armed bandit problems produces many of the qualitative behaviors deemed suboptimal in previous studies.

human sequential decision-making, structure learning

Neural Information Processing Systems

Industry: Education > Focused Education > Special Education (0.57)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (1.00)

Add feedback

LSTD with Random Projections

Ghavamzadeh, Mohammad, Lazaric, Alessandro, Maillard, Odalric, Munos, Rémi

Neural Information Processing SystemsFeb-15-2020, 00:57:36 GMT

We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a high-dimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm. Papers published at the Neural Information Processing Systems Conference.

algorithm, lstd, random projection, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)

Add feedback