AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

The Thing That We Tried Didn't Work Very Well : Deictic Representation in Reinforcement Learning

Finney, Sarah, Gardiol, Natalia, Kaelbling, Leslie Pack, Oates, Tim

arXiv.org Artificial IntelligenceDec-12-2012

Most reinforcement learning methods operate on propositional representations of the world state. Such representations are often intractably large and generalize poorly. Using a deictic representation is believed to be a viable alternative: they promise generalization while allowing the use of existing reinforcement-learning methods. Yet, there are few experiments on learning with deictic representations reported in the literature. In this paper we explore the effectiveness of two forms of deictic representation and a na\"{i}ve propositional representation in a simple blocks-world domain. We find, empirically, that the deictic representations actually worsen learning performance. We conclude with a discussion of possible causes of these results and strategies for more effective learning in domains with objects.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

1301.0567

Country: North America > United States > Massachusetts > Middlesex County (0.14)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Learning via Human Feedback in Continuous State and Action Spaces

Ngo, Vien Anh (Ravensburg-Weingarten University of Applied Sciences) | Ertel, Wolfgang (Ravensburg-Weingarten University of Applied Sciences)

AAAI ConferencesNov-5-2012

We consider the problem of extending manually trainedagents via evaluative reinforcement (TAMER) in con-tinuous state and action spaces. The early work TAMERframework allows a non-technical human to train anagent through a natural form of human feedback, neg-ative or positive. The advantages of TAMER havebeen shown on applications such as training Tetris andMountain Car with only human feedback, Cart-poleand Mountain Car with human feedback and environ-ment reward (augmenting reinforcement learning withhuman feedback). However, those methods are origi-nally designed for discrete state-action, or continuousstate-discrete action problems. In this paper, we intro-duce an extension of TAMER to allow both continu-ous states and actions. The new scheme, actor-criticTAMER, extends the original TAMER to allow usingany general function approximation of a human trainer’sreinforcement signal. Our extension still allows rein-forcement learning to be easily combined with humanfeedback. The experimental results show that the pro-posed method helps a human trainer successfully trainan agent in two continuous state-action domains: Moun-tain Car, and Cart-pole (balancing).

artificial intelligence, machine learning, reinforcement learning, (18 more...)

AAAI Conferences

2012 AAAI Fall Symposium Series

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Colorado > Denver County > Denver (0.04)
Europe > Germany (0.04)

Genre: Research Report (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Between Instruction and Reward: Human-Prompted Switching

Pilarski, Patrick M. (University of Alberta) | Sutton, Richard S. (University of Alberta)

AAAI ConferencesNov-5-2012

Intelligent systems promise to amplify, augment, and extend innate human abilities. A principal example is that of assistive rehabilitation robots---artificial intelligence and machine learning enable new electromechanical systems that restore biological functions lost through injury or illness. In order for an intelligent machine to assist a human user, it must be possible for a human to communicate their intentions and preferences to their non-human counterpart. While there are a number of techniques that a human can use to direct a machine learning system, most research to date has focused on the contrasting strategies of instruction and reward. The primary contribution of our work is to demonstrate that the middle ground between instruction and reward is a fertile space for research and immediate technological progress. To support this idea, we introduce the setting of human-prompted switching, and illustrate the successful combination of switching with interactive learning using a concrete real-world example: human control of a multi-joint robot arm. We believe techniques that fall between the domains of instruction and reward are complementary to existing approaches, and will open up new lines of rapid progress for interactive human training of machine learning systems.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

AAAI Conferences

2012 AAAI Fall Symposium Series

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)
(4 more...)

Industry:

Education > Educational Setting > Online (0.49)
Health & Medicine > Consumer Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Active Imitation Learning via Reduction to I.I.D. Active Learning

Judah, Kshitij (Oregon State University) | Fern, Alan Paul (Oregon State University) | Dietterich, Thomas Glenn (Oregon State University)

AAAI ConferencesNov-5-2012

In standard passive imitation learning, the goal is to learn an expert’s policy by passively observing full execution trajectories of it. Unfortunately, generating such trajectories can require substantial expert effort and be impractical in some cases. In this paper, we consider Active Imitation Learning (AIL) with the goal of reducing this effort by querying the expert about the desired action at individual states, which are selected based on answers to past queries and the learner’s interactions with an environment simulator. Our new approach is based on reducing AIL to i.i.d. active learning, which can leverage progress in the i.i.d. setting. We introduce and analyze reductions for both non-stationary and stationary policies, showing that the label complexity (number of queries) of AIL can be substantially less than passive learning. We also introduce a practical algorithm inspired by the reductions, which is shown to be highly effective in four test domains compared to a number of alternatives.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

AAAI Conferences

2012 AAAI Fall Symposium Series

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Oregon > Benton County > Corvallis (0.04)
North America > United States > Massachusetts (0.04)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

$QD$-Learning: A Collaborative Distributed Strategy for Multi-Agent Reinforcement Learning Through Consensus + Innovations

Kar, Soummya, Moura, Jose' M. F., Poor, H. Vincent

arXiv.org Machine LearningOct-24-2012

The paper considers a class of multi-agent Markov decision processes (MDPs), in which the network agents respond differently (as manifested by the instantaneous one-stage random costs) to a global controlled state and the control actions of a remote controller. The paper investigates a distributed reinforcement learning setup with no prior information on the global state transition and local agent cost statistics. Specifically, with the agents' objective consisting of minimizing a network-averaged infinite horizon discounted cost, the paper proposes a distributed version of $Q$-learning, $\mathcal{QD}$-learning, in which the network agents collaborate by means of local processing and mutual information exchange over a sparse (possibly stochastic) communication network to achieve the network goal. Under the assumption that each agent is only aware of its local online cost data and the inter-agent communication network is \emph{weakly} connected, the proposed distributed scheme is almost surely (a.s.) shown to yield asymptotically the desired value function and the optimal stationary control policy at each network agent. The analytical techniques developed in the paper to address the mixed time-scale stochastic dynamics of the \emph{consensus + innovations} form, which arise as a result of the proposed interactive distributed scheme, are of independent interest.

machine learning, reinforcement learning, state-action pair, (18 more...)

arXiv.org Machine Learning

doi: 10.1109/TSP.2013.2241057

1205.0047

Country: North America > United States (1.00)

Genre: Research Report (0.50)

Industry:

Leisure & Entertainment > Sports > Soccer (0.67)
Banking & Finance (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

Marginalizing Out Future Passengers in Group Elevator Control

Nikovski, Daniel N., Brand, Matthew

arXiv.org Artificial IntelligenceOct-19-2012

Group elevator scheduling is an NPhard sequential decision-making problem with unbounded state spaces and substantial uncertainty. Decision-theoretic reasoning plays a surprisingly limited role in fielded systems. A new opportunity for probabilistic methods has opened with the recent discovery of a tractable solution for the expected waiting times of all passengers in the building, marginalized over all possible passenger itineraries [Nikovski and Brand, 2003]. Though commercially competitive, this solution does not contemplate future passengers. Yet in up-peak traffic, the effects of future passengers arriving at the lobby and entering elevator cars can dominate all waiting times. We develop a probabilistic model of how these arrivals affect the behavior of elevator cars at the lobby, and demonstrate how this model can be used to very significantly reduce the average waiting time of all passengers.

artificial intelligence, machine learning, reinforcement learning, (21 more...)

arXiv.org Artificial Intelligence

1212.2499

Country: North America > United States > Massachusetts (0.46)

Genre:

Research Report > New Finding (0.48)
Research Report > Promising Solution (0.34)

Industry: Transportation > Passenger (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.66)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.46)

Add feedback

Monte Carlo Matrix Inversion Policy Evaluation

Lu, Fletcher, Schuurmans, Dale

arXiv.org Artificial IntelligenceOct-19-2012

In 1950, Forsythe and Leibler (1950) introduced a statistical technique for finding the inverse of a matrix by characterizing the elements of the matrix inverse as expected values of a sequence of random walks. Barto and Duff (1994) subsequently showed relations between this technique and standard dynamic programming and temporal differencing methods. The advantage of the Monte Carlo matrix inversion (MCMI) approach is that it scales better with respect to state-space size than alternative techniques. In this paper, we introduce an algorithm for performing reinforcement learning policy evaluation using MCMI. We demonstrate that MCMI improves on runtime over a maximum likelihood model-based policy evaluation approach and on both runtime and accuracy over the temporal differencing (TD) policy evaluation approach. We further improve on MCMI policy evaluation by adding an importance sampling technique to our algorithm to reduce the variance of our estimator. Lastly, we illustrate techniques for scaling up MCMI to large state spaces in order to perform policy improvement.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

arXiv.org Artificial Intelligence

1212.2471

Country:

North America > Canada (0.28)
North America > United States > Massachusetts (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Mean-Field Learning: a Survey

Tembine, Hamidou, Tempone, Raul, Vilanova, Pedro

arXiv.org Machine LearningOct-17-2012

In this paper we study iterative procedures for stationary equilibria in games with large number of players. Most of learning algorithms for games with continuous action spaces are limited to strict contraction best reply maps in which the Banach-Picard iteration converges with geometrical convergence rate. When the best reply map is not a contraction, Ishikawa-based learning is proposed. The algorithm is shown to behave well for Lipschitz continuous and pseudo-contractive maps. However, the convergence rate is still unsatisfactory. Several acceleration techniques are presented. We explain how cognitive users can improve the convergence rate based only on few number of measurements. The methodology provides nice properties in mean field games where the payoff function depends only on own-action and the mean of the mean-field (first moment mean-field games). A learning framework that exploits the structure of such games, called, mean-field learning, is proposed. The proposed mean-field learning framework is suitable not only for games but also for non-convex global optimization problems. Then, we introduce mean-field learning without feedback and examine the convergence to equilibria in beauty contest games, which have interesting applications in financial markets. Finally, we provide a fully distributed mean-field learning and its speedup versions for satisfactory solution in wireless networks. We illustrate the convergence rate improvement with numerical examples.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

1210.4657

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment > Games (0.46)
Banking & Finance (0.34)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)

Add feedback

Dynamic Teaching in Sequential Decision Making Environments

Walsh, Thomas J., Goschin, Sergiu

arXiv.org Machine LearningOct-16-2012

We describe theoretical bounds and a practical algorithm for teaching a model by demonstration in a sequential decision making environment. Unlike previous efforts that have optimized learners that watch a teacher demonstrate a static policy, we focus on the teacher as a decision maker who can dynamically choose different policies to teach different parts of the environment. We develop several teaching frameworks based on previously defined supervised protocols, such as Teaching Dimension, extending them to handle noise and sequences of inputs encountered in an MDP. We provide theoretical bounds on the learnability of several important model classes in this setting and suggest a practical algorithm for dynamic teaching.

learner, machine learning, reinforcement learning, (18 more...)

arXiv.org Machine Learning

1210.4918

Country: North America > United States > Kansas (0.28)

Genre: Research Report (0.50)

Industry:

Education (0.69)
Transportation > Passenger (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Sparse Q-learning with Mirror Descent

Mahadevan, Sridhar, Liu, Bo

arXiv.org Machine LearningOct-16-2012

This paper explores a new framework for reinforcement learning based on online convex optimization, in particular mirror descent and related algorithms. Mirror descent can be viewed as an enhanced gradient method, particularly suited to minimization of convex functions in highdimensional spaces. Unlike traditional gradient methods, mirror descent undertakes gradient updates of weights in both the dual space and primal space, which are linked together using a Legendre transform. Mirror descent can be viewed as a proximal algorithm where the distance generating function used is a Bregman divergence. A new class of proximal-gradient based temporal-difference (TD) methods are presented based on different Bregman divergences, which are more powerful than regular TD learning. Examples of Bregman divergences that are studied include p-norm functions, and Mahalanobis distance based on the covariance of sample gradients. A new family of sparse mirror-descent reinforcement learning methods are proposed, which are able to find sparse fixed points of an l1-regularized Bellman equation at significantly less computational cost than previous methods based on second-order matrix methods. An experimental study of mirror-descent reinforcement learning is presented using discrete and continuous Markov decision processes.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

arXiv.org Machine Learning

1210.4893

Country:

North America > United States > Massachusetts (0.46)
Europe > United Kingdom > England (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

Add feedback