AITopics | Tesauro, Gerald

Bayesian Inference in Monte-Carlo Tree Search

Tesauro, Gerald, Rajan, V T, Segal, Richard

arXiv.org Machine LearningMar-15-2012

Monte-Carlo Tree Search (MCTS) methods are drawing great interest after yielding breakthrough results in computer Go. This paper proposes a Bayesian approach to MCTS that is inspired by distributionfree approaches such as UCT [13], yet significantly differs in important respects. The Bayesian framework allows potentially much more accurate (Bayes-optimal) estimation of node values and node uncertainties from a limited number of simulation trials. We further propose propagating inference in the tree via fast analytic Gaussian approximation methods: this can make the overhead of Bayesian inference manageable in domains such as Go, while preserving high accuracy of expected-value estimates. We find substantial empirical outperformance of UCT in an idealized bandit-tree test environment, where we can obtain valuable insights by comparing with known ground truth. Additionally we rigorously prove on-policy and off-policy convergence of the proposed methods.

bayesian inference, node, planning & scheduling, (19 more...)

arXiv.org Machine Learning

1203.3519

Genre: Research Report > New Finding (0.88)

Industry: Leisure & Entertainment > Games > Go (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback

Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

Tesauro, Gerald, Das, Rajarshi, Chan, Hoi, Kephart, Jeffrey, Levine, David, Rawson, Freeman, Lefurgy, Charles

Neural Information Processing SystemsDec-31-2008

Businesses want to save power without sacrificing performance.This paper presents a reinforcement learning approach to simultaneous online management of both performance and power consumption. We apply RL in a realistic laboratory testbed using a Blade cluster and dynamically varyingHTTP workload running on a commercial web applications middleware platform.We embed a CPU frequency controller in the Blade servers' firmware, and we train policies for this controller using a multi-criteria reward signal depending on both application performance and CPU power consumption. Our testbed scenario posed a number of challenges to successful use of RL, including multipledisparate reward functions, limited decision sampling rates, and pathologies arising when using multiple sensor readings as state variables. We describe innovative practical solutions to these challenges, and demonstrate clear performance improvements over both hand-designed policies as well as obvious "cookbook" RL implementations.

artificial intelligence, reinforcement learning, response time, (18 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Extending Q-Learning to General Adaptive Multi-Agent Systems

Tesauro, Gerald

Neural Information Processing SystemsDec-31-2004

Recent multi-agent extensions of Q-Learning require knowledge of other agents' payoffs and Q-functions, and assume game-theoretic play at all times by all other agents. This paper proposes a fundamentally different approach, dubbed "Hyper-Q" Learning, in which values of mixed strategies rather than base actions are learned, and in which other agents' strategies are estimated from observed actions via Bayesian inference. Hyper-Q may be effective against many different types of adaptive agents, even if they are persistently dynamic. Against certain broad categories of adaptation, it is argued that Hyper-Q may converge to exact optimal time-varying policies. In tests using Rock-Paper-Scissors, Hyper-Q learns to significantly exploit an Infinitesimal Gradient Ascent (IGA) player, as well as a Policy Hill Climber (PHC) player. Preliminary analysis of Hyper-Q against itself is also presented.

Add feedback

Extending Q-Learning to General Adaptive Multi-Agent Systems

Tesauro, Gerald

Neural Information Processing SystemsDec-31-2004

Recent multi-agent extensions of Q-Learning require knowledge of other agents' payoffs and Q-functions, and assume game-theoretic play at all times by all other agents. This paper proposes a fundamentally different approach, dubbed "Hyper-Q" Learning, in which values of mixed strategies rather than base actions are learned, and in which other agents' strategies are estimated from observed actions via Bayesian inference. Hyper-Qmay be effective against many different types of adaptive agents, even if they are persistently dynamic. Against certain broad categories of adaptation, it is argued that Hyper-Q may converge to exact optimaltime-varying policies. In tests using Rock-Paper-Scissors, Hyper-Q learns to significantly exploit an Infinitesimal Gradient Ascent (IGA) player, as well as a Policy Hill Climber (PHC) player. Preliminary analysis of Hyper-Q against itself is also presented.

Add feedback

On-line Policy Improvement using Monte-Carlo Search

Tesauro, Gerald, Galperin, Gregory R.

Neural Information Processing SystemsDec-31-1997

Policy iteration is known to have rapid and robust convergence properties, and for Markov tasks with lookup-table state-space representations, it is guaranteed to convergence to the optimal policy. Online Policy Improvement using Monte-Carlo Search 1069 In typical uses of policy iteration, the policy improvement step is an extensive off-line procedure. For example, in dynamic programming, one performs a sweep through all states in the state space. Reinforcement learning provides another approach topolicy improvement; recently, several authors have investigated using RL in conjunction with nonlinear function approximators to represent the value functions and/orpolicies (Tesauro, 1992; Crites and Barto, 1996; Zhang and Dietterich, 1996). These studies are based on following actual state-space trajectories rather than sweeps through the full state space, but are still too slow to compute improved policies in real time.

algorithm, backgammon, planning & scheduling, (22 more...)

Neural Information Processing Systems

Country: North America > United States > Massachusetts > Middlesex County (0.14)

Industry: Leisure & Entertainment > Games > Backgammon (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

On-line Policy Improvement using Monte-Carlo Search

Tesauro, Gerald, Galperin, Gregory R.

Neural Information Processing SystemsDec-31-1997

Policy iteration is known to have rapid and robust convergence properties, and for Markov tasks with lookup-table state-space representations, it is guaranteed to convergence to the optimal policy. Online Policy Improvement using Monte-Carlo Search 1069 In typical uses of policy iteration, the policy improvement step is an extensive off-line procedure. For example, in dynamic programming, one performs a sweep through all states in the state space. Reinforcement learning provides another approach to policy improvement; recently, several authors have investigated using RL in conjunction with nonlinear function approximators to represent the value functions and/or policies (Tesauro, 1992; Crites and Barto, 1996; Zhang and Dietterich, 1996). These studies are based on following actual state-space trajectories rather than sweeps through the full state space, but are still too slow to compute improved policies in real time.

algorithm, backgammon, planning & scheduling, (22 more...)

Neural Information Processing Systems

Country: North America > United States > Massachusetts > Middlesex County (0.14)

Industry: Leisure & Entertainment > Games > Backgammon (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.91)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.62)

Add feedback

Practical Issues in Temporal Difference Learning

Tesauro, Gerald

Neural Information Processing SystemsDec-31-1992

This paper examines whether temporal difference methods for training connectionist networks, such as Suttons's TO('\) algorithm, can be successfully appliedto complex real-world problems. A number of important practical issues are identified and discussed from a general theoretical perspective. Thesepractical issues are then examined in the context of a case study in which TO('\) is applied to learning the game of backgammon from the outcome of self-play. This is apparently the first application of this algorithm to a complex nontrivial task. It is found that, with zero knowledge built in, the network is able to learn from scratch to play the entire game at a fairly strong intermediate level of performance, which is clearly better than conventional commercial programs, and which in fact surpasses comparable networks trained on a massive human expert data set. The hidden units in these network have apparently discovered useful features, a longstanding goal of computer games research.

backgammon, computer game, tesauro, (20 more...)

Neural Information Processing Systems

Genre: Research Report (0.48)

Industry:

Leisure & Entertainment > Games > Backgammon (0.89)
Leisure & Entertainment > Games > Computer Games (0.55)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Practical Issues in Temporal Difference Learning

Tesauro, Gerald

Neural Information Processing SystemsDec-31-1992

TO('\) is an algorithm for adjusting the weights in a connectionist network which 259 260 Tesauro has the following form:

artificial intelligence, backgammon, tesauro, (19 more...)

Neural Information Processing Systems

Industry: Leisure & Entertainment > Games > Backgammon (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Can neural networks do better than the Vapnik-Chervonenkis bounds?

Cohn, David, Tesauro, Gerald

Neural Information Processing SystemsDec-31-1991

These experiments are designed to test whether average generalization performance can surpass the worst-case bounds obtained from formal learning theory using the Vapnik-Chervonenkis dimension (Blumer et al., 1989). We indeed find that, in some cases, the average generalization is significantly better than the VC bound: the approach to perfect performance is exponential in the number of examples m, rather than the 11m result of the bound. In other cases, we do find the 11m behavior of the VC bound, and in these cases, the numerical prefactor is closely related to prefactor contained in the bound.

artificial data, artificial intelligence, neural network, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > Washington > King County > Seattle (0.14)
North America > United States > California (0.14)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.85)

Add feedback

Can neural networks do better than the Vapnik-Chervonenkis bounds?

Cohn, David, Tesauro, Gerald

Neural Information Processing SystemsDec-31-1991

These experiments are designed to test whether average generalization performance can surpass the worst-case bounds obtained from formal learning theory using the Vapnik-Chervonenkis dimension (Blumer et al., 1989). We indeed find that, in some cases, the average generalization is significantly better than the VC bound: the approach to perfect performance is exponential in the number of examples m, rather than the 11m result of the bound. In other cases, we do find the 11m behavior of the VC bound, and in these cases, the numerical prefactor is closely related to prefactor contained in the bound.

artificial intelligence, generalization, neural network, (19 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.14)

Technology: