Reinforcement Learning (RL) is a heuristic method for learning locally optimal policies in Markov Decision Processes (MDP). Its classical formulation (Sutton & Barto 1998) maintains point estimates of the expected values of states or state-action pairs. Bayesian RL (Dearden, Friedman, & Russell 1998) extends this to beliefs over values. However the concept of values sits uneasily with the original notion of Bayesian Networks (BNs), which were defined (Pearl 1988) as having explicitly causal semantics. In this paper we show how Bayesian RL can be cast in an explicitly Bayesian Network formalism, making use of backwards-in-time causality. We show how the heuristic used by RL can be seen as an instance of a more general BN inference heuristic, which cuts causal links in the network and replaces them with noncausal approximate hashing links for speed. This view brings RL into line with standard Bayesian AI concepts, and suggests similar hashing heuristics for other general inference tasks.
Apr-16-2008, 20:37:16 GMT