Sallans, Brian
Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
Sallans, Brian, Hinton, Geoffrey E.
The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned online using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a cooperative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.
Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
Sallans, Brian, Hinton, Geoffrey E.
The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned online using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions arechosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a cooperative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.
Learning Factored Representations for Partially Observable Markov Decision Processes
Sallans, Brian
The problem of reinforcement learning in a non-Markov environment is explored using a dynamic Bayesian network, where conditional independence assumptions between random variables are compactly represented by network parameters. The parameters are learned online, and approximations are used to perform inference and to compute the optimal value function. The relative effects of inference and value function approximations on the quality of the final policy are investigated, by learning to solve a moderately difficult driving task. The two value function approximations, linear and quadratic, were found to perform similarly, but the quadratic model was more sensitive to initialization. Both performed below the level of human performance on the task. The dynamic Bayesian network performed comparably to a model using a localist hidden state representation, while requiring exponentially fewer parameters.
Learning Factored Representations for Partially Observable Markov Decision Processes
Sallans, Brian
The problem of reinforcement learning in a non-Markov environment is explored using a dynamic Bayesian network, where conditional independence assumptionsbetween random variables are compactly represented by network parameters. The parameters are learned online, and approximations areused to perform inference and to compute the optimal value function. The relative effects of inference and value function approximations onthe quality of the final policy are investigated, by learning to solve a moderately difficult driving task. The two value function approximations, linearand quadratic, were found to perform similarly, but the quadratic model was more sensitive to initialization. Both performed below thelevel of human performance on the task.
Learning Factored Representations for Partially Observable Markov Decision Processes
Sallans, Brian
The problem of reinforcement learning in a non-Markov environment is explored using a dynamic Bayesian network, where conditional independence assumptions between random variables are compactly represented by network parameters. The parameters are learned online, and approximations are used to perform inference and to compute the optimal value function. The relative effects of inference and value function approximations on the quality of the final policy are investigated, by learning to solve a moderately difficult driving task. The two value function approximations, linear and quadratic, were found to perform similarly, but the quadratic model was more sensitive to initialization. Both performed below the level of human performance on the task. The dynamic Bayesian network performed comparably to a model using a localist hidden state representation, while requiring exponentially fewer parameters.