Department of Brain and Cognitive Sciences Massachusetts Institute of Technology Cambridge, MA 02139 Abstract We present a new algorithm for associative reinforcement learning. Thealgorithm is based upon the idea of matching a network's output probability with a probability distribution derived from the environment's reward signal. This Probability Matching algorithm is shown to perform faster and be less susceptible to local minima than previously existing algorithms. We use Probability Matching totrain mixture of experts networks, an architecture for which other reinforcement learning rules fail to converge reliably on even simple problems. This architecture is particularly well suited for our algorithm as it can compute arbitrarily complex functions yet calculation of the output probability is simple. 1 INTRODUCTION The problem of learning associative networks from scalar reinforcement signals is notoriously difficult.
The original goal of this post was to explore the relationship between the softmax and sigmoid functions. In truth, this relationship had always seemed just out of reach: "One has an exponent in the numerator! One has a 1 in the denominator!" And of course, the two have different names. Once derived, I quickly realized how this relationship backed out into a more general modeling framework motivated by the conditional probability axiom itself.
John S. Denker and Yann leCun AT&T Bell Laboratories Holmdel, NJ 07733 Abstract (1) The outputs of a typical multi-output classification network do not satisfy the axioms of probability; probabilities should be positive and sum to one. This problem can be solved by treating the trained network as a preprocessor that produces a feature vector that can be further processed, for instance by classical statistical estimation techniques. It is particularly useful to combine these two ideas: we implement the ideas of section 1 using Parzen windows, where the shape and relative size of each window is computed using the ideas of section 2. This allows us to make contact between important theoretical ideas (e.g. the ensemble formalism) and practical techniques (e.g. Our results also shed new light on and generalize the well-known "softmax" scheme. 1 Distribution of Categories in Output Space In many neural-net applications, it is crucial to produce a set of C numbers that serve as estimates of the probability of C mutually exclusive outcomes. For example, inspeech recognition, these numbers represent the probability of C different phonemes; the probabilities of successive segments can be combined using a Hidden Markov Model.
Previously referred to as `miraculous' in the scientific literature because of its powerful properties and its wide application as optimal solution to the problem of induction/inference, (approximations to) Algorithmic Probability (AP) and the associated Universal Distribution are (or should be) of the greatest importance in science. Here we investigate the emergence, the rates of emergence and convergence, and the Coding-theorem like behaviour of AP in Turing-subuniversal models of computation. We investigate empirical distributions of computing models in the Chomsky hierarchy. We introduce measures of algorithmic probability and algorithmic complexity based upon resource-bounded computation, in contrast to previously thoroughly investigated distributions produced from the output distribution of Turing machines. This approach allows for numerical approximations to algorithmic (Kolmogorov-Chaitin) complexity-based estimations at each of the levels of a computational hierarchy. We demonstrate that all these estimations are correlated in rank and that they converge both in rank and values as a function of computational power, despite fundamental differences between computational models. In the context of natural processes that operate below the Turing universal level because of finite resources and physical degradation, the investigation of natural biases stemming from algorithmic rules may shed light on the distribution of outcomes. We show that up to 60\% of the simplicity/complexity bias in distributions produced even by the weakest of the computational models can be accounted for by Algorithmic Probability in its approximation to the Universal Distribution.
Recently, a model for supervised learning of probabilistic transducers representedby suffix trees was introduced. However, this algorithm tendsto build very large trees, requiring very large amounts of computer memory. In this paper, we propose anew, more compact, transducermodel in which one shares the parameters of distributions associatedto contexts yielding similar conditional output distributions. We illustrate the advantages of the proposed algorithm withcomparative experiments on inducing a noun phrase recogmzer.