Uncertainty
Learning the Structure of Similarity
The additive clustering (ADCL US) model (Shepard & Arabie, 1979) treats the similarity of two stimuli as a weighted additive measure of their common features. Inspired by recent work in unsupervised learning with multiple cause models, we propose anew, statistically well-motivated algorithm for discovering the structure of natural stimulus classes using the ADCLUS model, which promises substantial gains in conceptual simplicity, practical efficiency, and solution quality over earlier efforts.
Fast Learning by Bounding Likelihoods in Sigmoid Type Belief Networks
Jaakkola, Tommi, Saul, Lawrence K., Jordan, Michael I.
Often the parameters used in these networks needto be learned from examples. Unfortunately, estimating the parameters via exact probabilistic calculations (i.e, the EMalgorithm) is intractable even for networks with fairly small numbers of hidden units. We propose to avoid the infeasibility of the E step by bounding likelihoods instead of computing them exactly. Weintroduce extended and complementary representations for these networks and show that the estimation of the network parameters can be made fast (reduced to quadratic optimization) by performing the estimation in either of the alternative domains. The complementary networks can be used for continuous density estimation as well. 1 Introduction The appeal of probabilistic networks for knowledge representation, inference, and learning (Pearl, 1988) derives both from the sound Bayesian framework and from the explicit representation of dependencies among the network variables which allows readyincorporation of prior information into the design of the network.
Gaussian Processes for Regression
Williams, Christopher K. I., Rasmussen, Carl Edward
The Bayesian analysis of neural networks is difficult because a simple priorover weights implies a complex prior distribution over functions. In this paper we investigate the use of Gaussian process priors over functions, which permit the predictive Bayesian analysis forfixed values of hyperparameters to be carried out exactly using matrix operations. Two methods, using optimization and averaging (viaHybrid Monte Carlo) over hyperparameters have been tested on a number of challenging problems and have produced excellent results. 1 INTRODUCTION In the Bayesian approach to neural networks a prior distribution over the weights induces a prior distribution over functions. This prior is combined with a noise model, which specifies the probability of observing the targets t given function values y, to yield a posterior over functions which can then be used for predictions. For neural networks the prior over functions has a complex form which means that implementations must either make approximations (e.g.
Learning the Structure of Similarity
The additive clustering (ADCL US) model (Shepard & Arabie, 1979) treats the similarity of two stimuli as a weighted additive measure of their common features. Inspired by recent work in unsupervised learning with multiple cause models, we propose anew, statistically well-motivated algorithm for discovering the structure of natural stimulus classes using the ADCLUS model, which promises substantial gainsin conceptual simplicity, practical efficiency, and solution quality over earlier efforts.
A Practical Monte Carlo Implementation of Bayesian Learning
A practical method for Bayesian training of feed-forward neural networks using sophisticated Monte Carlo methods is presented and evaluated. In reasonably small amounts of computer time this approach outperforms other state-of-the-art methods on 5 datalimited tasksfrom real world domains. 1 INTRODUCTION Bayesian learning uses a prior on model parameters, combines this with information from a training set, and then integrates over the resulting posterior to make predictions. Withthis approach, we can use large networks without fear of overfitting, allowing us to capture more structure in the data, thus improving prediction accuracy andeliminating the tedious search (often performed using cross validation) for the model complexity that optimises the bias/variance tradeoff. In this approach the size of the model is limited only by computational considerations. The application of Bayesian learning to neural networks has been pioneered by MacKay (1992), who uses a Gaussian approximation to the posterior weight distribution.
Constructive Algorithms for Hierarchical Mixtures of Experts
Waterhouse, Steve R., Robinson, Anthony J.
By applying a likelihood splitting criteria to each expert in the HME we "grow" the tree adaptively during training. Secondly,by considering only the most probable path through the tree we may "prune" branches away, either temporarily, or permanently ifthey become redundant. We demonstrate results for the growing and path pruning algorithms which show significant speed ups and more efficient use of parameters over the standard fixed structure in discriminating between two interlocking spirals and classifying 8-bit parity patterns. INTRODUCTION The HME (Jordan & Jacobs 1994) is a tree structured network whose terminal nodes are simple function approximators in the case of regression or classifiers in the case of classification. The outputs of the terminal nodes or experts are recursively combined upwards towards the root node, to form the overall output of the network, by "gates" which are situated at the non-terminal nodes.
Improved Gaussian Mixture Density Estimates Using Bayesian Penalty Terms and Network Averaging
We compare two regularization methods which can be used to improve thegeneralization capabilities of Gaussian mixture density estimates. The first method uses a Bayesian prior on the parameter space.We derive EM (Expectation Maximization) update rules which maximize the a posterior parameter probability. In the second approachwe apply ensemble averaging to density estimation. This includes Breiman's "bagging", which recently has been found to produce impressive results for classification networks.
Discovering Structure in Continuous Variables Using Bayesian Networks
Hofmann, Reimar, Tresp, Volker
We study Bayesian networks for continuous variables using nonlinear conditionaldensity estimators. We demonstrate that useful structures can be extracted from a data set in a self-organized way and we present sampling techniques for belief update based on Markov blanket conditional density models. 1 Introduction One of the strongest types of information that can be learned about an unknown process is the discovery of dependencies and -even more important-of independencies. Asuperior example is medical epidemiology where the goal is to find the causes of a disease and exclude factors which are irrelevant.
A Unified Learning Scheme: Bayesian-Kullback Ying-Yang Machine
A Bayesian-Kullback learning scheme, called Ying-Yang Machine, is proposed based on the two complement but equivalent Bayesian representations for joint density and their Kullback divergence. Not only the scheme unifies existing major supervised and unsupervised learnings,including the classical maximum likelihood or least square learning, the maximum information preservation, the EM & em algorithm and information geometry, the recent popular Helmholtz machine, as well as other learning methods with new variants and new results; but also the scheme provides a number of new learning models. 1 INTRODUCTION Many different learning models have been developed in the literature. We may come to an age of searching a unified scheme for them. With a unified scheme, we may understand deeply the existing models and their relationships, which may cause cross-fertilization on them to obtain new results and variants; We may also be guided to develop new learning models, after we get better understanding on which cases we have already studied or missed, which deserve to be further explored. Recently, a Baysian-Kullback scheme, called the YING-YANG Machine, has been proposed as such an effort(Xu, 1995a). It bases on the Kullback divergence and two complement but equivalent Baysian representations for the joint distribution of the input space and the representation space, instead of merely using Kullback divergence formatching un-structuralized joint densities in information geometry type learnings (Amari, 1995a&b; Byrne, 1992; Csiszar, 1975).
Adaptive Mixture of Probabilistic Transducers
We introduce and analyze a mixture model for supervised learning of probabilistic transducers. We devise an online learning algorithm that efficiently infers the structure and estimates the parameters of each model in the mixture. Theoretical analysis and comparative simulations indicate that the learning algorithm tracks the best model from an arbitrarily large (possibly infinite) pool of models. We also present an application of the model for inducing a noun phrase recognizer.