Bayesian Learning
Bayesian Unsupervised Learning of Higher Order Structure
Lewicki, Michael S., Sejnowski, Terrence J.
Many real world patterns have a hierarchical underlying structure in which simple features have a higher order structure among themselves. Because these relationships are often statistical in nature, it is natural to view the process of discovering such structures as a statistical inference problem in which a hierarchical model is fit to data. Hierarchical statistical structure can be conveniently represented with Bayesian belief networks (Pearl, 1988; Lauritzen and Spiegelhalter, 1988; Neal, 1992). These 530 M.S. Lewicki and T. 1. Sejnowski models are powerful, because they can capture complex statistical relationships among the data variables, and also mathematically convenient, because they allow efficient computation of the joint probability for any given set of model parameters. The joint probability density of a network of binary states is given by a product of conditional probabilities (1) where W is the weight matrix that parameterizes the model. Note that the probability ofan individual state Si depends only on its parents.
Ordered Classes and Incomplete Examples in Classification
The classes in classification tasks often have a natural ordering, and the training and testing examples are often incomplete. We propose a nonlinear ordinalmodel for classification into ordered classes. Predictive, simulation-based approaches are used to learn from past and classify future incompleteexamples. These techniques are illustrated by making prognoses for patients who have suffered severe head injuries.
On a Modification to the Mean Field EM Algorithm in Factorial Learning
Dunmur, A. P., Titterington, D. M.
A modification is described to the use of mean field approximations inthe E step of EM algorithms for analysing data from latent structure models, as described by Ghahramani (1995), among others. Themodification involves second-order Taylor approximations to expectations computed in the E step. The potential benefits of the method are illustrated using very simple latent profile models. 1 Introduction Ghahramani (1995) advocated the use of mean field methods as a means to avoid the heavy computation involved in the E step of the EM algorithm used for estimating parameters within a certain latent structure model, and Ghahramani & Jordan (1995) used the same ideas in a more complex situation. Dunmur & Titterington (1996a) identified Ghahramani's model as a so-called latent profile model, they observed that Zhang (1992,1993) had used mean field methods for a similar purpose, and they showed, in a simulation study based on very simple examples, that the mean field version of the EM algorithm often performed very respectably. By this it is meant that, when data were generated from the model under analysis, the estimators of the underlying parameters were efficient, judging by empirical results, especially in comparison with estimators obtained by employing the'correct' EM algorithm: the examples therefore had to be simple enough that the correct EM algorithm is numerically feasible, although any success reported for the mean field 432 A. P. Dunmur and D. M. Titterington version is, one hopes, an indication that the method will also be adequate in more complex situations in which the correct EM algorithm is not implementable because of computational complexity. In spite of the above positive remarks, there were circumstances in which there was a perceptible, if not dramatic, lack of efficiency in the simple (naive) mean field estimators, and the objective of this contribution is to propose and investigate ways of refining the method so as to improve performance without detracting from the appealing, and frequently essential, simplicity of the approach. The procedure used here is based on a second order correction to the naive mean field well known in statistical physics and sometimes called the cavity or TAP method (Mezard, Parisi & Virasoro, 1987). It has been applied recently in cluster analysis (Hofmann & Buhmann, 1996). In Section 2 we introduce the structure of our model, Section 3 explains the refined mean field approach, Section 4 provides numerical results, and Section 5 contains a statement of our conclusions.
An Apobayesian Relative of Winnow
Littlestone, Nick, Mesterharm, Chris
We study a mistake-driven variant of an online Bayesian learning algorithm(similar to one studied by Cesa-Bianchi, Helmbold, and Panizza [CHP96]). This variant only updates its state (learns) on trials in which it makes a mistake. The algorithm makes binary classifications using a linear-threshold classifier and runs in time linear inthe number of attributes seen by the learner. We have been able to show, theoretically and in simulations, that this algorithm performs well under assumptions quite different from those embodied inthe prior of the original Bayesian algorithm. It can handle situations that we do not know how to handle in linear time with Bayesian algorithms. We expect our techniques to be useful in deriving and analyzing other apobayesian algorithms. 1 Introduction We consider two styles of online learning.
Interpreting Images by Propagating Bayesian Beliefs
A central theme of computational vision research has been the realization thatreliable estimation of local scene properties requires propagating measurements across the image. Many authors have therefore suggested solving vision problems using architectures of locally connected units updating their activity in parallel. Unfortunately, theconvergence of traditional relaxation methods on such architectures has proven to be excruciatingly slow and in general they do not guarantee that the stable point will be a global minimum. In this paper we show that an architecture in which Bayesian Beliefs aboutimage properties are propagated between neighboring units yields convergence times which are several orders of magnitude fasterthan traditional methods and avoids local minima. In particular our architecture is non-iterative in the sense of Marr [5]: at every time step, the local estimates at a given location are optimal giventhe information which has already been propagated to that location. We illustrate the algorithm's performance on real images and compare it to several existing methods. 1 Theory The essence of our approach is shown in figure 1. Figure 1a shows the prototypical ill-posed problem: interpolation of a function from sparse data.
Compositionality, MDL Priors, and Object Recognition
Bienenstock, Elie, Geman, Stuart, Potter, Daniel
Images are ambiguous at each of many levels of a contextual hierarchy. Nevertheless,the high-level interpretation of most scenes is unambiguous, as evidenced by the superior performance of humans. Thisobservation argues for global vision models, such as deformable templates.Unfortunately, such models are computationally intractable for unconstrained problems. We propose a compositional modelin which primitives are recursively composed, subject to syntactic restrictions, to form tree-structured objects and object groupings. Ambiguity is propagated up the hierarchy in the form of multiple interpretations, which are later resolved by a Bayesian, equivalently minimum-description-Iength, cost functional.
Maximum Likelihood Blind Source Separation: A Context-Sensitive Generalization of ICA
Pearlmutter, Barak A., Parra, Lucas C.
We cast the problem as one of maximum likelihood density estimation, andin that framework introduce an algorithm that searches for independent components using both temporal and spatial cues. We call the resulting algorithm "Contextual ICA," after the (Bell and Sejnowski 1995) Infomax algorithm, which we show to be a special case of cICA. Because cICA can make use of the temporal structure of its input, it is able separate in a number of situations where standard methods cannot, including sources with low kurtosis, coloredGaussian sources, and sources which have Gaussian histograms. 1 The Blind Source Separation Problem Consider a set of n indepent sources
Continuous Sigmoidal Belief Networks Trained using Slice Sampling
These include Boltzmann machines (Hinton and Sejnowski 1986),binary sigmoidal belief networks (Neal 1992) and Helmholtz machines (Hinton et al. 1995; Dayan et al. 1995). However, some hidden variables, such as translation or scaling in images of shapes, are best represented using continuous values.Continuous-valued Boltzmann machines have been developed (Movellan and McClelland 1993), but these suffer from long simulation settling times and the requirement of a "negative phase" during learning. Tibshirani (1992) and Bishop et al. (1996) consider learning mappings from a continuous latent variable space to a higher-dimensional input space. MacKay (1995) has developed "density networks" that can model both continuous and categorical latent spaces using stochasticity at the topmost network layer. In this paper I consider a new hierarchical top-down connectionist model that has stochastic hidden variables at all layers; moreover, these variables can adapt to be continuous or categorical. The proposed top-down model can be viewed as a continuous-valued belief network, whichcan be simulated by performing a quick top-down pass (Pearl 1988).
Regression with Input-Dependent Noise: A Bayesian Treatment
Bishop, Christopher M., Quazaz, Cazhaow S.
In most treatments of the regression problem it is assumed that the distribution of target data can be described by a deterministic function of the inputs, together with additive Gaussian noise having constantvariance. The use of maximum likelihood to train such models then corresponds to the minimization of a sum-of-squares error function. In many applications a more realistic model would allow the noise variance itself to depend on the input variables. However, the use of maximum likelihood to train such models would give highly biased results. In this paper we show how a Bayesian treatment can allow for an input-dependent variance while overcoming thebias of maximum likelihood. 1 Introduction In regression problems it is important not only to predict the output variables but also to have some estimate of the error bars associated with those predictions.