Technology
TAP Gibbs Free Energy, Belief Propagation and Sparsity
Csató, Lehel, Opper, Manfred, Winther, Ole
The adaptive TAP Gibbs free energy for a general densely connected probabilistic model with quadratic interactions and arbritary single site constraints is derived. We show how a specific sequential minimization of the free energy leads to a generalization of Minka's expectation propagation. Lastly,we derive a sparse representation version of the sequential algorithm. The usefulness of the approach is demonstrated on classification anddensity estimation with Gaussian processes and on an independent componentanalysis problem.
Convolution Kernels for Natural Language
Collins, Michael, Duffy, Nigel
We describe the application of kernel methods to Natural Language Processing (NLP)problems. In many NLP tasks the objects being modeled are strings, trees, graphs or other discrete structures which require some mechanism to convert them into feature vectors. We describe kernels for various natural language structures, allowing rich, high dimensional representations ofthese structures. We show how a kernel over trees can be applied to parsing using the voted perceptron algorithm, and we give experimental results on the ATIS corpus of parse trees.
A Generalization of Principal Components Analysis to the Exponential Family
Collins, Michael, Dasgupta, S., Schapire, Robert E.
Principal component analysis (PCA) is a commonly applied technique for dimensionality reduction. PCA implicitly minimizes a squared loss function, which may be inappropriate for data that is not real-valued, such as binary-valued data. This paper draws on ideas from the Exponential family,Generalized linear models, and Bregman distances, to give a generalization of PCA to loss functions that we argue are better suited to other data types. We describe algorithms for minimizing the loss functions, andgive examples on simulated data.
The Infinite Hidden Markov Model
Beal, Matthew J., Ghahramani, Zoubin, Rasmussen, Carl E.
We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying state-transition matrix, and the expected number ofdistinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infinite-- consider, for example, symbols being possible words appearing in English text.
Gaussian Process Regression with Mismatched Models
I derive approximations to the learning curves for the more generic case of mismatched models, and find very rich behaviour: For large input space dimensionality, where the results become exact, there are universal (student-independent) plateaux in the learning curve, with transitions in between that can exhibit arbitrarily many over-fitting maxima; over-fitting can occur even if the student estimates the teacher noise level correctly. In lower dimensions, plateaux also appear, and the learning curve remains dependent on the mismatch between student and teacher even in the asymptotic limit of a large number of training examples. Learning withexcessively strong smoothness assumptions can be particularly dangerous:For example, a student with a standard radial basis function covariance function will learn a rougher teacher function onlylogarithmically slowly. All predictions are confirmed by simulations. 1 Introduction There has in the last few years been a good deal of excitement about the use of Gaussian processes (GPs) as an alternative to feedforward networks [1]. GPs make prior assumptions about the problem to be learned very transparent, and even though they are nonparametric models, inference-at least in the case of regression considered below-is relatively straightforward. One crucial question for applications is then how'fast' GPs learn, i.e. how many training examples are needed to achieve a certain level of generalization performance.
Probabilistic principles in unsupervised learning of visual structure: human data and a model
Edelman, Shimon, Hiles, Benjamin P., Yang, Hwajin, Intrator, Nathan
To find out how the representations of structured visual objects depend on the co-occurrence statistics of their constituents, we exposed subjects to a set of composite images with tight control exerted over (1) the conditional probabilitiesof the constituent fragments, and (2) the value of Barlow's criterion of "suspicious coincidence" (the ratio of joint probability to the product of marginals). We then compared the part verification response timesfor various probe/target combinations before and after the exposure. For composite probes, the speedup was much larger for targets thatcontained pairs of fragments perfectly predictive of each other, compared to those that did not. This effect was modulated by the significance oftheir co-occurrence as estimated by Barlow's criterion. For lone-fragment probes, the speedup in all conditions was generally lower than for composites. These results shed light on the brain's strategies for unsupervised acquisition of structural information in vision.
Iterative Double Clustering for Unsupervised and Semi-Supervised Learning
El-Yaniv, Ran, Souroujon, Oren
We present a powerful meta-clustering technique called Iterative Double Clustering(IDC). The IDC method is a natural extension of the recent Double Clustering (DC) method of Slonim and Tishby that exhibited impressiveperformance on text categorization tasks [12]. Using synthetically generated data we empirically find that whenever the DC procedure is successful in recovering some of the structure hidden in the data, the extended IDC procedure can incrementally compute a significantly more accurate classification. IDC is especially advantageous whenthe data exhibits high attribute noise. Our simulation results also show the effectiveness of IDC in text categorization problems. Surprisingly,this unsupervised procedure can be competitive with a (supervised) SVM trained with a small training set. Finally, we propose a simple and natural extension of IDC for semi-supervised and transductive learning where we are given both labeled and unlabeled examples.
Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons
Amari, Shun-ichi, Park, Hyeyoung, Ozeki, Tomoko
Singularities are ubiquitous in the parameter space of hierarchical models such as multilayer perceptrons. At singularities, the Fisher information matrix degenerates, and the Cramer-Rao paradigm does no more hold, implying that the classical model selection theory suchas AIC and MDL cannot be applied. It is important to study the relation between the generalization error and the training error at singularities. The present paper demonstrates a method of analyzing these errors both for the maximum likelihood estimator andthe Bayesian predictive distribution in terms of Gaussian random fields, by using simple models. 1 Introduction A neural network is specified by a number of parameters which are synaptic weights and biases. Learning takes place by modifying these parameters from observed input-output examples.