Goto

Collaborating Authors

 Country


Adaptive Sparseness Using Jeffreys Prior

Neural Information Processing Systems

In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters thatneed to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments withseveral publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.


A kernel method for multi-labelled classification

Neural Information Processing Systems

This article presents a Support Vector Machine (SVM) like learning system tohandle multi-label problems. Such problems are usually decomposed intomany two-class problems but the expressive power of such a system can be weak [5, 7]. We explore a new direct approach. It is based on a large margin ranking system that shares a lot of common properties withSVMs. We tested it on a Yeast gene functional classification problem with positive results.


Learning from Infinite Data in Finite Time

Neural Information Processing Systems

We propose the following general method for scaling learning algorithms to arbitrarily large data sets. We apply this method to the EM algorithm for mixtures of Gaussians. Preliminary experiments on a series of large data sets provide evidence of the potential of this approach. On the other hand, they require large computational resources to learn from. While in the past the factor limiting the quality of learnable models was typically the quantity of data available, in many domains today data is superabundant, and the bottleneck is t he time required to process it.


Adaptive Nearest Neighbor Classification Using Support Vector Machines

Neural Information Processing Systems

The nearest neighbor technique is a simple and appealing method to address classification problems. It relies on the assumption of locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with a finite number of examples dueto the curse of dimensionality. We propose a technique that computes a locally flexible metric by means of Support Vector Machines (SVMs). The maximum margin boundary found by the SVM is used to determine the most discriminant direction over the query's neighborhood. Such direction provides a local weighting scheme for input features.


TAP Gibbs Free Energy, Belief Propagation and Sparsity

Neural Information Processing Systems

The adaptive TAP Gibbs free energy for a general densely connected probabilistic model with quadratic interactions and arbritary single site constraints is derived. We show how a specific sequential minimization of the free energy leads to a generalization of Minka's expectation propagation. Lastly,we derive a sparse representation version of the sequential algorithm. The usefulness of the approach is demonstrated on classification anddensity estimation with Gaussian processes and on an independent componentanalysis problem.


Convolution Kernels for Natural Language

Neural Information Processing Systems

We describe the application of kernel methods to Natural Language Processing (NLP)problems. In many NLP tasks the objects being modeled are strings, trees, graphs or other discrete structures which require some mechanism to convert them into feature vectors. We describe kernels for various natural language structures, allowing rich, high dimensional representations ofthese structures. We show how a kernel over trees can be applied to parsing using the voted perceptron algorithm, and we give experimental results on the ATIS corpus of parse trees.


The Infinite Hidden Markov Model

Neural Information Processing Systems

We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying state-transition matrix, and the expected number ofdistinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infinite-- consider, for example, symbols being possible words appearing in English text.



Gaussian Process Regression with Mismatched Models

Neural Information Processing Systems

I derive approximations to the learning curves for the more generic case of mismatched models, and find very rich behaviour: For large input space dimensionality, where the results become exact, there are universal (student-independent) plateaux in the learning curve, with transitions in between that can exhibit arbitrarily many over-fitting maxima; over-fitting can occur even if the student estimates the teacher noise level correctly. In lower dimensions, plateaux also appear, and the learning curve remains dependent on the mismatch between student and teacher even in the asymptotic limit of a large number of training examples. Learning withexcessively strong smoothness assumptions can be particularly dangerous:For example, a student with a standard radial basis function covariance function will learn a rougher teacher function onlylogarithmically slowly. All predictions are confirmed by simulations. 1 Introduction There has in the last few years been a good deal of excitement about the use of Gaussian processes (GPs) as an alternative to feedforward networks [1]. GPs make prior assumptions about the problem to be learned very transparent, and even though they are nonparametric models, inference-at least in the case of regression considered below-is relatively straightforward. One crucial question for applications is then how'fast' GPs learn, i.e. how many training examples are needed to achieve a certain level of generalization performance.


Probabilistic principles in unsupervised learning of visual structure: human data and a model

Neural Information Processing Systems

To find out how the representations of structured visual objects depend on the co-occurrence statistics of their constituents, we exposed subjects to a set of composite images with tight control exerted over (1) the conditional probabilitiesof the constituent fragments, and (2) the value of Barlow's criterion of "suspicious coincidence" (the ratio of joint probability to the product of marginals). We then compared the part verification response timesfor various probe/target combinations before and after the exposure. For composite probes, the speedup was much larger for targets thatcontained pairs of fragments perfectly predictive of each other, compared to those that did not. This effect was modulated by the significance oftheir co-occurrence as estimated by Barlow's criterion. For lone-fragment probes, the speedup in all conditions was generally lower than for composites. These results shed light on the brain's strategies for unsupervised acquisition of structural information in vision.