Goto

Collaborating Authors

 Statistical Learning


Conditional Models on the Ranking Poset

Neural Information Processing Systems

A distance-based conditional model on the ranking poset is presented for use in classification and ranking. The model is an extension of the Mallows model, and generalizes the classifier combination methods used by several ensemble learning algorithms, including error correcting output codes, discrete AdaBoost, logistic regression and cranking. The algebraic structure of the ranking poset leads to a simple Bayesian interpretation of the conditional model and its special cases. In addition to a unifying view, the framework suggests a probabilistic interpretation for error correcting output codes and an extension beyond the binary coding scheme.


An Impossibility Theorem for Clustering

Neural Information Processing Systems

Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) tradeoffs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median.


On the Complexity of Learning the Kernel Matrix

Neural Information Processing Systems

We investigate data based procedures for selecting the kernel when learning with Support Vector Machines. We provide generalization error bounds by estimating the Rademacher complexities of the corresponding function classes. In particular we obtain a complexity bound for function classes induced by kernels with given eigenvectors, i.e., we allow to vary the spectrum and keep the eigenvectors fix. This bound is only a logarithmic factor bigger than the complexity of the function class induced by a single kernel. However, optimizing the margin over such classes leads to overfitting. We thus propose a suitable way of constraining the class. We use an efficient algorithm to solve the resulting optimization problem, present preliminary experimental results, and compare them to an alignment-based approach.


The Effect of Singularities in a Learning Machine when the True Parameters Do Not Lie on such Singularities

Neural Information Processing Systems

A lot of learning machines with hidden variables used in information science have singularities in their parameter spaces. At singularities, the Fisher information matrix becomes degenerate, resulting that the learning theory of regular statistical models does not hold. Recently, it was proven that, if the true parameter is contained in singularities, then the coefficient of the Bayes generalization error is equal to the pole of the zeta function of the Kullback information.


Information Diffusion Kernels

Neural Information Processing Systems

A new family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. Based on the heat equation on the Riemannian manifold defined by the Fisher information metric, information diffusion kernels generalize the Gaussian kernel of Euclidean space, and provide a natural way of combining generative statistical modeling with nonparametric discriminative learning. As a special case, the kernels give a new approach to applying kernel-based learning algorithms to discrete data. Bounds on covering numbers for the new kernels are proved using spectral theory in differential geometry, and experimental results are presented for text classification.


The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

Neural Information Processing Systems

In this paper we analyze the relationships between the eigenvalues of the m x m Gram matrix K for a kernel k(·,.) We bound the dif between the two spectra and provide a performance bound on kernel peA. 1 Introduction Over recent years there has been a considerable amount of interest in kernel methods for supervised learning (e.g. Support Vector Machines and Gaussian Process predict ion) and for unsupervised learning (e.g. In this paper we study the stability of the subspace of feature space extracted by kernel peA with respect to the sample of size m, and relate this to the feature space that would be extracted in the infinite sample-size limit. This analysis essentially "lifts" into (a potentially infinite dimensional) feature space an analysis which can also be carried out for peA, comparing the k-dimensional eigenspace extracted from a sample covariance matrix and the k-dimensional eigenspace extracted from the population covariance matrix, and comparing the residuals from the k-dimensional compression for the m-sample and the population.


Dyadic Classification Trees via Structural Risk Minimization

Neural Information Processing Systems

Classification trees are one of the most popular types of classifiers, with ease of implementation and interpretation being among their attractive features. Despite the widespread use of classification trees, theoretical analysis of their performance is scarce. In this paper, we show that a new family of classification trees, called dyadic classification trees (DCTs), are near optimal (in a minimax sense) for a very broad range of classification problems. This demonstrates that other schemes (e.g., neural networks, support vector machines) cannot perform significantly better than DCTs in many cases. We also show that this near optimal performance is attained with linear (in the number of training data) complexity growing and pruning algorithms. Moreover, the performance of DCTs on benchmark datasets compares favorably to that of standard CART, which is generally more computationally intensive and which does not possess similar near optimality properties. Our analysis stems from theoretical results on structural risk minimization, on which the pruning rule for DCTs is based.



Kernel-Based Extraction of Slow Features: Complex Cells Learn Disparity and Translation Invariance from Natural Images

Neural Information Processing Systems

In Slow Feature Analysis (SFA [1]), it has been demonstrated that high-order invariant properties can be extracted by projecting inputs into a nonlinear space and computing the slowest changing features in this space; this has been proposed as a simple general model for learning nonlinear invariances in the visual system. However, this method is highly constrained by the curse of dimensionality which limits it to simple theoretical simulations. This paper demonstrates that by using a different but closely-related objective function for extracting slowly varying features ([2, 3]), and then exploiting the kernel trick, this curse can be avoided. Using this new method we show that both the complex cell properties of translation invariance and disparity coding can be learnt simultaneously from natural images when complex cells are driven by simple cells also learnt from the image. The notion of maximising an objective function based upon the temporal predictability of output has been progressively applied in modelling the development of invariances in the visual system.


Spikernels: Embedding Spiking Neurons in Inner-Product Spaces

Neural Information Processing Systems

Inner-product operators, often referred to as kernels in statistical learning, define a mapping from some input space into a feature space. The focus of this paper is the construction of biologically-motivated kernels for cortical activities. The kernels we derive, termed Spikernels, map spike count sequences into an abstract vector space in which we can perform various prediction tasks. We discuss in detail the derivation of Spikernels and describe an efficient algorithm for computing their value on any two sequences of neural population spike counts. We demonstrate the merits of our modeling approach using the Spikernel and various standard kernels for the task of predicting hand movement velocities from cortical recordings. In all of our experiments all the kernels we tested outperform the standard scalar product used in regression with the Spikernel consistently achieving the best performance.