Asia
Hyperkernels
Ong, Cheng S., Williamson, Robert C., Smola, Alex J.
We consider the problem of choosing a kernel suitable for estimation using a Gaussian Process estimator or a Support Vector Machine. A novel solution is presented which involves defining a Reproducing Kernel Hilbert Space on the space of kernels itself. By utilizing an analog of the classical representer theorem, the problem of choosing a kernel from a parameterized family of kernels (e.g. of varying width) is reduced to a statistical estimation problem akin to the problem of minimizing a regularized risk functional. Various classical settings for model or kernel selection are special cases of our framework.
Margin Analysis of the LVQ Algorithm
Crammer, Koby, Gilad-bachrach, Ran, Navot, Amir, Tishby, Naftali
Prototypes based algorithms are commonly used to reduce the computational complexity of Nearest-Neighbour (NN) classifiers. In this paper we discuss theoretical and algorithmical aspects of such algorithms. On the theory side, we present margin based generalization bounds that suggest that these kinds of classifiers can be more accurate then the 1-NN rule. Furthermore, we derived a training algorithm that selects a good set of prototypes using large margin principles. We also show that the 20 years old Learning Vector Quantization (LVQ) algorithm emerges naturally from our framework.
Conditional Models on the Ranking Poset
Lebanon, Guy, Lafferty, John D.
A distance-based conditional model on the ranking poset is presented for use in classification and ranking. The model is an extension of the Mallows model, and generalizes the classifier combination methods used by several ensemble learning algorithms, including error correcting output codes, discrete AdaBoost, logistic regression and cranking. The algebraic structure of the ranking poset leads to a simple Bayesian interpretation of the conditional model and its special cases. In addition to a unifying view, the framework suggests a probabilistic interpretation for error correcting output codes and an extension beyond the binary coding scheme.
On the Complexity of Learning the Kernel Matrix
Bousquet, Olivier, Herrmann, Daniel
We investigate data based procedures for selecting the kernel when learning with Support Vector Machines. We provide generalization error bounds by estimating the Rademacher complexities of the corresponding function classes. In particular we obtain a complexity bound for function classes induced by kernels with given eigenvectors, i.e., we allow to vary the spectrum and keep the eigenvectors fix. This bound is only a logarithmic factor bigger than the complexity of the function class induced by a single kernel. However, optimizing the margin over such classes leads to overfitting. We thus propose a suitable way of constraining the class. We use an efficient algorithm to solve the resulting optimization problem, present preliminary experimental results, and compare them to an alignment-based approach.
The Effect of Singularities in a Learning Machine when the True Parameters Do Not Lie on such Singularities
Watanabe, Sumio, Amari, Shun-ichi
A lot of learning machines with hidden variables used in information science have singularities in their parameter spaces. At singularities, the Fisher information matrix becomes degenerate, resulting that the learning theory of regular statistical models does not hold. Recently, it was proven that, if the true parameter is contained in singularities, then the coefficient of the Bayes generalization error is equal to the pole of the zeta function of the Kullback information.
Information Diffusion Kernels
Lebanon, Guy, Lafferty, John D.
A new family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. Based on the heat equation on the Riemannian manifold defined by the Fisher information metric, information diffusion kernels generalize the Gaussian kernel of Euclidean space, and provide a natural way of combining generative statistical modeling with nonparametric discriminative learning. As a special case, the kernels give a new approach to applying kernel-based learning algorithms to discrete data. Bounds on covering numbers for the new kernels are proved using spectral theory in differential geometry, and experimental results are presented for text classification.
Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy
We extend recent work on the connection between loopy belief propagation and the Bethe free energy. Constrained minimization of the Bethe free energy can be turned into an unconstrained saddle-point problem. Both converging double-loop algorithms and standard loopy belief propagation can be interpreted as attempts to solve this saddle-point problem. Stability analysis then leads us to conclude that stable fixed points of loopy belief propagation must be (local) minima of the Bethe free energy. Perhaps surprisingly, the converse need not be the case: minima can be unstable fixed points. We illustrate this with an example and discuss implications.
Data-Dependent Bounds for Bayesian Mixture Methods
We consider Bayesian mixture approaches, where a predictor is constructed by forming a weighted average of hypotheses from some space of functions. While such procedures are known to lead to optimal predictors in several cases, where sufficiently accurate prior information is available, it has not been clear how they perform when some of the prior assumptions are violated. In this paper we establish data-dependent bounds for such procedures, extending previous randomized approaches such as the Gibbs algorithm to a fully Bayesian setting. The finite-sample guarantees established in this work enable the utilization of Bayesian mixture approaches in agnostic settings, where the usual assumptions of the Bayesian paradigm fail to hold. Moreover, the bounds derived can be directly applied to non-Bayesian mixture approaches such as Bagging and Boosting.
Learning in Spiking Neural Assemblies
We consider a statistical framework for learning in a class of networks of spiking neurons. Our aim is to show how optimal local learning rules can be readily derived once the neural dynamics and desired functionality of the neural assembly have been specified, in contrast to other models which assume (sub-optimal) learning rules. Within this framework we derive local rules for learning temporal sequences in a model of spiking neurons and demonstrate its superior performance to correlation (Hebbian) based approaches. We further show how to include mechanisms such as synaptic depression and outline how the framework is readily extensible to learning in networks of highly complex spiking neurons. A stochastic quantal vesicle release mechanism is considered and implications on the complexity of learning discussed.