Country
Bayesian Model Selection for Support Vector Machines, Gaussian Processes and Other Kernel Classifiers
We present a variational Bayesian method for model selection over families of kernels classifiers like Support Vector machines or Gaussian processes.The algorithm needs no user interaction and is able to adapt a large number of kernel parameters to given data without having to sacrifice training cases for validation. This opens the possibility touse sophisticated families of kernels in situations where the small "standard kernel" classes are clearly inappropriate. We relate the method to other work done on Gaussian processes and clarify the relation between Support Vector machines and certain Gaussian process models. 1 Introduction Bayesian techniques have been widely and successfully used in the neural networks and statistics community and are appealing because of their conceptual simplicity, generality and consistency with which they solve learning problems. In this paper we present a new method for applying the Bayesian methodology to Support Vector machines. We will briefly review Gaussian Process and Support Vector classification in this section and clarify their relationship by pointing out the common roots. Although we focus on classification here, it is straightforward to apply the methods to regression problems as well. In section 2 we introduce our algorithm and show relations to existing methods. Finally, we present experimental results in section 3 and close with a discussion in section 4. Let X be a measure space (e.g.
v-Arc: Ensemble Learning in the Presence of Outliers
Rรคtsch, Gunnar, Schรถlkopf, Bernhard, Smola, Alex J., Mรผller, Klaus-Robert, Onoda, Takashi, Mika, Sebastian
The idea of a large minimum margin [17] explains the good generalization performance ofAdaBoost in the low noise regime. However, AdaBoost performs worse on noisy tasks [10, 11], such as the iris and the breast cancer benchmark data sets [1]. On the latter tasks, a large margin on all training points cannot be achieved without adverse effects on the generalization error. This experimental observation was supported by the study of [13] where the generalization error of ensemble methods wasbounded by the sum of the fraction of training points which have a margin smaller than some value p, say, plus a complexity term depending on the base hypotheses andp. While this bound can only capture part of what is going on in practice, it nevertheless already conveys the message that in some cases it pays to allow for some points which have a small margin, or are misclassified, if this leads to a larger overall margin on the remaining points. To cope with this problem, it was mandatory to construct regularized variants of AdaBoost, which traded off the number of margin errors and the size of the margin 562 G.Riitsch, B. Sch6lkopf, A. J. Smola, K.-R.
Invariant Feature Extraction and Classification in Kernel Spaces
Mika, Sebastian, Rรคtsch, Gunnar, Weston, Jason, Schรถlkopf, Bernhard, Smola, Alex J., Mรผller, Klaus-Robert
In hyperspectral imagery one pixel typically consists of a mixture of the reflectance spectra of several materials, where the mixture coefficients correspond to the abundances of the constituting materials. Weassume linear combinations of reflectance spectra with some additive normal sensor noise and derive a probabilistic MAP framework for analyzing hyperspectral data. As the material reflectance characteristicsare not know a priori, we face the problem of unsupervised linear unmixing.
Boosting Algorithms as Gradient Descent
Mason, Llew, Baxter, Jonathan, Bartlett, Peter L., Frean, Marcus R.
Recent theoretical results suggest that the effectiveness of these algorithms is due to their tendency to produce large margin classifiers [1, 18]. Loosely speaking, if a combination of classifiers correctly classifies most of the training data with a large margin, then its error probability is small. In [14] we gave improved upper bounds on the misclassification probability of a combined classifier in terms of the average over the training data of a certain cost function of the margins. That paper also described DOOM, an algorithm for directly minimizingthe margin cost function by adjusting the weights associated with Boosting Algorithms as Gradient Descent 513 each base classifier (the base classifiers are suppiled to DOOM). DOOM exhibits performance improvements over AdaBoost, even when using the same base hypotheses, whichprovides additional empirical evidence that these margin cost functions are appropriate quantities to optimize. In this paper, we present a general class of algorithms (called AnyBoost) which are gradient descent algorithms for choosing linear combinations of elements of an inner product function space so as to minimize some cost functional. The normal operation of a weak learner is shown to be equivalent to maximizing a certain inner product. We prove convergence of AnyBoost under weak conditions. In Section 3, we show that this general class of algorithms includes as special cases nearly all existing voting methods.
Algorithms for Independent Components Analysis and Higher Order Statistics
Lee, Daniel D., Rokni, Uri, Sompolinsky, Haim
A latent variable generative model with finite noise is used to describe severaldifferent algorithms for Independent Components Analysis (lCA). In particular, the Fixed Point ICA algorithm is shown to be equivalent to the Expectation-Maximization algorithm for maximum likelihood under certain constraints, allowing the conditions for global convergence to be elucidated. The algorithms can also be explained by their generic behavior near a singular point where the size of the optimal generativebases vanishes. An expansion of the likelihood about this singular point indicates the role of higher order correlations in determining thefeatures discovered by ICA. The application and convergence of these algorithms are demonstrated on a simple illustrative example.
Topographic Transformation as a Discrete Latent Variable
Jojic, Nebojsa, Frey, Brendan J.
We describe a way to add transformation invariance toa generative density model by approximating the nonlinear transformation manifold by a discrete set of transformations. An EM algorithm for the original model can be extended to the new model by computing expectations over the set of transformations. We show how to add a discrete transformation variable to Gaussian mixture modeling, factor analysis and mixtures of factor analysis. We give results on filtering microscopy images, face and facial pose clustering, and handwritten digit modeling and recognition.
Learning to Parse Images
Hinton, Geoffrey E., Ghahramani, Zoubin, Teh, Yee Whye
We describe a class of probabilistic models that we call credibility networks. Using parse trees as internal representations of images, credibility networks are able to perform segmentation and recognition simultaneously,removing the need for ad hoc segmentation heuristics. Promising results in the problem of segmenting handwritten digitswere obtained.