Support Vector Machines
The Kernel Trick for Distances
A method is described which, like the kernel trick in support vector machines (SVMs),lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms. 1 Introduction One of the crucial ingredients of SVMs is the so-called kernel trick for the computation of dot products in high-dimensional feature spaces using simple functions defined on pairs of input patterns. This trick allows the formulation of nonlinear variants of any algorithm that can be cast in terms of dot products, SVMs being but the most prominent example [13, 8]. Although the mathematical result underlying the kernel trick is almost a century old [6], it was only much later [1, 3,13] that it was made fruitful for the machine learning community. Kernel methods have since led to interesting generalizations of learning algorithms and to successful real-world applications. The present paper attempts to extend the utility of the kernel trick by looking at the problem of which kernels can be used to compute distances in feature spaces. Again, the underlying mathematical results, mainly due to Schoenberg, have been known for a while [7]; some of them have already attracted interest in the kernel methods community in various contexts [11, 5, 15].
From Margin to Sparsity
Graepel, Thore, Herbrich, Ralf, Williamson, Robert C.
We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PACstyle generalisation error boundfor the classifier learned by the perceptron learning algorithm. Thebound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently availablefor the support vector solution itself. 1 Introduction In the last few years there has been a large controversy about the significance of the attained margin, i.e. the smallest real valued output of a classifiers before thresholding, as an indicator of generalisation performance. Results in the YC, PAC and luckiness frameworks seem to indicate that a large margin is a prerequisite for small generalisation error bounds (see [14, 12]). These results caused many researchers to focus on large margin methods such as the well known support vector machine (SYM).
An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods
This book is an introduction to support vector machines and related kernel methods in supervised learning, whose task is to estimate an input-output functional relationship from a training set of examples. A learning problem is referred to as classification if its output take discrete values in a set of possible categories and regression if it has continuous real-valued output.
An Improved Decomposition Algorithm for Regression Support Vector Machines
The Karush-Kuhn-Tucker Theorem is used to derive conditions for determining whether or not a given working set is optimal. These conditions become the algorithm)s termination criteria) as an alternative to Osuna)s criteria (also used by Joachims without modification) which used conditions for individual points. The advantage of the new conditions is that knowledge of the hyperplane)s constant factor b) which in some cases is difficult to compute) is not required. Further investigation of the new termination conditions allows to form the strategy for selecting an optimal working set. The new algorithm is applicable to the pattern recognition SVM) and is provably equivalent to Joachims) algorithm. One can also interpret the new algorithm in the sense of the method of feasible directions. Experimental results presented in the last section demonstrate superior performance of the new method in comparison with traditional training of regression SVM. 2 General Principles of Regression SVM Decomposition The original decomposition algorithm proposed for the pattern recognition SVM in [2] has been extended to the regression SVM in [4]. For the sake of completeness I will repeat the main steps of this extension with the aim of providing terse and streamlined notation to lay the ground for working set selection.
A SNoW-Based Face Detector
Yang, Ming-Hsuan, Roth, Dan, Ahuja, Narendra
A novel learning approach for human face detection using a network of linear units is presented. The SNoW learning architecture is a sparse network of linear functions over a predefined or incrementally learned feature space and is specifically tailored for learning in the presence of a very large number of features. A wide range of face images in different poses, with different expressions and under different lighting conditions are used as a training set to capture the variations of human faces. Experimental results on commonly used benchmark data sets of a wide range of face images show that the SNoW-based approach outperforms methods that use neural networks, Bayesian methods, support vector machines and others. Furthermore, learning and evaluation using the SNoW-based method are significantly more efficient than with other methods. 1 Introduction Growing interest in intelligent human computer interactions has motivated a recent surge in research on problems such as face tracking, pose estimation, face expression and gesture recognition. Most methods, however, assume human faces in their input images have been detected and localized.
Support Vector Method for Multivariate Density Estimation
Vapnik, Vladimir, Mukherjee, Sayan
A new method for multivariate density estimation is developed based on the Support Vector Method (SVM) solution of inverse ill-posed problems. The solution has the form of a mixture of densities. This method with Gaussian kernels compared favorably to both Parzen's method and the Gaussian Mixture Model method. For synthetic data we achieve more accurate estimates for densities of 2, 6, 12, and 40 dimensions. 1 Introduction The problem of multivariate density estimation is important for many applications, in particular, for speech recognition [1] [7]. When the unknown density belongs to a parametric set satisfying certain conditions one can estimate it using the maximum likelihood (ML) method. Often these conditions are too restrictive. Therefore, nonparametric methods were proposed. The most popular of these, Parzen's method [5], uses the following estimate given data
The Relevance Vector Machine
The support vector machine (SVM) is a state-of-the-art technique for regression and classification, combining excellent generalisation properties with a sparse kernel representation. However, it does suffer from a number of disadvantages, notably the absence of probabilistic outputs, the requirement to estimate a tradeoff parameter and the need to utilise'Mercer' kernel functions. In this paper we introduce the Relevance Vector Machine (RVM), a Bayesian treatment of a generalised linear model of identical functional form to the SVM. The RVM suffers from none of the above disadvantages, and examples demonstrate that for comparable generalisation performance, the RVM requires dramatically fewer kernel functions.
Leveraged Vector Machines
We describe an iterative algorithm for building vector machines used in classification tasks. The algorithm builds on ideas from support vector machines, boosting, and generalized additive models. The algorithm can be used with various continuously differential functions that bound the discrete (0-1) classification loss and is very simple to implement. We test the proposed algorithm with two different loss functions on synthetic and natural data. We also describe a norm-penalized version of the algorithm for the exponential loss function used in AdaBoost.
Bayesian Model Selection for Support Vector Machines, Gaussian Processes and Other Kernel Classifiers
We present a variational Bayesian method for model selection over families of kernels classifiers like Support Vector machines or Gaussian processes. The algorithm needs no user interaction and is able to adapt a large number of kernel parameters to given data without having to sacrifice training cases for validation. This opens the possibility to use sophisticated families of kernels in situations where the small "standard kernel" classes are clearly inappropriate. We relate the method to other work done on Gaussian processes and clarify the relation between Support Vector machines and certain Gaussian process models.