Support Vector Machines
Fast Sparse Gaussian Process Methods: The Informative Vector Machine
We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on information- theoretic principles, previously suggested for active learning. Our goal is not only to learn d{sparse predictors (which can be evalu- ated in O(d) rather than O(n), d (cid:28) n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(n (cid:1) d2), and in large real-world classi(cid:12)cation experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet can be signi(cid:12)cantly faster in training. In contrast to the SVM, our approximation produces esti- mates of predictive probabilities ('error bars'), allows for Bayesian model selection and is less complex in implementation.
Hyperkernels
We consider the problem of choosing a kernel suitable for estimation using a Gaussian Process estimator or a Support Vector Machine. A novel solution is presented which involves defining a Reproducing Ker- nel Hilbert Space on the space of kernels itself. By utilizing an analog of the classical representer theorem, the problem of choosing a kernel from a parameterized family of kernels (e.g. of varying width) is reduced to a statistical estimation problem akin to the problem of minimizing a regularized risk functional. Various classical settings for model or kernel selection are special cases of our framework.
The Decision List Machine
We introduce a new learning algorithm for decision lists to allow features that are constructed from the data and to allow a trade- ofi between accuracy and complexity. We bound its generalization error in terms of the number of errors and the size of the classifler it flnds on the training data. We also compare its performance on some natural data sets with the set covering machine and the support vector machine.
Knowledge-Based Support Vector Machine Classifiers
Prior knowledge in the form of multiple polyhedral sets, each be(cid:173) longing to one of two categories, is introduced into a reformulation of a linear support vector machine classifier. The resulting formu(cid:173) lation leads to a linear program that can be solved efficiently. Real world examples, from DNA sequencing and breast cancer prognosis, demonstrate the effectiveness of the proposed method. Numerical results show improvement in test set accuracy after the incorpo(cid:173) ration of prior knowledge into ordinary, data-based linear support vector machine classifiers. One experiment also shows that a lin(cid:173) ear classifier, based solely on prior knowledge, far outperforms the direct application of prior knowledge rules to classify data.
Dyadic Classification Trees via Structural Risk Minimization
Classification trees are one of the most popular types of classifiers, with ease of implementation and interpretation being among their attractive features. Despite the widespread use of classification trees, theoretical analysis of their performance is scarce. In this paper, we show that a new family of classification trees, called dyadic classification trees (DCTs), are near optimal (in a minimax sense) for a very broad range of clas- sification problems. This demonstrates that other schemes (e.g., neural networks, support vector machines) cannot perform significantly better than DCTs in many cases. We also show that this near optimal perfor- mance is attained with linear (in the number of training data) complexity growing and pruning algorithms.
On the Complexity of Learning the Kernel Matrix
We investigate data based procedures for selecting the kernel when learn- ing with Support Vector Machines. We provide generalization error bounds by estimating the Rademacher complexities of the corresponding function classes. In particular we obtain a complexity bound for function classes induced by kernels with given eigenvectors, i.e., we allow to vary the spectrum and keep the eigenvectors fix. This bound is only a loga- rithmic factor bigger than the complexity of the function class induced by a single kernel. However, optimizing the margin over such classes leads to overfitting.
Support Vector Machines for Multiple-Instance Learning
This paper presents two new formulations of multiple-instance learning as a maximum margin problem. The proposed extensions of the Support Vector Machine (SVM) learning approach lead to mixed integer quadratic programs that can be solved heuristically. Our generalization of SVMs makes a state-of-the-art classification technique, including non-linear classification via kernels, available to an area that up to now has been largely dominated by special purpose methods. We present experimental results on a pharma(cid:173) ceutical data set and on applications in automated image indexing and document categorization.
Forward-Decoding Kernel-Based Phone Recognition
Forward decoding kernel machines (FDKM) combine large-margin clas(cid:173) sifiers with hidden Markov models (HMM) for maximum a posteriori (MAP) adaptive sequence estimation. State transitions in the sequence are conditioned on observed data using a kernel-based probability model trained with a recursive scheme that deals effectively with noisy and par(cid:173) tially labeled data. Training over very large data sets is accomplished us(cid:173) ing a sparse probabilistic support vector machine (SVM) model based on quadratic entropy, and an on-line stochastic steepest descent algorithm. For speaker-independent continuous phone recognition, FDKM trained over 177,080 samples of the TlMIT database achieves 80.6% recognition accuracy over the full test set, without use of a prior phonetic language model.
Mismatch String Kernels for SVM Protein Classification
We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. These kernels measure sequence sim- ilarity based on shared occurrences of -length subsequences, counted with up to mismatches, and do not rely on any generative model for the positive training sequences. We compute the kernels efficiently using a mismatch tree data structure and report experiments on a benchmark SCOP dataset, where we show that the mismatch kernel used with an SVM classifier performs as well as the Fisher kernel, the most success- ful method for remote homology detection, while achieving considerable computational savings.
Learning a Distance Metric from Relative Comparisons
This paper presents a method for learning a distance metric from rel- ative comparison such as "A is closer to B than A is to C". Taking a Support Vector Machine (SVM) approach, we develop an algorithm that provides a flexible way of describing qualitative training data as a set of constraints. We show that such constraints lead to a convex quadratic programming problem that can be solved by adapting standard meth- ods for SVM training. We empirically evaluate the performance and the modelling flexibility of the algorithm on a collection of text documents.