Goto

Collaborating Authors

 Country


Large Margin DAGs for Multiclass Classification

Neural Information Processing Systems

We present a new learning architecture: the Decision Directed Acyclic Graph (DDAG), which is used to combine many two-class classifiers into a multiclass classifier. For an N -class problem, the DDAG contains N(N - 1)/2 classifiers, one for each pair of classes. We present a VC analysis of the case when the node classifiers are hyperplanes; the resulting bound on the test error depends on N and on the margin achieved at the nodes, but not on the dimension of the space. This motivates an algorithm, DAGSVM, which operates in a kernel-induced feature space and uses two-class maximal margin hyperplanes at each decision-node of the DDAG. The DAGSVM is substantially faster to train and evaluate than either the standard algorithm or Max Wins, while maintaining comparable accuracy to both of these algorithms. 1 Introduction The problem of multiclass classificatIon, especially for systems like SVMs, doesn't present an easy solution. It is generally simpler to construct classifier theory and algorithms for two mutually-exclusive classes than for N mutually-exclusive classes.


Optimal Kernel Shapes for Local Linear Regression

Neural Information Processing Systems

Local linear regression performs very well in many low-dimensional forecasting problems. In high-dimensional spaces, its performance typically decays due to the well-known "curse-of-dimensionality". A possible way to approach this problem is by varying the "shape" of the weighting kernel. In this work we suggest a new, data-driven method to estimating the optimal kernel shape. Experiments using an artificially generated data set and data from the UC Irvine repository show the benefits of kernel shaping. 1 Introduction Local linear regression has attracted considerable attention in both statistical and machine learning literature as a flexible tool for nonparametric regression analysis [Cle79, FG96, AMS97]. Like most statistical smoothing approaches, local modeling suffers from the so-called "curse-of-dimensionality", the well-known fact that the proportion of the training data that lie in a fixed-radius neighborhood of a point decreases to zero at an exponential rate with increasing dimension of the input space.


Approximate Inference A lgorithms for Two-Layer Bayesian Networks

Neural Information Processing Systems

We present a class of approximate inference algorithms for graphical models of the QMR-DT type. We give convergence rates for these algorithms and for the Jaakkola and Jordan (1999) algorithm, and verify these theoretical predictions empirically.


Invariant Feature Extraction and Classification in Kernel Spaces

Neural Information Processing Systems

In hyperspectral imagery one pixel typically consists of a mixture of the reflectance spectra of several materials, where the mixture coefficients correspond to the abundances of the constituting materials. We assume linear combinations of reflectance spectra with some additive normal sensor noise and derive a probabilistic MAP framework for analyzing hyperspectral data. As the material reflectance characteristics are not know a priori, we face the problem of unsupervised linear unmixing.


A Multi-class Linear Learning Algorithm Related to Winnow

Neural Information Processing Systems

In this paper, we present Committee, a new multi-class learning algorithm related to the Winnow family of algorithms. Committee is an algorithm for combining the predictions of a set of sub-experts in the online mistake-bounded model oflearning. A sub-expert is a special type of attribute that predicts with a distribution over a finite number of classes. Committee learns a linear function of sub-experts and uses this function to make class predictions. We provide bounds for Committee that show it performs well when the target can be represented by a few relevant sub-experts. We also show how Committee can be used to solve more traditional problems composed of attributes. This leads to a natural extension that learns on multi-class problems that contain both traditional attributes and sub-experts.


Boosting Algorithms as Gradient Descent

Neural Information Processing Systems

Recent theoretical results suggest that the effectiveness of these algorithms is due to their tendency to produce large margin classifiers [1, 18]. Loosely speaking, if a combination of classifiers correctly classifies most of the training data with a large margin, then its error probability is small. In [14] we gave improved upper bounds on the misclassification probability of a combined classifier in terms of the average over the training data of a certain cost function of the margins.


The Relaxed Online Maximum Margin Algorithm

Neural Information Processing Systems

We describe a new incremental algorithm for training linear threshold functions: the Relaxed Online Maximum Margin Algorithm, or ROMMA. ROMMA can be viewed as an approximation to the algorithm that repeatedly chooses the hyperplane that classifies previously seen examples correctly with the maximum margin. It is known that such a maximum-margin hypothesis can be computed by minimizing the length of the weight vector subject to a number of linear constraints. ROMMA works by maintaining a relatively simple relaxation of these constraints that can be efficiently updated. We prove a mistake bound for ROMMA that is the same as that proved for the perceptron algorithm. Our analysis implies that the more computationally intensive maximum-margin algorithm also satisfies this mistake bound; this is the first worst-case performance guarantee for this algorithm. We describe some experiments using ROMMA and a variant that updates its hypothesis more aggressively as batch algorithms to recognize handwritten digits. The computational complexity and simplicity of these algorithms is similar to that of perceptron algorithm, but their generalization is much better. We describe a sense in which the performance of ROMMA converges to that of SVM in the limit if bias isn't considered.


Algorithms for Independent Components Analysis and Higher Order Statistics

Neural Information Processing Systems

A latent variable generative model with finite noise is used to describe several different algorithms for Independent Components Analysis (lCA). In particular, the Fixed Point ICA algorithm is shown to be equivalent to the Expectation-Maximization algorithm for maximum likelihood under certain constraints, allowing the conditions for global convergence to be elucidated. The algorithms can also be explained by their generic behavior near a singular point where the size of the optimal generative bases vanishes. An expansion of the likelihood about this singular point indicates the role of higher order correlations in determining the features discovered by ICA. The application and convergence of these algorithms are demonstrated on a simple illustrative example.


Topographic Transformation as a Discrete Latent Variable

Neural Information Processing Systems

A very small amount of shearing will move the point only slightly, so deforming the object by shearing will trace a continuous curve in the space of pixel intensities. As illustrated in Fig. la, extensive levels of shearing will produce a highly nonlinear curve (consider shearing a thin vertical line), although the curve can be approximated by a straight line locally. Linear approximations of the transformation manifold have been used to significantly improve the performance of feedforward discriminative classifiers such as nearest neighbors (Simard et al., 1993) and multilayer perceptrons (Simard et al., 1992). Linear generative models (factor analysis, mixtures of factor analysis) have also been modified using linear approximations of the transformation manifold to build in some degree of transformation invariance (Hinton et al., 1997). In general, the linear approximation is accurate for transformations that couple neighboring pixels, but is inaccurate for transformations that couple nonneighboring pixels. In some applications (e.g., handwritten digit recognition), the input can be blurred so that the linear approximation becomes more robust. For significant levels of transformation, the nonlinear manifold can be better modeled using a discrete approximation. For example, the curve in Figure 1a can be 478 N. Jojic and B. J. Frey


An Improved Decomposition Algorithm for Regression Support Vector Machines

Neural Information Processing Systems

The Karush-Kuhn-Tucker Theorem is used to derive conditions for determining whether or not a given working set is optimal. These conditions become the algorithm)s termination criteria) as an alternative to Osuna)s criteria (also used by Joachims without modification) which used conditions for individual points. The advantage of the new conditions is that knowledge of the hyperplane)s constant factor b) which in some cases is difficult to compute) is not required. Further investigation of the new termination conditions allows to form the strategy for selecting an optimal working set. The new algorithm is applicable to the pattern recognition SVM) and is provably equivalent to Joachims) algorithm. One can also interpret the new algorithm in the sense of the method of feasible directions. Experimental results presented in the last section demonstrate superior performance of the new method in comparison with traditional training of regression SVM. 2 General Principles of Regression SVM Decomposition The original decomposition algorithm proposed for the pattern recognition SVM in [2] has been extended to the regression SVM in [4]. For the sake of completeness I will repeat the main steps of this extension with the aim of providing terse and streamlined notation to lay the ground for working set selection.