Goto

Collaborating Authors

 Mason, Llew


Boosting Algorithms as Gradient Descent

Neural Information Processing Systems

Recent theoretical results suggest that the effectiveness of these algorithms is due to their tendency to produce large margin classifiers [1, 18]. Loosely speaking, if a combination of classifiers correctly classifies most of the training data with a large margin, then its error probability is small. In [14] we gave improved upper bounds on the misclassification probability of a combined classifier in terms of the average over the training data of a certain cost function of the margins. That paper also described DOOM, an algorithm for directly minimizingthe margin cost function by adjusting the weights associated with Boosting Algorithms as Gradient Descent 513 each base classifier (the base classifiers are suppiled to DOOM). DOOM exhibits performance improvements over AdaBoost, even when using the same base hypotheses, whichprovides additional empirical evidence that these margin cost functions are appropriate quantities to optimize. In this paper, we present a general class of algorithms (called AnyBoost) which are gradient descent algorithms for choosing linear combinations of elements of an inner product function space so as to minimize some cost functional. The normal operation of a weak learner is shown to be equivalent to maximizing a certain inner product. We prove convergence of AnyBoost under weak conditions. In Section 3, we show that this general class of algorithms includes as special cases nearly all existing voting methods.


Boosting Algorithms as Gradient Descent

Neural Information Processing Systems

Recent theoretical results suggest that the effectiveness of these algorithms is due to their tendency to produce large margin classifiers [1, 18]. Loosely speaking, if a combination of classifiers correctly classifies most of the training data with a large margin, then its error probability is small. In [14] we gave improved upper bounds on the misclassification probability of a combined classifier in terms of the average over the training data of a certain cost function of the margins.


Direct Optimization of Margins Improves Generalization in Combined Classifiers

Neural Information Processing Systems

The dark curve is AdaBoost, the light curve is DOOM. DOOM sacrifices significant training error forimproved test error (horizontal markson margin 0 line)_ 1 Introduction Many learning algorithms for pattern classification minimize some cost function of the training data, with the aim of minimizing error (the probability of misclassifying an example). One example of such a cost function is simply the classifier's error on the training data.



Generalization in Decision Trees and DNF: Does Size Matter?

Neural Information Processing Systems

Recent theoretical results for pattern classification with thresholded real-valuedfunctions (such as support vector machines, sigmoid networks,and boosting) give bounds on misclassification probability that do not depend on the size of the classifier, and hence can be considerably smaller than the bounds that follow from the VC theory. In this paper, we show that these techniques can be more widely applied, by representing other boolean functions as two-layer neural networks (thresholded convex combinations of boolean functions).


Generalization in Decision Trees and DNF: Does Size Matter?

Neural Information Processing Systems

Recent theoretical results for pattern classification with thresholded real-valued functions (such as support vector machines, sigmoid networks, and boosting) give bounds on misclassification probability that do not depend on the size of the classifier, and hence can be considerably smaller than the bounds that follow from the VC theory. In this paper, we show that these techniques can be more widely applied, by representing other boolean functions as two-layer neural networks (thresholded convex combinations of boolean functions).