Country
Measure Based Regularization
Bousquet, Olivier, Chapelle, Olivier, Hein, Matthias
We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations.
Approximate Analytical Bootstrap Averages for Support Vector Classifiers
Malzahn, Dรถrthe, Opper, Manfred
We compute approximate analytical bootstrap averages for support vector classificationusing a combination of the replica method of statistical physics and the TAP approach for approximate inference. We test our method on a few datasets and compare it with exact averages obtained by extensive Monte-Carlo sampling.
Geometric Clustering Using the Information Bottleneck Method
Still, Susanne, Bialek, William, Bottou, Lรฉon
We argue that K-means and deterministic annealing algorithms for geometric clusteringcan be derived from the more general Information Bottleneck approach.If we cluster the identities of data points to preserve information about their location, the set of optimal solutions is massively degenerate. But if we treat the equations that define the optimal solution as an iterative algorithm, then a set of "smooth" initial conditions selects solutions with the desired geometrical properties. In addition to conceptual unification,we argue that this approach can be more efficient and robust than classic algorithms.
When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts?
Donoho, David, Stodden, Victoria
We interpret nonnegative matrix factorization geometrically, as the problem of finding a simplicial cone which contains a cloud of data points and which is contained in the positive orthant. We show that under certain conditions, basically requiring that some of the data are spread across the faces of the positive orthant, there is a unique such simplicial cone.We give examples of synthetic image articulation databases which obey these conditions; these require separated support and factorial sampling.For such databases there is a generative model in terms of'parts' and NMF correctly identifies the'parts'. We show that our theoretical results are predictive of the performance of published NMF code, by running the published algorithms on one of our synthetic image articulation databases.
Self-calibrating Probability Forecasting
Vovk, Vladimir, Shafer, Glenn, Nouretdinov, Ilia
In the problem of probability forecasting the learner's goal is to output, given a training set and a new object, a suitable probability measure on the possible values of the new object's label. An online algorithm for probability forecasting is said to be well-calibrated if the probabilities it outputs agree with the observed frequencies. We give a natural nonasymptotic formalizationof the notion of well-calibratedness, which we then study under the assumption of randomness (the object/label pairs are independent and identically distributed). It turns out that, although no probability forecasting algorithm is automatically well-calibrated in our sense, there exists a wide class of algorithms for "multiprobability forecasting" (such algorithms are allowed to output a set, ideally very narrow, of probability measures) which satisfy this property; we call the algorithms in this class "Venn probability machines". Our experimental results demonstrate that a 1-Nearest Neighbor Venn probability machine performs reasonably well on a standard benchmark data set, and one of our theoretical results asserts that a simple Venn probability machine asymptotically approaches the true conditional probabilities regardless, and without knowledge, of the true probability measure generating the examples.
PAC-Bayesian Generic Chaining
Audibert, Jean-yves, Bousquet, Olivier
There exist many different generalization error bounds for classification. Each of these bounds contains an improvement over the others for certain situations.Our goal is to combine these different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced byMcAllester [1], which is interesting for averaging classifiers, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand [2]. This combination is quite natural sincethe generic chaining is based on the notion of majorizing measures, whichcan be considered as priors on the set of classifiers, and such priors also arise in the PACbayesian setting.
Near-Minimax Optimal Classification with Dyadic Classification Trees
The classifiers are based on dyadic classification trees (DCTs), which involve adaptively pruned partitions of the feature space. A key aspect of DCTs is their spatial adaptivity, which enables local (ratherthan global) fitting of the decision boundary. Our risk analysis involves a spatial decomposition of the usual concentration inequalities, leading to a spatially adaptive, data-dependent pruning criterion. For any distribution on (X, Y) whose Bayes decision boundary behaves locally like a Lipschitz smooth function, we show that the DCT error converges to the Bayes error at a rate within a logarithmic factor of the minimax optimal rate.
Online Learning of Non-stationary Sequences
Monteleoni, Claire, Jaakkola, Tommi S.
We consider an online learning scenario in which the learner can make predictions on the basis of a fixed set of experts. We derive upper and lower relative loss bounds for a class of universal learning algorithms involving aswitching dynamics over the choice of the experts. On the basis of the performance bounds we provide the optimal a priori discretization forlearning the parameter that governs the switching dynamics. We demonstrate the new algorithm in the context of wireless networks.
Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds
The decision functions constructed by support vector machines (SVM's) usually depend only on a subset of the training set--the so-called support vectors. We derive asymptotically sharp lower and upper bounds on the number of support vectors for several standard types of SVM's. In particular, weshow for the Gaussian RBF kernel that the fraction of support vectors tends to twice the Bayes risk for the L1-SVM, to the probability of noise for the L2-SVM, and to 1 for the LS-SVM.