Country
Asymptotic Universality for Learning Curves of Support Vector Machines
Opper, Manfred, Urbanczik, Robert
Using methods of Statistical Physics, we investigate the rOle of model complexity in learning with support vector machines (SVMs). We show the advantages of using SVMs with kernels of infinite complexity on noisy target rules, which, in contrast to common theoretical beliefs, are found to achieve optimal generalization erroralthough the training error does not converge to the generalization error. Moreover, we find a universal asymptotics of the learning curves which only depend on the target rule but not on the SVM kernel. 1 Introduction Powerful systems for data inference, like neural networks implement complex inputoutput relationsby learning from example data. The price one has to pay for the flexibility of these models is the need to choose the proper model complexity for a given task, i.e. the system architecture which gives good generalization ability for novel data. This has become an important problem also for support vector machines [1].
A Variational Approach to Learning Curves
Malzahn, Dรถrthe, Opper, Manfred
We combine the replica approach from statistical physics with a variational approachto analyze learning curves analytically. We apply the method to Gaussian process regression. As a main result we derive approximative relationsbetween empirical error measures, the generalization error and the posterior variance.
Means, Correlations and Bounds
Leisink, Martijn, Kappen, Bert
The partition function for a Boltzmann machine can be bounded from above and below. We can use this to bound the means and the correlations. For networks with small weights, the values of these statistics can be restricted to nontrivial regions (i.e. a subset of [-1, 1]). Experimental results show that reasonable bounding occurs for weight sizes where mean field expansions generally give good results. 1 Introduction Over the last decade, bounding techniques have become a popular tool to deal with graphical models that are too complex for exact computation. A nice property of bounds is that they give at least some information you can rely on.
Boosting and Maximum Likelihood for Exponential Models
Lebanon, Guy, Lafferty, John D.
We derive an equivalence between AdaBoost and the dual of a convex optimization problem, showing that the only difference between minimizing theexponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized toform a conditional probability distribution over labels. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets support our theoretical analysis andgive additional insight into the relationship between boosting and logistic regression.
Kernel Machines and Boolean Functions
Kowalczyk, Adam, Smola, Alex J., Williamson, Robert C.
We give results about the learnability and required complexity of logical formulae to solve classification problems. These results are obtained by linking propositional logic with kernel machines. In particular we show that decision trees and disjunctive normal forms (DNF) can be represented bythe help of a special kernel, linking regularized risk to separation margin. Subsequently we derive a number of lower bounds on the required complexity of logic formulae using properties of algorithms for generation of linear estimators, such as perceptron and maximal perceptron learning.
Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms
Khardon, Roni, Roth, Dan, Servedio, Rocco A.
We study online learning in Boolean domains using kernels which capture featureexpansions equivalent to using conjunctions over basic features. Wedemonstrate a tradeoff between the computational efficiency with which these kernels can be computed and the generalization ability ofthe resulting classifier. We first describe several kernel functions which capture either limited forms of conjunctions or all conjunctions. We show that these kernels can be used to efficiently run the Perceptron algorithmover an exponential number of conjunctions; however we also prove that using such kernels the Perceptron algorithm can make an exponential number of mistakes even when learning simple functions. Wealso consider an analogous use of kernel functions to run the multiplicative-update Winnow algorithm over an expanded feature space of exponentially many conjunctions. While known upper bounds imply that Winnow can learn DNF formulae with a polynomial mistake bound in this setting, we prove that it is computationally hard to simulate Winnow's behaviorfor learning DNF over such a feature set, and thus that such kernel functions for Winnow are not efficiently computable.
Novel iteration schemes for the Cluster Variation Method
Kappen, Hilbert J., Wiegerinck, Wim
It has been noted by several authors that Belief Propagation can can also give impressive results for graphs that are not trees [2]. The Cluster Variation Method (CVM), is a method that has been developed in the physics community for approximate inference in the Ising model [3]. The CVM approximates thejoint probability distribution by a number of (overlapping) marginal distributions (clusters). The quality of the approximation is determined by the size and number of clusters. When the clusters consist of only two variables, the method is known as the Bethe approximation.
Distribution of Mutual Information
The mutual information of two random variables z and J with joint probabilities {7rij} is commonly used in learning Bayesian nets as well as in many other fields. The chances 7rij are usually estimated by the empirical sampling frequency nij In leading to a point estimate J(nijIn) for the mutual information. To answer questions like "is J (nij In) consistent with zero?" or "what is the probability that the true mutual information is much larger than the point estimate?"
Algorithmic Luckiness
Herbrich, Ralf, Williamson, Robert C.
In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. Motivated by the luckiness framework [8] we are also able to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses ina given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space. 1 Introduction Statistical learning theory is mainly concerned with the study of uniform bounds on the expected error of hypotheses from a given hypothesis space [9, 1].