We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PACstyle generalisation error boundfor the classifier learned by the perceptron learning algorithm. Thebound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently availablefor the support vector solution itself. 1 Introduction In the last few years there has been a large controversy about the significance of the attained margin, i.e. the smallest real valued output of a classifiers before thresholding, as an indicator of generalisation performance. Results in the YC, PAC and luckiness frameworks seem to indicate that a large margin is a prerequisite for small generalisation error bounds (see [14, 12]).
The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently beendemonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisationabilities. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simpleand is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digitswhich show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine.
In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. Motivated by the luckiness framework  we are also able to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses ina given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space. 1 Introduction Statistical learning theory is mainly concerned with the study of uniform bounds on the expected error of hypotheses from a given hypothesis space [9, 1].
There are two mathematical flavors of margin bound dependent upon the weights Wi of the vote and the features Xi that the vote is taken over. Those (, ) with a bound on Li w and Li x ("bib" bounds). The results here are of the "bll2" form. We improve on Shawe-Taylor et al.  and Bartlett  by a log(m)2 sample complexity factor and much tighter constants (1000 or unstated versus 9 or 18 as suggested by Section 2.2). In addition, the bound here covers margin errors without weakening the error-free case.