Rethinking Generalisation
Marcu, Antonia, Prügel-Bennett, Adam
Vision, Learning and Control University of Southampton Southampton, UK Abstract In this paper, we present a new approach to computing the generalisation performance assuming that the distribution of risks, ρ (r), for a learning scenario is known. This allows us to compute the expected error of a learning machine using empirical risk minimisation. We show that it is possible to obtain results for both classification and regression. We show a critical quantity in determining the generalisation performance is the power-law behaviour of ρ ( r) around its minimum value. We compute ρ ( r) for the case of all Boolean functions and for the perceptron. We start with a simplistic analysis but then do a more formal one later on. We show that the simplistic results are qualitatively correct and provide a good approximation to the actual results if we replace the true training set size with an approximate training set size. Keywords: Generalisation, Learning Theory 1. Introduction Traditional computational learning theory aims to eliminate all rules that do not correctly explain the data. A rule can be thought of as a fixed set of parameters of a learning machine; more formally, a hypothesis. This process relies on the idea that rules with poor generalisation performance (high risk) will, with high probability, make errors on a sufficiently large randomly chosen training data set (Vapnik and Chervonenkis, 1971; Valiant, 1984; Baum and Haussler, 1989; Blumer et al., 1989; Haussler, 1992; Vapnik, 1992). Suppose there exists a mechanism for selecting a rule from the subset of rules that have the lowest errors on the training set. Then, there is a very small probability that any of the selected rules has a high risk. However, this crucially depends on there being effectively a finite number of hypotheses, otherwise, there could still be a high-risk set of parameters which by chance did well on the particular training set. In the case where the learning machine has a continuous parameter space (so that the dimensionality of the space is uncountably infinite), we consider the effective size of the hypothesis space to be the Vapnik-Chervonenkis (VC) dimension. The VC dimension measures the number of possible ways in which the machine can give different outputs to a finite number of training examples (Vapnik and Chervonenkis, 1971). This effective size or capacity lies at the heart of conventional computational learning theory. By limiting the capacity we can obtain stronger bounds on the generalisation performance. In this paper, we challenge this traditional approach.
Nov-11-2019
- Country:
- Europe > United Kingdom > England > Hampshire > Southampton (0.24)
- Genre:
- Research Report (0.50)
- Technology: