Goto

Collaborating Authors

 Technology


Occam's Razor

Neural Information Processing Systems

The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work. 1 Introduction Occam's Razor is a well known principle of "parsimony of explanations" which is influential in scientific thinking in general and in problems of statistical inference in particular. In this paper we review its consequences for Bayesian statistical models, where its behaviour can be easily demonstrated and quantified.


Learning Continuous Distributions: Simulations With Field Theoretic Priors

Neural Information Processing Systems

Learning of a smooth but nonparametric probability density can be regularized using methods of Quantum Field Theory. We implement a field theoretic prior numerically, test its efficacy, and show that the free parameter of the theory (,smoothness scale') can be determined self consistently by the data; this forms an infinite dimensional generalization of the MDL principle. Finally, we study the implications of one's choice of the prior and the parameterization and conclude that the smoothness scale determination makes density estimation very weakly sensitive to the choice of the prior, and that even wrong choices can be advantageous for small data sets. One of the central problems in learning is to balance'goodness of fit' criteria against the complexity of models. An important development in the Bayesian approach was thus the realization that there does not need to be any extra penalty for model complexity: if we compute the total probability that data are generated by a model, there is a factor from the volume in parameter space-the'Occam factor' -that discriminates against models with more parameters [1, 2].


Weak Learners and Improved Rates of Convergence in Boosting

Neural Information Processing Systems

The problem of constructing weak classifiers for boosting algorithms is studied. We present an algorithm that produces a linear classifier that is guaranteed to achieve an error better than random guessing for any distribution on the data. While this weak learner is not useful for learning in general, we show that under reasonable conditions on the distribution it yields an effective weak learner for one-dimensional problems. Preliminary simulations suggest that similar behavior can be expected in higher dimensions, a result which is corroborated by some recent theoretical bounds. Additionally, we provide improved convergence rate bounds for the generalization error in situations where the empirical error can be made small, which is exactly the situation that occurs if weak learners with guaranteed performance that is better than random guessing can be established.


A Tighter Bound for Graphical Models

Neural Information Processing Systems

The neurons in these networks are the random variables, whereas the connections between them model the causal dependencies. Usually, some of the nodes have a direct relation with the random variables in the problem and are called'visibles'. The other nodes, known as'hiddens', are used to model more complex probability distributions. Learning in graphical models can be done as long as the likelihood that the visibles correspond to a pattern in the data set, can be computed. In general the time it takes, scales exponentially with the number of hidden neurons.


Foundations for a Circuit Complexity Theory of Sensory Processing

Neural Information Processing Systems

We introduce total wire length as salient complexity measure for an analysis of the circuit complexity of sensory processing in biological neural systems and neuromorphic engineering. This new complexity measure is applied to a set of basic computational problems that apparently need to be solved by circuits for translation-and scale-invariant sensory processing. We exhibit new circuit design strategies for these new benchmark functions that can be implemented within realistic complexity bounds, in particular with linear or almost linear total wire length. 1 Introduction Circuit complexity theory is a classical area of theoretical computer science, that provides estimates for the complexity of circuits for computing specific benchmark functions, such as binary addition, multiplication and sorting (see, e.g.


Some New Bounds on the Generalization Error of Combined Classifiers

Neural Information Processing Systems

In this paper we develop the method of bounding the generalization error of a classifier in terms of its margin distribution which was introduced in the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The theory of Gaussian and empirical processes allow us to prove the margin type inequalities for the most general functional classes, the complexity of the class being measured via the so called Gaussian complexity functions. As a simple application of our results, we obtain the bounds of Schapire, Freund, Bartlett and Lee for the generalization error of boosting. We also substantially improve the results of Bartlett on bounding the generalization error of neural networks in terms of h -norms of the weights of neurons. Furthermore, under additional assumptions on the complexity of the class of hypotheses we provide some tighter bounds, which in the case of boosting improve the results of Schapire, Freund, Bartlett and Lee.


Sparsity of Data Representation of Optimal Kernel Machine and Leave-one-out Estimator

Neural Information Processing Systems

Vapnik's result that the expectation of the generalisation error ofthe optimal hyperplane is bounded by the expectation of the ratio of the number of support vectors to the number of training examples is extended to a broad class of kernel machines. The class includes Support Vector Machines for soft margin classification and regression, and Regularization Networks with a variety of kernels and cost functions. We show that key inequalities in Vapnik's result become equalities once "the classification error" is replaced by "the margin error", with the latter defined as an instance with positive cost. In particular we show that expectations of the true margin error and the empirical margin error are equal, and that the sparse solutions for kernel machines are possible only if the cost function is "partially" insensitive. 1 Introduction Minimization of regularized risk is a backbone of several recent advances in machine learning, including Support Vector Machines (SVM) [13], Regularization Networks (RN) [5] or Gaussian Processes [15]. Such a machine is typically implemented as a weighted sum of a kernel function evaluated for pairs composed of a data vector in question and a number of selected training vectors, so called support vectors.


Second Order Approximations for Probability Models

Neural Information Processing Systems

In this paper, we derive a second order mean field theory for directed graphical probability models. By using an information theoretic argument it is shown how this can be done in the absense of a partition function. This method is a direct generalisation of the well-known TAP approximation for Boltzmann Machines. In a numerical example, it is shown that the method greatly improves the first order mean field approximation. For a restricted class of graphical models, so-called single overlap graphs, the second order method has comparable complexity to the first order method. For sigmoid belief networks, the method is shown to be particularly fast and effective.


On Reversing Jensen's Inequality

Neural Information Processing Systems

Jensen's inequality is a powerful mathematical tool and one of the workhorses in statistical learning. Its applications therein include the EM algorithm, Bayesian estimation and Bayesian inference. Jensen computes simple lower bounds on otherwise intractable quantities such as products of sums and latent log-likelihoods. This simplification then permits operations like integration and maximization. Quite often (i.e. in discriminative learning) upper bounds are needed as well. We derive and prove an efficient analytic inequality that provides such variational upper bounds. This inequality holds for latent variable mixtures of exponential family distributions and thus spans a wide range of contemporary statistical models. We also discuss applications of the upper bounds including maximum conditional likelihood, large margin discriminative models and conditional Bayesian inference. Convergence, efficiency and prediction results are shown.


A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Neural Information Processing Systems

We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to nontrivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses.