Technology
Error-correcting Codes on a Bethe-like Lattice
Vicente, Renato, Saad, David, Kabashima, Yoshiyuki
We analyze Gallager codes by employing a simple mean-field approximation thatdistorts the model geometry and preserves important interactions between sites. The method naturally recovers the probability propagation decodingalgorithm as an extremization of a proper free-energy. We find a thermodynamic phase transition that coincides with information theoreticalupper-bounds and explain the practical code performance in terms of the free-energy landscape.
The Kernel Trick for Distances
A method is described which, like the kernel trick in support vector machines (SVMs),lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms. 1 Introduction One of the crucial ingredients of SVMs is the so-called kernel trick for the computation of dot products in high-dimensional feature spaces using simple functions defined on pairs of input patterns. This trick allows the formulation of nonlinear variants of any algorithm that can be cast in terms of dot products, SVMs being but the most prominent example [13, 8]. Although the mathematical result underlying the kernel trick is almost a century old [6], it was only much later [1, 3,13] that it was made fruitful for the machine learning community. Kernel methods have since led to interesting generalizations of learning algorithms and to successful real-world applications. The present paper attempts to extend the utility of the kernel trick by looking at the problem of which kernels can be used to compute distances in feature spaces. Again, the underlying mathematical results, mainly due to Schoenberg, have been known for a while [7]; some of them have already attracted interest in the kernel methods community in various contexts [11, 5, 15].
Occam's Razor
Rasmussen, Carl Edward, Ghahramani, Zoubin
The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work. 1 Introduction Occam's Razor is a well known principle of "parsimony of explanations" which is influential inscientific thinking in general and in problems of statistical inference in particular. In this paper we review its consequences for Bayesian statistical models, where its behaviour can be easily demonstrated and quantified.
Learning Continuous Distributions: Simulations With Field Theoretic Priors
Nemenman, Ilya, Bialek, William
Learning of a smooth but nonparametric probability density can be regularized usingmethods of Quantum Field Theory. We implement a field theoretic prior numerically, test its efficacy, and show that the free parameter ofthe theory (,smoothness scale') can be determined self consistently bythe data; this forms an infinite dimensional generalization of the MDL principle. Finally, we study the implications of one's choice of the prior and the parameterization and conclude that the smoothness scale determination makes density estimation very weakly sensitive to the choice of the prior, and that even wrong choices can be advantageous for small data sets. One of the central problems in learning is to balance'goodness of fit' criteria against the complexity of models. An important development in the Bayesian approach was thus the realization that there does not need to be any extra penalty for model complexity: if we compute the total probability that data are generated by a model, there is a factor from the volume in parameter space-the'Occam factor' -that discriminates against models with more parameters [1, 2].
Weak Learners and Improved Rates of Convergence in Boosting
We present an algorithm that produces a linear classifier that is guaranteed to achieve an error better than random guessing for any distribution on the data. While this weak learner is not useful for learning in general, we show that under reasonable conditions on the distribution it yields an effective weak learner for one-dimensional problems. Preliminary simulations suggest that similar behavior can be expected in higher dimensions, a result which is corroborated by some recent theoretical bounds. Additionally, weprovide improved convergence rate bounds for the generalization errorin situations where the empirical error can be made small, which is exactly the situation that occurs if weak learners with guaranteed performance that is better than random guessing can be established.
Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
Malzahn, Dรถrthe, Opper, Manfred
Based on a statistical mechanics approach, we develop a method for approximately computing average case learning curves for Gaussian processregression models. The approximation works well in the large sample size limit and for arbitrary dimensionality of the input space. We explain how the approximation can be systematically improvedand argue that similar techniques can be applied to general likelihood models. 1 Introduction Gaussian process (GP) models have gained considerable interest in the Neural Computation Community(see e.g.[I, 2, 3, 4]) in recent years. Being nonparametric models by construction their theoretical understanding seems to be less well developed comparedto simpler parametric models like neural networks. We are especially interested in developing theoretical approaches which will at least give good approximations togeneralization errors when the number of training data is sufficiently large. In this paper we present a step in this direction which is based on a statistical mechanics approach.In contrast to most previous applications of statistical mechanics to learning theory we are not limited to the so called "thermodynamic" limit which would require a high dimensional input space.
Foundations for a Circuit Complexity Theory of Sensory Processing
Legenstein, Robert A., Maass, Wolfgang
We introduce total wire length as salient complexity measure for an analysis ofthe circuit complexity of sensory processing in biological neural systems and neuromorphic engineering. This new complexity measure is applied to a set of basic computational problems that apparently need to be solved by circuits for translation-and scale-invariant sensory processing. Weexhibit new circuit design strategies for these new benchmark functions that can be implemented within realistic complexity bounds, in particular with linear or almost linear total wire length. 1 Introduction Circuit complexity theory is a classical area of theoretical computer science, that provides estimates for the complexity of circuits for computing specific benchmark functions, such as binary addition, multiplication and sorting (see, e.g.
Some New Bounds on the Generalization Error of Combined Classifiers
Koltchinskii, Vladimir, Panchenko, Dmitriy, Lozano, Fernando
In this paper we develop the method of bounding the generalization error of a classifier in terms of its margin distribution which was introduced in the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The theory of Gaussian and empirical processes allow us to prove the margin type inequalities for the most general functional classes, the complexity of the class being measured via the so called Gaussian complexity functions. Asa simple application of our results, we obtain the bounds of Schapire, Freund, Bartlett and Lee for the generalization error of boosting. Wealso substantially improve the results of Bartlett on bounding the generalization error of neural networks in terms of h -norms of the weights of neurons. Furthermore, under additional assumptions on the complexity of the class of hypotheses we provide some tighter bounds, which in the case of boosting improve the results of Schapire, Freund, Bartlett and Lee.
On Reversing Jensen's Inequality
Jensen's inequality is a powerful mathematical tool and one of the workhorses in statistical learning. Its applications therein include the EM algorithm, Bayesian estimation and Bayesian inference. Jensen computes simplelower bounds on otherwise intractable quantities such as products of sums and latent log-likelihoods. This simplification then permits operationslike integration and maximization. Quite often (i.e. in discriminative learning) upper bounds are needed as well. We derive and prove an efficient analytic inequality that provides such variational upper bounds. This inequality holds for latent variable mixtures of exponential family distributions and thus spans a wide range of contemporary statistical models.We also discuss applications of the upper bounds including maximum conditional likelihood, large margin discriminative models and conditional Bayesian inference. Convergence, efficiency and prediction results are shown.