Europe
Explaining Away in Weight Space
Explaining away has mostly been considered in terms of inference of states in belief networks. We show how it can also arise in a Bayesian context in inference about the weights governing relationships such as those between stimuli and reinforcers in conditioning experiments such as bacA, 'Ward blocking. We show how explaining away in weight space can be accounted for using an extension of a Kalman filter model; provide a new approximate way of looking at the Kalman gain matrix as a whitener for the correlation matrix of the observation process; suggest a network implementation of this whitener using an architecture due to Goodall; and show that the resulting model exhibits backward blocking.
Model Complexity, Goodness of Fit and Diminishing Returns
Cadez, Igor V., Smyth, Padhraic
Such learning tasks can typically be characterized by the existence of a model and a loss function. A fitted model of complexity k is a function of the data points D and depends on a specific set of fitted parameters B. The loss function (goodnessof-fit) is a functional of the model and maps each specific model to a scalar used to evaluate the model, e.g., likelihood for density estimation or sum-of-squares for regression. Figure 1 illustrates a typical empirical curve for loss function versus complexity, for mixtures of Markov models fitted to a large data set of 900,000 sequences. The complexity k is the number of Markov models being used in the mixture (see Cadez et al. (2000) for further details on the model and the data set). The empirical curve has a distinctly concave appearance, with large relative gains in fit for low complexity models and much more modest relative gains for high complexity models.
Convergence of Large Margin Separable Linear Classification
Large margin linear classification methods have been successfully applied to many applications. For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed "optimal hyperplane" approaches zero at a rate proportional to the inverse training sample size. This rate is usually characterized by the margin and the maximum norm of the input data. In this paper, we argue that another quantity, namely the robustness of the input data distribution, also plays an important role in characterizing the convergence behavior of expected misclassification error. Based on this concept of robustness, we show that for a large margin separable linear classification problem, the expected misclassification error may converge exponentially in the number of training sample size.
Computing with Finite and Infinite Networks
Using statistical mechanics results, I calculate learning curves (average generalization error) for Gaussian processes (GPs) and Bayesian neural networks (NNs) used for regression. Applying the results to learning a teacher defined by a two-layer network, I can directly compare GP and Bayesian NN learning.
Algebraic Information Geometry for Learning Machines with Singularities
Algebraic geometry is essential to learning theory. In hierarchical learning machines such as layered neural networks and gaussian mixtures, the asymptotic normality does not hold, since Fisher information matrices are singular. In this paper, the rigorous asymptotic form of the stochastic complexity is clarified based on resolution of singularities and two different problems are studied.
Error-correcting Codes on a Bethe-like Lattice
Vicente, Renato, Saad, David, Kabashima, Yoshiyuki
We analyze Gallager codes by employing a simple mean-field approximation that distorts the model geometry and preserves important interactions between sites. The method naturally recovers the probability propagation decoding algorithm as an extremization of a proper free-energy. We find a thermodynamic phase transition that coincides with information theoretical upper-bounds and explain the practical code performance in terms of the free-energy landscape.
The Kernel Trick for Distances
A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms. 1 Introduction One of the crucial ingredients of SVMs is the so-called kernel trick for the computation of dot products in high-dimensional feature spaces using simple functions defined on pairs of input patterns. This trick allows the formulation of nonlinear variants of any algorithm that can be cast in terms of dot products, SVMs being but the most prominent example [13, 8]. Although the mathematical result underlying the kernel trick is almost a century old [6], it was only much later [1, 3,13] that it was made fruitful for the machine learning community. Kernel methods have since led to interesting generalizations of learning algorithms and to successful real-world applications. The present paper attempts to extend the utility of the kernel trick by looking at the problem of which kernels can be used to compute distances in feature spaces. Again, the underlying mathematical results, mainly due to Schoenberg, have been known for a while [7]; some of them have already attracted interest in the kernel methods community in various contexts [11, 5, 15].
Occam's Razor
Rasmussen, Carl Edward, Ghahramani, Zoubin
The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work. 1 Introduction Occam's Razor is a well known principle of "parsimony of explanations" which is influential in scientific thinking in general and in problems of statistical inference in particular. In this paper we review its consequences for Bayesian statistical models, where its behaviour can be easily demonstrated and quantified.
Weak Learners and Improved Rates of Convergence in Boosting
The problem of constructing weak classifiers for boosting algorithms is studied. We present an algorithm that produces a linear classifier that is guaranteed to achieve an error better than random guessing for any distribution on the data. While this weak learner is not useful for learning in general, we show that under reasonable conditions on the distribution it yields an effective weak learner for one-dimensional problems. Preliminary simulations suggest that similar behavior can be expected in higher dimensions, a result which is corroborated by some recent theoretical bounds. Additionally, we provide improved convergence rate bounds for the generalization error in situations where the empirical error can be made small, which is exactly the situation that occurs if weak learners with guaranteed performance that is better than random guessing can be established.
Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
Malzahn, Dörthe, Opper, Manfred
Based on a statistical mechanics approach, we develop a method for approximately computing average case learning curves for Gaussian process regression models. The approximation works well in the large sample size limit and for arbitrary dimensionality of the input space. We explain how the approximation can be systematically improved and argue that similar techniques can be applied to general likelihood models. 1 Introduction Gaussian process (GP) models have gained considerable interest in the Neural Computation Community (see e.g.[I, 2, 3, 4]) in recent years. Being nonparametric models by construction their theoretical understanding seems to be less well developed compared to simpler parametric models like neural networks. We are especially interested in developing theoretical approaches which will at least give good approximations to generalization errors when the number of training data is sufficiently large. In this paper we present a step in this direction which is based on a statistical mechanics approach.