Goto

Collaborating Authors

 Country


The Rectified Gaussian Distribution

Neural Information Processing Systems

The variables of the rectified Gaussian are constrained to be nonnegative, enabling the use of nonconvex energy functions.Two multimodal examples, the competitive and cooperative distributions, illustrate the representational power of the rectified Gaussian. Since the cooperative distribution can represent thetranslations of a pattern, it demonstrates the potential of the rectified Gaussian for modeling pattern manifolds.


From Regularization Operators to Support Vector Kernels

Neural Information Processing Systems

Support Vector (SV) Machines for pattern recognition, regression estimation and operator inversion exploit the idea of transforming into a high dimensional feature space where they perform a linear algorithm. Instead of evaluating this map explicitly, one uses Hilbert Schmidt Kernels k(x, y) which correspond to dot products of the mapped data in high dimensional space, i.e. k(x,y) ( I (x) ยท I (y)) (I) with I: .!Rn --*:F denoting the map into feature space. Mostly, this map and many of its properties are unknown. Even worse, so far no general rule was available.


Data-Dependent Structural Risk Minimization for Perceptron Decision Trees

Neural Information Processing Systems

Using displays of line orientations taken from Wolfe's experiments [1992], we study the hypothesis that the distinction between parallel versus serial processes arises from the availability of global information in the internal representations of the visual scene. The model operates in two phases. First, the visual displays are compressed via principal-component-analysis. Second, the compressed data is processed by a target detector module inorder to identify the existence of a target in the display. Our main finding is that targets in displays which were found experimentally tobe processed in parallel can be detected by the system, while targets in experimentally-serial displays cannot. This fundamental difference is explained via variance analysis of the compressed representations, providing a numerical criterion distinguishing parallelfrom serial displays. Our model yields a mapping of response-time slopes that is similar to Duncan and Humphreys's "search surface" [1989], providing an explicit formulation of their intuitive notion of feature similarity. It presents a neural realization ofthe processing that may underlie the classical metaphorical explanations of visual search.


Structural Risk Minimization for Nonparametric Time Series Prediction

Neural Information Processing Systems

The problem of time series prediction is studied within the uniform convergence frameworkof Vapnik and Chervonenkis. The dependence inherent in the temporal structure is incorporated into the analysis, thereby generalizing the available theory for memoryless processes. Finite sample boundsare calculated in terms of covering numbers of the approximating class,and the tradeoff between approximation and estimation is discussed. A complexity regularization approach is outlined, based on Vapnik's method of Structural Risk Minimization, and shown to be applicable inthe context of mixing stochastic processes.


Two Approaches to Optimal Annealing

Neural Information Processing Systems

The latter studies are based on examining the Kramers Moyal expansion of the master equation for the weight space probability densities. A different approach, based on the deterministic dynamics of macroscopic quantities called order parameters, has been recently presented [6, 7]. This approach enables one to monitor the evolution of the order parameters and the system performance at all times. In this paper we examine the relation between the two approaches and contrast the results obtained for different learning rate annealing schedules in the asymptotic regime. We employ the order parameter approach to examine the dependence of the dynamics on the number of hidden nodes in a multilayer system.


Asymptotic Theory for Regularization: One-Dimensional Linear Case

Neural Information Processing Systems

The generalization ability of a neural network can sometimes be improved dramatically by regularization. To analyze the improvement oneneeds more refined results than the asymptotic distribution ofthe weight vector. Here we study the simple case of one-dimensional linear regression under quadratic regularization, i.e., ridge regression. We study the random design, misspecified case, where we derive expansions for the optimal regularization parameter andthe ensuing improvement. It is possible to construct examples where it is best to use no regularization.


Relative Loss Bounds for Multidimensional Regression Problems

Neural Information Processing Systems

We study online generalized linear regression with multidimensional outputs, i.e., neural networks with multiple output nodes but no hidden nodes. We allow at the final layer transfer functions such as the softmax functionthat need to consider the linear activations to all the output neurons. We use distance functions of a certain kind in two completely independent roles in deriving and analyzing online learning algorithms for such tasks. We use one distance function to define a matching loss function for the (possibly multidimensional) transfer function, which allows usto generalize earlier results from one-dimensional to multidimensional outputs.We use another distance function as a tool for measuring progress made by the online updates. This shows how previously studied algorithmssuch as gradient descent and exponentiated gradient fit into a common framework. We evaluate the performance of the algorithms usingrelative loss bounds that compare the loss of the online algoritm to the best off-line predictor from the relevant model class, thus completely eliminating probabilistic assumptions about the data.


Boltzmann Machine Learning Using Mean Field Theory and Linear Response Correction

Neural Information Processing Systems

We present a new approximate learning algorithm for Boltzmann Machines, using a systematic expansion of the Gibbs free energy to second order in the weights. The linear response correction to the correlations is given by the Hessian of the Gibbs free energy. The computational complexity of the algorithm is cubic in the number of neurons. We compare the performance of the exact BM learning algorithm with first order (Weiss) mean field theory and second order (TAP) mean field theory. The learning task consists of a fully connected Ising spin glass model on 10 neurons. We conclude that 1) the method works well for paramagnetic problems 2) the TAP correction gives a significant improvement over the Weiss mean field theory, both for paramagnetic and spin glass problems and 3) that the inclusion of diagonal weights improves the Weiss approximation for paramagnetic problems, but not for spin glass problems.


Selecting Weighting Factors in Logarithmic Opinion Pools

Neural Information Processing Systems

A simple linear averaging of the outputs of several networks as e.g. in bagging [3], seems to follow naturally from a bias/variance decomposition of the sum-squared error. The sum-squared error of the average model is a quadratic function of the weighting factors assigned to the networks in the ensemble [7], suggesting a quadratic programming algorithm for finding the "optimal" weighting factors. If we interpret the output of a network as a probability statement, the sum-squared error corresponds to minus the loglikelihood or the Kullback-Leibler divergence, and linear averaging of the outputs tologarithmic averaging of the probability statements: the logarithmic opinion pool. The crux of this paper is that this whole story about model averaging, bias/variancedecompositions, and quadratic programming to find the optimal weighting factors, is not specific for the sumsquared error,but applies to the combination of probability statements of any kind in a logarithmic opinion pool, as long as the Kullback-Leibler divergence plays the role of the error measure. As examples we treat model averaging for classification models under a cross-entropy error measure and models for estimating variances.


Generalization in Decision Trees and DNF: Does Size Matter?

Neural Information Processing Systems

Recent theoretical results for pattern classification with thresholded real-valuedfunctions (such as support vector machines, sigmoid networks,and boosting) give bounds on misclassification probability that do not depend on the size of the classifier, and hence can be considerably smaller than the bounds that follow from the VC theory. In this paper, we show that these techniques can be more widely applied, by representing other boolean functions as two-layer neural networks (thresholded convex combinations of boolean functions).