Goto

Collaborating Authors

 Computational Learning Theory


Can neural networks do better than the Vapnik-Chervonenkis bounds?

Neural Information Processing Systems

These experiments are designed to test whether average generalization performance can surpass the worst-case bounds obtained from formal learning theory using the Vapnik-Chervonenkis dimension (Blumer et al., 1989). We indeed find that, in some cases, the average generalization is significantly better than the VC bound: the approach to perfect performance is exponential in the number of examples m, rather than the 11m result of the bound. In other cases, we do find the 11m behavior of the VC bound, and in these cases, the numerical prefactor is closely related to prefactor contained in the bound.


Learning Time-varying Concepts

Neural Information Processing Systems

This work extends computational learning theory to situations in which concepts vary over time, e.g., system identification of a time-varying plant. We have extended formal definitions of concepts and learning to provide a framework in which an algorithm can track a concept as it evolves over time. Given this framework and focusing on memory-based algorithms, we have derived some PAC-style sample complexity results that determine, for example, when tracking is feasible. We have also used a similar framework and focused on incremental tracking algorithms for which we have derived some bounds on the mistake or error rates for some specific concept classes.


Chaitin-Kolmogorov Complexity and Generalization in Neural Networks

Neural Information Processing Systems

We present a unified framework for a number of different ways of failing to generalize properly. The complexity of the function computed is therefore increased, and generalization is degraded. We analyze replicated networks, in which a number of identical networks are independently trained on the same data and their results averaged. We conclude that replication almost always results in a decrease in the expected complexity of the network, and that replication therefore increases expected generalization. Simulations confirming the effect are also presented.


Polynomial Uniform Convergence of Relative Frequencies to Probabilities

Neural Information Processing Systems

We define the concept of polynomial uniform convergence of relative frequencies to probabilities in the distribution-dependent context. Let Xn {O, l}n, let Pn be a probability distribution on Xn and let Fn C 2X,. be a family of events. The family {(Xn, Pn, Fn)}n l has the property of polynomial uniform convergence if the probability that the maximum difference (over Fn) between the relative frequency and the probabil(cid:173) ity of an event exceed a given positive e be at most 6 (0 6 1), when the sample on which the frequency is evaluated has size polynomial in n,l/e,l/b. Given at-sample (Xl, ...,Xt), let C t)(XI, ...,Xt) be the Vapnik-Chervonenkis dimension of the family {{x}, ...,xtl n f I f E Fn} and M(n, t) the expectation E(C t) It). Applications to distribution-dependent PAC learning are discussed.


The VC-Dimension versus the Statistical Capacity of Multilayer Networks

Neural Information Processing Systems

A general relationship is developed between the VC-dimension and the statistical lower epsilon-capacity which shows that the VC-dimension can be lower bounded (in order) by the statistical lower epsilon-capacity of a network trained with random samples. This relationship explains quan(cid:173) titatively how generalization takes place after memorization, and relates the concept of generalization (consistency) with the capacity of the optimal classifier over a class of classifiers with the same structure and the capacity of the Bayesian classifier. Furthermore, it provides a general methodology to evaluate a lower bound for the VC-dimension of feedforward multilayer neural networks. This general methodology is applied to two types of networks which are important for hardware implementations: two layer (N - 2L - 1) net(cid:173) works with binary weights, integer thresholds for the hidden units and zero threshold for the output unit, and a single neuron ((N - 1) net(cid:173) works) with binary weigths and a zero threshold. Here W is the total number of weights of the (N - 2L - 1) networks.


Automatic Capacity Tuning of Very Large VC-Dimension Classifiers

Neural Information Processing Systems

Large VC-dimension classifiers can learn difficult tasks, but are usually impractical because they generalize well only if they are trained with huge quantities of data. In this paper we show that even high-order polynomial classifiers in high dimensional spaces can be trained with a small amount of training data and yet generalize better than classifiers with a smaller VC-dimension. This is achieved with a maximum margin algorithm (the Generalized Portrait). The technique is applicable to a wide variety of classifiers, including Perceptrons, polynomial classifiers (sigma-pi unit net(cid:173) works) and Radial Basis Functions. The effective number of parameters is adjusted automatically by the training algorithm to match the complexity of the problem.


Developing Population Codes by Minimizing Description Length

Neural Information Processing Systems

The Minimum Description Length principle (MDL) can be used to train the hidden units of a neural network to extract a representa(cid:173) tion that is cheap to describe but nonetheless allows the input to be reconstructed accurately. We show how MDL can be used to develop highly redundant population codes. Each hidden unit has a location in a low-dimensional implicit space. If the hidden unit activities form a bump of a standard shape in this space, they can be cheaply encoded by the center ofthis bump. So the weights from the input units to the hidden units in an autoencoder are trained to make the activities form a standard bump.


Autoencoders, Minimum Description Length and Helmholtz Free Energy

Neural Information Processing Systems

An autoencoder network uses a set of recognition weights to convert an input vector into a code vector. It then uses a set of generative weights to convert the code vector into an approximate reconstruction of the input vector. We derive an objective function for training autoencoders based on the Minimum Description Length (MDL) principle. The aim is to minimize the information required to describe both the code vector and the reconstruction error. We show that this information is minimized by choosing code vectors stochastically according to a Boltzmann distri(cid:173) bution, where the generative weights define the energy of each possible code vector given the input vector.


Neural Networks with Quadratic VC Dimension

Neural Information Processing Systems

This paper shows that neural networks which use continuous acti(cid:173) vation functions have VC dimension at least as large as the square of the number of weights w. This result settles a long-standing open question, namely whether the well-known O( w log w) bound, known for hard-threshold nets, also held for more general sigmoidal nets. Implications for the number of samples needed for valid gen(cid:173) eralization are discussed.


A Convergence Proof for the Softassign Quadratic Assignment Algorithm

Neural Information Processing Systems

The softassign quadratic assignment algorithm has recently emerged as an effective strategy for a variety of optimization prob(cid:173) lems in pattern recognition and combinatorial optimization. While the effectiveness of the algorithm was demonstrated in thousands of simulations, there was no known proof of convergence. Here, we provide a proof of convergence for the most general form of the algorithm.