Statistical Learning
From Isolation to Cooperation: An Alternative View of a System of Experts
Schaal, Stefan, Atkeson, Christopher G.
We introduce a constructive, incremental learning system for regression problems that models data by means of locally linear experts. In contrast to other approaches, the experts are trained independently and do not compete for data during learning. Only when a prediction for a query is required do the experts cooperate by blending their individual predictions. Each expert is trained by minimizing a penalized local cross validation error using second order methods. In this way, an expert is able to find a local distance metric by adjusting the size and shape of the receptive field in which its predictions are valid, and also to detect relevant input features by adjusting its bias on the importance of individual input dimensions. We derive asymptotic results for our method. In a variety of simulations the properties of the algorithm are demonstrated with respect to interference, learning speed, prediction accuracy, feature detection, and task oriented incremental learning.
An Information-theoretic Learning Algorithm for Neural Network Classification
Miller, David J., Rao, Ajit V., Rose, Kenneth, Gersho, Allen
A new learning algorithm is developed for the design of statistical classifiers minimizing the rate of misclassification. The method, which is based on ideas from information theory and analogies to statistical physics, assigns data to classes in probability. The distributions are chosen to minimize the expected classification error while simultaneously enforcing the classifier's structure and a level of "randomness" measured by Shannon's entropy. Achievement of the classifier structure is quantified by an associated cost. The constrained optimization problem is equivalent to the minimization of a Helmholtz free energy, and the resulting optimization method is a basic extension of the deterministic annealing algorithm that explicitly enforces structural constraints on assignments while reducing the entropy and expected cost with temperature. In the limit of low temperature, the error rate is minimized directly and a hard classifier with the requisite structure is obtained. This learning algorithm can be used to design a variety of classifier structures. The approach is compared with standard methods for radial basis function design and is demonstrated to substantially outperform other design methods on several benchmark examples, while often retaining design complexity comparable to, or only moderately greater than that of strict descent-based methods.
Prediction of Beta Sheets in Proteins
Krogh, Anders, Riis, Soren Kamaric
Most current methods for prediction of protein secondary structure use a small window of the protein sequence to predict the structure of the central amino acid. We describe a new method for prediction of the non-local structure called,8-sheet, which consists of two or more,8-strands that are connected by hydrogen bonds. Since,8-strands are often widely separated in the protein chain, a network with two windows is introduced. After training on a set of proteins the network predicts the sheets well, but there are many false positives. By using a global energy function the,8-sheet prediction is combined with a local prediction of the three secondary structures a-helix,,8-strand and coil.
EM Optimization of Latent-Variable Density Models
Bishop, Christopher M., Svensรฉn, Markus, Williams, Christopher K. I.
There is currently considerable interest in developing general nonlinear density models based on latent, or hidden, variables. Such models have the ability to discover the presence of a relatively small number of underlying'causes' which, acting in combination, give rise to the apparent complexity of the observed data set. Unfortunately, to train such models generally requires large computational effort. In this paper we introduce a novel latent variable algorithm which retains the general nonlinear capabilities of previous models but which uses a training procedure based on the EM algorithm. We demonstrate the performance of the model on a toy problem and on data from flow diagnostics for a multiphase oil pipeline.
Softassign versus Softmax: Benchmarks in Combinatorial Optimization
Gold, Steven, Rangarajan, Anand
Steven Gold Department of Computer Science Yale University New Haven, CT 06520-8285 AnandRangarajan Dept. of Diagnostic Radiology Yale University New Haven, CT 06520-8042 Abstract A new technique, termed soft assign, is applied for the first time to two classic combinatorial optimization problems, the traveling salesmanproblem and graph partitioning. Softassign, which has emerged from the recurrent neural network/statistical physics framework, enforces two-way (assignment) constraints without the use of penalty terms in the energy functions. The softassign can also be generalized from two-way winner-take-all constraints to multiple membership constraints which are required for graph partitioning. Thesoftassign technique is compared to the softmax (Potts glass). Within the statistical physics framework, softmax and a penalty term has been a widely used method for enforcing the two-way constraints common within many combinatorial optimization problems.The benchmarks present evidence that softassign has clear advantages in accuracy, speed, parallelizabilityand algorithmic simplicityover softmax and a penalty term in optimization problems with two-way constraints. 1 Introduction In a series of papers in the early to mid 1980's, Hopfield and Tank introduced techniques which allowed one to solve combinatorial optimization problems with recurrent neural networks [Hopfield and Tank, 1985].
The Gamma MLP for Speech Phoneme Recognition
Lawrence, Steve, Tsoi, Ah Chung, Back, Andrew D.
Department of Electrical and Computer Engineering University of Queensland St. Lucia Qld 4072 Australia Abstract We define a Gamma multi-layer perceptron (MLP) as an MLP with the usual synaptic weights replaced by gamma filters (as proposed byde Vries and Principe (de Vries and Principe, 1992)) and associated gain terms throughout all layers. We derive gradient descent update equations and apply the model to the recognition of speech phonemes. We find that both the inclusion of gamma filters in all layers, and the inclusion of synaptic gains, improves the performance of the Gamma MLP. We compare the Gamma MLP with TDNN, Back-Tsoi FIR MLP, and Back-Tsoi I1R MLP architectures, and a local approximation scheme. We find that the Gamma MLP results in an substantial reduction in error rates. 1 INTRODUCTION 1.1 THE GAMMA FILTER Infinite Impulse Response (I1R) filters have a significant advantage over Finite Impulse Response(FIR) filters in signal processing: the length of the impulse response is uncoupled from the number of filter parameters.
Discriminant Adaptive Nearest Neighbor Classification and Regression
Hastie, Trevor, Tibshirani, Robert
Nearest neighbor classification expects the class conditional probabilities tobe locally constant, and suffers from bias in high dimensions Wepropose a locally adaptive form of nearest neighbor classification to try to finesse this curse of dimensionality. We use a local linear discriminant analysis to estimate an effective metric forcomputing neighborhoods. We determine the local decision boundaries from centroid information, and then shrink neighborhoods indirections orthogonal to these local decision boundaries, and elongate them parallel to the boundaries. Thereafter, any neighborhood-based classifier can be employed, using the modified neighborhoods. We also propose a method for global dimension reduction, that combines local dimension information.
Gradient and Hamiltonian Dynamics Applied to Learning in Neural Networks
Howse, James W., Abdallah, Chaouki T., Heileman, Gregory L.
James W. Howse Chaouki T. Abdallah Gregory L. Heileman Department of Electrical and Computer Engineering University of New Mexico Albuquerque, NM 87131 Abstract The process of machine learning can be considered in two stages: model selection and parameter estimation. In this paper a technique is presented for constructing dynamical systems with desired qualitative properties. The approach is based on the fact that an n-dimensional nonlinear dynamical system can be decomposed into one gradient and (n - 1) Hamiltonian systems. Thus,the model selection stage consists of choosing the gradient and Hamiltonian portions appropriately so that a certain behavior is obtainable. To estimate the parameters, a stably convergent learning rule is presented.