Goto

Collaborating Authors

 Statistical Learning


Robust Parameter Estimation and Model Selection for Neural Network Regression

Neural Information Processing Systems

In this paper, it is shown that the conventional back-propagation (BPP) algorithm for neural network regression is robust to leverages (datawith:n corrupted), but not to outliers (data with y corrupted). A robust model is to model the error as a mixture of normal distribution. The influence function for this mixture model is calculated and the condition for the model to be robust to outliers is given. EM algorithm [5] is used to estimate the parameter. The usefulness of model selection criteria is also discussed.


Locally Adaptive Nearest Neighbor Algorithms

Neural Information Processing Systems

Four versions of a k-nearest neighbor algorithm with locally adaptive kare introduced and compared to the basic k-nearest neighbor algorithm (kNN). Locally adaptive kNN algorithms choose the value of k that should be used to classify a query by consulting the results of cross-validation computations in the local neighborhood of the query. Local kNN methods are shown to perform similar to kNN in experiments with twelve commonly used data sets. Encouraging resultsin three constructed tasks show that local methods can significantly outperform kNN in specific applications. Local methods can be recommended for online learning and for applications wheredifferent regions of the input space are covered by patterns solving different sub-tasks.


Assessing the Quality of Learned Local Models

Neural Information Processing Systems

An approach is presented to learning high dimensional functions in the case where the learning algorithm can affect the generation of new data. A local modeling algorithm, locally weighted regression, is used to represent the learned function. Architectural parameters of the approach, such as distance metrics, are also localized and become a function of the query point instead of being global. Statistical tests are given for when a local model is good enough and sampling should be moved to a new area. Our methods explicitly deal with the case where prediction accuracy requirements exist during exploration: By gradually shifting a "center of exploration" and controlling the speed of the shift with local prediction accuracy,a goal-directed exploration of state space takes place along the fringes of the current data support until the task goal is achieved.


Unsupervised Parallel Feature Extraction from First Principles

Neural Information Processing Systems

EE., Linkoping University S-58183 Linkoping Sweden Abstract We describe a number of learning rules that can be used to train unsupervised parallelfeature extraction systems. The learning rules are derived using gradient ascent of a quality function. We consider anumber of quality functions that are rational functions of higher order moments of the extracted feature values. We show that one system learns the principle components of the correlation matrix.Principal component analysis systems are usually not optimal feature extractors for classification. Therefore we design quality functions which produce feature vectors that support unsupervised classification.The properties of the different systems are compared with the help of different artificially designed datasets and a database consisting of all Munsell color spectra. 1 Introduction There are a number of unsupervised Hebbian learning algorithms (see Oja, 1992 and references therein) that perform some version of the Karhunen-Loeve expansion.


Supervised learning from incomplete data via an EM approach

Neural Information Processing Systems

Real-world learning tasks may involve high-dimensional data sets with arbitrary patterns of missing data. In this paper we present a framework based on maximum likelihood density estimation for learning from such data set.s. VVe use mixture models for the density estimatesand make two distinct appeals to the Expectation Maximization (EM) principle (Dempster et al., 1977) in deriving a learning algorithm-EM is used both for the estimation of mixture componentsand for coping wit.h missing dat.a. The resulting algorithm is applicable t.o a wide range of supervised as well as unsupervised learning problems.


Central and Pairwise Data Clustering by Competitive Neural Networks

Neural Information Processing Systems

Data clustering amounts to a combinatorial optimization problem to reduce thecomplexity of a data representation and to increase its precision. Central and pairwise data clustering are studied in the maximum entropy framework.For central clustering we derive a set of reestimation equations and a minimization procedure which yields an optimal number ofclusters, their centers and their cluster probabilities. A meanfield approximation for pairwise clustering is used to estimate assignment probabilities. A se1fconsistent solution to multidimensional scaling and pairwise clustering is derived which yields an optimal embedding and clustering of data points in a d-dimensional Euclidian space. 1 Introduction A central problem in information processing is the reduction of the data complexity with minimal loss in precision to discard noise and to reveal basic structure of data sets. Data clustering addresses this tradeoff by optimizing a cost function which preserves the original data as complete as possible and which simultaneously favors prototypes with minimal complexity (Linde et aI., 1980; Gray, 1984; Chou et aI., 1989; Rose et ai., 1990). We discuss anobjective function for the joint optimization of distortion errors and the complexity of a reduced data representation.




Fast Pruning Using Principal Components

Neural Information Processing Systems

In this procedure one transforms variables to a basis in which the covariance isdiagonal and then projects out the low variance directions. While application of PCA to remove input variables is useful in some cases (Leen et al., 1990), there is no guarantee that low variance variables have little effect on error. We propose a saliency measure, based on PCA, that identifies those variables that have the least effect on error. Our proposed Principal Components Pruning algorithm applies this measure to obtain a simple and cheap pruning technique in the context of supervised learning. Fast Pruning Using Principal Components 37 Special Case: PCP in Linear Regression In unbiased linear models, one can bound the bias introduced from pruning the principal degrees of freedom in the model.


Unsupervised Learning of Mixtures of Multiple Causes in Binary Data

Neural Information Processing Systems

This paper presents a formulation for unsupervised learning of clusters reflectingmultiple causal structure in binary data. Unlike the standard mixture model, a multiple cause model accounts for observed databy combining assertions from many hidden causes, each of which can pertain to varying degree to any subset of the observable dimensions.A crucial issue is the mixing-function for combining beliefs from different cluster-centers in order to generate data reconstructions whose errors are minimized both during recognition and learning. We demonstrate a weakness inherent to the popular weighted sum followed by sigmoid squashing, and offer an alternative formof the nonlinearity. Results are presented demonstrating the algorithm's ability successfully to discover coherent multiple causal representat.ions of noisy test data and in images of printed characters. 1 Introduction The objective of unsupervised learning is to identify patterns or features reflecting underlying regularities in data. Single-cause techniques, including the k-means algorithm andthe standard mixture-model (Duda and Hart, 1973), represent clusters of data points sharing similar patterns of Is and Os under the assumption that each data point belongs to, or was generated by, one and only one cluster-center; output activity is constrained to sum to 1. In contrast, a multiple-cause model permits more than one cluster-center to become fully active in accounting for an observed data vector. The advantage of a multiple cause model is that a relatively small number 27 28 Saund of hidden variables can be applied combinatorially to generate a large data set.