Goto

Collaborating Authors

 Country


Practical Confidence and Prediction Intervals

Neural Information Processing Systems

We propose a new method to compute prediction intervals. Especially forsmall data sets the width of a prediction interval does not only depend on the variance of the target distribution, but also on the accuracy of our estimator of the mean of the target, i.e., on the width of the confidence interval. The confidence interval follows from the variation in an ensemble of neural networks, each of them trained and stopped on bootstrap replicates of the original data set. A second improvement is the use of the residuals on validation patterns insteadof on training patterns for estimation of the variance of the target distribution. As illustrated on a synthetic example, our method is better than existing methods with regard to extrapolation andinterpolation in data regimes with a limited amount of data, and yields prediction intervals which actual confidence levels are closer to the desired confidence levels. 1 STATISTICAL INTERVALS In this paper we will consider feedforward neural networks for regression tasks: estimating an underlying mathematical function between input and output variables based on a finite number of data points possibly corrupted by noise.


For Valid Generalization the Size of the Weights is More Important than the Size of the Network

Neural Information Processing Systems

Baum and Haussler [4] used these results to give sample size bounds for multi-layer threshold networks Generalization and the Size ofthe Weights in Neural Networks 135 that grow at least as quickly as the number of weights (see also [7]). However, for pattern classification applications the VC-bounds seem loose; neural networks often perform successfully with training sets that are considerably smaller than the number of weights. This paper shows that for classification problems on which neural networksperform well, if the weights are not too big, the size of the weights determines the generalization performance. In contrast with the function classes and algorithms considered in the VC-theory, neural networks used for binary classification problems have real-valued outputs, and learning algorithms typically attempt to minimize the squared error of the network output over a training set. As well as encouraging the correct classification, this tends to push the output away from zero and towards the target values of { -1, I}.


Statistically Efficient Estimations Using Cortical Lateral Connections

Neural Information Processing Systems

Coarse codes are widely used throughout the brain to encode sensory andmotor variables. Methods designed to interpret these codes, such as population vector analysis, are either inefficient, i.e., the variance of the estimate is much larger than the smallest possible variance,or biologically implausible, like maximum likelihood. Moreover, these methods attempt to compute a scalar or vector estimate of the encoded variable. Neurons are faced with a similar estimationproblem. They must read out the responses of the presynaptic neurons, but, by contrast, they typically encode the variable with a further population code rather than as a scalar. We show how a nonlinear recurrent network can be used to perform theseestimation in an optimal way while keeping the estimate in a coarse code format. This work suggests that lateral connections inthe cortex may be involved in cleaning up uncorrelated noise among neurons representing similar variables.


Complex-Cell Responses Derived from Center-Surround Inputs: The Surprising Power of Intradendritic Computation

Neural Information Processing Systems

Biophysical modeling studies have previously shown that cortical pyramidal cells driven by strong NMDA-type synaptic currents and/or containing dendritic voltage-dependent Ca or Na channels, respondmore strongly when synapses are activated in several spatially clustered groups of optimal size-in comparison to the same number of synapses activated diffusely about the dendritic arbor [8]- The nonlinear intradendritic interactions giving rise to this "cluster sensitivity" property are akin to a layer of virtual nonlinear "hiddenunits" in the dendrites, with implications for the cellular basis of learning and memory [7, 6], and for certain classes of nonlinear sensory processing [8]- In the present study, we show that a single neuron, with access only to excitatory inputs from unoriented ONand OFFcenter cells in the LGN, exhibits the principal nonlinear response properties of a "complex" cell in primary visual cortex, namely orientation tuning coupled with translation invariance andcontrast insensitivity_ We conjecture that this type of intradendritic processing could explain how complex cell responses can persist in the absence of oriented simple cell input [13]- 84 B. W. Mel, D. L. Ruderman and K. A. Archie


Learning Decision Theoretic Utilities through Reinforcement Learning

Neural Information Processing Systems

Probability models can be used to predict outcomes and compensate for missing data, but even a perfect model cannot be used to make decisions unless the utility of the outcomes, or preferences between them, are also provided. This arises in many real-world problems, such as medical diagnosis, wherethe cost of the test as well as the expected improvement in the outcome must be considered. Relatively little work has been done on learning the utilities of outcomes for optimal decision making. In this paper, we show how temporal-difference reinforcement learning (TO(Aยป can be used to determine decision theoretic utilities within the context of a mixture model and apply this new approach to a problem in medical diagnosis. TO(A) learning of utilities reduces the number of tests that have to be done to achieve the same level of performance compared with the probability model alone, which results in significant cost savings and increased efficiency.


The Effect of Correlated Input Data on the Dynamics of Learning

Neural Information Processing Systems

The convergence properties of the gradient descent algorithm in the case of the linear perceptron may be obtained from the response function. We derive a general expression for the response function and apply it to the case of data with simple input correlations. It is found that correlations severely may slow down learning. This explains the success of PCA as a method for reducing training time. Motivated by this finding we furthermore propose to transform the input data by removing the mean across input variables as well as examples to decrease correlations. Numerical findings for a medical classification problem are in fine agreement with the theoretical results. 1 INTRODUCTION Learning and generalization are important areas of research within the field of neural networks.Although good generalization is the ultimate goal in feed-forward networks (perceptrons), it is of practical importance to understand the mechanism which control the amount of time required for learning, i. e. the dynamics of learning. Thisis of course particularly important in the case of a large data set. An exact analysis of this mechanism is possible for the linear perceptron and as usual it is hoped that the results to some extend may be carried over to explain the behaviour of nonlinear perceptrons.


Blind Separation of Delayed and Convolved Sources

Neural Information Processing Systems

We address the difficult problem of separating multiple speakers with multiple microphones in a real room. We combine the work and Amari, Cichocki and Yang, to give Natural Gradientof Torkkola information maximisation rules for recurrent (IIR) networks, and deconvolving mixed signals.blindly


Promoting Poor Features to Supervisors: Some Inputs Work Better as Outputs

Neural Information Processing Systems

In supervised learning there is usually a clear distinction between inputs and outputs - inputs are what you will measure, outputs are what you will predict from those measurements. This paper shows that the distinction between inputs and outputs is not this Some features are more useful as extra outputs than assimple. By using a feature as an output we get more than just the case values but can. For many features this mapping may be more useful than the feature value itself. We present two regression problems and one classification problem where performance improves if features that could have been used as inputs are used as extra outputs instead.


Text-Based Information Retrieval Using Exponentiated Gradient Descent

Neural Information Processing Systems

The following investigates the use of single-neuron learning algorithms to improve the performance of text-retrieval systems that accept natural-language queries. A retrieval process is explained that transforms the natural-language query into the query syntax of a real retrieval system: the initial query is expanded using statistical and learning techniques and is then used for document ranking and binary classification. The results of experiments suggest that Kivinen and Warmuth's Exponentiated Gradient Descent learning algorithm works significantly better than previous approaches. 1 Introduction The following work explores two learning algorithms - Least Mean Squared (LMS) [1] and Exponentiated Gradient Descent (EG) [2] - in the context of text-based Information Retrieval (IR) systems. The experiments presented in [3] use connectionist to improve the retrieval of relevant documents from a largelearning models collection of text. Previous the area employs various techniques for improving retrieval [6, 7, 14].


Clustering via Concave Minimization

Neural Information Processing Systems

If a polyhedral distance is used, the problem can be formulated as that of minimizing a piecewise-linear concave function on a polyhedral set which is shown to be equivalent to a bilinear program: minimizing a bilinear function on a polyhedral set.A fast finite k-Median Algorithm consisting of solving few linear programs in closed form leads to a stationary point of the bilinear program. Computational testing on a number of realworld databaseswas carried out. On the Wisconsin Diagnostic Breast Cancer (WDBC) database, k-Median training set correctness wascomparable to that of the k-Mean Algorithm, however its testing set correctness was better. Additionally, on the Wisconsin Prognostic Breast Cancer (WPBC) database, distinct and clinically importantsurvival curves were extracted by the k-Median Algorithm, whereas the k-Mean Algorithm failed to obtain such distinct survival curves for the same database.