A Mean Field Algorithm for Bayes Learning in Large Feed-forward Neural Networks
In the Bayes approach to statistical inference [Berger, 1985] one assumes that the prior uncertainty about parameters of an unknown data generating mechanism can be encoded in a probability distribution, the so called prior. Using the prior and the likelihood of the data given the parameters, the posterior distribution of the parameters can be derived from Bayes rule. From this posterior, various estimates for functions ofthe parameter, like predictions about unseen data, can be calculated. However, in general, those predictions cannot be realised by specific parameter values, but only by an ensemble average over parameters according to the posterior probability. Hence,exact implementations of Bayes method for neural networks require averages over network parameters which in general can be performed by time consuming 226 M.Opper and O. Winther Monte Carlo procedures.
Multi-Task Learning for Stock Selection
Ghosn, Joumana, Bengio, Yoshua
Artificial Neural Networks can be used to predict future returns of stocks in order to take financial decisions. Should one build a separate network for each stock or share the same network for all the stocks? In this paper we also explore other alternatives, in which some layers are shared and others are not shared. When the prediction of future returns for different stocks are viewed as different tasks, sharing some parameters across stocks is a form of multi-task learning. In a series of experiments with Canadian stocks, we obtain yearly returns that are more than 14% above various benchmarks. 1 Introd uction Previous applications of ANNs to financial time-series suggest that several of these prediction and decision-taking tasks present sufficient non-linearities to justify the use of ANNs (Refenes, 1994; Moody, Levin and Rehfuss, 1993).
Unsupervised Learning by Convex and Conic Coding
Lee, Daniel D., Seung, H. Sebastian
Unsupervised learning algorithms based on convex and conic encoders areproposed. The encoders find the closest convex or conic combination of basis vectors to the input. The learning algorithms produce basis vectors that minimize the reconstruction error of the encoders. The convex algorithm develops locally linear models of the input, while the conic algorithm discovers features. Both algorithms areused to model handwritten digits and compared with vector quantization and principal component analysis.
The Effect of Correlated Input Data on the Dynamics of Learning
Halkjรฆr, Sรธren, Winther, Ole
The convergence properties of the gradient descent algorithm in the case of the linear perceptron may be obtained from the response function. We derive a general expression for the response function and apply it to the case of data with simple input correlations. It is found that correlations severely may slow down learning. This explains the success of PCA as a method for reducing training time. Motivated by this finding we furthermore propose to transform the input data by removing the mean across input variables as well as examples to decrease correlations. Numerical findings for a medical classification problem are in fine agreement with the theoretical results. 1 INTRODUCTION Learning and generalization are important areas of research within the field of neural networks.Although good generalization is the ultimate goal in feed-forward networks (perceptrons), it is of practical importance to understand the mechanism which control the amount of time required for learning, i. e. the dynamics of learning. Thisis of course particularly important in the case of a large data set. An exact analysis of this mechanism is possible for the linear perceptron and as usual it is hoped that the results to some extend may be carried over to explain the behaviour of nonlinear perceptrons.
Analytical Mean Squared Error Curves in Temporal Difference Learning
Singh, Satinder P., Dayan, Peter
We have calculated analytical expressions for how the bias and variance of the estimators provided by various temporal difference value estimation algorithms change with offline updates over trials in absorbing Markov chains using lookup table representations. We illustrate classes of learning curve behavior in various chains, and show the manner in which TD is sensitive to the choice of its stepsize andeligibility trace parameters. 1 INTRODUCTION A reassuring theory of asymptotic convergence is available for many reinforcement learning (RL) algorithms.
Statistical Mechanics of the Mixture of Experts
Kukjin Kang and Jong-Hoon Oh Department of Physics Pohang University of Science and Technology Hyoja San 31, Pohang, Kyongbuk 790-784, Korea Email: kkj.jhohOgalaxy.postech.ac.kr Abstract We study generalization capability of the mixture of experts learning fromexamples generated by another network with the same architecture. When the number of examples is smaller than a critical value,the network shows a symmetric phase where the role of the experts is not specialized. Upon crossing the critical point, the system undergoes a continuous phase transition to a symmetry breakingphase where the gating network partitions the input space effectively and each expert is assigned to an appropriate subspace. Wealso find that the mixture of experts with multiple level of hierarchy shows multiple phase transitions. 1 Introduction Recently there has been considerable interest among neural network community in techniques that integrate the collective predictions of a set of networks[l, 2, 3, 4]. The mixture of experts [1, 2] is a well known example which implements the philosophy ofdivide-and-conquer elegantly.
Viewpoint Invariant Face Recognition using Independent Component Analysis and Attractor Networks
Bartlett, Marian Stewart, Sejnowski, Terrence J.
We have explored two approaches to recogmzmg faces across changes in pose. First, we developed a representation of face images based on independent component analysis (ICA) and compared it to a principal component analysis (PCA) representation for face recognition. The ICA basis vectors for this data set were more spatially local than the PCA basis vectors and the ICA representation hadgreater invariance to changes in pose. Second, we present a model for the development of viewpoint invariant responses to faces from visual experience in a biological system. The temporal continuity of natural visual experience was incorporated into an attractor network model by Hebbian learning following a lowpass temporal filter on unit activities.
Clustering via Concave Minimization
Bradley, Paul S., Mangasarian, Olvi L., Street, W. Nick
If a polyhedral distance is used, the problem can be formulated as that of minimizing a piecewise-linear concave function on a polyhedral set which is shown to be equivalent to a bilinear program: minimizing a bilinear function on a polyhedral set.A fast finite k-Median Algorithm consisting of solving few linear programs in closed form leads to a stationary point of the bilinear program.
Local Bandit Approximation for Optimal Learning Problems
Duff, Michael O., Barto, Andrew G.
A Bayesian formulation of the problem leads to a clear concept of a solution whose computation, however, appears to entail an examination of an intractably-large number of hyperstates. This paper hassuggested extending the Gittins index approach (which applies with great power and elegance to the special class of multi-armed bandit processes) to general adaptive MDP's. The hope has been that if certain salient features of the value of information could be captured, even approximately, then one could be led to a reasonable method for avoiding certain defects of certainty-equivalence approaches (problems with identifiability, "metastability"). Obviously, positive evidence, in the form of empirical results from simulation experiments, would lend support to these ideas-work along these lines is underway. Local bandit approximation is but one approximate computational approach for problems of optimal learning and dual control. Most prominent in the literature of control theory is the "wide-sense" approach of [Bar-Shalom & Tse, 1976], which utilizes localquadratic approximations about nominal state/control trajectories. For certain problems, this method has demonstrated superior performance compared to a certainty-equivalence approach, but it is computationally very intensive and unwieldy, particularly for problems with controller dimension greater than one. One could revert to the view of the bandit problem, or general adaptive MDP, as simply a very large MDP defined over hyperstates, and then consider a some- Local Bandit Approximationfor Optimal Learning Problems 1025 what direct approach in which one performs approximate dynamic programming with function approximation over this domain-details of function-approximation, feature-selection, and "training" all become important design issues.
A Hierarchical Model of Visual Rivalry
Binocular rivalry is the alternating percept that can result when the two eyes see different scenes. Recent psychophysical evidence supports an account for one component of binocular rivalry similar to that for other bistable percepts. Recent neurophysiological evidence showsthat some binocular neurons are modulated with the changing percept; others are not, even if they are selective between thestimuli presented to the eyes. We extend our model to a hierarchy to address these effects. 1 Introduction Although binocular rivalry leads to distinct perceptual distress, it is revealing about the mechanisms of visual information processing. Various experiments have suggested that simple input competition cannot be the whole story.