Country
Gaussian Processes for Regression
Williams, Christopher K. I., Rasmussen, Carl Edward
The Bayesian analysis of neural networks is difficult because a simple prior over weights implies a complex prior distribution over functions. In this paper we investigate the use of Gaussian process priors over functions, which permit the predictive Bayesian analysis for fixed values of hyperparameters to be carried out exactly using matrix operations. Two methods, using optimization and averaging (via Hybrid Monte Carlo) over hyperparameters have been tested on a number of challenging problems and have produced excellent results. 1 INTRODUCTION In the Bayesian approach to neural networks a prior distribution over the weights induces a prior distribution over functions. This prior is combined with a noise model, which specifies the probability of observing the targets t given function values y, to yield a posterior over functions which can then be used for predictions. For neural networks the prior over functions has a complex form which means that implementations must either make approximations (e.g.
Exploiting Tractable Substructures in Intractable Networks
Saul, Lawrence K., Jordan, Michael I.
We develop a refined mean field approximation for inference and learning in probabilistic neural networks. Our mean field theory, unlike most, does not assume that the units behave as independent degrees of freedom; instead, it exploits in a principled way the existence of large substructures that are computationally tractable. To illustrate the advantages of this framework, we show how to incorporate weak higher order interactions into a first-order hidden Markov model, treating the corrections (but not the first order structure) within mean field theory. 1 INTRODUCTION Learning the parameters in a probabilistic neural network may be viewed as a problem in statistical estimation.
Boosting Decision Trees
Drucker, Harris, Cortes, Corinna
We introduce a constructive, incremental learning system for regression problems that models data by means of locally linear experts. In contrast to other approaches, the experts are trained independently and do not compete for data during learning. Only when a prediction for a query is required do the experts cooperate by blending their individual predictions. Each expert is trained by minimizing a penalized local cross validation error using second order methods. In this way, an expert is able to find a local distance metric by adjusting the size and shape of the receptive field in which its predictions are valid, and also to detect relevant input features by adjusting its bias on the importance of individual input dimensions. We derive asymptotic results for our method. In a variety of simulations the properties of the algorithm are demonstrated with respect to interference, learning speed, prediction accuracy, feature detection, and task oriented incremental learning.
Factorial Hidden Markov Models
Ghahramani, Zoubin, Jordan, Michael I.
Due to the simplicity and efficiency of its parameter estimation algorithm, the hidden Markov model (HMM) has emerged as one of the basic statistical tools for modeling discrete time series, finding widespread application in the areas of speech recognition (Rabiner and Juang, 1986) and computational molecular biology (Baldi et al., 1994). An HMM is essentially a mixture model, encoding information about the history of a time series in the value of a single multinomial variable (the hidden state). This multinomial assumption allows an efficient parameter estimation algorithm to be derived (the Baum-Welch algorithm). However, it also severely limits the representational capacity of HMMs.
A Smoothing Regularizer for Recurrent Neural Networks
We derive a smoothing regularizer for recurrent network models by requiring robustness in prediction performance to perturbations of the training data. The regularizer can be viewed as a generalization of the first order Tikhonov stabilizer to dynamic models. The closed-form expression of the regularizer covers both time-lagged and simultaneous recurrent nets, with feedforward nets and onelayer linear nets as special cases. We have successfully tested this regularizer in a number of case studies and found that it performs better than standard quadratic weight decay. 1 Introd uction One technique for preventing a neural network from overfitting noisy data is to add a regularizer to the error function being minimized. Regularizers typically smooth the fit to noisy data. Well-established techniques include ridge regression, see (Hoerl & Kennard 1970), and more generally spline smoothing functions or Tikhonov stabilizers that penalize the mth-order squared derivatives of the function being fit, as in (Tikhonov & Arsenin 1977), (Eubank 1988), (Hastie & Tibshirani 1990) and (Wahba 1990). Thes(-ilethods have recently been extended to networks of radial basis functions (Girosi, Jones & Poggio 1995), and several heuristic approaches have been developed for sigmoidal neural networks, for example, quadratic weight decay (Plaut, Nowlan & Hinton 1986), weight elimination (Scalettar & Zee 1988),(Chauvin 1990),(Weigend, Rumelhart & Huberman 1990) and soft weight sharing (Nowlan & Hinton 1992).
Universal Approximation and Learning of Trajectories Using Oscillators
Natural and artificial neural circuits must be capable of traversing specific state space trajectories. A natural approach to this problem is to learn the relevant trajectories from examples. Unfortunately, gradient descent learning of complex trajectories in amorphous networks is unsuccessful. We suggest a possible approach where trajectories are realized by combining simple oscillators, in various modular ways. We contrast two regimes of fast and slow oscillations. In all cases, we show that banks of oscillators with bounded frequencies have universal approximation properties. Open questions are also discussed briefly.
A Unified Learning Scheme: Bayesian-Kullback Ying-Yang Machine
A Bayesian-Kullback learning scheme, called Ying-Yang Machine, is proposed based on the two complement but equivalent Bayesian representations for joint density and their Kullback divergence. Not only the scheme unifies existing major supervised and unsupervised learnings, including the classical maximum likelihood or least square learning, the maximum information preservation, the EM & em algorithm and information geometry, the recent popular Helmholtz machine, as well as other learning methods with new variants and new results; but also the scheme provides a number of new learning models. 1 INTRODUCTION Many different learning models have been developed in the literature. We may come to an age of searching a unified scheme for them. With a unified scheme, we may understand deeply the existing models and their relationships, which may cause cross-fertilization on them to obtain new results and variants; We may also be guided to develop new learning models, after we get better understanding on which cases we have already studied or missed, which deserve to be further explored. Recently, a Baysian-Kullback scheme, called the YING-YANG Machine, has been proposed as such an effort(Xu, 1995a). It bases on the Kullback divergence and two complement but equivalent Baysian representations for the joint distribution of the input space and the representation space, instead of merely using Kullback divergence for matching un-structuralized joint densities in information geometry type learnings (Amari, 1995a&b; Byrne, 1992; Csiszar, 1975).
Symplectic Nonlinear Component Analysis
Statistically independent features can be extracted by finding a factorial representation of a signal distribution. Principal Component Analysis (PCA) accomplishes this for linear correlated and Gaussian distributed signals. Independent Component Analysis (ICA), formalized by Comon (1994), extracts features in the case of linear statistical dependent but not necessarily Gaussian distributed signals. Nonlinear Component Analysis finally should find a factorial representation for nonlinear statistical dependent distributed signals. This paper proposes for this task a novel feed-forward, information conserving, nonlinear map - the explicit symplectic transformations. It also solves the problem of non-Gaussian output distributions by considering single coordinate higher order statistics.