Backpropagation
Training Deep Gaussian Processes using Stochastic Expectation Propagation and Probabilistic Backpropagation
Bui, Thang D., Hernández-Lobato, José Miguel, Li, Yingzhen, Hernández-Lobato, Daniel, Turner, Richard E.
Deep Gaussian processes (DGPs) are multi-layer hierarchical generalisations of Gaussian processes (GPs) and are formally equivalent to neural networks with multiple, infinitely wide hidden layers. DGPs are probabilistic and non-parametric and as such are arguably more flexible, have a greater capacity to generalise, and provide better calibrated uncertainty estimates than alternative deep models. The focus of this paper is scalable approximate Bayesian learning of these networks. The paper develops a novel and efficient extension of probabilistic backpropagation, a state-of-the-art method for training Bayesian neural networks, that can be used to train DGPs. The new method leverages a recently proposed method for scaling Expectation Propagation, called stochastic Expectation Propagation. The method is able to automatically discover useful input warping, expansion or compression, and it is therefore is a flexible form of Bayesian kernel design. We demonstrate the success of the new method for supervised learning on several real-world datasets, showing that it typically outperforms GP regression and is never much worse.
Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks
Hernández-Lobato, José Miguel, Adams, Ryan P.
Large multilayer neural networks trained with backpropagation have recently achieved state-of-the-art results in a wide range of problems. However, using backprop for neural net learning still has some disadvantages, e.g., having to tune a large number of hyperparameters to the data, lack of calibrated probabilistic predictions, and a tendency to overfit the training data. In principle, the Bayesian approach to learning neural networks does not have these problems. However, existing Bayesian techniques lack scalability to large dataset and network sizes. In this work we present a novel scalable method for learning Bayesian neural networks, called probabilistic backpropagation (PBP). Similar to classical backpropagation, PBP works by computing a forward propagation of probabilities through the network and then doing a backward computation of gradients. A series of experiments on ten real-world datasets show that PBP is significantly faster than other techniques, while offering competitive predictive abilities. Our experiments also show that PBP provides accurate estimates of the posterior variance on the network weights.
Expectation Backpropagation: Parameter-Free Training of Multilayer Neural Networks with Continuous or Discrete Weights
Soudry, Daniel, Hubara, Itay, Meir, Ron
Multilayer Neural Networks (MNNs) are commonly trained using gradient descent-based methods, such as BackPropagation (BP). Inference in probabilistic graphical models is often done using variational Bayes methods, such as Expectation Propagation (EP). We show how an EP based approach can also be used to train deterministic MNNs. Specifically, we approximate the posterior of the weights given the data using a “mean-field” factorized distribution, in an online setting. Using online EP and the central limit theorem we find an analytical approximation to the Bayes update of this posterior, as well as the resulting Bayes estimates of the weights and outputs. Despite a different origin, the resulting algorithm, Expectation BackPropagation (EBP), is very similar to BP in form and efficiency. However, it has several additional advantages: (1) Training is parameter-free, given initial conditions (prior) and the MNN architecture. This is useful for large-scale problems, where parameter tuning is a major challenge. (2) The weights can be restricted to have discrete values. This is especially useful for implementing trained MNNs in precision limited hardware chips, thus improving their speed and energy efficiency by several orders of magnitude. We test the EBP algorithm numerically in eight binary text classification tasks. In all tasks, EBP outperforms: (1) standard BP with the optimal constant learning rate (2) previously reported state of the art. Interestingly, EBP-trained MNNs with binary weights usually perform better than MNNs with continuous (real) weights - if we average the MNN output using the inferred posterior.
Missing Value Imputation With Unsupervised Backpropagation
Gashler, Michael S., Smith, Michael R., Morris, Richard, Martinez, Tony
Unfortunately, real-world datasets often include only samples of observed values mixed with many missing or unknown elements. Missing values may occur due to human impatience, human error during data entry, data loss, faulty sensory equipment, changes in data collection methods, inability to decipher handwriting, privacy issues, legal requirements, and a variety of other practical factors. Thus, improvements to methods for imputing missing values can have far-reaching impact on improving the effectiveness of existing learning algorithms for operating on real-world data. We present a method for imputation called Unsupervised Backpropagation (UBP), which trains a multilayer perceptron (MLP) to fit to the manifold represented by the known features in a dataset. We demonstrate this algorithm with the task of imputing missing values, and we show that it is significantly more effective than other methods for imputation. Backpropagation has long been a popular method for training neural networks (Rumelhart et al., 1986; Werbos, 1990).
Training a Feed-forward Neural Network with Artificial Bee Colony Based Backpropagation Method
Nandy, Sudarshan, Sarkar, Partha Pratim, Das, Achintya
Back-propagation algorithm is one of the most widely used and popular techniques to optimize the feed forward neural network training. Nature inspired meta-heuristic algorithms also provide derivative-free solution to optimize complex problem. Artificial bee colony algorithm is a nature inspired meta-heuristic algorithm, mimicking the foraging or food source searching behaviour of bees in a bee colony and this algorithm is implemented in several applications for an improved optimized outcome. The proposed method in this paper includes an improved artificial bee colony algorithm based back-propagation neural network training method for fast and improved convergence rate of the hybrid neural network learning method. The result is analysed with the genetic algorithm based back-propagation method, and it is another hybridized procedure of its kind. Analysis is performed over standard data sets, reflecting the light of efficiency of proposed method in terms of convergence speed and rate.
Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping
Caruana, Rich, Lawrence, Steve, Giles, C. Lee
The conventional wisdom is that backprop nets with excess hidden units generalize poorly. We show that nets with excess capacity generalize well when trained with backprop and early stopping. Experiments suggest tworeasons for this: 1) Overfitting can vary significantly in different regions of the model. Excess capacity allows better fit to regions of high non-linearity, and backprop often avoids overfitting the regions of low non-linearity.
Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping
Caruana, Rich, Lawrence, Steve, Giles, C. Lee
The conventional wisdom is that backprop nets with excess hidden units generalize poorly. We show that nets with excess capacity generalize well when trained with backprop and early stopping. Experiments suggest two reasons for this: 1) Overfitting can vary significantly in different regions of the model. Excess capacity allows better fit to regions of high non-linearity, and backprop often avoids overfitting the regions of low non-linearity.
Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping
Caruana, Rich, Lawrence, Steve, Giles, C. Lee
The conventional wisdom is that backprop nets with excess hidden units generalize poorly. We show that nets with excess capacity generalize well when trained with backprop and early stopping. Experiments suggest two reasons for this: 1) Overfitting can vary significantly in different regions of the model. Excess capacity allows better fit to regions of high non-linearity, and backprop often avoids overfitting the regions of low non-linearity.
Tempering Backpropagation Networks: Not All Weights are Created Equal
Schraudolph, Nicol N., Sejnowski, Terrence J.
Backpropagation learning algorithms typically collapse the network's structure into a single vector of weight parameters to be optimized. We suggest that their performance may be improved by utilizing the structural information instead of discarding it, and introduce a framework for ''tempering'' each weight accordingly. In the tempering model, activation and error signals are treated as approximately independent random variables. The characteristic scale of weight changes is then matched to that ofthe residuals, allowing structural properties such as a node's fan-in and fan-out to affect the local learning rate and backpropagated error. The model also permits calculation of an upper bound on the global learning rate for batch updates, which in turn leads to different update rules for bias vs. non-bias weights. This approach yields hitherto unparalleled performance on the family relations benchmark, a deep multi-layer network: for both batch learning with momentum and the delta-bar-delta algorithm, convergence at the optimal learning rate is sped up by more than an order of magnitude.