Gradient Descent
Learning sparsity in reservoir computing through a novel bio-inspired algorithm
Manneschi, Luca, Lin, Andrew C., Vasilaki, Eleni
The mushroom body is the key network for the representation of learned olfactory stimuli in Drosophila and insects. The sparse activity of Kenyon cells, the principal neurons in the mushroom body, plays a key role in the learned classification of different odours. In the specific case of the fruit fly, the sparseness of the network is enforced by an inhibitory feedback neuron called APL, and by an intrinsic high firing threshold of the Kenyon cells. In this work we took inspiration from the fruit fly brain to formulate a novel machine learning algorithm that is able to optimize the sparsity level of a reservoir by changing the firing thresholds of the nodes. The sparsity is only applied on the readout layer so as not to change the timescales of the reservoir and to allow the derivation of a one-layer update rule for the firing thresholds. The proposed algorithm is a combination of learning a neuron-specific sparsity threshold via gradient descent and a global sparsity threshold via a Markov chain Monte Carlo method. The proposed model outperforms the standard gradient descent, which is limited to the readout weights of the reservoir, on two example tasks. It demonstrates how the learnt sparse representation can lead to better classification performance, memorization ability and convergence time.
On the geometry of solutions and on the capacity of multi-layer neural networks with ReLU activations
Baldassi, Carlo, Malatesta, Enrico M., Zecchina, Riccardo
Rectified Linear Units (ReLU) have become the main model for the neural units in current deep learning systems. This choice has been originally suggested as a way to compensate for the so called vanishing gradient problem which can undercut stochastic gradient descent (SGD) learning in networks composed of multiple layers. Here we provide analytical results on the effects of ReLUs on the capacity and on the geometrical landscape of the solution space in two-layer neural networks with either binary or real-valued weights. We study the problem of storing an extensive number of random patterns and find that, quite unexpectedly, the capacity of the network remains finite as the number of neurons in the hidden layer increases, at odds with the case of threshold units in which the capacity diverges. Possibly more important, a large deviation approach allows us to find that the geometrical landscape of the solution space has a peculiar structure: while the majority of solutions are close in distance but still isolated, there exist rare regions of solutions which are much more dense than the similar ones in the case of threshold units. These solutions are robust to perturbations of the weights and can tolerate large perturbations of the inputs. The analytical results are corroborated by numerical findings.
Meta-descent for Online, Continual Prediction
Jacobsen, Andrew, Schlegel, Matthew, Linke, Cameron, Degris, Thomas, White, Adam, White, Martha
This paper investigates different vector step-size adaptation approaches for non-stationary online, continual prediction problems. Vanilla stochastic gradient descent can be considerably improved by scaling the update with a vector of appropriately chosen step-sizes. Many methods, including AdaGrad, RMSProp, and AMSGrad, keep statistics about the learning process to approximate a second order update---a vector approximation of the inverse Hessian. Another family of approaches use meta-gradient descent to adapt the step-size parameters to minimize prediction error. These meta-descent strategies are promising for non-stationary problems, but have not been as extensively explored as quasi-second order methods. We first derive a general, incremental meta-descent algorithm, called AdaGain, designed to be applicable to a much broader range of algorithms, including those with semi-gradient updates or even those with accelerations, such as RMSProp. We provide an empirical comparison of methods from both families. We conclude that methods from both families can perform well, but in non-stationary prediction problems the meta-descent methods exhibit advantages. Our method is particularly robust across several prediction problems, and is competitive with the state-of-the-art method on a large-scale, time-series prediction problem on real data from a mobile robot.
Learning Privately over Distributed Features: An ADMM Sharing Approach
Hu, Yaochen, Liu, Peng, Kong, Linglong, Niu, Di
Distributed machine learning has been widely studied in order to handle exploding amount of data. In this paper, we study an important yet less visited distributed learning problem where features are inherently distributed or vertically partitioned among multiple parties, and sharing of raw data or model parameters among parties is prohibited due to privacy concerns. We propose an ADMM sharing framework to approach risk minimization over distributed features, where each party only needs to share a single value for each sample in the training process, thus minimizing the data leakage risk. We establish convergence and iteration complexity results for the proposed parallel ADMM algorithm under non-convex loss. We further introduce a novel differentially private ADMM sharing algorithm and bound the privacy guarantee with carefully designed noise perturbation. The experiments based on a prototype system shows that the proposed ADMM algorithms converge efficiently in a robust fashion, demonstrating advantage over gradient based methods especially for data set with high dimensional feature spaces.
$\texttt{DeepSqueeze}$: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression
Tang, Hanlin, Lian, Xiangru, Qiu, Shuang, Yuan, Lei, Zhang, Ce, Zhang, Tong, Liu, Ji
Communication is a key bottleneck in distributed training. Recently, an \emph{error-compensated} compression technology was particularly designed for the \emph{centralized} learning and receives huge successes, by showing significant advantages over state-of-the-art compression based methods in saving the communication cost. Since the \emph{decentralized} training has been witnessed to be superior to the traditional \emph{centralized} training in the communication restricted scenario, therefore a natural question to ask is "how to apply the error-compensated technology to the decentralized learning to further reduce the communication cost." However, a trivial extension of compression based centralized training algorithms does not exist for the decentralized scenario. key difference between centralized and decentralized training makes this extension extremely non-trivial. In this paper, we propose an elegant algorithmic design to employ error-compensated stochastic gradient descent for the decentralized scenario, named $\texttt{DeepSqueeze}$. Both the theoretical analysis and the empirical study are provided to show the proposed $\texttt{DeepSqueeze}$ algorithm outperforms the existing compression based decentralized learning algorithms. To the best of our knowledge, this is the first time to apply the error-compensated compression to the decentralized learning.
The stochastic multi-gradient algorithm for multi-objective optimization and its application to supervised machine learning
Liu, Suyun, Vicente, Luis Nunes
Optimization of conflicting functions is of paramount importance in decision making, and real world applications frequently involve data that is uncertain or unknown, resulting in multi-objective optimization (MOO) problems of stochastic type. We study the stochastic multi-gradient (SMG) method, seen as an extension of the classical stochastic gradient method for single-objective optimization. At each iteration of the SMG method, a stochastic multi-gradient direction is calculated by solving a quadratic subproblem, and it is shown that this direction is biased even when all individual gradient estimators are unbiased. We establish rates to compute a point in the Pareto front, of order similar to what is known for stochastic gradient in both convex and strongly convex cases. The analysis handles the bias in the multi-gradient and the unknown a priori weights of the limiting Pareto point. The SMG method is framed into a Pareto-front type algorithm for the computation of the entire Pareto front. The Pareto-front SMG algorithm is capable of robustly determining Pareto fronts for a number of synthetic test problems. One can apply it to any stochastic MOO problem arising from supervised machine learning, and we report results for logistic binary classification where multiple objectives correspond to distinct-sources data groups.
SGD momentum optimizer with step estimation by online parabola model
In stochastic gradient descent, especially for neural network training, there are currently dominating first order methods: not modeling local distance to minimum. This information required for optimal step size is provided by second order methods, however, they have many difficulties, starting with full Hessian having square of dimension number of coefficients. This article proposes a minimal step from successful first order momentum method toward second order: online parabola modelling in just a single direction: normalized $\hat{v}$ from momentum method. It is done by estimating linear trend of gradients $\vec{g}=\nabla F(\vec{\theta})$ in $\hat{v}$ direction: such that $g(\vec{\theta}_\bot+\theta\hat{v})\approx \lambda (\theta -p)$ for $\theta = \vec{\theta}\cdot \hat{v}$, $g= \vec{g}\cdot \hat{v}$, $\vec{\theta}_\bot=\vec{\theta}-\theta\hat{v}$. Using linear regression, $\lambda$, $p$ are MSE estimated by just updating four averages (of $g$, $\theta$, $g\theta$, $\theta^2$) in the considered direction. Exponential moving averages allow here for inexpensive online estimation, weakening contribution of the old gradients. Controlling sign of curvature $\lambda$, we can repel from saddles in contrast to attraction in standard Newton method. In the remaining directions: not considered in second order model, we can simultaneously perform e.g. gradient descent.
Learning One-hidden-layer neural networks via Provable Gradient Descent with Random Initialization
Although deep learning has shown its powerful performance in many applications, the mathematical principles behind neural networks are still mysterious. In this paper, we consider the problem of learning a one-hidden-layer neural network with quadratic activations. We focus on the under-parameterized regime where the number of hidden units is smaller than the dimension of the inputs. We shall propose to solve the problem via a provable gradient-based method with random initialization. For the non-convex neural networks training problem we reveal that the gradient descent iterates are able to enter a local region that enjoys strong convexity and smoothness within a few iterations, and then provably converges to a globally optimal model at a linear rate with near-optimal sample complexity. We further corroborate our theoretical findings via various experiments.
Controlling Model Complexity in Probabilistic Model-Based Dynamic Optimization of Neural Network Structures
Saito, Shota, Shirakawa, Shinichi
A method of simultaneously optimizing both the structure of neural networks and the connection weights in a single training loop can reduce the enormous computational cost of neural architecture search. We focus on the probabilistic model-based dynamic neural network structure optimization that considers the probability distribution of structure parameters and simultaneously optimizes both the distribution parameters and connection weights based on gradient methods. Since the existing algorithm searches for the structures that only minimize the training loss, this method might find overly complicated structures. In this paper, we propose the introduction of a penalty term to control the model complexity of obtained structures. We formulate a penalty term using the number of weights or units and derive its analytical natural gradient. The proposed method minimizes the objective function injected the penalty term based on the stochastic gradient descent. We apply the proposed method in the unit selection of a fully-connected neural network and the connection selection of a convolutional neural network. The experimental results show that the proposed method can control model complexity while maintaining performance.
Quantitative $W_1$ Convergence of Langevin-Like Stochastic Processes with Non-Convex Potential State-Dependent Noise
Cheng, Xiang, Yin, Dong, Bartlett, Peter L., Jordan, Michael I.
Stochastic Gradient Descent (SGD) is one of the workhorses of modern day machine learning. In many nonconvex optimization problems, such as training deep neural networks, SGD is able to produce solutions with good generalization error. Further, there is evidence that the generalization error of an SGD solution can be significantly better than Gradient Descent (GD) [12]. This suggests that, to understand the behavior of SGD, it is not enough to consider the limiting cases (such as small step-size or large batch-size), when it degenerates to GD. We take an alternate view of SGD as a sampling algorithm, and aim to understand its convergence to an appropriate stationary distribution.