Gradient Descent
Is Learning in Biological Neural Networks based on Stochastic Gradient Descent? An analysis using stochastic processes
Christensen, Sören, Kallsen, Jan
In recent years, there has been an intense debate about how learning in biological neural networks (BNNs) differs from learning in artificial neural networks. It is often argued that the updating of connections in the brain relies only on local information, and therefore a stochastic gradient-descent type optimization method cannot be used. In this paper, we study a stochastic model for supervised learning in BNNs. We show that a (continuous) gradient step occurs approximately when each learning opportunity is processed by many local updates. This result suggests that stochastic gradient descent may indeed play a role in optimizing BNNs.
Towards Real-time Training of Physics-informed Neural Networks: Applications in Ultrafast Ultrasound Blood Flow Imaging
Guan, Haotian, Dong, Jinping, Lee, Wei-Ning
Physics-informed Neural Network (PINN) is one of the most preeminent solvers of Navier-Stokes equations, which are widely used as the governing equation of blood flow. However, current approaches, relying on full Navier-Stokes equations, are impractical for ultrafast Doppler ultrasound, the state-of-the-art technique for depiction of complex blood flow dynamics \emph{in vivo} through acquired thousands of frames (or, timestamps) per second. In this article, we first propose a novel training framework of PINN for solving Navier-Stokes equations by discretizing Navier-Stokes equations into steady state and sequentially solving steady-state Navier-Stokes equations with transfer learning. The novel training framework is coined as SeqPINN. Upon the success of SeqPINN, we adopt the idea of averaged constant stochastic gradient descent (SGD) as initialization and propose a parallel training scheme for all timestamps. To ensure an initialization that generalizes well, we borrow the concept of Stochastic Weight Averaging Gaussian to perform uncertainty estimation as an indicator of generalizability of the initialization. This algorithm, named SP-PINN, further expedites training of PINN while achieving comparable accuracy with SeqPINN. Finite-element simulations and \emph{in vitro} phantoms of single-branch and trifurcate blood vessels are used to evaluate the performance of SeqPINN and SP-PINN. Results show that both SeqPINN and SP-PINN are manyfold faster than the original design of PINN, while respectively achieving Root Mean Square Errors (RMSEs) of 1.01 cm/s and 1.26 cm/s on the straight vessel and 1.91 cm/s and 2.56 cm/s on the trifurcate blood vessel when recovering blood flow velocities.
A Gentle Introduction to Gradient-Based Optimization and Variational Inequalities for Machine Learning
Wadia, Neha S., Dandi, Yatin, Jordan, Michael I.
The rapid progress in machine learning in recent years has been based on a highly productive connection to gradient-based optimization. Further progress hinges in part on a shift in focus from pattern recognition to decision-making and multi-agent problems. In these broader settings, new mathematical challenges emerge that involve equilibria and game theory instead of optima. Gradient-based methods remain essential -- given the high dimensionality and large scale of machine-learning problems -- but simple gradient descent is no longer the point of departure for algorithm design. We provide a gentle introduction to a broader framework for gradient-based algorithms in machine learning, beginning with saddle points and monotone games, and proceeding to general variational inequalities. While we provide convergence proofs for several of the algorithms that we present, our main focus is that of providing motivation and intuition.
Approximation Results for Gradient Descent trained Neural Networks
The paper contains approximation guarantees for neural networks that are trained with gradient flow, with error measured in the continuous $L_2(\mathbb{S}^{d-1})$-norm on the $d$-dimensional unit sphere and targets that are Sobolev smooth. The networks are fully connected of constant depth and increasing width. Although all layers are trained, the gradient flow convergence is based on a neural tangent kernel (NTK) argument for the non-convex second but last layer. Unlike standard NTK analysis, the continuous error norm implies an under-parametrized regime, possible by the natural smoothness assumption required for approximation. The typical over-parametrization re-enters the results in form of a loss in approximation rate relative to established approximation methods for Sobolev smooth functions.
Soft Quantization using Entropic Regularization
Lakshmanan, Rajmadan, Pichler, Alois
The quantization problem aims to find the best possible approximation of probability measures on ${\mathbb{R}}^d$ using finite, discrete measures. The Wasserstein distance is a typical choice to measure the quality of the approximation. This contribution investigates the properties and robustness of the entropy-regularized quantization problem, which relaxes the standard quantization problem. The proposed approximation technique naturally adopts the softmin function, which is well known for its robustness in terms of theoretical and practicability standpoints. Moreover, we use the entropy-regularized Wasserstein distance to evaluate the quality of the soft quantization problem's approximation, and we implement a stochastic gradient approach to achieve the optimal solutions. The control parameter in our proposed method allows for the adjustment of the optimization problem's difficulty level, providing significant advantages when dealing with exceptionally challenging problems of interest. As well, this contribution empirically illustrates the performance of the method in various expositions.
Enabling energy-Efficient object detection with surrogate gradient descent in spiking neural networks
Luo, Jilong, Xiao, Shanlin, Chen, Yinsheng, Yu, Zhiyi
Spiking Neural Networks (SNNs) are a biologically plausible neural network model with significant advantages in both event-driven processing and spatio-temporal information processing, rendering SNNs an appealing choice for energyefficient object detection. However, the non-differentiability of the biological neuronal dynamics model presents a challenge during the training of SNNs. Furthermore, a suitable decoding strategy for object detection in SNNs is currently lacking. In this study, we introduce the Current Mean Decoding (CMD) method, which solves the regression problem to facilitate the training of deep SNNs for object detection tasks. Based on the gradient surrogate and CMD, we propose the SNN-YOLOv3 model for object detection. Our experiments demonstrate that SNN-YOLOv3 achieves a remarkable performance with an mAP of 61.87% on the PASCAL VOC dataset, requiring only 6 time steps. Compared to SpikingYOLO, we have managed to increase mAP by nearly 10% while reducing energy consumption by two orders of magnitude.
DBsurf: A Discrepancy Based Method for Discrete Stochastic Gradient Estimation
Arabi, Pau Mulet, Flowers, Alec, Mauch, Lukas, Cardinaux, Fabien
Computing gradients of an expectation with respect to the distributional parameters of a discrete distribution is a problem arising in many fields of science and engineering. Typically, this problem is tackled using Reinforce, which frames the problem of gradient estimation as a Monte Carlo simulation. Unfortunately, the Reinforce estimator is especially sensitive to discrepancies between the true probability distribution and the drawn samples, a common issue in low sampling regimes that results in inaccurate gradient estimates. In this paper, we introduce DBsurf, a reinforce-based estimator for discrete distributions that uses a novel sampling procedure to reduce the discrepancy between the samples and the actual distribution. To assess the performance of our estimator, we subject it to a diverse set of tasks. Among existing estimators, DBsurf attains the lowest variance in a least squares problem commonly used in the literature for benchmarking. Furthermore, DBsurf achieves the best results for training variational auto-encoders (VAE) across different datasets and sampling setups. Finally, we apply DBsurf to build a simple and efficient Neural Architecture Search (NAS) algorithm with state-of-the-art performance.
Breaking the Lower Bound with (Little) Structure: Acceleration in Non-Convex Stochastic Optimization with Heavy-Tailed Noise
Liu, Zijian, Zhang, Jiawei, Zhou, Zhengyuan
We consider the stochastic optimization problem with smooth but not necessarily convex objectives in the heavy-tailed noise regime, where the stochastic gradient's noise is assumed to have bounded $p$th moment ($p\in(1,2]$). Zhang et al. (2020) is the first to prove the $\Omega(T^{\frac{1-p}{3p-2}})$ lower bound for convergence (in expectation) and provides a simple clipping algorithm that matches this optimal rate. Cutkosky and Mehta (2021) proposes another algorithm, which is shown to achieve the nearly optimal high-probability convergence guarantee $O(\log(T/\delta)T^{\frac{1-p}{3p-2}})$, where $\delta$ is the probability of failure. However, this desirable guarantee is only established under the additional assumption that the stochastic gradient itself is bounded in $p$th moment, which fails to hold even for quadratic objectives and centered Gaussian noise. In this work, we first improve the analysis of the algorithm in Cutkosky and Mehta (2021) to obtain the same nearly optimal high-probability convergence rate $O(\log(T/\delta)T^{\frac{1-p}{3p-2}})$, without the above-mentioned restrictive assumption. Next, and curiously, we show that one can achieve a faster rate than that dictated by the lower bound $\Omega(T^{\frac{1-p}{3p-2}})$ with only a tiny bit of structure, i.e., when the objective function $F(x)$ is assumed to be in the form of $\mathbb{E}_{\Xi\sim\mathcal{D}}[f(x,\Xi)]$, arguably the most widely applicable class of stochastic optimization problems. For this class of problems, we propose the first variance-reduced accelerated algorithm and establish that it guarantees a high-probability convergence rate of $O(\log(T/\delta)T^{\frac{1-p}{2p-1}})$ under a mild condition, which is faster than $\Omega(T^{\frac{1-p}{3p-2}})$. Notably, even when specialized to the finite-variance case, our result yields the (near-)optimal high-probability rate $O(\log(T/\delta)T^{-1/3})$.
DRAG: Divergence-based Adaptive Aggregation in Federated learning on Non-IID Data
Zhu, Feng, Zhang, Jingjing, Liu, Shengyun, Wang, Xin
Local stochastic gradient descent (SGD) is a fundamental approach in achieving communication efficiency in Federated Learning (FL) by allowing individual workers to perform local updates. However, the presence of heterogeneous data distributions across working nodes causes each worker to update its local model towards a local optimum, leading to the phenomenon known as ``client-drift" and resulting in slowed convergence. To address this issue, previous works have explored methods that either introduce communication overhead or suffer from unsteady performance. In this work, we introduce a novel metric called ``degree of divergence," quantifying the angle between the local gradient and the global reference direction. Leveraging this metric, we propose the divergence-based adaptive aggregation (DRAG) algorithm, which dynamically ``drags" the received local updates toward the reference direction in each round without requiring extra communication overhead. Furthermore, we establish a rigorous convergence analysis for DRAG, proving its ability to achieve a sublinear convergence rate. Compelling experimental results are presented to illustrate DRAG's superior performance compared to state-of-the-art algorithms in effectively managing the client-drift phenomenon. Additionally, DRAG exhibits remarkable resilience against certain Byzantine attacks. By securely sharing a small sample of the client's data with the FL server, DRAG effectively counters these attacks, as demonstrated through comprehensive experiments.
Corgi^2: A Hybrid Offline-Online Approach To Storage-Aware Data Shuffling For SGD
Livne, Etay, Kaplun, Gal, Malach, Eran, Shalev-Schwatz, Shai
When using Stochastic Gradient Descent (SGD) for training machine learning models, it is often crucial to provide the model with examples sampled at random from the dataset. However, for large datasets stored in the cloud, random access to individual examples is often costly and inefficient. A recent work \cite{corgi}, proposed an online shuffling algorithm called CorgiPile, which greatly improves efficiency of data access, at the cost some performance loss, which is particularly apparent for large datasets stored in homogeneous shards (e.g., video datasets). In this paper, we introduce a novel two-step partial data shuffling strategy for SGD which combines an offline iteration of the CorgiPile method with a subsequent online iteration. Our approach enjoys the best of both worlds: it performs similarly to SGD with random access (even for homogenous data) without compromising the data access efficiency of CorgiPile. We provide a comprehensive theoretical analysis of the convergence properties of our method and demonstrate its practical advantages through experimental results.