Goto

Collaborating Authors

 Gradient Descent


Reviews: Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

Neural Information Processing Systems

Originality: To the best of my knowledge, the results are novel and provide important extensions/improvements over the previous art. Quality: I did a high level check of the proofs and it seems sound to me. Clarity: the paper is a joy to read. The problem definition, assumptions, the algorithm, and statement of results are very well presented. Significance: the results provide several extensions and improvements over the previous work, including training deeper models, training all layers, training with SGD (rather than GD), and smaller required overparameterization.


Reviews: Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

Neural Information Processing Systems

This paper provides a generalization bound for training over-parameterized deep neural networks with ReLU activation and cross-entropy loss using SGD. Initially the paper received mixed reviews, with two positive and one negative reviews. On the one hand, the analysis is found to be intuitive, general, and potentially influential, the generalization bound is found to be more general and sharper than many existing generalization error bounds for over-parameterized neural networks, and the paper to be very well written. On the other, hand the width requirement is found to be too strict. The rebuttal addressed the issues raised by the reviewers, one rating was increased from 6 to 8, and the negative review updated the score to 6. Upon discussion, the reviewers agreed that the paper should be accepted.


Reviews: Provable Non-linear Inductive Matrix Completion

Neural Information Processing Systems

This paper considers the problem of Non-linear Inductive Matrix Completion (NIMC) in a deep learning formulation. In NIMC one is given a query set, an item set and a few query-item relevance values, and the goal is to learn the query-item relevance function. The main contribution of the paper is to provide theoretical guarantees for using a one hidden layer network to estimate that function via an L2 loss. In particular, this can be thought of as two one-layer networks, one learning the embedding of the queries and the other the embedding of the items, while the relevance function is taken to be the inner product of the outputs of the two networks. The authors prove that for sigmoid and tanh activation functions the objective function is locally strongly convex around the global optimum and that stochastic gradient descent converges linearly if initialized sufficiently well.


Reviews: Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup

Neural Information Processing Systems

This paper studies the learning dynamics of two-layer neural networks in the teacher-student scenario under the assumptions that the input is i.i.d. The dynamics is set to be the online algorithm or the stochastic gradient descent (SGD) with mini-batch of single sample, and the dataset size is also assumed to be sufficiently large so that the parameters have no correlation with forthcoming samples. Thanks to these assumptions, the dynamics is governed only by the covariances of connections of the student and teacher, and the closed-form macroscopic dynamics of those covariances can be derived from the SGD dynamics itself. Using this macroscopic dynamics, the generalization error which is also characterized by the covariances only, can be accurately calculated. Meanwhile when both layers of the student are leant, the generalization ability strongly depends on the choice of the activation function: For the sigmoid activation, the generalization error decreases'' as the overparameterization level increases'' while for the other activations the generalization error almost stays constant with respect to it.


Reviews: Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup

Neural Information Processing Systems

This paper derives a coupled system of ODEs modelling this teacher-student setup. The authors provide an asymptotic analysis of the dynamics when only the first layer is trained, and generalization error increases with the size of the student network, and results when both layers are trained are also obtained. All reviewers agree that it is a good contribution.


Closed-Form Feedback-Free Learning with Forward Projection

arXiv.org Machine Learning

State-of-the-art methods for backpropagation-free learning employ local error feedback to direct iterative optimisation via gradient descent. In this study, we examine the more restrictive setting where retrograde communication from neuronal outputs is unavailable for pre-synaptic weight optimisation. To address this challenge, we propose Forward Projection (FP). This novel randomised closed-form training method requires only a single forward pass over the entire dataset for model fitting, without retrograde communication. Target values for pre-activation membrane potentials are generated layer-wise via nonlinear projections of pre-synaptic inputs and the labels. Local loss functions are optimised over pre-synaptic inputs using closed-form regression, without feedback from neuronal outputs or downstream layers. Interpretability is a key advantage of FP training; membrane potentials of hidden neurons in FP-trained networks encode information which is interpretable layer-wise as label predictions. We demonstrate the effectiveness of FP across four biomedical datasets. In few-shot learning tasks, FP yielded more generalisable models than those optimised via backpropagation. In large-sample tasks, FP-based models achieve generalisation comparable to gradient descent-based local learning methods while requiring only a single forward propagation step, achieving significant speed up for training. Interpretation functions defined on local neuronal activity in FP-based models successfully identified clinically salient features for diagnosis in two biomedical datasets. Forward Projection is a computationally efficient machine learning approach that yields interpretable neural network models without retrograde communication of neuronal activity during training.


Ringmaster ASGD: The First Asynchronous SGD with Optimal Time Complexity

arXiv.org Machine Learning

Asynchronous Stochastic Gradient Descent (Asynchronous SGD) is a cornerstone method for parallelizing learning in distributed machine learning. However, its performance suffers under arbitrarily heterogeneous computation times across workers, leading to suboptimal time complexity and inefficiency as the number of workers scales. While several Asynchronous SGD variants have been proposed, recent findings by Tyurin & Richt\'arik (NeurIPS 2023) reveal that none achieve optimal time complexity, leaving a significant gap in the literature. In this paper, we propose Ringmaster ASGD, a novel Asynchronous SGD method designed to address these limitations and tame the inherent challenges of Asynchronous SGD. We establish, through rigorous theoretical analysis, that Ringmaster ASGD achieves optimal time complexity under arbitrarily heterogeneous and dynamically fluctuating worker computation times. This makes it the first Asynchronous SGD method to meet the theoretical lower bounds for time complexity in such scenarios.


Training Dynamics of In-Context Learning in Linear Attention

arXiv.org Artificial Intelligence

While attention-based models have demonstrated the remarkable ability of in-context learning, the theoretical understanding of how these models acquired this ability through gradient descent training is still preliminary. Towards answering this question, we study the gradient descent dynamics of multi-head linear self-attention trained for in-context linear regression. We examine two parametrizations of linear self-attention: one with the key and query weights merged as a single matrix (common in theoretical studies), and one with separate key and query matrices (closer to practical settings). For the merged parametrization, we show the training dynamics has two fixed points and the loss trajectory exhibits a single, abrupt drop. We derive an analytical time-course solution for a certain class of datasets and initialization. For the separate parametrization, we show the training dynamics has exponentially many fixed points and the loss exhibits saddle-to-saddle dynamics, which we reduce to scalar ordinary differential equations. During training, the model implements principal component regression in context with the number of principal components increasing over training time. Overall, we characterize how in-context learning abilities evolve during gradient descent training of linear attention, revealing dynamics of abrupt acquisition versus progressive improvements in models with different parametrizations.


SAPPHIRE: Preconditioned Stochastic Variance Reduction for Faster Large-Scale Statistical Learning

arXiv.org Machine Learning

Regularized empirical risk minimization (rERM) has become important in data-intensive fields such as genomics and advertising, with stochastic gradient methods typically used to solve the largest problems. However, ill-conditioned objectives and non-smooth regularizers undermine the performance of traditional stochastic gradient methods, leading to slow convergence and significant computational costs. To address these challenges, we propose the $\texttt{SAPPHIRE}$ ($\textbf{S}$ketching-based $\textbf{A}$pproximations for $\textbf{P}$roximal $\textbf{P}$reconditioning and $\textbf{H}$essian $\textbf{I}$nexactness with Variance-$\textbf{RE}$educed Gradients) algorithm, which integrates sketch-based preconditioning to tackle ill-conditioning and uses a scaled proximal mapping to minimize the non-smooth regularizer. This stochastic variance-reduced algorithm achieves condition-number-free linear convergence to the optimum, delivering an efficient and scalable solution for ill-conditioned composite large-scale convex machine learning problems. Extensive experiments on lasso and logistic regression demonstrate that $\texttt{SAPPHIRE}$ often converges $20$ times faster than other common choices such as $\texttt{Catalyst}$, $\texttt{SAGA}$, and $\texttt{SVRG}$. This advantage persists even when the objective is non-convex or the preconditioner is infrequently updated, highlighting its robust and practical effectiveness.


A Unified Analysis of Stochastic Gradient Descent with Arbitrary Data Permutations and Beyond

arXiv.org Artificial Intelligence

We aim to provide a unified convergence analysis for permutation-based Stochastic Gradient Descent (SGD), where data examples are permuted before each epoch. By examining the relations among permutations, we categorize existing permutation-based SGD algorithms into four categories: Arbitrary Permutations, Independent Permutations (including Random Reshuffling), One Permutation (including Incremental Gradient, Shuffle One and Nice Permutation) and Dependent Permutations (including GraBs Lu et al., 2022; Cooper et al., 2023). Existing unified analyses failed to encompass the Dependent Permutations category due to the inter-epoch dependencies in its permutations. In this work, we propose a general assumption that captures the inter-epoch permutation dependencies. Using the general assumption, we develop a unified framework for permutation-based SGD with arbitrary permutations of examples, incorporating all the aforementioned representative algorithms. Furthermore, we adapt our framework on example ordering in SGD for client ordering in Federated Learning (FL). Specifically, we develop a unified framework for regularized-participation FL with arbitrary permutations of clients.