Gradient Descent
Particle-based Variational Inference with Generalized Wasserstein Gradient Flow
Cheng, Ziheng, Zhang, Shiyue, Yu, Longlin, Zhang, Cheng
Particle-based variational inference methods (ParVIs) such as Stein variational gradient descent (SVGD) update the particles based on the kernelized Wasserstein gradient flow for the Kullback-Leibler (KL) divergence. However, the design of kernels is often non-trivial and can be restrictive for the flexibility of the method. Recent works show that functional gradient flow approximations with quadratic form regularization terms can improve performance. In this paper, we propose a ParVI framework, called generalized Wasserstein gradient descent (GWG), based on a generalized Wasserstein gradient flow of the KL divergence, which can be viewed as a functional gradient method with a broader class of regularizers induced by convex functions. We show that GWG exhibits strong convergence guarantees. We also provide an adaptive version that automatically chooses Wasserstein metric to accelerate convergence. In experiments, we demonstrate the effectiveness and efficiency of the proposed framework on both simulated and real data problems.
Leveraging the two timescale regime to demonstrate convergence of neural networks
Marion, Pierre, Berthier, Raphaรซl
We study the training dynamics of shallow neural networks, in a two-timescale regime in which the stepsizes for the inner layer are much smaller than those for the outer layer. In this regime, we prove convergence of the gradient flow to a global optimum of the non-convex optimization problem in a simple univariate setting. The number of neurons need not be asymptotically large for our result to hold, distinguishing our result from popular recent approaches such as the neural tangent kernel or mean-field regimes. Experimental illustration is provided, showing that the stochastic gradient descent behaves according to our description of the gradient flow and thus converges to a global optimum in the two-timescale regime, but can fail outside of this regime.
(S)GD over Diagonal Linear Networks: Implicit Regularisation, Large Stepsizes and Edge of Stability
Even, Mathieu, Pesme, Scott, Gunasekar, Suriya, Flammarion, Nicolas
In this paper, we investigate the impact of stochasticity and large stepsizes on the implicit regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over diagonal linear networks. We prove the convergence of GD and SGD with macroscopic stepsizes in an overparametrised regression setting and characterise their solutions through an implicit regularisation problem. Our crisp characterisation leads to qualitative insights about the impact of stochasticity and stepsizes on the recovered solution. Specifically, we show that large stepsizes consistently benefit SGD for sparse regression problems, while they can hinder the recovery of sparse solutions for GD. These effects are magnified for stepsizes in a tight window just below the divergence threshold, in the "edge of stability" regime. Our findings are supported by experimental results.
Decentralized Learning over Wireless Networks with Broadcast-Based Subgraph Sampling
Herrera, Daniel Pรฉrez, Chen, Zheng, Larsson, Erik G.
This work centers on the communication aspects of decentralized learning over wireless networks, using consensus-based decentralized stochastic gradient descent (D-SGD). Considering the actual communication cost or delay caused by in-network information exchange in an iterative process, our goal is to achieve fast convergence of the algorithm measured by improvement per transmission slot. We propose BASS, an efficient communication framework for D-SGD over wireless networks with broadcast transmission and probabilistic subgraph sampling. In each iteration, we activate multiple subsets of non-interfering nodes to broadcast model updates to their neighbors. These subsets are randomly activated over time, with probabilities reflecting their importance in network connectivity and subject to a communication cost constraint (e.g., the average number of transmission slots per iteration). During the consensus update step, only bi-directional links are effectively preserved to maintain communication symmetry. In comparison to existing link-based scheduling methods, the inherent broadcasting nature of wireless channels offers intrinsic advantages in speeding up convergence of decentralized learning by creating more communicated links with the same number of transmission slots.
Algorithmic Regularization in Tensor Optimization: Towards a Lifted Approach in Matrix Sensing
Ma, Ziye, Lavaei, Javad, Sojoudi, Somayeh
Gradient descent (GD) is crucial for generalization in machine learning models, as it induces implicit regularization, promoting compact representations. In this work, we examine the role of GD in inducing implicit regularization for tensor optimization, particularly within the context of the lifted matrix sensing framework. This framework has been recently proposed to address the non-convex matrix sensing problem by transforming spurious solutions into strict saddles when optimizing over symmetric, rank-1 tensors. We show that, with sufficiently small initialization scale, GD applied to this lifted problem results in approximate rank-1 tensors and critical points with escape directions. Our findings underscore the significance of the tensor parametrization of matrix sensing, in combination with first-order methods, in achieving global optimality in such problems.
Meta- (out-of-context) learning in neural networks
Krasheninnikov, Dmitrii, Krasheninnikov, Egor, Mlodozeniec, Bruno, Krueger, David
Brown et al. (2020) famously introduced the phenomenon of in-context learning in large language models (LLMs). We establish the existence of a phenomenon we call meta-out-of-context learning (meta-OCL) via carefully designed synthetic experiments with LLMs. Our results suggest that meta-OCL leads LLMs to more readily "internalize" the semantic content of text that is, or appears to be, broadly useful (such as true statements, or text from authoritative sources) and use it in appropriate circumstances. We further demonstrate meta-OCL in a synthetic computer vision setting, and propose two hypotheses for the emergence of meta-OCL: one relying on the way models store knowledge in their parameters, and another suggesting that the implicit gradient alignment bias of gradient-descent-based optimizers may be responsible. Finally, we reflect on what our results might imply about capabilities of future AI systems, and discuss potential risks. Our code can be found at https://github.com/krasheninnikov/internalization.
Enhancing Low-Precision Sampling via Stochastic Gradient Hamiltonian Monte Carlo
Wang, Ziyi, Chen, Yujie, Song, Qifan, Zhang, Ruqi
Low-precision training has emerged as a promising low-cost technique to enhance the training efficiency of deep neural networks without sacrificing much accuracy. Its Bayesian counterpart can further provide uncertainty quantification and improved generalization accuracy. This paper investigates low-precision sampling via Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) with low-precision and full-precision gradient accumulators for both strongly log-concave and non-log-concave distributions. Theoretically, our results show that, to achieve $\epsilon$-error in the 2-Wasserstein distance for non-log-concave distributions, low-precision SGHMC achieves quadratic improvement ($\widetilde{\mathbf{O}}\left({\epsilon^{-2}{\mu^*}^{-2}\log^2\left({\epsilon^{-1}}\right)}\right)$) compared to the state-of-the-art low-precision sampler, Stochastic Gradient Langevin Dynamics (SGLD) ($\widetilde{\mathbf{O}}\left({{\epsilon}^{-4}{\lambda^{*}}^{-1}\log^5\left({\epsilon^{-1}}\right)}\right)$). Moreover, we prove that low-precision SGHMC is more robust to the quantization error compared to low-precision SGLD due to the robustness of the momentum-based update w.r.t. gradient noise. Empirically, we conduct experiments on synthetic data, and {MNIST, CIFAR-10 \& CIFAR-100} datasets, which validate our theoretical findings. Our study highlights the potential of low-precision SGHMC as an efficient and accurate sampling method for large-scale and resource-limited machine learning.
Projected Stochastic Gradient Descent with Quantum Annealed Binary Gradients
Krahn, Maximilian, Sasdelli, Michelle, Yang, Fengyi, Golyanik, Vladislav, Kannala, Juho, Chin, Tat-Jun, Birdal, Tolga
We present, QP-SBGD, a novel layer-wise stochastic optimiser tailored towards training neural networks with binary weights, known as binary neural networks (BNNs), on quantum hardware. BNNs reduce the computational requirements and energy consumption of deep learning models with minimal loss in accuracy. However, training them in practice remains to be an open challenge. Most known BNN-optimisers either rely on projected updates or binarise weights post-training. Instead, QP-SBGD approximately maps the gradient onto binary variables, by solving a quadratic constrained binary optimisation. Under practically reasonable assumptions, we show that this update rule converges with a rate of $\mathcal{O}(1 / \sqrt{T})$. Moreover, we show how the $\mathcal{NP}$-hard projection can be effectively executed on an adiabatic quantum annealer, harnessing recent advancements in quantum computation. We also introduce a projected version of this update rule and prove that if a fixed point exists in the binary variable space, the modified updates will converge to it. Last but not least, our algorithm is implemented layer-wise, making it suitable to train larger networks on resource-limited quantum hardware. Through extensive evaluations, we show that QP-SBGD outperforms or is on par with competitive and well-established baselines such as BinaryConnect, signSGD and ProxQuant when optimising the Rosenbrock function, training BNNs as well as binary graph neural networks.
Population Descent: A Natural-Selection Based Hyper-Parameter Tuning Framework
Pomalapally, Abhinav, Mabsout, Bassel El, Mansuco, Renato
First-order gradient descent has been the base of the most successful optimization algorithms ever implemented. On supervised learning problems with very high dimensionality, such as neural network optimization, it is almost always the algorithm of choice, mainly due to its memory and computational efficiency. However, it is a classical result in optimization that gradient descent converges to local minima on non-convex functions. Even more importantly, in certain high-dimensional cases, escaping the plateaus of large saddle points becomes intractable. On the other hand, black-box optimization methods are not sensitive to the local structure of a loss function's landscape but suffer the curse of dimensionality. Instead, memetic algorithms aim to combine the benefits of both. Inspired by this, we present Population Descent, a memetic algorithm focused on hyperparameter optimization. We show that an adaptive m-elitist selection approach combined with a normalized-fitness-based randomization scheme outperforms more complex state-of-the-art algorithms by up to 13% on common benchmark tasks.
Quantum Advantage Seeker with Kernels (QuASK): a software framework to speed up the research in quantum machine learning
Di Marcantonio, Francesco, Incudini, Massimiliano, Tezza, Davide, Grossi, Michele
Exploiting the properties of quantum information to the benefit of machine learning models is perhaps the most active field of research in quantum computation. This interest has supported the development of a multitude of software frameworks (e.g. Qiskit, Pennylane, Braket) to implement, simulate, and execute quantum algorithms. Most of them allow us to define quantum circuits, run basic quantum algorithms, and access low-level primitives depending on the hardware such software is supposed to run. For most experiments, these frameworks have to be manually integrated within a larger machine learning software pipeline. The researcher is in charge of knowing different software packages, integrating them through the development of long code scripts, analyzing the results, and generating the plots. Long code often leads to erroneous applications, due to the average number of bugs growing proportional with respect to the program length. Moreover, other researchers will struggle to understand and reproduce the experiment, due to the need to be familiar with all the different software frameworks involved in the code script. We propose QuASK, an open-source quantum machine learning framework written in Python that aids the researcher in performing their experiments, with particular attention to quantum kernel techniques. QuASK can be used as a command-line tool to download datasets, pre-process them, quantum machine learning routines, analyze and visualize the results. QuASK implements most state-of-the-art algorithms to analyze the data through quantum kernels, with the possibility to use projected kernels, (gradient-descent) trainable quantum kernels, and structure-optimized quantum kernels. Our framework can also be used as a library and integrated into pre-existing software, maximizing code reuse.