Goto

Collaborating Authors

 Gradient Descent


Learning and Transferring Sparse Contextual Bigrams with Linear Transformers

arXiv.org Artificial Intelligence

Transformers have excelled in natural language modeling and one reason behind this success is their exceptional ability to combine contextual informal and global knowledge. However, the theoretical basis remains unclear. In this paper, first we introduce the Sparse Contextual Bigram (SCB), a natural extension of the classical bigram model, where the next token's generation depends on a sparse set of earlier positions determined by the last token. We then analyze the training dynamics and sample complexity of learning SCB using a one-layer linear transformer with a gradient-based algorithm. We show that when trained from scratch, the training process can be split into an initial sample-intensive stage where the correlation is boosted from zero to a nontrivial value, followed by a more sample-efficient stage of further improvement. Additionally, we prove that, provided a nontrivial correlation between the downstream and pretraining tasks, finetuning from a pretrained model allows us to bypass the initial sample-intensive stage. We also empirically demonstrate that our algorithm can outperform SGD in this setting and discuss its relationship with the usual softmax-based transformers.


Communication-Efficient Federated Learning over Wireless Channels via Gradient Sketching

arXiv.org Artificial Intelligence

Large-scale federated learning (FL) over wireless multiple access channels (MACs) has emerged as a crucial learning paradigm with a wide range of applications. However, its widespread adoption is hindered by several major challenges, including limited bandwidth shared by many edge devices, noisy and erroneous wireless communications, and heterogeneous datasets with different distributions across edge devices. To overcome these fundamental challenges, we propose Federated Proximal Sketching (FPS), tailored towards band-limited wireless channels and handling data heterogeneity across edge devices. FPS uses a count sketch data structure to address the bandwidth bottleneck and enable efficient compression while maintaining accurate estimation of significant coordinates. Additionally, we modify the loss function in FPS such that it is equipped to deal with varying degrees of data heterogeneity. We establish the convergence of the FPS algorithm under mild technical conditions and characterize how the bias induced due to factors like data heterogeneity and noisy wireless channels play a role in the overall result. We complement the proposed theoretical framework with numerical experiments that demonstrate the stability, accuracy, and efficiency of FPS in comparison to state-of-the-art methods on both synthetic and real-world datasets. Overall, our results show that FPS is a promising solution to tackling the above challenges of FL over wireless MACs.


Functional Gradient Flows for Constrained Sampling

arXiv.org Machine Learning

Recently, through a unified gradient flow perspective of Markov chain Monte Carlo (MCMC) and variational inference (VI), particle-based variational inference methods (ParVIs) have been proposed that tend to combine the best of both worlds. While typical ParVIs such as Stein Variational Gradient Descent (SVGD) approximate the gradient flow within a reproducing kernel Hilbert space (RKHS), many attempts have been made recently to replace RKHS with more expressive function spaces, such as neural networks. While successful, these methods are mainly designed for sampling from unconstrained domains. In this paper, we offer a general solution to constrained sampling by introducing a boundary condition for the gradient flow which would confine the particles within the specific domain. This allows us to propose a new functional gradient ParVI method for constrained sampling, called constrained functional gradient flow (CFG), with provable continuous-time convergence in total variation (TV). We also present novel numerical strategies to handle the boundary integral term arising from the domain constraints. Our theory and experiments demonstrate the effectiveness of the proposed framework.


Super Gradient Descent: Global Optimization requires Global Gradient

arXiv.org Artificial Intelligence

Global optimization plays a critical role in addressing complex real-life challenges across various fields. In engineering, it is applied to structural design optimization, where minimizing weight or material use while ensuring durability is essential for cost-effective and safe construction. In financial services, portfolio optimization requires balancing risk and return by finding the global minimum or maximum in investment strategies. In logistics and transportation, global optimization is crucial for solving routing problems such as determining the shortest path or optimizing delivery routes which leads to significant cost savings and improved efficiency. Similarly, in energy systems, global optimization is key to managing and distributing power more efficiently, reducing operational costs, and optimizing renewable energy usage. In machine learning, the need for global optimization is especially pronounced. The performance of models often depends on the ability to minimize complex, non-convex loss functions. While traditional methods like gradient descent are effective in many cases, they frequently encounter the problem of getting trapped in local minima, which can hinder the model's overall performance. This is particularly relevant in tasks that require complex models where the optimization landscape is highly non-linear and fraught with local minima.


Shuffling Gradient-Based Methods for Nonconvex-Concave Minimax Optimization

arXiv.org Machine Learning

This paper aims at developing novel shuffling gradient-based methods for tackling two classes of minimax problems: nonconvex-linear and nonconvex-strongly concave settings. The first algorithm addresses the nonconvex-linear minimax model and achieves the state-of-the-art oracle complexity typically observed in nonconvex optimization. It also employs a new shuffling estimator for the "hyper-gradient", departing from standard shuffling techniques in optimization. The second method consists of two variants: semi-shuffling and full-shuffling schemes. These variants tackle the nonconvex-strongly concave minimax setting. We establish their oracle complexity bounds under standard assumptions, which, to our best knowledge, are the best-known for this specific setting. Numerical examples demonstrate the performance of our algorithms and compare them with two other methods. Our results show that the new methods achieve comparable performance with SGD, supporting the potential of incorporating shuffling strategies into minimax algorithms.


A Stein Gradient Descent Approach for Doubly Intractable Distributions

arXiv.org Machine Learning

Bayesian inference for doubly intractable distributions is challenging because they include intractable terms, which are functions of parameters of interest. Although several alternatives have been developed for such models, they are computationally intensive due to repeated auxiliary variable simulations. We propose a novel Monte Carlo Stein variational gradient descent (MC-SVGD) approach for inference for doubly intractable distributions. Through an efficient gradient approximation, our MC-SVGD approach rapidly transforms an arbitrary reference distribution to approximate the posterior distribution of interest, without necessitating any predefined variational distribution class for the posterior. Such a transport map is obtained by minimizing Kullback-Leibler divergence between the transformed and posterior distributions in a reproducing kernel Hilbert space (RKHS). We also investigate the convergence rate of the proposed method. We illustrate the application of the method to challenging examples, including a Potts model, an exponential random graph model, and a Conway--Maxwell--Poisson regression model. The proposed method achieves substantial computational gains over existing algorithms, while providing comparable inferential performance for the posterior distributions.


Trustworthiness of Stochastic Gradient Descent in Distributed Learning

arXiv.org Artificial Intelligence

DL is the method used to accelerate the training of deep learning models by distributing training tasks to multiple computing nodes [1]. However, as data scales continue to grow, the complexity of model gradients increases accordingly, for example, consider the training of deep learning on ImageNet [2], which contains over 14 million labeled images and topics with approximately 22,000 categories, leading to constraints on communication efficiency [3]. Gradient compression aimed at reducing communication overhead during gradient transmission between multiple nodes which enhances system computational efficiency [4, 5, 6], thus this has emerged as an effective optimization technique in distributed learning, especially when training complex models to process large-scale data. Among various gradient compression techniques, PowerSGD [6] and Top-K SGD [7] have emerged as prominent solutions for their ability to substantially reduce communication costs while preserving scalability and model accuracy in large-scale distributed learning. These two algorithms are particularly suitable for our study as they represent fundamental approaches to gradient compression: PowerSGD uses low-rank approximation, while TopKSGD leverages sparsification through threshold quantization. Both techniques are widely recognized for their practical effectiveness, especially when combined, to varying extents, with advanced features such as error feedback, warm start, all-reduce, making them ideal candidates of compressed SGD for assessing privacy risks in distributed deep learning systems.


Near-Optimal Streaming Heavy-Tailed Statistical Estimation with Clipped SGD

arXiv.org Machine Learning

We consider the problem of high-dimensional heavy-tailed statistical estimation in the streaming setting, which is much harder than the traditional batch setting due to memory constraints. We cast this problem as stochastic convex optimization with heavy tailed stochastic gradients, and prove that the widely used Clipped-SGD algorithm attains near-optimal sub-Gaussian statistical rates whenever the second moment of the stochastic gradient noise is finite. More precisely, with $T$ samples, we show that Clipped-SGD, for smooth and strongly convex objectives, achieves an error of $\sqrt{\frac{\mathsf{Tr}(\Sigma)+\sqrt{\mathsf{Tr}(\Sigma)\|\Sigma\|_2}\log(\frac{\log(T)}{\delta})}{T}}$ with probability $1-\delta$, where $\Sigma$ is the covariance of the clipped gradient. Note that the fluctuations (depending on $\frac{1}{\delta}$) are of lower order than the term $\mathsf{Tr}(\Sigma)$. This improves upon the current best rate of $\sqrt{\frac{\mathsf{Tr}(\Sigma)\log(\frac{1}{\delta})}{T}}$ for Clipped-SGD, known only for smooth and strongly convex objectives. Our results also extend to smooth convex and lipschitz convex objectives. Key to our result is a novel iterative refinement strategy for martingale concentration, improving upon the PAC-Bayes approach of Catoni and Giulini.


Learned Reference-based Diffusion Sampling for multi-modal distributions

arXiv.org Machine Learning

Over the past few years, several approaches utilizing score-based diffusion have been proposed to sample from probability distributions, that is without having access to exact samples and relying solely on evaluations of unnormalized densities. In practice, the performance of these methods heavily depends on key hyperparameters that require ground truth samples to be accurately tuned. Our work aims to highlight and address this fundamental issue, focusing in particular on multimodal distributions, which pose significant challenges for existing sampling methods. Building on existing approaches, we introduce Learned Reference-based Diffusion Sampler (LRDS), a methodology specifically designed to leverage prior knowledge on the location of the target modes in order to bypass the obstacle of hyperparameter tuning. LRDS proceeds in two steps by (i) learning a reference diffusion model on samples located in high-density space regions and tailored for multimodality, and (ii) using this reference model to foster the training of a diffusion-based sampler. We experimentally demonstrate that LRDS best exploits prior knowledge on the target distribution compared to competing algorithms on a variety of challenging distributions. We consider the problem of sampling from a probability density known up to a normalizing constant. In particular, we are interested in sampling from multimodal distributions, i.e., distributions whose density admits multiple local maxima, called modes. Finding the modes of such distributions is a notoriously hard problem, yet, maybe surprisingly, even if the location of the modes is known, sampling π remains a very challenging problem (Noé et al., 2019; Pompe et al., 2020; Grenioux et al., 2023). In this work, we aim to address this specific issue and will assume that we have access to the location of the modes as prior information on π. However, we do not assume to have access a priori to ground truth samples from π. Annealed MCMC. Markov Chain Monte Carlo (MCMC) samplers are among the most popular approaches for sampling. In particular, gradient-based methods based on discretizations of Langevin or Hamiltonian dynamics (Roberts & Tweedie, 1996; Neal, 2012; Hoffman & Gelman, 2014) are guaranteed to be efficient for high-dimensional target distributions that are log-concave or satisfy or functional inequalities (Dalalyan, 2017; Durmus & Moulines, 2017).


Improving Stochastic Cubic Newton with Momentum

arXiv.org Artificial Intelligence

We study stochastic second-order methods for solving general non-convex optimization problems. We propose using a special version of momentum to stabilize the stochastic gradient and Hessian estimates in Newton's method. We show that momentum provably improves the variance of stochastic estimates and allows the method to converge for any noise level. Using the cubic regularization technique, we prove a global convergence rate for our method on general non-convex problems to a second-order stationary point, even when using only a single stochastic data sample per iteration. This starkly contrasts with all existing stochastic second-order methods for non-convex problems, which typically require large batches. Therefore, we are the first to demonstrate global convergence for batches of arbitrary size in the non-convex case for the Stochastic Cubic Newton. Additionally, we show improved speed on convex stochastic problems for our regularized Newton methods with momentum.