Goto

Collaborating Authors

 Gradient Descent


MIRA: A Method of Federated MultI-Task Learning for LaRge LAnguage Models

arXiv.org Artificial Intelligence

In this paper, we introduce a method for fine-tuning Large Language Models (LLMs), inspired by Multi-Task learning in a federated manner. Our approach leverages the structure of each client's model and enables a learning scheme that considers other clients' tasks and data distribution. To mitigate the extensive computational and communication overhead often associated with LLMs, we utilize a parameter-efficient fine-tuning method, specifically Low-Rank Adaptation (LoRA), reducing the number of trainable parameters. Experimental results, with different datasets and models, demonstrate the proposed method's effectiveness compared to existing frameworks for federated fine-tuning of LLMs in terms of average and local performances. The proposed scheme outperforms existing baselines by achieving lower local loss for each client while maintaining comparable global performance.


Tighter Performance Theory of FedExProx

arXiv.org Machine Learning

We revisit FedExProx - a recently proposed distributed optimization method designed to enhance convergence properties of parallel proximal algorithms via extrapolation. In the process, we uncover a surprising flaw: its known theoretical guarantees on quadratic optimization tasks are no better than those offered by the vanilla Gradient Descent (GD) method. Motivated by this observation, we develop a novel analysis framework, establishing a tighter linear convergence rate for non-strongly convex quadratic problems. By incorporating both computation and communication costs, we demonstrate that FedExProx can indeed provably outperform GD, in stark contrast to the original analysis. Furthermore, we consider partial participation scenarios and analyze two adaptive extrapolation strategies - based on gradient diversity and Polyak stepsizes - again significantly outperforming previous results. Moving beyond quadratics, we extend the applicability of our analysis to general functions satisfying the Polyak-Lojasiewicz condition, outperforming the previous strongly convex analysis while operating under weaker assumptions. Backed by empirical results, our findings point to a new and stronger potential of FedExProx, paving the way for further exploration of the benefits of extrapolation in federated learning.


Fractional-order spike-timing-dependent gradient descent for multi-layer spiking neural networks

arXiv.org Artificial Intelligence

Accumulated detailed knowledge about the neuronal activities in human brains has brought more attention to bio-inspired spiking neural networks (SNNs). In contrast to non-spiking deep neural networks (DNNs), SNNs can encode and transmit spatiotemporal information more efficiently by exploiting biologically realistic and low-power event-driven neuromorphic architectures. However, the supervised learning of SNNs still remains a challenge because the spike-timing-dependent plasticity (STDP) of connected spiking neurons is difficult to implement and interpret in existing backpropagation learning schemes. This paper proposes a fractional-order spike-timing-dependent gradient descent (FO-STDGD) learning model by considering a derived nonlinear activation function that describes the relationship between the quasi-instantaneous firing rate and the temporal membrane potentials of nonleaky integrate-and-fire neurons. The training strategy can be generalized to any fractional orders between 0 and 2 since the FO-STDGD incorporates the fractional gradient descent method into the calculation of spike-timing-dependent loss gradients. The proposed FO-STDGD model is tested on the MNIST and DVS128 Gesture datasets and its accuracy under different network structure and fractional orders is analyzed. It can be found that the classification accuracy increases as the fractional order increases, and specifically, the case of fractional order 1.9 improves by 155% relative to the case of fractional order 1 (traditional gradient descent). In addition, our scheme demonstrates the state-of-the-art computational efficacy for the same SNN structure and training epochs.


SNAP: Stopping Catastrophic Forgetting in Hebbian Learning with Sigmoidal Neuronal Adaptive Plasticity

arXiv.org Artificial Intelligence

Artificial Neural Networks (ANNs) suffer from catastrophic forgetting, where the learning of new tasks causes the catastrophic forgetting of old tasks. Existing Machine Learning (ML) algorithms, including those using Stochastic Gradient Descent (SGD) and Hebbian Learning typically update their weights linearly with experience i.e., independently of their current strength. This contrasts with biological neurons, which at intermediate strengths are very plastic, but consolidate with Long-Term Potentiation (LTP) once they reach a certain strength. We hypothesize this mechanism might help mitigate catastrophic forgetting. We introduce Sigmoidal Neuronal Adaptive Plasticity (SNAP) an artificial approximation to Long-Term Potentiation for ANNs by having the weights follow a sigmoidal growth behaviour allowing the weights to consolidate and stabilize when they reach sufficiently large or small values. We then compare SNAP to linear weight growth and exponential weight growth and see that SNAP completely prevents the forgetting of previous tasks for Hebbian Learning but not for SGD-base learning.


Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging

arXiv.org Artificial Intelligence

Text embeddings are vital for tasks such as text retrieval and semantic textual similarity (STS). Recently, the advent of pretrained language models, along with unified benchmarks like the Massive Text Embedding Benchmark (MTEB), has facilitated the development of versatile general-purpose text embedding models. Advanced embedding models are typically developed using large-scale multi-task data and joint training across multiple tasks. However, our experimental analysis reveals two significant drawbacks of joint training: 1) Task Conflict: Gradients from different tasks interfere with each other, leading to negative transfer. 2) Data Imbalance: Disproportionate data distribution introduces biases that negatively impact performance across tasks. To overcome these challenges, we explore model merging-a technique that combines independently trained models to mitigate gradient conflicts and balance data distribution. We introduce a novel method, Self Positioning, which efficiently searches for optimal model combinations within the interpolation space of task vectors using stochastic gradient descent. Our experiments demonstrate that Self Positioning significantly enhances multi-task performance on the MTEB dataset, achieving an absolute improvement of 0.7 points. It outperforms traditional resampling methods while reducing computational costs. This work offers a robust approach to building generalized text embedding models with superior performance across diverse embedding-related tasks.


Stochastic Gradient Descent Jittering for Inverse Problems: Alleviating the Accuracy-Robustness Tradeoff

arXiv.org Artificial Intelligence

Inverse problems aim to reconstruct unseen data from corrupted or perturbed measurements. While most work focuses on improving reconstruction quality, generalization accuracy and robustness are equally important, especially for safety-critical applications. Model-based architectures (MBAs), such as loop unrolling methods, are considered more interpretable and achieve better reconstructions. Empirical evidence suggests that MBAs are more robust to perturbations than black-box solvers, but the accuracy-robustness tradeoff in MBAs remains underexplored. In this work, we propose a simple yet effective training scheme for MBAs, called SGD jittering, which injects noise iteration-wise during reconstruction. We theoretically demonstrate that SGD jittering not only generalizes better than the standard mean squared error training but is also more robust to average-case attacks. We validate SGD jittering using denoising toy examples, seismic deconvolution, and single-coil MRI reconstruction. The proposed method achieves cleaner reconstructions for out-of-distribution data and demonstrates enhanced robustness to adversarial attacks.


A Mirror Descent Perspective of Smoothed Sign Descent

arXiv.org Artificial Intelligence

Recent work by Woodworth et al. (2020) shows that the optimization dynamics of gradient descent for overparameterized problems can be viewed as low-dimensional dual dynamics induced by a mirror map, explaining the implicit regularization phenomenon from the mirror descent perspective. However, the methodology does not apply to algorithms where update directions deviate from true gradients, such as ADAM. We use the mirror descent framework to study the dynamics of smoothed sign descent with a stability constant $\varepsilon$ for regression problems. We propose a mirror map that establishes equivalence to dual dynamics under some assumptions. By studying dual dynamics, we characterize the convergent solution as an approximate KKT point of minimizing a Bregman divergence style function, and show the benefit of tuning the stability constant $\varepsilon$ to reduce the KKT error.


Nonlinear Stochastic Gradient Descent and Heavy-tailed Noise: A Unified Framework and High-probability Guarantees

arXiv.org Artificial Intelligence

We study high-probability convergence in online learning, in the presence of heavy-tailed noise. To combat the heavy tails, a general framework of nonlinear SGD methods is considered, subsuming several popular nonlinearities like sign, quantization, component-wise and joint clipping. In our work the nonlinearity is treated in a black-box manner, allowing us to establish unified guarantees for a broad range of nonlinear methods. For symmetric noise and non-convex costs we establish convergence of gradient norm-squared, at a rate $\widetilde{\mathcal{O}}(t^{-1/4})$, while for the last iterate of strongly convex costs we establish convergence to the population optima, at a rate $\mathcal{O}(t^{-\zeta})$, where $\zeta \in (0,1)$ depends on noise and problem parameters. Further, if the noise is a (biased) mixture of symmetric and non-symmetric components, we show convergence to a neighbourhood of stationarity, whose size depends on the mixture coefficient, nonlinearity and noise. Compared to state-of-the-art, who only consider clipping and require unbiased noise with bounded $p$-th moments, $p \in (1,2]$, we provide guarantees for a broad class of nonlinearities, without any assumptions on noise moments. While the rate exponents in state-of-the-art depend on noise moments and vanish as $p \rightarrow 1$, our exponents are constant and strictly better whenever $p < 6/5$ for non-convex and $p < 8/7$ for strongly convex costs. Experiments validate our theory, demonstrating noise symmetry in real-life settings and showing that clipping is not always the optimal nonlinearity, further underlining the value of a general framework.


Independently-Normalized SGD for Generalized-Smooth Nonconvex Optimization

arXiv.org Machine Learning

Recent studies have shown that many nonconvex machine learning problems meet a so-called generalized-smooth condition that extends beyond traditional smooth nonconvex optimization. However, the existing algorithms designed for generalized-smooth nonconvex optimization encounter significant limitations in both their design and convergence analysis. In this work, we first study deterministic generalized-smooth nonconvex optimization and analyze the convergence of normalized gradient descent under the generalized Polyak-Lojasiewicz condition. Our results provide a comprehensive understanding of the interplay between gradient normalization and function geometry. Then, for stochastic generalized-smooth nonconvex optimization, we propose an independently-normalized stochastic gradient descent algorithm, which leverages independent sampling, gradient normalization and clipping to achieve an $\mathcal{O}(\epsilon^{-4})$ sample complexity under relaxed assumptions. Experiments demonstrate the fast convergence of our algorithm.


Tensor Decomposition with Unaligned Observations

arXiv.org Machine Learning

This paper presents a canonical polyadic (CP) tensor decomposition that addresses unaligned observations. The mode with unaligned observations is represented using functions in a reproducing kernel Hilbert space (RKHS). We introduce a versatile loss function that effectively accounts for various types of data, including binary, integer-valued, and positive-valued types. Additionally, we propose an optimization algorithm for computing tensor decompositions with unaligned observations, along with a stochastic gradient method to enhance computational efficiency. A sketching algorithm is also introduced to further improve efficiency when using the $\ell_2$ loss function. To demonstrate the efficacy of our methods, we provide illustrative examples using both synthetic data and an early childhood human microbiome dataset.