AITopics

2502.17009

Country:

North America > Canada (0.14)
Europe > Switzerland (0.14)
Europe > Norway (0.14)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

arXiv.org Machine LearningFeb-17-2025

Double Momentum and Error Feedback for Clipping with Fast Rates and Differential Privacy

Islamov, Rustem, Horvath, Samuel, Lucchi, Aurelien, Richtarik, Peter, Gorbunov, Eduard

Strong Differential Privacy (DP) and Optimization guarantees are two desirable properties for a method in Federated Learning (FL). However, existing algorithms do not achieve both properties at once: they either have optimal DP guarantees but rely on restrictive assumptions such as bounded gradients/bounded data heterogeneity, or they ensure strong optimization performance but lack DP guarantees. To address this gap in the literature, we propose and analyze a new method called Clip21-SGD2M based on a novel combination of clipping, heavy-ball momentum, and Error Feedback. In particular, for non-convex smooth distributed problems with clients having arbitrarily heterogeneous data, we prove that Clip21-SGD2M has optimal convergence rate and also near optimal (local-)DP neighborhood. Our numerical experiments on non-convex logistic regression and training of neural networks highlight the superiority of Clip21-SGD2M over baselines in terms of the optimization performance for a given DP-budget.

artificial intelligence, deep learning, machine learning, (17 more...)

2502.11682

Country: Europe (0.27)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

arXiv.org Artificial IntelligenceNov-24-2024

Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise

Compagnoni, Enea Monzio, Liu, Tianlin, Islamov, Rustem, Proske, Frank Norbert, Orvieto, Antonio, Lucchi, Aurelien

Despite the vast empirical evidence supporting the efficacy of adaptive optimization methods in deep learning, their theoretical understanding is far from complete. This work introduces novel SDEs for commonly used adaptive optimizers: SignSGD, RMSprop(W), and Adam(W). These SDEs offer a quantitatively accurate description of these optimizers and help illuminate an intricate relationship between adaptivity, gradient noise, and curvature. Our novel analysis of SignSGD highlights a noteworthy and precise contrast to SGD in terms of convergence speed, stationary distribution, and robustness to heavy-tail noise. We extend this analysis to AdamW and RMSpropW, for which we observe that the role of noise is much more complex. Crucially, we support our theoretical analysis with experimental evidence by verifying our insights: this includes numerically integrating our SDEs using Euler-Maruyama discretization on various neural network architectures such as MLPs, CNNs, ResNets, and Transformers. Our SDEs accurately track the behavior of the respective optimizers, especially when compared to previous SDEs derived for Adam and RMSprop. We believe our approach can provide valuable insights into best training practices and novel scaling rules.

artificial intelligence, deep learning, machine learning, (18 more...)

2411.15958

Country:

Europe > Germany (0.14)
North America > Canada (0.14)
Europe > Switzerland (0.14)
Europe > Norway (0.14)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningOct-24-2024

Loss Landscape Characterization of Neural Networks without Over-Parametrization

Islamov, Rustem, Ajroldi, Niccolò, Orvieto, Antonio, Lucchi, Aurelien

Optimization methods play a crucial role in modern machine learning, powering the remarkable empirical achievements of deep learning models. These successes are even more remarkable given the complex non-convex nature of the loss landscape of these models. Yet, ensuring the convergence of optimization methods requires specific structural conditions on the objective function that are rarely satisfied in practice. One prominent example is the widely recognized Polyak-Lojasiewicz (PL) inequality, which has gained considerable attention in recent years. However, validating such assumptions for deep neural networks entails substantial and often impractical levels of over-parametrization. In order to address this limitation, we propose a novel class of functions that can characterize the loss landscape of modern deep models without requiring extensive over-parametrization and can also include saddle points. Crucially, we prove that gradient-based optimizers possess theoretical guarantees of convergence under this assumption. Finally, we validate the soundness of our new function class through both theoretical analysis and empirical experimentation across a diverse range of deep learning models.

artificial intelligence, dist, machine learning, (16 more...)

2410.12455

Country: Europe > Germany (0.28)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningMay-30-2024

Near Optimal Decentralized Optimization with Compression and Momentum Tracking

Islamov, Rustem, Gao, Yuan, Stich, Sebastian U.

Decentralized machine learning approaches are increasingly popular in numerous applications such as the internet-of-things (IoT) and networked autonomous systems [Marvasti et al., 2014, Savazzi et al., 2020], primarily due to their scalability to larger datasets and systems, as well as their respect for data locality and privacy concerns. In this work, we focus on decentralized optimization techniques that operate without a central coordinator, relying solely on on-device computation and local communication with neighboring devices. This encompasses traditional scenarios like training Machine Learning models in large data centers, as well as emerging applications where computations occur directly on devices. Such a setting is preferred over centralized topology which often poses a significant bottleneck on the central node in terms of communication latency, bandwidth, and fault tolerance. Considering the enormous size of modern Machine Learning models, classic single-node training is often impossible. Moreover, the training of large models requires a huge amount of data that does not fit the memory of a single machine. Therefore, modern training techniques heavily rely on distributed computations over a set of computation nodes/clients [Shoeybi et al., 2019, Wang et al., 2020, Ramesh et al., 2021, 2022]. One of the instances of distributed training is Federated Learning (FL) [Konecnỳ et al., 2016, Kairouz et al., 2021] which has recently gathered a lot of attention.

artificial intelligence, international conference, machine learning, (12 more...)

2405.20114

Country: Europe > Italy (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.66)

arXiv.org Machine LearningNov-6-2023

EControl: Fast Distributed Optimization with Compression and Error Control

Gao, Yuan, Islamov, Rustem, Stich, Sebastian

Modern distributed training relies heavily on communication compression to reduce the communication overhead. In this work, we study algorithms employing a popular class of contractive compressors in order to reduce communication overhead. However, the naive implementation often leads to unstable convergence or even exponential divergence due to the compression bias. Error Compensation (EC) is an extremely popular mechanism to mitigate the aforementioned issues during the training of models enhanced by contractive compression operators. Compared to the effectiveness of EC in the data homogeneous regime, the understanding of the practicality and theoretical foundations of EC in the data heterogeneous regime is limited. Existing convergence analyses typically rely on strong assumptions such as bounded gradients, bounded data heterogeneity, or large batch accesses, which are often infeasible in modern machine learning applications. We resolve the majority of current issues by proposing EControl, a novel mechanism that can regulate error compensation by controlling the strength of the feedback signal. We prove fast convergence for EControl in standard strongly convex, general convex, and nonconvex settings without any additional assumptions on the problem or data heterogeneity. We conduct extensive numerical evaluations to illustrate the efficacy of our method and support our theoretical findings.

artificial intelligence, econtrol, machine learning, (18 more...)

2311.05645

Country: Europe (0.14)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Machine LearningOct-31-2023

AsGrad: A Sharp Unified Analysis of Asynchronous-SGD Algorithms

Islamov, Rustem, Safaryan, Mher, Alistarh, Dan

We analyze asynchronous-type algorithms for distributed SGD in the heterogeneous setting, where each worker has its own computation and communication speeds, as well as data distribution. In these algorithms, workers compute possibly stale and stochastic gradients associated with their local data at some iteration back in history and then return those gradients to the server without synchronizing with other workers. We present a unified convergence theory for non-convex smooth functions in the heterogeneous regime. The proposed analysis provides convergence for pure asynchronous SGD and its various modifications. Moreover, our theory explains what affects the convergence rate and what can be done to improve the performance of asynchronous algorithms. In particular, we introduce a novel asynchronous method based on worker shuffling. As a by-product of our analysis, we also demonstrate convergence guarantees for gradient-type algorithms such as SGD with random reshuffling and shuffle-once mini-batch SGD. The derived rates match the best-known results for those algorithms, highlighting the tightness of our approach. Finally, our numerical evaluations support theoretical findings and show the good practical performance of our method.

artificial intelligence, machine learning, sgd, (15 more...)

2310.20452

Country: Europe (0.28)

Genre: Research Report > New Finding (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

arXiv.org Artificial IntelligenceMay-30-2023

Clip21: Error Feedback for Gradient Clipping

Khirirat, Sarit, Gorbunov, Eduard, Horváth, Samuel, Islamov, Rustem, Karray, Fakhri, Richtárik, Peter

Motivated by the increasing popularity and importance of large-scale training under differential privacy (DP) constraints, we study distributed gradient methods with gradient clipping, i.e., clipping applied to the gradients computed from local information at the nodes. While gradient clipping is an essential tool for injecting formal DP guarantees into gradient-based methods [1], it also induces bias which causes serious convergence issues specific to the distributed setting. Inspired by recent progress in the error-feedback literature which is focused on taming the bias/error introduced by communication compression operators such as Top-$k$ [2], and mathematical similarities between the clipping operator and contractive compression operators, we design Clip21 -- the first provably effective and practically useful error feedback mechanism for distributed methods with gradient clipping. We prove that our method converges at the same $\mathcal{O}\left(\frac{1}{K}\right)$ rate as distributed gradient descent in the smooth nonconvex regime, which improves the previous best $\mathcal{O}\left(\frac{1}{\sqrt{K}}\right)$ rate which was obtained under significantly stronger assumptions. Our method converges significantly faster in practice than competing methods.

artificial intelligence, machine learning, null 1, (17 more...)

2305.18929

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

arXiv.org Artificial IntelligenceMay-29-2023

Partially Personalized Federated Learning: Breaking the Curse of Data Heterogeneity

Mishchenko, Konstantin, Islamov, Rustem, Gorbunov, Eduard, Horváth, Samuel

We present a partially personalized formulation of Federated Learning (FL) that strikes a balance between the flexibility of personalization and cooperativeness of global training. In our framework, we split the variables into global parameters, which are shared across all clients, and individual local parameters, which are kept private. We prove that under the right split of parameters, it is possible to find global parameters that allow each client to fit their data perfectly, and refer to the obtained problem as overpersonalized. For instance, the shared global parameters can be used to learn good data representations, whereas the personalized layers are fine-tuned for a specific client. Moreover, we present a simple algorithm for the partially personalized formulation that offers significant benefits to all clients. In particular, it breaks the curse of data heterogeneity in several settings, such as training with local steps, asynchronous training, and Byzantine-robust training.

artificial intelligence, machine learning, personalization, (12 more...)

2305.18285

Country: Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)