AITopics

This paper focuses on decentralized stochastic bilevel optimization (DSBO) where agents only communicate with their neighbors. We propose Decentralized Stochastic Gradient Descent and Ascent with Gradient Tracking (DSGDA-GT), a novel algorithm that only requires first-order oracles that are much cheaper than second-order oracles widely adopted in existing works. We further provide a finite-time convergence analysis showing that for $n$ agents collaboratively solving the DSBO problem, the sample complexity of finding an $\epsilon$-stationary point in our algorithm is $\mathcal{O}(n^{-1}\epsilon^{-7})$, which matches the currently best-known results of the single-agent counterpart with linear speedup. The numerical experiments demonstrate both the communication and training efficiency of our algorithm.

artificial intelligence, deep learning, machine learning, (14 more...)

2410.19319

Country:

North America > United States > Illinois > Champaign County > Urbana (0.04)
North America > United States > California > Yolo County > Davis (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Maes, Lucas, Zhang, Tianyue H., Jolicoeur-Martineau, Alexia, Mitliagkas, Ioannis, Scieur, Damien, Lacoste-Julien, Simon, Guille-Escuret, Charles

Understanding Adam Requires Better Rotation Dependent Assumptions

Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam's sensitivity to rotations of the parameter space. We demonstrate that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature, evaluating their adequacy in explaining Adam's behavior across various rotation types. This work highlights the need for new, rotation-dependent theoretical frameworks to fully understand Adam's empirical success in modern machine learning tasks.

large language model, machine learning, natural language, (18 more...)

2410.19964

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > China > Guangxi Province > Nanning (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Privacy without Noisy Gradients: Slicing Mechanism for Generative Model Training

Greenewald, Kristjan, Yu, Yuancheng, Wang, Hao, Xu, Kai

Training generative models with differential privacy (DP) typically involves injecting noise into gradient updates or adapting the discriminator's training procedure. As a result, such approaches often struggle with hyper-parameter tuning and convergence. We consider the slicing privacy mechanism that injects noise into random low-dimensional projections of the private data, and provide strong privacy guarantees for it. These noisy projections are used for training generative models. To enable optimizing generative models using this DP approach, we introduce the smoothed-sliced $f$-divergence and show it enjoys statistical consistency. Moreover, we present a kernel-based estimator for this divergence, circumventing the need for adversarial training. Extensive numerical experiments demonstrate that our approach can generate synthetic data of higher quality compared with baselines. Beyond performance improvement, our method, by sidestepping the need for noisy gradients, offers data scientists the flexibility to adjust generator architecture and hyper-parameters, run the optimization over any number of epochs, and even restart the optimization process -- all without incurring additional privacy costs.

artificial intelligence, machine learning, natural language, (16 more...)

2410.19941

Country:

Asia > Middle East > Israel (0.04)
North America > United States > California (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(6 more...)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Gradient Descent Efficiency Index

Dhingra, Aviral

Gradient descent is a widely used iterative algorithm for finding local minima in multivariate functions. However, the final iterations often either overshoot the minima or make minimal progress, making it challenging to determine an optimal stopping point. This study introduces a new efficiency metric, Ek, designed to quantify the effectiveness of each iteration. The proposed metric accounts for both the relative change in error and the stability of the loss function across iterations. This measure is particularly valuable in resource-constrained environments, where costs are closely tied to training time. Experimental validation across multiple datasets and models demonstrates that Ek provides valuable insights into the convergence behavior of gradient descent, complementing traditional performance metrics. The index has the potential to guide more informed decisions in the selection and tuning of optimization algorithms in machine learning applications and be used to compare the "effectiveness" of models relative to each other.

artificial intelligence, iteration, machine learning, (14 more...)

2410.19448

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.97)

Mehmood, Sheheryar, Ochs, Peter

Automatic Differentiation of Optimization Algorithms with Time-Varying Updates

Numerous Optimization Algorithms have a time-varying update rule thanks to, for instance, a changing step size, momentum parameter or, Hessian approximation. In this paper, we apply unrolled or automatic differentiation to a time-varying iterative process and provide convergence (rate) guarantees for the resulting derivative iterates. We adapt these convergence results and apply them to proximal gradient descent with variable step size and FISTA when solving partly smooth problems. We confirm our findings numerically by solving $\ell_1$ and $\ell_2$-regularized linear and logisitc regression respectively. Our theoretical and numerical results show that the convergence rate of the algorithm is reflected in its derivative iterates.

artificial intelligence, convergence, machine learning, (14 more...)

2410.15923

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > New York (0.04)
(9 more...)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

arXiv.org Machine LearningOct-24-2024

A Random Matrix Theory Perspective on the Spectrum of Learned Features and Asymptotic Generalization Capabilities

Dandi, Yatin, Pesce, Luca, Cui, Hugo, Krzakala, Florent, Lu, Yue M., Loureiro, Bruno

A key property of neural networks is their capacity of adapting to data during training. Yet, our current mathematical understanding of feature learning and its relationship to generalization remain limited. In this work, we provide a random matrix analysis of how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step. We rigorously establish the equivalence between the updated features and an isotropic spiked random feature model, in the limit of large batch size. For the latter model, we derive a deterministic equivalent description of the feature empirical covariance matrix in terms of certain low-dimensional operators. This allows us to sharply characterize the impact of training in the asymptotic feature spectrum, and in particular, provides a theoretical grounding for how the tails of the feature spectrum modify with training. The deterministic equivalent further yields the exact asymptotic generalization error, shedding light on the mechanisms behind its improvement in the presence of feature learning. Our result goes beyond standard random matrix ensembles, and therefore we believe it is of independent technical interest. Different from previous work, our result holds in the challenging maximal learning rate regime, is fully rigorous and allows for finitely supported second layer initialization, which turns out to be crucial for studying the functional expressivity of the learned features. This provides a sharp description of the impact of feature learning in the generalization of two-layer neural networks, beyond the random features and lazy training regimes.

artificial intelligence, machine learning, matrix, (17 more...)

arXiv.org Machine Learning

2410.18938

Country:

North America > United States (0.27)
Europe > Switzerland > Vaud > Lausanne (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report > New Finding (0.54)

Industry:

Education (0.67)
Government > Regional Government > North America Government (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Chen, Naitong, Huggins, Jonathan H., Campbell, Trevor

Tuning-free coreset Markov chain Monte Carlo

A Bayesian coreset is a small, weighted subset of a data set that replaces the full data during inference to reduce computational cost. The state-of-the-art coreset construction algorithm, Coreset Markov chain Monte Carlo (Coreset MCMC), uses draws from an adaptive Markov chain targeting the coreset posterior to train the coreset weights via stochastic gradient optimization. However, the quality of the constructed coreset, and thus the quality of its posterior approximation, is sensitive to the stochastic optimization learning rate. In this work, we propose a learning-rate-free stochastic gradient optimization procedure, Hot-start Distance over Gradient (Hot DoG), Figure 1: Relative Coreset MCMC posterior approximation for training coreset weights in Coreset MCMC error (average squared coordinate-wise z-score) without user tuning effort. Empirical results using ADAM with different learning rates versus the demonstrate that Hot DoG provides higher proposed Hot DoG method (with fixed r = 0.001). Median quality posterior approximations than other values after 200,000 optimization iterations across learning-rate-free stochastic gradient methods, 10 trials are used for the relative comparison for a variety and performs competitively to optimallytuned of datasets, models, and coreset sizes.

artificial intelligence, machine learning, regression, (16 more...)

2410.18973

Country: North America > Canada > British Columbia (0.04)

Genre: Research Report (0.67)

Industry: Consumer Products & Services (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.83)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.76)

Yau, Chung-Yiu, Liu, Haoming, Wai, Hoi-To

Fully Stochastic Primal-dual Gradient Algorithm for Non-convex Optimization on Random Graphs

Stochastic decentralized optimization algorithms often suffer from issues such as synchronization overhead and intermittent communication. This paper proposes a $\underline{\rm F}$ully $\underline{\rm S}$tochastic $\underline{\rm P}$rimal $\underline{\rm D}$ual gradient $\underline{\rm A}$lgorithm (FSPDA) that suggests an asynchronous decentralized procedure with (i) sparsified non-blocking communication on random undirected graphs and (ii) local stochastic gradient updates. FSPDA allows multiple local gradient steps to accelerate convergence to stationarity while finding a consensual solution with stochastic primal-dual updates. For problems with smooth (possibly non-convex) objective function, we show that FSPDA converges to an $\mathrm{\mathcal{O}( {\it \sigma /\sqrt{nT}} )}$-stationary solution after $\mathrm{\it T}$ iterations without assuming data heterogeneity. The performance of FSPDA is on par with state-of-the-art algorithms whose convergence depend on static graph and synchronous updates. To our best knowledge, FSPDA is the first asynchronous algorithm that converges exactly under the non-convex setting. Numerical experiments are presented to show the benefits of FSPDA.

algorithm, artificial intelligence, machine learning, (18 more...)

2410.18774

Country:

Asia > China > Hong Kong (0.04)
North America > United States > Massachusetts (0.04)
Europe > Italy (0.04)
Asia > Singapore (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

A Historical Trajectory Assisted Optimization Method for Zeroth-Order Federated Learning

Wu, Chenlin, He, Xiaoyu, Li, Zike, Gong, Jing, Zheng, Zibin

Federated learning heavily relies on distributed gradient descent techniques. In the situation where gradient information is not available, the gradients need to be estimated from zeroth-order information, which typically involves computing finite-differences along isotropic random directions. This method suffers from high estimation errors, as the geometric features of the objective landscape may be overlooked during the isotropic sampling. In this work, we propose a non-isotropic sampling method to improve the gradient estimation procedure. Gradients in our method are estimated in a subspace spanned by historical trajectories of solutions, aiming to encourage the exploration of promising regions and hence improve the convergence. The proposed method uses a covariance matrix for sampling which is a convex combination of two parts. The first part is a thin projection matrix containing the basis of the subspace which is designed to improve the exploitation ability. The second part is the historical trajectories. We implement this method in zeroth-order federated settings, and show that the convergence rate aligns with existing ones while introducing no significant overheads in communication or local computation. The effectiveness of our proposal is verified on several numerical experiments in comparison to several commonly-used zeroth-order federated optimization algorithms.

artificial intelligence, machine learning, optimization problem, (16 more...)

2409.15955

Country: Asia > China > Guangdong Province > Zhuhai (0.04)

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Arous, Gérard Ben, Gerbelot, Cédric, Piccolo, Vanessa

Stochastic gradient descent in high dimensions for multi-spiked tensor PCA

arXiv.org Machine LearningOct-23-2024

We study the dynamics in high dimensions of online stochastic gradient descent for the multi-spiked tensor model. This multi-index model arises from the tensor principal component analysis (PCA) problem with multiple spikes, where the goal is to estimate $r$ unknown signal vectors within the $N$-dimensional unit sphere through maximum likelihood estimation from noisy observations of a $p$-tensor. We determine the number of samples and the conditions on the signal-to-noise ratios (SNRs) required to efficiently recover the unknown spikes from natural random initializations. We show that full recovery of all spikes is possible provided a number of sample scaling as $N^{p-2}$, matching the algorithmic threshold identified in the rank-one case [Ben Arous, Gheissari, Jagannath 2020, 2021]. Our results are obtained through a detailed analysis of a low-dimensional system that describes the evolution of the correlations between the estimators and the spikes, while controlling the noise in the dynamics. We find that the spikes are recovered sequentially in a process we term "sequential elimination": once a correlation exceeds a critical threshold, all correlations sharing a row or column index become sufficiently small, allowing the next correlation to grow and become macroscopic. The order in which correlations become macroscopic depends on their initial values and the corresponding SNRs, leading to either exact recovery or recovery of a permutation of the spikes. In the matrix case, when $p=2$, if the SNRs are sufficiently separated, we achieve exact recovery of the spikes, whereas equal SNRs lead to recovery of the subspace spanned by the spikes.

artificial intelligence, machine learning, recovery, (15 more...)

arXiv.org Machine Learning

2410.18162

Country:

Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.25)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New York (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)