AITopics

doi: 10.1109/TNNLS.2024.3511670

2412.02291

Country:

Asia > China > Beijing > Beijing (0.05)
Asia > Middle East > Jordan (0.04)
Asia > China > Chongqing Province > Chongqing (0.04)
(8 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Transportation (0.93)
Leisure & Entertainment > Games > Computer Games (0.68)
Automobiles & Trucks (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

arXiv.org Machine LearningDec-8-2024

Anytime Acceleration of Gradient Descent

Zhang, Zihan, Lee, Jason D., Du, Simon S., Chen, Yuxin

This work investigates stepsize-based acceleration of gradient descent with {\em anytime} convergence guarantees. For smooth (non-strongly) convex optimization, we propose a stepsize schedule that allows gradient descent to achieve convergence guarantees of $O(T^{-1.119})$ for any stopping time $T$, where the stepsize schedule is predetermined without prior knowledge of the stopping time. This result provides an affirmative answer to a COLT open problem \citep{kornowski2024open} regarding whether stepsize-based acceleration can yield anytime convergence rates of $o(T^{-1})$. We further extend our theory to yield anytime convergence guarantees of $\exp(-\Omega(T/\kappa^{0.893}))$ for smooth and strongly convex optimization, with $\kappa$ being the condition number.

artificial intelligence, machine learning, stepsize schedule, (14 more...)

2411.17668

Country:

North America > United States > Pennsylvania (0.04)
North America > United States > Massachusetts (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.91)

Dutta, Sanchayan, Sra, Suvrit

Memory-augmented Transformers can implement Linear First-Order Optimization Methods

arXiv.org Artificial IntelligenceDec-8-2024

We show that memory-augmented Transformers (Memformers) can implement linear first-order optimization methods such as conjugate gradient descent, momentum methods, and more generally, methods that linearly combine past gradients. Building on prior work that demonstrates how Transformers can simulate preconditioned gradient descent, we provide theoretical and empirical evidence that Memformers can learn more advanced optimization algorithms. Specifically, we analyze how memory registers in Memformers store suitable intermediate attention values allowing them to implement algorithms such as conjugate gradient. Our results show that Memformers can efficiently learn these methods by training on random linear regression tasks, even learning methods that outperform conjugate gradient. This work extends our knowledge about the algorithmic capabilities of Transformers, showing how they can learn complex optimization methods.

artificial intelligence, machine learning, transformer, (16 more...)

2410.07263

Country:

North America > United States > California > Yolo County > Davis (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Machine LearningDec-8-2024

Stochastic Gradient Descent Revisited

Louzi, Azar

The advent of artificial intelligence (AI) has been rendered possible by the spectacular acceleration of computing chip capacity over the last few decades, and has driven a technological revolution that has not spared any aspect of life, including healthcare, supply chain management, social media, etc. AI describes a set of machine learning methods that abandon any form of structural representation of data and look instead into uncovering data patterns to produce probabilistic relationships between input and output quantities of interest. While it has significantly improved people's standards of living, AI has nevertheless engendered many operational risks (e.g. by producing undesirable or unexpected outcomes) as well as systemic risks (e.g. the "Flash Crash", whereby a blue-chip company's share price suddenly plummeted and bounced back in the span of minutes [KL13]). To better manage, prevent and mitigate such risks, some level of mathematical insight must be brought in to shed light onto the inner workings of AI, in order to allow practitioners and regulators alike to act upon it in order to increase its efficiency and curb its shortcomings. SGD is the engine of AI, making it a natural stepping stone toward mathematically explaining AI. Indeed, to capture their intricacies, machine learning problems are often modeled using wide and highly parametrized neural networks [GBC16], which are then solved using SGD or an adaptive variant thereof, namely Adagrad, Adadelta, RMSProp, Adamax or Adam [Rud17]. To approximate a stationary point of a given loss landscape (also referred to as objective or cost function [LZB22; AL24; AMA05]), SGD recursively spawns a trajectory of iterates by factoring in, at each step, a stochastic gradient modulated by a positive learning rate. Whereas classical SGD literature provides convergence guarantees and convergence rates within a (strongly) convex framework [Duf96; BV04; RM51], machine learning models are often highly nonconvex and require new SGD frameworks to better understand and parametrize them.

artificial intelligence, convergence, machine learning, (17 more...)

2412.0607

Genre: Research Report (0.50)

Industry: Banking & Finance > Trading (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Abbe, Emmanuel, Cornacchia, Elisabetta, Hązła, Jan, Kougang-Yombi, Donald

Learning High-Degree Parities: The Crucial Role of the Initialization

arXiv.org Artificial IntelligenceDec-6-2024

Parities have become a standard benchmark for evaluating learning algorithms. Recent works show that regular neural networks trained by gradient descent can efficiently learn degree $k$ parities on uniform inputs for constant $k$, but fail to do so when $k$ and $d-k$ grow with $d$ (here $d$ is the ambient dimension). However, the case where $k=d-O_d(1)$ (almost-full parities), including the degree $d$ parity (the full parity), has remained unsettled. This paper shows that for gradient descent on regular neural networks, learnability depends on the initial weight distribution. On one hand, the discrete Rademacher initialization enables efficient learning of almost-full parities, while on the other hand, its Gaussian perturbation with large enough constant standard deviation $\sigma$ prevents it. The positive result for almost-full parities is shown to hold up to $\sigma=O(d^{-1})$, pointing to questions about a sharper threshold phenomenon. Unlike statistical query (SQ) learning, where a singleton function class like the full parity is trivially learnable, our negative result applies to a fixed function and relies on an initial gradient alignment measure of potential broader relevance to neural networks learning.

artificial intelligence, initialization, machine learning, (15 more...)

2412.0491

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
South America > Brazil > São Paulo (0.04)
North America > United States > Massachusetts (0.04)
(3 more...)

Genre: Research Report > New Finding (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Taheri, Hossein, Thrampoulidis, Christos, Mazumdar, Arya

Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods

In this paper, we study the data-dependent convergence and generalization behavior of gradient methods for neural networks with smooth activation. Our first result is a novel bound on the excess risk of deep networks trained by the logistic loss, via an alogirthmic stability analysis. Compared to previous works, our results improve upon the shortcomings of the well-established Rademacher complexity-based bounds. Importantly, the bounds we derive in this paper are tighter, hold even for neural networks of small width, do not scale unfavorably with width, are algorithm-dependent, and consequently capture the role of initialization on the sample complexity of gradient descent for deep nets. Specialized to noiseless data separable with margin $\gamma$ by neural tangent kernel (NTK) features of a network of width $\Omega(\text{poly}(\log(n)))$, we show the test-error rate to be $e^{O(L)}/{\gamma^2 n}$, where $n$ is the training set size and $L$ denotes the number of hidden layers. This is an improvement in the test loss bound compared to previous works while maintaining the poly-logarithmic width conditions. We further investigate excess risk bounds for deep nets trained with noisy data, establishing that under a polynomial condition on the network width, gradient descent can achieve the optimal excess risk. Finally, we show that a large step-size significantly improves upon the NTK regime's results in classifying the XOR distribution. In particular, we show for a one-hidden-layer neural network of constant width $m$ with quadratic activation and standard Gaussian initialization that mini-batch SGD with linear sample complexity and with a large step-size $\eta=m$ reaches the perfect test accuracy after only $\ceil{\log(d)}$ iterations, where $d$ is the data dimension.

artificial intelligence, initialization, machine learning, (18 more...)

2410.10024

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > Canada > British Columbia (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Ortega, Tomas, Huang, Chun-Yin, Li, Xiaoxiao, Jafarkhani, Hamid

Communication Compression for Distributed Learning without Control Variates

arXiv.org Artificial IntelligenceDec-5-2024

Distributed learning algorithms, such as the ones employed in Federated Learning (FL), require communication compression to reduce the cost of client uploads. The compression methods used in practice are often biased, which require error feedback to achieve convergence when the compression is aggressive. In turn, error feedback requires client-specific control variates, which directly contradicts privacy-preserving principles and requires stateful clients. In this paper, we propose Compressed Aggregate Feedback (CAFe), a novel distributed learning framework that allows highly compressible client updates by exploiting past aggregated updates, and does not require control variates. We consider Distributed Gradient Descent (DGD) as a representative algorithm and provide a theoretical proof of CAFe's superiority to Distributed Compressed Gradient Descent (DCGD) with biased compression in the non-smooth regime with bounded gradient dissimilarity. Experimental results confirm that CAFe consistently outperforms distributed learning with direct compression and highlight the compressibility of the client updates with CAFe.

artificial intelligence, compression, machine learning, (14 more...)

2412.04538

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > California > Orange County > Irvine (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

De Gregorio, Alessandro, Iafrate, Francesco

Pathwise optimization for bridge-type estimators and its applications

Sparse parametric models are of great interest in statistical learning and are often analyzed by means of regularized estimators. Pathwise methods allow to efficiently compute the full solution path for penalized estimators, for any possible value of the penalization parameter $\lambda$. In this paper we deal with the pathwise optimization for bridge-type problems; i.e. we are interested in the minimization of a loss function, such as negative log-likelihood or residual sum of squares, plus the sum of $\ell^q$ norms with $q\in(0,1]$ involving adpative coefficients. For some loss functions this regularization achieves asymptotically the oracle properties (such as the selection consistency). Nevertheless, since the objective function involves nonconvex and nondifferentiable terms, the minimization problem is computationally challenging. The aim of this paper is to apply some general algorithms, arising from nonconvex optimization theory, to compute efficiently the path solutions for the adaptive bridge estimator with multiple penalties. In particular, we take into account two different approaches: accelerated proximal gradient descent and blockwise alternating optimization. The convergence and the path consistency of these algorithms are discussed. In order to assess our methods, we apply these algorithms to the penalized estimation of diffusion processes observed at discrete times. This latter represents a recent research topic in the field of statistics for time-dependent data.

alessandro de gregorio, algorithm, estimator, (14 more...)

2412.04047

Country:

North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Italy > Lazio > Rome (0.04)
Europe > Germany > Hamburg (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Wei, Dennis, Padhi, Inkit, Ghosh, Soumya, Dhurandhar, Amit, Ramamurthy, Karthikeyan Natesan, Chang, Maria

Final-Model-Only Data Attribution with a Unifying View of Gradient-Based Methods

Training data attribution (TDA) is the task of attributing model behavior to elements in the training data. This paper draws attention to the common setting where one has access only to the final trained model, and not the training algorithm or intermediate information from training. To serve as a gold standard for TDA in this "final-model-only" setting, we propose further training, with appropriate adjustment and averaging, to measure the sensitivity of the given model to training instances. We then unify existing gradient-based methods for TDA by showing that they all approximate the further training gold standard in different ways. We investigate empirically the quality of these gradient-based approximations to further training, for tabular, image, and text datasets and models. We find that the approximation quality of first-order methods is sometimes high but decays with the amount of further training. In contrast, the approximations given by influence function methods are more stable but surprisingly lower in quality.

2412.03906

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(4 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.85)

Blanchet, Jose, Mijatović, Aleksandar, Yang, Wenhao

Limit Theorems for Stochastic Gradient Descent with Infinite Variance

Stochastic gradient descent is a classic algorithm that has gained great popularity especially in the last decades as the most common approach for training models in machine learning. While the algorithm has been well-studied when stochastic gradients are assumed to have a finite variance, there is significantly less research addressing its theoretical properties in the case of infinite variance gradients. In this paper, we establish the asymptotic behavior of stochastic gradient descent in the context of infinite variance stochastic gradients, assuming that the stochastic gradient is regular varying with index $\alpha\in(1,2)$. The closest result in this context was established in 1969 , in the one-dimensional case and assuming that stochastic gradients belong to a more restrictive class of distributions. We extend it to the multidimensional case, covering a broader class of infinite variance distributions. As we show, the asymptotic distribution of the stochastic gradient descent algorithm can be characterized as the stationary distribution of a suitably defined Ornstein-Uhlenbeck process driven by an appropriate stable L\'evy process. Additionally, we explore the applications of these results in linear regression and logistic regression models.

converge, exp, lemma 7, (14 more...)

2410.1634

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Massachusetts > Middlesex County > Reading (0.04)
North America > United States > Illinois (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)