AITopics | Reddi, Sashank

Collaborating Authors

Reddi, Sashank

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Structured Preconditioners in Adaptive Optimization: A Unified Analysis

Xie, Shuo, Wang, Tianhao, Reddi, Sashank, Kumar, Sanjiv, Li, Zhiyuan

arXiv.org Artificial IntelligenceMar-13-2025

We present a novel unified analysis for a broad class of adaptive optimization algorithms with structured (e.g., layerwise, diagonal, and kronecker-factored) preconditioners for both online regret minimization and offline convex optimization. Our analysis not only provides matching rate to several important structured preconditioned algorithms including diagonal AdaGrad, full-matrix AdaGrad, and AdaGrad-Norm, but also gives an improved convergence rate for a one-sided variant of Shampoo over that of original Shampoo. Interestingly, more structured preconditioners (e.g., diagonal Adagrad, AdaGrad-Norm which use less space and compute) are often presented as computationally efficient approximations to full-matrix Adagrad, aiming for improved optimization performance through better approximations. Our unified analysis challenges this prevailing view and reveals, perhaps surprisingly, that more structured preconditioners, despite using less space and computation per step, can outperform their less structured counterparts. To demonstrate this, we show that one-sided Shampoo, which is relatively much cheaper than full-matrix AdaGrad could outperform it both theoretically and experimentally.

artificial intelligence, machine learning, optimization problem, (16 more...)

arXiv.org Artificial Intelligence

2503.10537

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)

Add feedback

Efficient Stagewise Pretraining via Progressive Subnetworks

Panigrahi, Abhishek, Saunshi, Nikunj, Lyu, Kaifeng, Miryoosefi, Sobhan, Reddi, Sashank, Kale, Satyen, Kumar, Sanjiv

arXiv.org Artificial IntelligenceFeb-8-2024

Recent developments in large language models have sparked interest in efficient pretraining methods. A recent effective paradigm is to perform stage-wise training, where the size of the model is gradually increased over the course of training (e.g. gradual stacking (Reddi et al., 2023)). While the resource and wall-time savings are appealing, it has limitations, particularly the inability to evaluate the full model during earlier stages, and degradation in model quality due to smaller model capacity in the initial stages. In this work, we propose an alternative framework, progressive subnetwork training, that maintains the full model throughout training, but only trains subnetworks within the model in each step. We focus on a simple instantiation of this framework, Random Path Training (RaPTr) that only trains a sub-path of layers in each step, progressively increasing the path lengths in stages. RaPTr achieves better pre-training loss for BERT and UL2 language models while requiring 20-33% fewer FLOPs compared to standard training, and is competitive or better than other efficient training methods. Furthermore, RaPTr shows better downstream performance on UL2, improving QA tasks and SuperGLUE by 1-5% compared to standard training and stacking. Finally, we provide a theoretical basis for RaPTr to justify (a) the increasing complexity of subnetworks in stages, and (b) the stability in loss across stage transitions due to residual connections and layer norm.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2402.05913

Country:

Europe (0.67)
North America > United States (0.28)

Genre: Research Report (0.81)

Industry: Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.87)

Add feedback

The Inductive Bias of Flatness Regularization for Deep Matrix Factorization

Gatmiry, Khashayar, Li, Zhiyuan, Chuang, Ching-Yao, Reddi, Sashank, Ma, Tengyu, Jegelka, Stefanie

arXiv.org Artificial IntelligenceJun-22-2023

Recent works on over-parameterized neural networks have shown that the stochasticity in optimizers has the implicit regularization effect of minimizing the sharpness of the loss function (in particular, the trace of its Hessian) over the family zero-loss solutions. More explicit forms of flatness regularization also empirically improve the generalization performance. However, it remains unclear why and when flatness regularization leads to better generalization. This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in an important setting: learning deep linear networks from linear measurements, also known as \emph{deep matrix factorization}. We show that for all depth greater than one, with the standard Restricted Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters (i.e., the product of all layer matrices), which in turn leads to better generalization. We empirically verify our theoretical findings on synthetic datasets.

artificial intelligence, hessian, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2306.13239

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)

Add feedback

Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces

Rawat, Ankit Singh, Menon, Aditya Krishna, Jitkrittum, Wittawat, Jayasumana, Sadeep, Yu, Felix X., Reddi, Sashank, Kumar, Sanjiv

arXiv.org Machine LearningMay-12-2021

Classification problems with a large number of labels arise in language modelling [Mikolov et al., 2013, Levy and Goldberg, 2014], recommender systems [Covington et al., 2016, Xu et al., 2016], and information retrieval [Agrawal et al., 2013, Prabhu and Varma, 2014]. Such large-output problems pose a core challenge: losses such as the softmax cross-entropy can be prohibitive to optimise, as they depend on the entire set of labels. Several works have thus devised negative sampling schemes for efficiently and effectively approximating such losses [Bengio and Senecal, 2008, Blanc and Rendle, 2018, Ruiz et al., 2018, Bamler and Mandt, 2020]. Broadly, negative sampling techniques sample a subset of "negative" labels, which are used to contrast against the observed "positive" labels. One further applies a suitable weighting on these "negatives", which ostensibly corrects the sampling bias introduced by the dependence on a random subset of labels. Intuitively, such bias assesses how closely a scheme approximates the unsampled loss on the full label set. This bias is well understood for sampled softmax schemes (see, e.g., Bengio and Senecal [2008]); surprisingly, however, far less is understood about other popular schemes, e.g., within-batch and uniform sampling (cf.

artificial intelligence, neural network, softmax cross-entropy, (20 more...)

arXiv.org Machine Learning

2105.05736

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.48)

Add feedback

Federated Composite Optimization

Yuan, Honglin, Zaheer, Manzil, Reddi, Sashank

arXiv.org Machine LearningNov-17-2020

Federated Learning (FL) is a distributed learning paradigm which scales on-device learning collaboratively and privately. Standard FL algorithms such as Federated Averaging (FedAvg) are primarily geared towards smooth unconstrained settings. In this paper, we study the Federated Composite Optimization (FCO) problem, where the objective function in FL includes an additive (possibly) non-smooth component. Such optimization problems are fundamental to machine learning and arise naturally in the context of regularization (e.g., sparsity, low-rank, monotonicity, and constraint). To tackle this problem, we propose different primal/dual averaging approaches and study their communication and computation complexities. Of particular interest is Federated Dual Averaging (FedDualAvg), a federated variant of the dual averaging algorithm. FedDualAvg uses a novel double averaging procedure, which involves gradient averaging step in standard dual averaging and an average of client updates akin to standard federated averaging. Our theoretical analysis and empirical experiments demonstrate that FedDualAvg outperforms baselines for FCO.

feddualavg, health & medicine, optimization problem, (16 more...)

arXiv.org Machine Learning

2011.08474

Country: North America > United States (0.45)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Learning to Learn by Zeroth-Order Oracle

Ruan, Yangjun, Xiong, Yuanhao, Reddi, Sashank, Kumar, Sanjiv, Hsieh, Cho-Jui

arXiv.org Machine LearningOct-21-2019

In the learning to learn (L2L) framework, we cast the design of optimization algorithms as a machine learning problem and use deep neural networks to learn the update rules. In this paper, we extend the L2L framework to zeroth-order (ZO) optimization setting, where no explicit gradient information is available. Our learned optimizer, modeled as recurrent neural network (RNN), first approximates gradient by ZO gradient estimator and then produces parameter update utilizing the knowledge of previous iterations. To reduce high variance effect due to ZO gradient estimator, we further introduce another RNN to learn the Gaussian sampling rule and dynamically guide the query direction sampling. Our learned optimizer outperforms hand-designed algorithms in terms of convergence rate and final solution on both synthetic and practical ZO optimization tasks (in particular, the black-box adversarial attack task, which is one of the most widely used tasks of ZO optimization). We finally conduct extensive analytical experiments to demonstrate the effectiveness of our proposed optimizer. Learning to learn (L2L) is a recently proposed meta-learning framework where we leverage deep neural networks to learn optimization algorithms automatically. The most common choice for the learned optimizer is recurrent neural network (RNN) since it can capture long-term dependencies and propose parameter updates based on knowledge of previous iterations. By training RNN op-timizers on predefined optimization tasks, the optimizers are capable of learning to explore the loss landscape and adaptively choose descent directions and steps (Lv et al., 2017).

deep learning, neural network, optimizer, (18 more...)

arXiv.org Machine Learning

1910.09464

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (0.48)
Education (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Adaptive Methods for Nonconvex Optimization

Zaheer, Manzil, Reddi, Sashank, Sachan, Devendra, Kale, Satyen, Kumar, Sanjiv

Neural Information Processing SystemsDec-31-2018

Adaptive gradient methods that rely on scaling gradients down by the square root of exponential moving averages of past squared gradients, such RMSProp, Adam, Adadelta have found wide application in optimizing the nonconvex problems that arise in deep learning. However, it has been recently demonstrated that such methods can fail to converge even in simple convex optimization settings. In this work, we provide a new analysis of such methods applied to nonconvex stochastic optimization problems, characterizing the effect of increasing minibatch size. Our analysis shows that under this scenario such methods do converge to stationarity up to the statistical limit of variance in the stochastic gradients (scaled by a constant factor). In particular, our result implies that increasing minibatch sizes enables convergence, thus providing a way to circumvent the non-convergence issues. Furthermore, we provide a new adaptive optimization algorithm, Yogi, which controls the increase in effective learning rate, leading to even better performance with similar theoretical guarantees on convergence. Extensive experiments show that Yogi with very little hyperparameter tuning outperforms methods such as Adam in several challenging machine learning tasks.

deep learning, learning rate, neural network, (21 more...)

Neural Information Processing Systems

Country: North America > United States > New York (0.14)

Genre: Research Report > New Finding (0.66)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.88)

Add feedback

Adaptive Methods for Nonconvex Optimization

Zaheer, Manzil, Reddi, Sashank, Sachan, Devendra, Kale, Satyen, Kumar, Sanjiv

Neural Information Processing SystemsDec-31-2018

computer based training, deep learning, learning rate, (23 more...)

Neural Information Processing Systems

Country: North America > United States > New York (0.14)

Genre: Research Report > New Finding (0.66)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.88)

Add feedback

Large-scale randomized-coordinate descent methods with non-separable linear constraints

Reddi, Sashank, Hefny, Ahmed, Downey, Carlton, Dubey, Avinava, Sra, Suvrit

arXiv.org Machine LearningJun-10-2015

We develop randomized (block) coordinate descent (CD) methods for linearly constrained convex optimization. Unlike most CD methods, we do not assume the constraints to be separable, but let them be coupled linearly. To our knowledge, ours is the first CD method that allows linear coupling constraints, without making the global iteration complexity have an exponential dependence on the number of constraints. We present algorithms and analysis for four key problem scenarios: (i) smooth; (ii) smooth + nonsmooth separable; (iii) asynchronous parallel; and (iv) stochastic. We illustrate empirical behavior of our algorithms by simulation experiments.

artificial intelligence, constraint, optimization problem, (18 more...)

arXiv.org Machine Learning

1409.2617

Country: North America > United States > Massachusetts (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Data Science > Data Mining (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.47)

Add feedback