AITopics | Ahn, Kwangjun

Collaborating Authors

Ahn, Kwangjun

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

General framework for online-to-nonconvex conversion: Schedule-free SGD is also effective for nonconvex optimization

Ahn, Kwangjun, Magakyan, Gagik, Cutkosky, Ashok

arXiv.org Machine LearningNov-11-2024

Training large-scale neural network models, such as large language models, requires a well-designed optimization strategy to ensure stable and fast convergence. For instance, training typically requires a carefully designed optimizer, such as the Adam optimizer [Kingma and Ba, 2014], along with meticulously tuned learning rate scheduling. Recently, Defazio et al. [2024] introduced the schedule-free method, which achieves impressive training performance without any need for learning rate scheduling. In brief, the schedule-free method is an add-on scheme that can be applied to any chosen base optimizer, converting it into a schedule-free variant. While this method has shown strong empirical performance in training large neural network models, its theoretical analysis has, to date, been limited to the convex setting [Defazio et al., 2024]. Our aim is to extend the theoretical understanding of schedule-free methods to nonconvex optimization. In this work, as an initial step, we focus on the version where the base optimizer is chosen as SGD, referred to as schedule-free SGD.

artificial intelligence, cutkosky, machine learning, (17 more...)

arXiv.org Machine Learning

2411.07061

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Learning to Achieve Goals with Belief State Transformers

Hu, Edward S., Ahn, Kwangjun, Liu, Qinghua, Xu, Haoran, Tomar, Manan, Langford, Ada, Jayaraman, Dinesh, Lamb, Alex, Langford, John

arXiv.org Artificial IntelligenceOct-30-2024

We introduce the "Belief State Transformer", a next-token predictor that takes both a prefix and suffix as inputs, with a novel objective of predicting both the next token for the prefix and the previous token for the suffix. The Belief State Transformer effectively learns to solve challenging problems that conventional forward-only transformers struggle with, in a domain-independent fashion. Key to this success is learning a compact belief state that captures all relevant information necessary for accurate predictions. Empirical ablations show that each component of the model is essential in difficult scenarios where standard Transformers fall short. For the task of story writing with known prefixes and suffixes, our approach outperforms the Fill-in-the-Middle method for reaching known goals and demonstrates improved performance even when the goals are unknown. Altogether, the Belief State Transformer enables more efficient goal-conditioned decoding, better test-time inference, and high-quality text representations on small scale problems.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.23506

Country:

North America > United States > California (0.14)
North America > Canada > Alberta (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Belief Revision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)

Add feedback

Improved Sample Complexity of Imitation Learning for Barrier Model Predictive Control

Pfrommer, Daniel, Padmanabhan, Swati, Ahn, Kwangjun, Umenberger, Jack, Marcucci, Tobia, Mhammedi, Zakaria, Jadbabaie, Ali

arXiv.org Artificial IntelligenceOct-1-2024

Imitation learning has emerged as a powerful tool in machine learning, enabling agents to learn complex behaviors by imitating expert demonstrations acquired either from a human demonstrator or a policy computed offline [3, 11, 12, 13]. Despite its significant success, imitation learning often suffers from a compounding error problem: Successive evaluations of the approximate policy could accumulate error, resulting in out-of-distribution failures [3]. Recent results in imitation learning [31, 32, 34] have identified smoothness (i.e., Lipschitzness of the derivative of the optimal controller with respect to the initial state) and stability of the expert as two key properties that circumvent this issue, thereby allowing for end-to-end performance guarantees for the final learned controller. In this paper, our focus is on enabling such guarantees when the expert being imitated is a Model Predictive Controller (MPC), a powerful class of control algorithms based on solving an optimization problem over a receding prediction horizon [23]. In some cases, the solution to this multiparametric optimization problem, known as the explicit MPC representation [6], can be pre-computed. For instance, in our setup -- linear systems with polytopic constraints -- the optimal control input is a piecewise affine (and, hence, highly non-smooth) function of the state [6].

adj, artificial intelligence, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2410.00859

Country:

North America > United States > Massachusetts (0.14)
Europe > United Kingdom > England (0.14)

Genre: Research Report (1.00)

Industry: Energy > Oil & Gas > Upstream (0.41)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.86)

Add feedback

Adam with model exponential moving average is effective for nonconvex optimization

Ahn, Kwangjun, Cutkosky, Ashok

arXiv.org Artificial IntelligenceMay-28-2024

In this work, we offer a theoretical analysis of two modern optimization techniques for training large and complex models: (i) adaptive optimization algorithms, such as Adam, and (ii) the model exponential moving average (EMA). Specifically, we demonstrate that a clipped version of Adam with model EMA achieves the optimal convergence rates in various nonconvex optimization settings, both smooth and nonsmooth. Moreover, when the scale varies significantly across different coordinates, we demonstrate that the coordinate-wise adaptivity of Adam is provably advantageous. Notably, unlike previous analyses of Adam, our analysis crucially relies on its core elements -- momentum and discounting factors -- as well as model EMA, motivating their wide applications in practice.

artificial intelligence, machine learning, optimization, (16 more...)

arXiv.org Artificial Intelligence

2405.18199

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

Does SGD really happen in tiny subspaces?

Song, Minhak, Ahn, Kwangjun, Yun, Chulhee

arXiv.org Machine LearningMay-24-2024

Understanding the training dynamics of deep neural networks is challenging due to their high-dimensional nature and intricate loss landscapes. Recent studies have revealed that, along the training trajectory, the gradient approximately aligns with a low-rank top eigenspace of the training loss Hessian, referred to as the dominant subspace. Given this alignment, this paper explores whether neural networks can be trained within the dominant subspace, which, if feasible, could lead to more efficient training methods. Our primary observation is that when the SGD update is projected onto the dominant subspace, the training loss does not decrease further. This suggests that the observed alignment between the gradient and the dominant subspace is spurious. Surprisingly, projecting out the dominant subspace proves to be just as effective as the original update, despite removing the majority of the original update component. Similar observations are made for the large learning rate regime (also known as Edge of Stability) and Sharpness-Aware Minimization. We discuss the main causes and implications of this spurious alignment, shedding light on the intricate dynamics of neural network training.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Machine Learning

2405.16002

Country: Asia (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise

Ahn, Kwangjun, Zhang, Zhiyu, Kook, Yunbum, Dai, Yan

arXiv.org Artificial IntelligenceFeb-2-2024

Despite the success of the Adam optimizer in practice, the theoretical understanding of its algorithmic components still remains limited. In particular, most existing analyses of Adam show the convergence rate that can be simply achieved by non-adative algorithms like SGD. In this work, we provide a different perspective based on online learning that underscores the importance of Adam's algorithmic components. Inspired by Cutkosky et al. (2023), we consider the framework called online learning of updates, where we choose the updates of an optimizer based on an online learner. With this framework, the design of a good optimizer is reduced to the design of a good online learner. Our main observation is that Adam corresponds to a principled online learning framework called Follow-the-Regularized-Leader (FTRL). Building on this observation, we study the benefits of its algorithmic components from the online learning perspective.

artificial intelligence, dynamic regret, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2402.01567

Genre: Research Report (0.64)

Industry: Education > Educational Setting > Online (1.00)

Technology:

Information Technology > Enterprise Applications > Human Resources > Learning Management (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Transformers learn to implement preconditioned gradient descent for in-context learning

Ahn, Kwangjun, Cheng, Xiang, Daneshmand, Hadi, Sra, Suvrit

arXiv.org Artificial IntelligenceNov-9-2023

Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of gradient descent. Going beyond the question of expressivity, we ask: Can transformers learn to implement such algorithms by training over random problem instances? To our knowledge, we make the first theoretical progress on this question via an analysis of the loss landscape for linear transformers trained over random instances of linear regression. For a single attention layer, we prove the global minimum of the training objective implements a single iteration of preconditioned gradient descent. Notably, the preconditioning matrix not only adapts to the input distribution but also to the variance induced by data inadequacy. For a transformer with $L$ attention layers, we prove certain critical points of the training objective implement $L$ iterations of preconditioned gradient descent. Our results call for future theoretical studies on learning algorithms by training transformers.

artificial intelligence, machine learning, transformer, (19 more...)

arXiv.org Artificial Intelligence

2306.00297

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

The Crucial Role of Normalization in Sharpness-Aware Minimization

Dai, Yan, Ahn, Kwangjun, Sra, Suvrit

arXiv.org Machine LearningOct-23-2023

Sharpness-Aware Minimization (SAM) is a recently proposed gradient-based optimizer (Foret et al., ICLR 2021) that greatly improves the prediction performance of deep neural networks. Consequently, there has been a surge of interest in explaining its empirical success. We focus, in particular, on understanding the role played by normalization, a key component of the SAM updates. We theoretically and empirically study the effect of normalization in SAM for both convex and non-convex functions, revealing two key roles played by normalization: i) it helps in stabilizing the algorithm; and ii) it enables the algorithm to drift along a continuum (manifold) of minima -- a property identified by recent theoretical works that is the key to better performance. We further argue that these two properties of normalization make SAM robust against the choice of hyper-parameters, supporting the practicality of SAM. Our conclusions are backed by various experiments.

artificial intelligence, machine learning, usam, (18 more...)

arXiv.org Machine Learning

2305.15287

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Learning threshold neurons via the "edge of stability"

Ahn, Kwangjun, Bubeck, Sébastien, Chewi, Sinho, Lee, Yin Tat, Suarez, Felipe, Zhang, Yi

arXiv.org Artificial IntelligenceOct-19-2023

Existing analyses of neural network training often operate under the unrealistic assumption of an extremely small learning rate. This lies in stark contrast to practical wisdom and empirical studies, such as the work of J. Cohen et al. (ICLR 2021), which exhibit startling new phenomena (the "edge of stability" or "unstable convergence") and potential benefits for generalization in the large learning rate regime. Despite a flurry of recent works on this topic, however, the latter effect is still poorly understood. In this paper, we take a step towards understanding genuinely non-convex training dynamics with large learning rates by performing a detailed analysis of gradient descent for simplified models of two-layer neural networks. For these models, we provably establish the edge of stability phenomenon and discover a sharp phase transition for the step size below which the neural network fails to learn "threshold-like" neurons (i.e., neurons with a non-zero first-layer bias). This elucidates one possible mechanism by which the edge of stability can in fact lead to better generalization, as threshold neurons are basic building blocks with useful inductive bias for many tasks.

artificial intelligence, machine learning, regime, (16 more...)

arXiv.org Artificial Intelligence

2212.07469

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Add feedback

Linear attention is (maybe) all you need (to understand transformer optimization)

Ahn, Kwangjun, Cheng, Xiang, Song, Minhak, Yun, Chulhee, Jadbabaie, Ali, Sra, Suvrit

arXiv.org Artificial IntelligenceOct-2-2023

Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training transformers by carefully studying a simple yet canonical linearized shallow transformer model. Specifically, we train linear transformers to solve regression tasks, inspired by J. von Oswald et al. (ICML 2023), and K. Ahn et al. (NeurIPS 2023). Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of transformer training dynamics. Consequently, the results obtained in this paper suggest that a simple linearized transformer model could actually be a valuable, realistic abstraction for understanding transformer optimization.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2310.01082

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback