AITopics | signum

Collaborating Authors

signum

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

In Search of Adam's Secret Sauce

Orvieto, Antonio, Gower, Robert M.

arXiv.org Artificial IntelligenceDec-2-2025

Understanding the remarkable efficacy of Adam when training transformer-based language models has become a central research topic within the optimization community. To gain deeper insights, several simplifications of Adam have been proposed, such as the signed gradient and signed momentum methods. In this work, we conduct an extensive empirical study - training over 1500 language models across different data configurations and scales - comparing Adam to several known simplified variants. We find that signed momentum methods are faster than SGD, but consistently underperform relative to Adam, even after careful tuning of momentum, clipping setting and learning rates. However, our analysis reveals a compelling option that preserves near-optimal performance while allowing for new insightful reformulations: constraining the Adam momentum parameters to be equal, beta1 = beta2. Beyond robust performance, this choice affords new theoretical insights, highlights the "secret sauce" on top of signed momentum, and grants a precise statistical interpretation: we show that Adam in this setting implements a natural online algorithm for estimating the mean and variance of gradients-one that arises from a mean-field Gaussian variational inference perspective.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.21829

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
Asia > Middle East > Jordan (0.04)
Europe > Norway > Northern Norway > Troms > Tromsø (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.93)

Add feedback

Benchmarking Optimizers for Large Language Model Pretraining

Semenov, Andrei, Pagliardini, Matteo, Jaggi, Martin

arXiv.org Artificial IntelligenceSep-3-2025

The recent development of Large Language Models (LLMs) has been accompanied by an effervescence of novel ideas and methods to better optimize the loss of deep learning models. Claims from those methods are myriad: from faster convergence to removing reliance on certain hyperparameters. However, the diverse experimental protocols used to validate these claims make direct comparisons between methods challenging. This study presents a comprehensive evaluation of recent optimization techniques across standardized LLM pretraining scenarios, systematically varying model size, batch size, and training duration. Through careful tuning of each method, we provide guidance to practitioners on which optimizer is best suited for each scenario. For researchers, our work highlights promising directions for future optimization research. Finally, by releasing our code and making all experiments fully reproducible, we hope our efforts can help the development and rigorous benchmarking of future methods.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.0144

Country:

Asia > Middle East > Jordan (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)
(6 more...)

Genre: Research Report > New Finding (1.00)

Industry: Education > Educational Technology > Educational Software > Computer Based Training (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

FOCUS: First Order Concentrated Updating Scheme

Liu, Yizhou, Liu, Ziming, Gore, Jeff

arXiv.org Artificial IntelligenceJan-21-2025

Large language models (LLMs) demonstrate remarkable performance, and improving their pre-training process appears to be key to enhancing their capabilities further. Based on the documented success of Adam, learning rate decay, and weight decay, we hypothesize that the pre-training loss landscape features a narrowing valley structure. Through experiments with synthetic loss functions, we discover that when gradient query noise is high relative to the valley's sharpness, Adam's performance falls behind that of Signum because Adam reduces the effective step size too drastically. This observation led us to develop FOCUS, an optimizer that enhances Signum by incorporating attraction toward moving averaged parameters, allowing it to handle noise better while maintaining larger step sizes. In training GPT-2, FOCUS proves to be more stable than Signum and faster than Adam. These results suggest that gradient noise may be an underappreciated limiting factor in LLM training, and FOCUS offers promising solutions.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2501.12243

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Deconstructing What Makes a Good Optimizer for Language Models

Zhao, Rosie, Morwani, Depen, Brandfonbrener, David, Vyas, Nikhil, Kakade, Sham

arXiv.org Artificial IntelligenceJul-10-2024

Training language models becomes increasingly expensive with scale, prompting numerous attempts to improve optimization efficiency. Despite these efforts, the Adam optimizer remains the most widely used, due to a prevailing view that it is the most effective approach. We aim to compare several optimization algorithms, including SGD, Adafactor, Adam, and Lion, in the context of autoregressive language modeling across a range of model sizes, hyperparameters, and architecture variants. Our findings indicate that, except for SGD, these algorithms all perform comparably both in their optimal performance and also in terms of how they fare across a wide range of hyperparameter choices. Our results suggest to practitioners that the choice of optimizer can be guided by practical considerations like memory constraints and ease of implementation, as no single algorithm emerged as a clear winner in terms of performance or stability to hyperparameter misspecification. Given our findings, we further dissect these approaches, examining two simplified versions of Adam: a) signed momentum (Signum) which we see recovers both the performance and hyperparameter stability of Adam and b) Adalayer, a layerwise variant of Adam which we introduce to study Adam's preconditioning. Examining Adalayer leads us to the conclusion that the largest impact of Adam's preconditioning is restricted to the last layer and LayerNorm parameters, and, perhaps surprisingly, the remaining layers can be trained with SGD.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2407.07972

Country:

Europe > Sweden > Stockholm > Stockholm (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Top 30 Twitter Accounts On AI You Should Follow - Signum.ai

#artificialintelligenceJan-9-2023, 06:25:33 GMT

The world is going nuts over AI and tools built on it. They plunge into this area and can't help consuming more and more info about it, and are even creating their own intelligent products. To help you catch up on the latest news on the AI industry, we've selected Twitter accounts that are both top-rated and trending. This account needs no introduction, as it is curated by the industry leader in AI. Getting the news straight from the horse's mouth is worth your time.

industry news, tweet, twitter account, (15 more...)

#artificialintelligence

Country:

North America > United States > New York (0.05)
North America > United States > Massachusetts (0.05)
Asia > China (0.05)

Industry: Information Technology > Services (0.89)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.99)

Add feedback

SignSGD: Fault-Tolerance to Blind and Byzantine Adversaries

Akoun, Jason, Meyer, Sebastien

arXiv.org Machine LearningFeb-7-2022

Distributed learning has become a necessity for training ever-growing models by sharing calculation among several devices. However, some of the devices can be faulty, deliberately or not, preventing the proper convergence. As a matter of fact, the baseline distributed SGD algorithm does not converge in the presence of one Byzantine adversary. In this article we focus on the more robust SignSGD algorithm derived from SGD. We provide an upper bound for the convergence rate of SignSGD proving that this new version is robust to Byzantine adversaries. We implemented SignSGD along with Byzantine strategies attempting to crush the learning process. Therefore, we provide empirical observations from our experiments to support our theory. Our code is available on GitHub https://github.com/jasonakoun/signsgd-fault-tolerance and our experiments are reproducible by using the provided parameters.

adversary, byzantine adversary, signsgd, (16 more...)

arXiv.org Machine Learning

2202.02085

Country: Europe > France (0.04)

Genre: Research Report (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.52)

Add feedback

Learn About Trends Before Everyone Else Signum Startup Review Feedough

#artificialintelligenceJul-28-2019, 22:36:31 GMT

Digital trends are hard to understand. An average person would have never imagined the internet to go crazy at fidget spinners. Random people become memes and got free viral marketing and no one knows why that happened. What if you get to know what's going to be hot on the internet and act on it before anyone else? Well, Signum is that astrologer for digital trends. Signum is a subscription-based mailing list where you receive bi-weekly emails with a report on emerging hot trends before everyone else learns about them.

artificial intelligence, signum, social media, (18 more...)

#artificialintelligence

Country: North America > United States > New York (0.06)

Genre: Overview (0.31)

Technology:

Information Technology > Artificial Intelligence (0.97)
Information Technology > Communications > Social Media (0.36)

Add feedback

Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback

Zheng, Shuai, Huang, Ziyue, Kwok, James T.

arXiv.org Machine LearningMay-26-2019

Communication overhead is a major bottleneck hampering the scalability of distributed machine learning systems. Recently, there has been a surge of interest in using gradient compression to improve the communication efficiency of distributed neural network training. Using 1-bit quantization, signSGD with majority vote achieves a 32x reduction on communication cost. However, its convergence is based on unrealistic assumptions and can diverge in practice. In this paper, we propose a general distributed compressed SGD with Nesterov's momentum. We consider two-way compression, which compresses the gradients both to and from workers. Convergence analysis on nonconvex problems for general gradient compressors is provided. By partitioning the gradient into blocks, a blockwise compressor is introduced such that each gradient block is compressed and transmitted in 1-bit format with a scaling factor, leading to a nearly 32x reduction on communication. Experimental results show that the proposed method converges as fast as full-precision distributed momentum SGD and achieves the same testing accuracy. In particular, on distributed ResNet training with 7 workers on the ImageNet, the proposed algorithm achieves the same testing accuracy as momentum SGD using full-precision gradients, but with $46\%$ less wall clock time.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Machine Learning

1905.10936

Country: Asia > China > Hong Kong (0.04)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback