AITopics | swan

Collaborating Authors

swan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Gradient Multi-Normalization for Efficient LLMTraining

Neural Information Processing SystemsJun-16-2026, 15:53:23 GMT

Training large language models (LLMs) commonly relies on adaptive optimizers such as Adam (Kingma & Ba, 2015), which accelerate convergence through moment estimates but incur substantial memory overhead. Recent stateless approaches such as SWAN (Ma et al., 2024) have shown that appropriate preprocessing of instantaneous gradient matrices can match the performance of adaptive methods without storing optimizer states. Building on this insight, we introduce gradient multi-normalization, a principled framework for designing stateless optimizers that normalize gradients with respect to multiple norms simultaneously. Whereas standard first-order methods can be viewed as gradient normalization under a single norm (Bernstein & Newhouse, 2024), our formulation generalizes this perspective to a multi-norm setting. We derive an efficient alternating scheme that enforces these normalization constraints and show that our procedure can produce, up to an arbitrary precision, a fixed-point of the problem. This unifies and extends prior stateless optimizers, showing that SWAN arises as a specific instance with particular norm choices. Leveraging this principle, we develop SinkGD, a lightweight matrix optimizer that retains the memory footprint of SGD (w/o momentum) while substantially reducing computation relative to whitening-based methods. On the memory-efficient LLaMA training benchmark (Zhao et al., 2024a), SinkGD achieves state-of-the-art performance, reaching the same evaluation perplexity as Adam using only 40% of the training tokens.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Information Technology (0.46)
Law (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)

Add feedback

SBSC: Step-By-Step Coding for Improving Mathematical Olympiad Performance

Singh, Kunal, Biswas, Ankan, Bhowmick, Sayandeep, Moturi, Pradeep, Gollapalli, Siva Kishore

arXiv.org Artificial IntelligenceFeb-23-2025

We propose Step-by-Step Coding (SBSC): a multi-turn math reasoning framework that enables Large Language Models (LLMs) to generate sequence of programs for solving Olympiad level math problems. At each step/turn, by leveraging the code execution outputs and programs of previous steps, the model generates the next sub-task and the corresponding program to solve it. This way, SBSC, sequentially navigates to reach the final answer. SBSC allows more granular, flexible and precise approach to problem-solving compared to existing methods. Extensive experiments highlight the effectiveness of SBSC in tackling competition and Olympiad-level math problems. For Claude-3.5-Sonnet, we observe SBSC (greedy decoding) surpasses existing state-of-the-art (SOTA) program generation based reasoning strategies by absolute 10.7% on AMC12, 8% on AIME and 12.6% on MathOdyssey. Given SBSC is multi-turn in nature, we also benchmark SBSC's greedy decoding against self-consistency decoding results of existing SOTA math reasoning strategies and observe performance gain by absolute 6.2% on AMC, 6.7% on AIME and 7.4% on MathOdyssey.

conference paper, sbsc, sympy import symbol, (15 more...)

arXiv.org Artificial Intelligence

2502.16666

Country: Asia > India > Maharashtra > Mumbai (0.04)

Genre:

Workflow (1.00)
Research Report > New Finding (0.92)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Gradient Multi-Normalization for Stateless and Scalable LLM Training

Scetbon, Meyer, Ma, Chao, Gong, Wenbo, Meeds, Edward

arXiv.org Artificial IntelligenceFeb-10-2025

Training large language models (LLMs) typically relies on adaptive optimizers like Adam (Kingma & Ba, 2015) which store additional state information to accelerate convergence but incur significant memory overhead. Recent efforts, such as SWAN (Ma et al., 2024) address this by eliminating the need for optimizer states while achieving performance comparable to Adam via a multi-step preprocessing procedure applied to instantaneous gradients. Motivated by the success of SWAN, we introduce a novel framework for designing stateless optimizers that normalizes stochastic gradients according to multiple norms. To achieve this, we propose a simple alternating scheme to enforce the normalization of gradients w.r.t these norms. We show that our procedure can produce, up to an arbitrary precision, a fixed-point of the problem, and that SWAN is a particular instance of our approach with carefully chosen norms, providing a deeper understanding of its design. However, SWAN's computationally expensive whitening/orthogonalization step limit its practicality for large LMs. Using our principled perspective, we develop of a more efficient, scalable, and practical stateless optimizer. Our algorithm relaxes the properties of SWAN, significantly reducing its computational cost while retaining its memory efficiency, making it applicable to training large-scale models. Experiments on pre-training LLaMA models with up to 1 billion parameters demonstrate a 3X speedup over Adam with significantly reduced memory requirements, outperforming other memory-efficient baselines.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2502.06742

Country:

Asia > Middle East > Jordan (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training

Ma, Chao, Gong, Wenbo, Scetbon, Meyer, Meeds, Edward

arXiv.org Artificial IntelligenceDec-23-2024

Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require to maintain optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b). In this work, we show that pre-processing SGD in a stateless manner can achieve the same performance as the Adam optimizer for LLM training, while drastically reducing the memory cost. Specifically, we propose to pre-process the instantaneous stochastic gradients using normalization and whitening. We show that normalization stabilizes gradient distributions, and whitening counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any optimizer states. Empirically, SWAN has the same memory footprint as SGD, achieving $\approx 50\%$ reduction on total end-to-end memory compared to Adam. In language modeling tasks, SWAN demonstrates comparable or even better performance than Adam: when pre-training the LLaMA model with 350M and 1.3B parameters, SWAN achieves a 2x speedup by reaching the same evaluation perplexity using half as many tokens.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.13148

Country:

Asia > Middle East > Jordan (0.05)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > British Indian Ocean Territory > Diego Garcia (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.90)

Add feedback

Firing Pat Gelsinger doesn't solve Intel's problems

EngadgetDec-3-2024, 17:34:20 GMT

Despite Intel's recent woes, I didn't expect to see CEO Pat Gelsinger joining 15,000 or so of his colleagues being shown the door. Gelsinger is a storied engineer and business success who laid down an exhaustive rescue plan when he took the helm of the beleaguered chipmaker in 2021. It was never going to be a quick fix, given the company's long legacy of missteps. Gelsinger may be the public face of Intel's current malaise, but the problems started long before his tenure and will likely keep going. Gelsinger was tasked with addressing almost two decades' worth of bad decisions, all of which have compounded.

artificial intelligence, gelsinger, intel, (17 more...)

Engadget

Country:

North America > United States (0.15)
Asia > Taiwan (0.05)

Industry: Information Technology > Hardware (0.50)

Technology: Information Technology > Artificial Intelligence (0.49)

Add feedback

Design-o-meter: Towards Evaluating and Refining Graphic Designs

Goyal, Sahil, Mahajan, Abhinav, Mishra, Swasti, Udhayanan, Prateksha, Shukla, Tripti, Joseph, K J, Srinivasan, Balaji Vasan

arXiv.org Artificial IntelligenceNov-22-2024

Graphic designs are an effective medium for visual communication. They range from greeting cards to corporate flyers and beyond. Off-late, machine learning techniques are able to generate such designs, which accelerates the rate of content production. An automated way of evaluating their quality becomes critical. Towards this end, we introduce Design-o-meter, a data-driven methodology to quantify the goodness of graphic designs. Further, our approach can suggest modifications to these designs to improve its visual appeal. To the best of our knowledge, Design-o-meter is the first approach that scores and refines designs in a unified framework despite the inherent subjectivity and ambiguity of the setting. Our exhaustive quantitative and qualitative analysis of our approach against baselines adapted for the task (including recent Multimodal LLM-based approaches) brings out the efficacy of our methodology. We hope our work will usher more interest in this important and pragmatic problem setting.

design-o-meter, graphic design, meta, (17 more...)

arXiv.org Artificial Intelligence

2411.14959

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
Europe > Monaco (0.04)
Asia > Middle East > Jordan (0.04)
(2 more...)

Genre:

Research Report (0.64)
Overview (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)

Add feedback

Scene-wise Adaptive Network for Dynamic Cold-start Scenes Optimization in CTR Prediction

Li, Wenhao, Zhou, Jie, Luo, Chuan, Tang, Chao, Zhang, Kun, Zhao, Shixiong

arXiv.org Artificial IntelligenceAug-18-2024

In the realm of modern mobile E-commerce, providing users with nearby commercial service recommendations through location-based online services has become increasingly vital. While machine learning approaches have shown promise in multi-scene recommendation, existing methodologies often struggle to address cold-start problems in unprecedented scenes: the increasing diversity of commercial choices, along with the short online lifespan of scenes, give rise to the complexity of effective recommendations in online and dynamic scenes. In this work, we propose Scene-wise Adaptive Network (SwAN), a novel approach that emphasizes high-performance cold-start online recommendations for new scenes. Our approach introduces several crucial capabilities, including scene similarity learning, user-specific scene transition cognition, scene-specific information construction for the new scene, and enhancing the diverged logical information between scenes. We demonstrate SwAN's potential to optimize dynamic multi-scene recommendation problems by effectively online handling cold-start recommendations for any newly arrived scenes. More encouragingly, SwAN has been successfully deployed in Meituan's online catering recommendation service, which serves millions of customers per day, and SwAN has achieved a 5.64% CTR index improvement relative to the baselines and a 5.19% increase in daily order volume proportion.

experiment, information, recommendation, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3640457.3688115

2408.07278

Country:

Europe > Italy > Apulia > Bari (0.05)
Asia > China > Beijing > Beijing (0.05)
Asia > China > Hong Kong (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Services > e-Commerce Services (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Tackling Graph Oversquashing by Global and Local Non-Dissipativity

Gravina, Alessio, Eliasof, Moshe, Gallicchio, Claudio, Bacciu, Davide, Schönlieb, Carola-Bibiane

arXiv.org Artificial IntelligenceMay-2-2024

A common problem in Message-Passing Neural Networks is oversquashing -- the limited ability to facilitate effective information flow between distant nodes. Oversquashing is attributed to the exponential decay in information transmission as node distances increase. This paper introduces a novel perspective to address oversquashing, leveraging properties of global and local non-dissipativity, that enable the maintenance of a constant information flow rate. Namely, we present SWAN, a uniquely parameterized model GNN with antisymmetry both in space and weight domains, as a means to obtain non-dissipativity. Our theoretical analysis asserts that by achieving these properties, SWAN offers an enhanced ability to transmit information over extended distances. Empirical evaluations on synthetic and real-world benchmarks that emphasize long-range interactions validate the theoretical understanding of SWAN, and its ability to mitigate oversquashing.

node, swan, tackling graph oversquashing, (12 more...)

arXiv.org Artificial Intelligence

2405.01009

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > California > San Francisco County > San Francisco (0.04)
Europe > Italy > Tuscany > Pisa Province > Pisa (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Beyond Empirical Windowing: An Attention-Based Approach for Trust Prediction in Autonomous Vehicles

Niu, Minxue, Zheng, Zhaobo, Akash, Kumar, Misu, Teruhisa

arXiv.org Artificial IntelligenceJan-16-2024

Humans' internal states play a key role in human-machine interaction, leading to the rise of human state estimation as a prominent field. Compared to swift state changes such as surprise and irritation, modeling gradual states like trust and satisfaction are further challenged by label sparsity: long time-series signals are usually associated with a single label, making it difficult to identify the critical span of state shifts. Windowing has been one widely-used technique to enable localized analysis of long time-series data. However, the performance of downstream models can be sensitive to the window size, and determining the optimal window size demands domain expertise and extensive search. To address this challenge, we propose a Selective Windowing Attention Network (SWAN), which employs window prompts and masked attention transformation to enable the selection of attended intervals with flexible lengths. We evaluate SWAN on the task of trust prediction on a new multimodal driving simulation dataset. Experiments show that SWAN significantly outperforms an existing empirical window selection baseline and neural network baselines including CNN-LSTM and Transformer. Furthermore, it shows robustness across a wide span of windowing ranges, compared to the traditional windowing approach.

sequence, swan, trust prediction, (16 more...)

arXiv.org Artificial Intelligence

2312.10209

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > California > Santa Clara County > San Jose (0.04)
Asia > India (0.04)

Genre: Research Report (1.00)

Industry: Transportation (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SWAN: A Generic Framework for Auditing Textual Conversational Systems

Sakai, Tetsuya

arXiv.org Artificial IntelligenceMay-14-2023

We argue that such frameworks should satisfy the following requirements at least. Alertness They should detect potential problems with extremely high recall (i.e., near-zero misses), while appropriately crediting the benefits of the conversational systems. Moreover, when aiming for high recall, different people involved (i.e., not just users, but also workers who label data for training the system, etc.) should be taken into account; in particular, if the evaluation framework ignores some negative impacts on marginalised people, it does not satisfy the alertness requirement. Specificity By this we mean that the evaluation framework should be specific when locating the problem(s) within conversations. For example, an evaluation result that says"There is a problem somewhere inside this conversation session" is less useful than one that says"There is a problem in this particular system turn," which in turn is less useful than one that says "There is a problem in this particular claim within this system turn."

artificial intelligence, chatbot, natural language, (18 more...)

arXiv.org Artificial Intelligence

2305.0829

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.05)
Asia > Taiwan > Taiwan Province > Taipei (0.04)
(10 more...)

Genre: Research Report (0.41)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.88)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.68)

Add feedback