AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Neural Information Processing SystemsFeb-11-2026, 19:07:15 GMT

eddea82ad2755b24c4e168c5fc2ebd40-Paper.pdf

acprop, arxiv preprint arxiv, optimizer, (13 more...)

Country:

North America > United States > Illinois (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Neural Information Processing SystemsDec-25-2025, 04:41:24 GMT

Momentum Centering and Asynchronous Update for Adaptive Gradient Methods

We propose ACProp (Asynchronous-centering-Prop), an adaptive optimizer which combines centering of second momentum and asynchronous update (e.g. for $t$-th update, denominator uses information up to step $t-1$, while numerator uses gradient at $t$-th step). ACProp has both strong theoretical properties and empirical performance. With the example by Reddi et al. (2018), we show that asynchronous optimizers (e.g. AdaShift, ACProp) have weaker convergence condition than synchronous optimizers (e.g. Adam, RMSProp, AdaBelief); within asynchronous optimizers, we show that centering of second momentum further weakens the convergence condition. We demonstrate that ACProp has a convergence rate of $O(\frac{1}{\sqrt{T}})$ for the stochastic non-convex case, which matches the oracle rate and outperforms the $O(\frac{logT}{\sqrt{T}})$ rate of RMSProp and Adam. We validate ACProp in extensive empirical studies: ACProp outperforms both SGD and other adaptive optimizers in image classification with CNN, and outperforms well-tuned adaptive optimizers in the training of various GAN models, reinforcement learning and transformers. To sum up, ACProp has good theoretical properties including weak convergence condition and optimal convergence rate, and strong empirical performance including good generalization like SGD and training stability like Adam.

acprop, momentum centering and asynchronous update, optimizer, (11 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.39)

arXiv.org Artificial IntelligenceSep-3-2025

Convergence Analysis of Aggregation-Broadcast in LoRA-enabled Distributed Fine-Tuning

Chen, Xin, Chen, Shuaijun, Tavallaie, Omid, Tran, Nguyen, Xiang, Shuhuang, Zomaya, Albert

Federated Learning (FL) enables collaborative model training across decentralized data sources while preserving data privacy. However, the growing size of Machine Learning (ML) models poses communication and computation challenges in FL. Low-Rank Adaptation (LoRA) has recently been introduced into FL as an efficient fine-tuning method, reducing communication overhead by updating only a small number of trainable parameters. Despite its effectiveness, how to aggregate LoRA-updated local models on the server remains a critical and understudied problem. In this paper, we provide a unified convergence analysis for LoRA-based FL. We first categories the current aggregation method into two major type: Sum-Product (SP) and Product-Sum (PS). Then we formally define the Aggregation-Broadcast Operator (ABO) and derive both weak and strong convergence condition under mild assumptions. Furthermore, we present both weak and strong convergence condition that guarantee convergence of the local model and the global model respectively. These theoretical analyze offer a principled understanding of various aggregation strategies. Notably, we prove that the SP and PS aggregation methods satisfy the weak and strong convergence condition respectively, but differ in their ability to achieve the optimal convergence rate. Extensive experiments on standard benchmarks validate our theoretical findings.

artificial intelligence, machine learning, natural language, (15 more...)

2508.01348

Country:

North America > United States (0.28)
Oceania > Australia (0.28)
North America > Mexico (0.28)
Europe > United Kingdom > England (0.28)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.92)

Sanyal, Aniket, Chavdarova, Tatjana

Understanding Lookahead Dynamics Through Laplace Transform

arXiv.org Machine LearningJun-17-2025

We introduce a frequency-domain framework for convergence analysis of hyperparameters in game optimization, leveraging High-Resolution Differential Equations (HRDEs) and Laplace transforms. Focusing on the Lookahead algorithm--characterized by gradient steps $k$ and averaging coefficient $α$--we transform the discrete-time oscillatory dynamics of bilinear games into the frequency domain to derive precise convergence criteria. Our higher-precision $O(γ^2)$-HRDE models yield tighter criteria, while our first-order $O(γ)$-HRDE models offer practical guidance by prioritizing actionable hyperparameter tuning over complex closed-form solutions. Empirical validation in discrete-time settings demonstrates the effectiveness of our approach, which may further extend to locally linear operators, offering a scalable framework for selecting hyperparameters for learning in games.

artificial intelligence, laplace transform, machine learning, (18 more...)

arXiv.org Machine Learning

2506.13712

Country:

Europe > Russia (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Asia > Russia (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceJun-17-2025

CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding

Song, Wenxuan, Chen, Jiayi, Ding, Pengxiang, Huang, Yuxin, Zhao, Han, Wang, Donglin, Li, Haoang

In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities. Despite the progress, their practical deployment is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks. While recent studies have explored Jacobi decoding as a more efficient alternative to traditional autoregressive decoding, its practical benefits are marginal due to the lengthy iterations. To address it, we introduce consistency distillation training to predict multiple correct action tokens in each iteration, thereby achieving acceleration. Besides, we design mixed-label supervision to mitigate the error accumulation during distillation. Although distillation brings acceptable speedup, we identify that certain inefficient iterations remain a critical bottleneck. To tackle this, we propose an early-exit decoding strategy that moderately relaxes convergence conditions, which further improves average inference efficiency. Experimental results show that the proposed method achieves more than 4 times inference acceleration across different baselines while maintaining high task success rates in both simulated and real-world robot tasks. These experiments validate that our approach provides an efficient and general paradigm for accelerating multimodal decision-making in robotics. Our project page is available at https://irpn-eai.github.io/CEED-VLA/.

artificial intelligence, ceed-vla, iteration, (16 more...)

2506.13725

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Robots > Manipulation (0.34)

Neural Information Processing SystemsFeb-11-2025, 02:00:49 GMT

Momentum Centering and Asynchronous Update for Adaptive Gradient Methods

We propose ACProp (Asynchronous-centering-Prop), an adaptive optimizer which combines centering of second momentum and asynchronous update (e.g. for t -th update, denominator uses information up to step t-1, while numerator uses gradient at t -th step). ACProp has both strong theoretical properties and empirical performance. With the example by Reddi et al. (2018), we show that asynchronous optimizers (e.g. AdaShift, ACProp) have weaker convergence condition than synchronous optimizers (e.g. Adam, RMSProp, AdaBelief); within asynchronous optimizers, we show that centering of second momentum further weakens the convergence condition.

acprop, momentum centering and asynchronous update, optimizer, (9 more...)

Technology: Information Technology > Artificial Intelligence (0.40)

Neural Information Processing SystemsApr-6-2023, 18:26:58 GMT

Generalized Learning Vector Quantization

We propose a new learning method, "Generalized Learning Vec(cid:173) tor Quantization (GLVQ)," in which reference vectors are updated based on the steepest descent method in order to minimize the cost function . The cost function is determined so that the obtained learning rule satisfies the convergence condition. We prove that Kohonen's rule as used in LVQ does not satisfy the convergence condition and thus degrades recognition ability. Experimental re(cid:173) sults for printed Chinese character recognition reveal that GLVQ is superior to LVQ in recognition ability.

convergence condition, generalized learning vector quantization, recognition ability, (2 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceDec-6-2022

BALPA: A Balanced Primal-Dual Algorithm for Nonsmooth Optimization with Application to Distributed Optimization

Guo, Luyao, Cao, Jinde, Shi, Xinli, Yang, Shaofu

In this paper, we propose a novel primal-dual proximal splitting algorithm (PD-PSA), named BALPA, for the composite optimization problem with equality constraints, where the loss function consists of a smooth term and a nonsmooth term composed with a linear mapping. In BALPA, the dual update is designed as a proximal point for a time-varying quadratic function, which balances the implementation of primal and dual update and retains the proximity-induced feature of classic PD-PSAs. In addition, by this balance, BALPA eliminates the inefficiency of classic PD-PSAs for composite optimization problems in which the Euclidean norm of the linear mapping or the equality constraint mapping is large. Therefore, BALPA not only inherits the advantages of simple structure and easy implementation of classic PD-PSAs but also ensures a fast convergence when these norms are large. Moreover, we propose a stochastic version of BALPA (S-BALPA) and apply the developed BALPA to distributed optimization to devise a new distributed optimization algorithm. Furthermore, a comprehensive convergence analysis for BALPA and S-BALPA is conducted, respectively. Finally, numerical experiments demonstrate the efficiency of the proposed algorithms.

algorithm, artificial intelligence, machine learning, (17 more...)

2212.02835

Country: Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Zhang, Guoqiang, Kenta, Niwa, Kleijn, W. Bastiaan

Extending AdamW by Leveraging Its Second Moment and Magnitude

arXiv.org Artificial IntelligenceDec-9-2021

Recent work [4] analyses the local convergence of Adam in a neighbourhood of an optimal solution for a twice-differentiable function. It is found that the learning rate has to be sufficiently small to ensure local stability of the optimal solution. The above convergence results also hold for AdamW. In this work, we propose a new adaptive optimisation method by extending AdamW in two aspects with the purpose to relax the requirement on small learning rate for local stability, which we refer to as Aida. Firstly, we consider tracking the 2nd moment r_t of the pth power of the gradient-magnitudes. r_t reduces to v_t of AdamW when p=2. Suppose {m_t} is the first moment of AdamW. It is known that the update direction m_{t+1}/(v_{t+1}+epsilon)^0.5 (or m_{t+1}/(v_{t+1}^0.5+epsilon) of AdamW (or Adam) can be decomposed as the sign vector sign(m_{t+1}) multiplied elementwise by a vector of magnitudes |m_{t+1}|/(v_{t+1}+epsilon)^0.5 (or |m_{t+1}|/(v_{t+1}^0.5+epsilon)). Aida is designed to compute the qth power of the magnitude in the form of |m_{t+1}|^q/(r_{t+1}+epsilon)^(q/p) (or |m_{t+1}|^q/((r_{t+1})^(q/p)+epsilon)), which reduces to that of AdamW when (p,q)=(2,1). Suppose the origin 0 is a local optimal solution of a twice-differentiable function. It is found theoretically that when q>1 and p>1 in Aida, the origin 0 is locally stable only when the weight-decay is non-zero. Experiments are conducted for solving ten toy optimisation problems and training Transformer and Swin-Transformer for two deep learning (DL) tasks. The empirical study demonstrates that in a number of scenarios (including the two DL tasks), Aida with particular setups of (p,q) not equal to (2,1) outperforms the setup (p,q)=(2,1) of AdamW.

adamw, aida, convergence, (17 more...)

2112.06125

Country:

Oceania > New Zealand > North Island > Wellington Region > Wellington (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)