AITopics

2502.05001

Country: Europe > United Kingdom > England (0.46)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models

Zhou, Jiajun, Yang, Yifan, Zhen, Kai, Liu, Ziyue, Zhao, Yequan, Banijamali, Ershad, Mouchtaris, Athanasios, Wong, Ngai, Zhang, Zheng

Language Models (LLMs) are often quantized to lower precision to reduce the memory cost and latency in inference. However, quantization often degrades model performance, thus fine-tuning is required for various down-stream tasks. Traditional fine-tuning methods such as stochastic gradient descent and Adam optimization require backpropagation, which are error-prone in the low-precision settings. To overcome these limitations, we propose the Quantized Zeroth-Order (QuZO) framework, specifically designed for fine-tuning LLMs through low-precision (e.g., 4- or 8-bit) forward passes. Our method can avoid the error-prone low-precision straight-through estimator, and utilizes optimized stochastic rounding to mitigate the increased bias. QuZO simplifies the training process, while achieving results comparable to first-order methods in ${\rm FP}8$ and superior accuracy in ${\rm INT}8$ and ${\rm INT}4$ training. Experiments demonstrate that low-bit training QuZO achieves performance comparable to MeZO optimization on GLUE, Multi-Choice, and Generation tasks, while reducing memory cost by $2.94 \times$ in LLaMA2-7B fine-tuning compared to quantized first-order methods.

large language model, machine learning, quantization, (19 more...)

2502.12346

Genre: Research Report (0.64)

Industry: Energy > Oil & Gas (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Fentaw, Haftu W., Campbell, Steve, Caton, Simon

Exploring Quantum Control Landscape and Solution Space Complexity through Dimensionality Reduction & Optimization Algorithms

Understanding the quantum control landscape (QCL) is important for designing effective quantum control strategies. In this study, we analyze the QCL for a single two-level quantum system (qubit) using various control strategies. We employ Principal Component Analysis (PCA), to visualize and analyze the QCL for higher dimensional control parameters. Our results indicate that dimensionality reduction techniques such as PCA, can play an important role in understanding the complex nature of quantum control in higher dimensions. Evaluations of traditional control techniques and machine learning algorithms reveal that Genetic Algorithms (GA) outperform Stochastic Gradient Descent (SGD), while Q-learning (QL) shows great promise compared to Deep Q-Networks (DQN) and Proximal Policy Optimization (PPO). Additionally, our experiments highlight the importance of reward function design in DQN and PPO demonstrating that using immediate reward results in improved performance rather than delayed rewards for systems with short time steps. A study of solution space complexity was conducted by using Cluster Density Index (CDI) as a key metric for analyzing the density of optimal solutions in the landscape. The CDI reflects cluster quality and helps determine whether a given algorithm generates regions of high fidelity or not. Our results provide insights into effective quantum control strategies, emphasizing the significance of parameter selection and algorithm optimization.

algorithm, fidelity, landscape, (16 more...)

2502.11905

Country:

Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Dimensionality Reduction (0.61)

Mitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient Descent

Wu, Junda, Xiong, Yuxin, Li, Xintong, Xia, Yu, Wang, Ruoyu, Wang, Yu, Yu, Tong, Kim, Sungchul, Rossi, Ryan A., Yao, Lina, Shang, Jingbo, McAuley, Julian

Recent MLLMs have shown emerging visual understanding and reasoning abilities after being pre-trained on large-scale multimodal datasets. Unlike pre-training, where MLLMs receive rich visual-text alignment, instruction-tuning is often text-driven with weaker visual supervision, leading to the degradation of pre-trained visual understanding and causing visual forgetting. Existing approaches, such as direct fine-tuning and continual learning methods, fail to explicitly address this issue, often compressing visual representations and prioritizing task alignment over visual retention, which further worsens visual forgetting. To overcome this limitation, we introduce a novel perspective leveraging effective rank to quantify the degradation of visual representation richness, interpreting this degradation through the information bottleneck principle as excessive compression that leads to the degradation of crucial pre-trained visual knowledge. Building on this view, we propose a modality-decoupled gradient descent (MDGD) method that regulates gradient updates to maintain the effective rank of visual representations while mitigating the over-compression effects described by the information bottleneck. By explicitly disentangling the optimization of visual understanding from task-specific alignment, MDGD preserves pre-trained visual knowledge while enabling efficient task adaptation. To enable lightweight instruction-tuning, we further develop a memory-efficient fine-tuning approach using gradient masking, which selectively updates a subset of model parameters to enable parameter-efficient fine-tuning (PEFT), reducing computational overhead while preserving rich visual representations. Extensive experiments across various downstream tasks and backbone MLLMs demonstrate that MDGD effectively mitigates visual forgetting from pre-trained tasks while enabling strong adaptation to new tasks.

artificial intelligence, machine learning, representation, (13 more...)

2502.1174

Country: North America > United States (0.50)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine (0.46)
Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.91)

Jeong, Wooseong, Yoon, Kuk-Jin

Selective Task Group Updates for Multi-Task Optimization

Multi-task learning enables the acquisition of task-generic knowledge by training multiple tasks within a unified architecture. However, training all tasks together in a single architecture can lead to performance degradation, known as negative transfer, which is a main concern in multi-task learning. Previous works have addressed this issue by optimizing the multi-task network through gradient manipulation or weighted loss adjustments. However, their optimization strategy focuses on addressing task imbalance in shared parameters, neglecting the learning of task-specific parameters. As a result, they show limitations in mitigating negative transfer, since the learning of shared space and task-specific information influences each other during optimization. To address this, we propose a different approach to enhance multi-task performance by selectively grouping tasks and updating them for each batch during optimization. We introduce an algorithm that adaptively determines how to effectively group tasks and update them during the learning process. To track inter-task relations and optimize multi-task networks simultaneously, we propose proximal inter-task affinity, which can be measured during the optimization process. We provide a theoretical analysis on how dividing tasks into multiple groups and updating them sequentially significantly affects multi-task performance by enhancing the learning of task-specific parameters. Our methods substantially outperform previous multi-task optimization approaches and are scalable to different architectures and various numbers of tasks. Multi-task learning (MTL) stands out as a key approach for crafting efficient and robust deep learning models that can adeptly manage numerous tasks within a unified architecture (Caruana, 1997).

artificial intelligence, machine learning, optimization problem, (17 more...)

2502.11986

Country: Europe (0.46)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Stability-based Generalization Bounds for Variational Inference

Wei, Yadi, Khardon, Roni

Variational inference (VI) is widely used for approximate inference in Bayesian machine learning. In addition to this practical success, generalization bounds for variational inference and related algorithms have been developed, mostly through the connection to PAC-Bayes analysis. A second line of work has provided algorithm-specific generalization bounds through stability arguments or using mutual information bounds, and has shown that the bounds are tight in practice, but unfortunately these bounds do not directly apply to approximate Bayesian algorithms. This paper fills this gap by developing algorithm-specific stability based generalization bounds for a class of approximate Bayesian algorithms that includes VI, specifically when using stochastic gradient descent to optimize their objective. As in the non-Bayesian case, the generalization error is bounded by by expected parameter differences on a perturbed dataset. The new approach complements PAC-Bayes analysis and can provide tighter bounds in some cases. An experimental illustration shows that the new approach yields non-vacuous bounds on modern neural network architectures and datasets and that it can shed light on performance differences between variant approximate Bayesian algorithms.

artificial intelligence, bayesian inference, machine learning, (15 more...)

2502.12353

Country: North America > United States (0.46)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)

Snow, Luke, Krishnamurthy, Vikram

Efficient Neural SDE Training using Wiener-Space Cubature

A neural stochastic differential equation (SDE) is an SDE with drift and diffusion terms parametrized by neural networks. The training procedure for neural SDEs consists of optimizing the SDE vector field (neural network) parameters to minimize the expected value of an objective functional on infinite-dimensional path-space. Existing training techniques focus on methods to efficiently compute path-wise gradients of the objective functional with respect to these parameters, then pair this with Monte-Carlo simulation to estimate the expectation, and stochastic gradient descent to optimize. In this work we introduce a novel training technique which bypasses and improves upon Monte-Carlo simulation; we extend results in the theory of Wiener-space cubature to approximate the expected objective functional by a weighted sum of deterministic ODE solutions. This allows us to compute gradients by efficient ODE adjoint methods. Furthermore, we exploit a high-order recombination scheme to drastically reduce the number of ODE solutions necessary to achieve a reasonable approximation. We show that this Wiener-space cubature approach can surpass the O(1/sqrt(n)) rate of Monte-Carlo simulation, or the O(log(n)/n) rate of quasi-Monte-Carlo, to achieve a O(1/n) rate under reasonable assumptions.

approximation, artificial intelligence, machine learning, (16 more...)

2502.12395

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Carrell, Annabelle Michael, Gong, Albert, Shetty, Abhishek, Dwivedi, Raaz, Mackey, Lester

Low-Rank Thinning

arXiv.org Machine LearningFeb-17-2025

The goal in thinning is to summarize a dataset using a small set of representative points. Remarkably, sub-Gaussian thinning algorithms like Kernel Halving and Compress can match the quality of uniform subsampling while substantially reducing the number of summary points. However, existing guarantees cover only a restricted range of distributions and kernel-based quality measures and suffer from pessimistic dimension dependence. To address these deficiencies, we introduce a new low-rank analysis of sub-Gaussian thinning that applies to any distribution and any kernel, guaranteeing high-quality compression whenever the kernel or data matrix is approximately low-rank. To demonstrate the broad applicability of the techniques, we design practical sub-Gaussian thinning approaches that improve upon the best known guarantees for approximating attention in transformers, accelerating stochastic gradient training through reordering, and distinguishing distributions in near-linear time.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

2502.12063

Country:

North America > United States (0.28)
Europe > United Kingdom > England (0.27)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Riabinin, Artem, Khaled, Ahmed, Richtárik, Peter

A Novel Unified Parametric Assumption for Nonconvex Optimization

arXiv.org Machine LearningFeb-17-2025

Nonconvex optimization is central to modern machine learning, but the general framework of nonconvex optimization yields weak convergence guarantees that are too pessimistic compared to practice. On the other hand, while convexity enables efficient optimization, it is of limited applicability to many practical problems. To bridge this gap and better understand the practical success of optimization algorithms in nonconvex settings, we introduce a novel unified parametric assumption. Our assumption is general enough to encompass a broad class of nonconvex functions while also being specific enough to enable the derivation of a unified convergence theorem for gradient-based methods. Notably, by tuning the parameters of our assumption, we demonstrate its versatility in recovering several existing function classes as special cases and in identifying functions amenable to efficient optimization. We derive our convergence theorem for both deterministic and stochastic optimization, and conduct experiments to verify that our assumption can hold practically over optimization trajectories.

artificial intelligence, assumption 1, machine learning, (12 more...)

arXiv.org Machine Learning

2502.12329

Country:

North America (0.46)
Asia (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.67)

arXiv.org Artificial IntelligenceFeb-16-2025

Provable and Practical Online Learning Rate Adaptation with Hypergradient Descent

Chu, Ya-Chi, Gao, Wenzhi, Ye, Yinyu, Udell, Madeleine

This paper investigates the convergence properties of the hypergradient descent method (HDM), a 25-year-old heuristic originally proposed for adaptive stepsize selection in stochastic first-order methods. We provide the first rigorous convergence analysis of HDM using the online learning framework of [Gao24] and apply this analysis to develop new state-of-the-art adaptive gradient methods with empirical and theoretical support. Notably, HDM automatically identifies the optimal stepsize for the local optimization landscape and achieves local superlinear convergence. Our analysis explains the instability of HDM reported in the literature and proposes efficient strategies to address it. We also develop two HDM variants with heavy-ball and Nesterov momentum. Experiments on deterministic convex problems show HDM with heavy-ball momentum (HDM-HB) exhibits robust performance and significantly outperforms other adaptive first-order methods. Moreover, HDM-HB often matches the performance of L-BFGS, an efficient and practical quasi-Newton method, using less memory and cheaper iterations.

artificial intelligence, convergence, machine learning, (16 more...)

2502.11229

Country:

Europe (0.45)
North America (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Education > Educational Setting > Online (0.71)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Enterprise Applications > Human Resources > Learning Management (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.30)