AITopics | gradient space

Distilling knowledge from an ensemble of teacher models is expected to have a more promising performance than that from a single one. Current methods mainly adopt a vanilla average rule, i.e., to simply take the average of all teacher losses for training the student network. However, this approach treats teachers equally and ignores the diversity among them. When conflicts or competitions exist among teachers, which is common, the inner compromise might hurt the distillation performance. In this paper, we examine the diversity of teacher models in the gradient space and regard the ensemble knowledge distillation as a multi-objective optimization problem so that we can determine a better optimization direction for the training of student network. Besides, we also introduce a tolerance parameter to accommodate disagreement among teachers. In this way, our method can be seen as a dynamic weighting method for each teacher in the ensemble.

adaptive ensemble knowledge distillation, disagree, name change, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.60)

Add feedback

On the Importance of Gradients for Detecting Distributional Shifts in the Wild

Neural Information Processing SystemsDec-23-2025, 17:16:45 GMT

Detecting out-of-distribution (OOD) data has become a critical component in ensuring the safe deployment of machine learning models in the real world. Existing OOD detection approaches primarily rely on the output or feature space for deriving OOD scores, while largely overlooking information from the gradient space. In this paper, we present GradNorm, a simple and effective approach for detecting OOD inputs by utilizing information extracted from the gradient space. GradNorm directly employs the vector norm of gradients, backpropagated from the KL divergence between the softmax output and a uniform probability distribution. Our key idea is that the magnitude of gradients is higher for in-distribution (ID) data than that for OOD data, making it informative for OOD detection. GradNorm demonstrates superior performance, reducing the average FPR95 by up to 16.33% compared to the previous best method.

detecting distributional shift, importance, name change, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

8493eeaccb772c0878f99d60a0bd2bb3-AuthorFeedback.pdf

Neural Information Processing SystemsAug-14-2025, 22:58:43 GMT

clean data, noisy label, subset, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)

Add feedback

Continual Gradient Low-Rank Projection Fine-Tuning for LLMs

Wang, Chenxu, Lyu, Yilin, Sun, Zicheng, Jing, Liping

arXiv.org Artificial IntelligenceJul-8-2025

Continual fine-tuning of Large Language Models (LLMs) is hampered by the trade-off between efficiency and expressiveness. Low-Rank Adaptation (LoRA) offers efficiency but constrains the model's ability to learn new tasks and transfer knowledge due to its low-rank nature and reliance on explicit parameter constraints. We propose GORP (Gradient LOw Rank Projection) for Continual Learning, a novel training strategy that overcomes these limitations by synergistically combining full and low-rank parameters and jointly updating within a unified low-rank gradient subspace. GORP expands the optimization space while preserving efficiency and mitigating catastrophic forgetting. Extensive experiments on continual learning benchmarks demonstrate GORP's superior performance compared to existing state-of-the-art approaches. Code is available at https://github.com/Wcxwcxw/GORP.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.02503

Country:

Europe (1.00)
North America > United States (0.93)
Asia > Middle East > UAE (0.46)

Genre:

Research Report > New Finding (0.68)
Research Report > Promising Solution (0.48)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.88)

Add feedback

Review for NeurIPS paper: Agree to Disagree: Adaptive Ensemble Knowledge Distillation in Gradient Space

Neural Information Processing SystemsJan-26-2025, 14:02:36 GMT

The motivation of AE-KD is to encourage the optimization direction of the student guided equally by all the teachers. However, considering there are some weak teachers (low generalization accuracy) in the ensemble teacher pool, why are these weak teachers treated equally with other strong teachers in the gradient space? Intuitively, the guidance of student should favor those strong teachers, but keep away from the weak teachers. What is the difference between them? 3. How to optimize the weights \alpha_m in Eq. (11)? Is it end-to-end optimized together with the student?

adaptive ensemble knowledge distillation, gradient space, knowledge distillation, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.44)

Add feedback

Agree to Disagree: Adaptive Ensemble Knowledge Distillation in Gradient Space

Neural Information Processing SystemsOct-10-2024, 18:44:35 GMT

Distilling knowledge from an ensemble of teacher models is expected to have a more promising performance than that from a single one. Current methods mainly adopt a vanilla average rule, i.e., to simply take the average of all teacher losses for training the student network. However, this approach treats teachers equally and ignores the diversity among them. When conflicts or competitions exist among teachers, which is common, the inner compromise might hurt the distillation performance. In this paper, we examine the diversity of teacher models in the gradient space and regard the ensemble knowledge distillation as a multi-objective optimization problem so that we can determine a better optimization direction for the training of student network. Besides, we also introduce a tolerance parameter to accommodate disagreement among teachers.

adaptive ensemble knowledge distillation, gradient space, student network, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.64)

Add feedback

On the Importance of Gradients for Detecting Distributional Shifts in the Wild

Neural Information Processing SystemsOct-9-2024, 09:51:37 GMT

Detecting out-of-distribution (OOD) data has become a critical component in ensuring the safe deployment of machine learning models in the real world. Existing OOD detection approaches primarily rely on the output or feature space for deriving OOD scores, while largely overlooking information from the gradient space. In this paper, we present GradNorm, a simple and effective approach for detecting OOD inputs by utilizing information extracted from the gradient space. GradNorm directly employs the vector norm of gradients, backpropagated from the KL divergence between the softmax output and a uniform probability distribution. Our key idea is that the magnitude of gradients is higher for in-distribution (ID) data than that for OOD data, making it informative for OOD detection.

detecting distributional shift, gradient, gradient space, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Outlier Gradient Analysis: Efficiently Improving Deep Learning Model Performance via Hessian-Free Influence Functions

Chhabra, Anshuman, Li, Bo, Chen, Jian, Mohapatra, Prasant, Liu, Hongfu

arXiv.org Artificial IntelligenceMay-12-2024

Data-centric learning focuses on enhancing algorithmic performance from the perspective of the training data [Oala et al., 2023]. In contrast to model-centric learning, which designs novel algorithms or optimization techniques for performance improvement with fixed training data, data-centric learning operates with a fixed learning algorithm while modifying the training data through trimming, augmenting, or other methods aligned with improving utility [Zha et al., 2023]. Data-centric learning holds significant potential in many areas such as model interpretation, subset training set selection, data generation, noisy label detection, active learning, and others [Chhabra et al., 2024, Kwon et al., 2024]. The essence of data-centric learning lies in estimating data influence, also known as data valuation [Hammoudeh and Lowd, 2022], in the context of a learning task. Intuitively, the impact of an individual data sample can be measured by assessing the change in learning utility when training with and without that specific sample. This leave-one-out influence [Cook and Weisberg, 1982] provides a rough gauge of the relative data influence of the specific sample on the otherwise full fixed training set. On the other hand, Shapley value [Ghorbani and Zou, 2019, Jia et al., 2019], originating from cooperative game theory, quantifies the increase in value when a group of samples collaborates to achieve the learning goal. Unlike leave-one-out influence, Shapley value represents the weighted average utility change resulting from adding the point to different training subsets. Despite the absence of assumptions on the learning model, the aforementioned retraining-based methods incur significant computational costs, especially for large-scale data analysis and deep models [Hammoudeh and Lowd, 2022].

dataset, gradient analysis, outlier gradient analysis, (11 more...)

arXiv.org Artificial Intelligence

2405.03869

Country: