AITopics | negative gradient

Collaborating Authors

negative gradient

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization

Neural Information Processing SystemsJun-18-2026, 03:32:08 GMT

Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO's widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO's learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.

large language model, machine learning, negative gradient, (19 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

VPO: Reasoning Preferences Optimization Based on \mathcal{V} -Usable Information

Neural Information Processing SystemsJun-14-2026, 08:02:14 GMT

Direct Preference Optimization (DPO) is a widely used preference optimization algorithm in large language model (LLM) alignment, which reparameterizes the reward function in reinforcement learning with human feedback (RLHF) without requiring a separate reward model. However, during the DPO training process, when a large negative gradient is applied to low-confidence samples, LLMs with a softmax output head tend to squeeze the confidence in the model's output distribution towards the highest-confidence sentence, which may lead to a decrease in the confidence of both preference and non-preference samples, while increasing the confidence of unrelated tokens. This phenomenon becomes more complex in reasoning tasks. In this work, focusing on reasoning tasks, we propose VPO, a negative gradient constraint method for human non-preference samples based on $\mathcal{V}$-usable information. By using $\mathcal{V}$-usable information to measure the similarity between preference pairs and selectively constrain the negative gradient, VPO can alleviate the squeezing effect of DPO, enhance alignment with the generation objective, and maintain the model's ability to distinguish between preference and non-preference samples. We compare VPO with DPO and its latest variants on mathematical reasoning tasks using the LLama 3.1 and Qwen 2.5 series, including both Base and Instruct models. Our results demonstrate that VPO consistently and significantly outperforms existing methods.

large language model, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.59)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.59)

Add feedback

The Debate on RLVR Reasoning Capability Boundary: Shrinkage, Expansion, or Both? A Two-Stage Dynamic View

Yao, Xinhao, Yu, Lu, Hu, Xiaolin, Teng, Fengwei, Cui, Qing, Zhou, Jun, Liu, Yong

arXiv.org Artificial IntelligenceOct-7-2025

The ongoing debate on whether reinforcement learning with verifiable rewards (RLVR) expands or shrinks the reasoning capabilities of large language models (LLMs) remains unresolved. Some studies contend that RLVR mainly improves sampling efficiency but at the expense of diversity and exploratory capacity, resulting in capability boundary shrinkage. In contrast, others demonstrate that prolonged training can lead to the emergence of novel reasoning strategies, suggesting capability boundary expansion. To reconcile these contradictory findings, we theoretically and empirically show that both perspectives are partially valid-each aligning with a separate phase in an inherent two-stage probability mass dynamic: (1) Exploitation stage: initially, the model primarily samples explored high-reward and low-reward tokens, while rarely selecting the potentially optimal token. Positive advantage estimates increase the probability of high-reward tokens and decrease those of low-reward tokens, yet the optimal token's probability remains largely unchanged during this stage. (2) Exploration stage: as training advances, the growth rate of previously acquired high-reward tokens slows as their probabilities approach saturation. When a potentially optimal token-now receiving positive advantage estimates-is occasionally sampled, its probability increases, while those of the originally high-reward tokens decrease. This dynamic suggests that over-exploitation during the exploitation stage may lead to capability boundary shrinkage, whereas prolonged training into the exploration stage can promote an expansion of the reasoning capability boundary. Building upon our insights, we revisit the potential of only using relative negative gradients for prolonging training, providing a theoretical and empirical foundation for the development of more advanced reasoning capabilities.

arxiv preprint arxiv, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.04028

Genre: Research Report > Experimental Study (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Add feedback

On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization

Deng, Wenlong, Ren, Yi, Li, Muchen, Sutherland, Danica J., Li, Xiaoxiao, Thrampoulidis, Christos

arXiv.org Artificial IntelligenceMay-27-2025

large language model, machine learning, negative gradient, (20 more...)

arXiv.org Artificial Intelligence

2505.1883

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Is All Learning (Natural) Gradient Descent?

Shoji, Lucas, Suzuki, Kenta, Kozachkov, Leo

arXiv.org Artificial IntelligenceSep-24-2024

This paper shows that a wide class of effective learning rules -- those that improve a scalar performance measure over a given time window -- can be rewritten as natural gradient descent with respect to a suitably defined loss function and metric. Specifically, we show that parameter updates within this class of learning rules can be expressed as the product of a symmetric positive definite matrix (i.e., a metric) and the negative gradient of a loss function. We also demonstrate that these metrics have a canonical form and identify several optimal ones, including the metric that achieves the minimum possible condition number. The proofs of the main results are straightforward, relying only on elementary linear algebra and calculus, and are applicable to continuous-time, discrete-time, stochastic, and higher-order learning rules, as well as loss functions that explicitly depend on time.

descent, gradient, loss function, (15 more...)

arXiv.org Artificial Intelligence

2409.16422

Country:

North America > United States > New York (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.74)

Add feedback

ReGentS: Real-World Safety-Critical Driving Scenario Generation Made Stable

Yin, Yuan, Khayatan, Pegah, Zablocki, Éloi, Boulch, Alexandre, Cord, Matthieu

arXiv.org Artificial IntelligenceSep-12-2024

Machine learning based autonomous driving systems often face challenges with safety-critical scenarios that are rare in real-world data, hindering their large-scale deployment. While increasing real-world training data coverage could address this issue, it is costly and dangerous. This work explores generating safety-critical driving scenarios by modifying complex real-world regular scenarios through trajectory optimization. We propose ReGentS, which stabilizes generated trajectories and introduces heuristics to avoid obvious collisions and optimization problems. Our approach addresses unrealistic diverging trajectories and unavoidable collision scenarios that are not useful for training robust planner. We also extend the scenario generation framework to handle real-world data with up to 32 agents.

adversary, agent, scenario, (14 more...)

arXiv.org Artificial Intelligence

2409.0783

Country:

Europe > France > Île-de-France > Paris > Paris (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)

Genre: Research Report (0.50)

Industry: Transportation > Ground > Road (0.89)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)

Add feedback

Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data

Tajwar, Fahim, Singh, Anikait, Sharma, Archit, Rafailov, Rafael, Schneider, Jeff, Xie, Tengyang, Ermon, Stefano, Finn, Chelsea, Kumar, Aviral

arXiv.org Artificial IntelligenceJun-2-2024

Learning from preference labels plays a crucial role in fine-tuning large language models. There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning. Different methods come with different implementation tradeoffs and performance differences, and existing empirical findings present different conclusions, for instance, some results show that online RL is quite important to attain good fine-tuning results, while others find (offline) contrastive or even purely supervised methods sufficient. This raises a natural question: what kind of approaches are important for fine-tuning with preference data and why? In this paper, we answer this question by performing a rigorous analysis of a number of fine-tuning techniques on didactic and full-scale LLM problems. Our main finding is that, in general, approaches that use on-policy sampling or attempt to push down the likelihood on certain responses (i.e., employ a "negative gradient") outperform offline and maximum likelihood objectives. We conceptualize our insights and unify methods that use on-policy sampling or negative gradient under a notion of mode-seeking objectives for categorical distributions. Mode-seeking objectives are able to alter probability mass on specific bins of a categorical distribution at a fast rate compared to maximum likelihood, allowing them to relocate masses across bins more effectively. Our analysis prescribes actionable insights for preference fine-tuning of LLMs and informs how data should be collected for maximal improvement.

gradient, negative gradient, preference fine-tuning, (14 more...)

arXiv.org Artificial Intelligence

2404.14367

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.54)

Add feedback

Guided AbsoluteGrad: Magnitude of Gradients Matters to Explanation's Localization and Saliency

Huang, Jun, Liu, Yan

arXiv.org Artificial IntelligenceApr-23-2024

This paper proposes a new gradient-based XAI method called Guided AbsoluteGrad for saliency map explanations. We utilize both positive and negative gradient magnitudes and employ gradient variance to distinguish the important areas for noise deduction. We also introduce a novel evaluation metric named ReCover And Predict (RCAP), which considers the Localization and Visual Noise Level objectives of the explanations. We propose two propositions for these two objectives and prove the necessity of evaluating them. We evaluate Guided AbsoluteGrad with seven gradient-based XAI methods using the RCAP metric and other SOTA metrics in three case studies: (1) ImageNet dataset with ResNet50 model; (2) International Skin Imaging Collaboration (ISIC) dataset with EfficientNet model; (3) the Places365 dataset with DenseNet161 model. Our method surpasses other gradient-based approaches, showcasing the quality of enhanced saliency map explanations through gradient magnitude.

explanation, saliency map, saliency value, (16 more...)

arXiv.org Artificial Intelligence

2404.15564

Country: North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Vision (0.88)

Add feedback

ViT-MUL: A Baseline Study on Recent Machine Unlearning Methods Applied to Vision Transformers

Cho, Ikhyun, Park, Changyeon, Hockenmaier, Julia

arXiv.org Artificial IntelligenceFeb-7-2024

Hence, there is a crucial need for MUL studies that Machine unlearning (MUL) is an arising field in machine specifically target ViT models. In response to this need, we learning that seeks to erase the learned information conduct comprehensive machine unlearning experiments on of specific training data points from a trained model. Despite ViT models using the recently proposed MUL algorithms the recent active research in MUL within computer and datasets [5]. Specifically, we utilize two most widelyused vision, the majority of work has focused on ResNet-based ViT models, ViT-base and ViT-large, applying and analyzing models. Given that Vision Transformers (ViT) have become recent machine unlearning algorithms on these architectures.

algorithm, mucac, vision transformer, (12 more...)

arXiv.org Artificial Intelligence

2403.09681

Country:

North America > United States > Illinois (0.05)
North America > United States > California (0.05)
Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report (0.64)

Industry:

Law (1.00)
Information Technology > Security & Privacy (0.94)
Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

DUCK: Distance-based Unlearning via Centroid Kinematics

Cotogni, Marco, Bonato, Jacopo, Sabetta, Luigi, Pelosin, Francesco, Nicolosi, Alessandro

arXiv.org Artificial IntelligenceDec-4-2023

Machine Unlearning is rising as a new field, driven by the pressing necessity of ensuring privacy in modern artificial intelligence models. This technique primarily aims to eradicate any residual influence of a specific subset of data from the knowledge acquired by a neural model during its training. This work introduces a novel unlearning algorithm, denoted as Distance-based Unlearning via Centroid Kinematics (DUCK), which employs metric learning to guide the removal of samples matching the nearest incorrect centroid in the embedding space. Evaluation of the algorithm's performance is conducted across various benchmark datasets in two distinct scenarios, class removal, and homogeneous sampling removal, obtaining state-of-the-art performance. We introduce a novel metric, called Adaptive Unlearning Score (AUS), encompassing not only the efficacy of the unlearning process in forgetting target data but also quantifying the performance loss relative to the original model. Moreover, we propose a novel membership inference attack to assess the algorithm's capacity to erase previously acquired knowledge, designed to be adaptable to future methodologies.

dataset, hr scenario, scenario, (15 more...)

arXiv.org Artificial Intelligence

2312.02052

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Virginia (0.04)

Genre: Research Report > Promising Solution (0.36)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback