AITopics

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.84)

Neural Information Processing SystemsFeb-18-2026, 05:40:21 GMT

cf8b2205e39f81726a8d828ecbe00ad0-Paper-Conference.pdf

cal-dpo, large language model, machine learning, (18 more...)

Country:

North America > United States > Pennsylvania (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Media (0.68)
Banking & Finance (0.67)
Government > Regional Government > North America Government > United States Government (0.45)
Education > Curriculum > Subject-Specific Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Communications > Social Media (0.69)

arXiv.org Artificial IntelligenceNov-25-2025

Personalized LLM Decoding via Contrasting Personal Preference

Bu, Hyungjune, Jung, Chanjoo, Kang, Minjae, Kim, Jaehyung

As large language models (LLMs) are progressively deployed in various real-world applications, personalization of LLMs has become increasingly important. While various approaches to LLM personalization such as prompt-based and training-based methods have been actively explored, the development of effective decoding-time algorithms remains largely overlooked, despite their demonstrated potential. In this paper, we propose CoPe (Contrasting Personal Preference), a novel decoding-time approach applied after performing parameter-efficient fine-tuning (PEFT) on user-specific data. Our core idea is to leverage reward-guided decoding specifically for personalization by maximizing each user's implicit reward signal. We evaluate CoPe across five open-ended personalized text generation tasks. Our empirical results demonstrate that CoPe achieves strong performance, improving personalization by an average of 10.57% in ROUGE-L, without relying on external reward models or additional training procedures.

large language model, machine learning, natural language, (19 more...)

2506.12109

Genre: Research Report > New Finding (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsOct-10-2025, 17:08:20 GMT

cf8b2205e39f81726a8d828ecbe00ad0-Paper-Conference.pdf

arxiv preprint arxiv, cal-dpo, dataset, (13 more...)

Country:

North America > United States > Pennsylvania (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Media (0.68)
Banking & Finance (0.67)
Government > Regional Government > North America Government > United States Government (0.45)
Education > Curriculum > Subject-Specific Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Communications > Social Media (0.69)

arXiv.org Artificial IntelligenceOct-10-2025

Contrastive Weak-to-strong Generalization

Jiang, Houcheng, Fang, Junfeng, Wu, Jiaxin, Zhang, Tianyu, Gao, Chen, Li, Yong, Wang, Xiang, He, Xiangnan, Deng, Yang

Weak-to-strong generalization provides a promising paradigm for scaling large language models (LLMs) by training stronger models on samples from aligned weaker ones, without requiring human feedback or explicit reward modeling. However, its robustness and generalization are hindered by the noise and biases in weak-model outputs, which limit its applicability in practice. To address this challenge, we leverage implicit rewards, which approximate explicit rewards through log-likelihood ratios, and reveal their structural equivalence with Contrastive Decoding (CD), a decoding strategy shown to reduce noise in LLM generation. Building on this connection, we propose Contrastive Weak-to-Strong Generalization (ConG), a framework that employs contrastive decoding between pre- and post-alignment weak models to generate higher-quality samples. This approach enables more reliable capability transfer, denoising, and improved robustness, substantially mitigating the limitations of traditional weak-to-strong methods. Empirical results across different model families confirm consistent improvements, demonstrating the generality and effectiveness of ConG. Taken together, our findings highlight the potential of ConG to advance weak-to-strong generalization and provide a promising pathway toward AGI.

implicit reward, large language model, machine learning, (20 more...)

2510.07884

Country: Asia (0.46)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Sports (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Bui, The Viet, Mai, Tien, Nguyen, Hong Thanh

Preference-Guided Learning for Sparse-Reward Multi-Agent Reinforcement Learning

arXiv.org Artificial IntelligenceSep-29-2025

We study the problem of online multi-agent reinforcement learning (MARL) in environments with sparse rewards, where reward feedback is not provided at each interaction but only revealed at the end of a trajectory. This setting, though realistic, presents a fundamental challenge: the lack of intermediate rewards hinders standard MARL algorithms from effectively guiding policy learning. To address this issue, we propose a novel framework that integrates online inverse preference learning with multi-agent on-policy optimization into a unified architecture. At its core, our approach introduces an implicit multi-agent reward learning model, built upon a preference-based value-decomposition network, which produces both global and local reward signals. These signals are further used to construct dual advantage streams, enabling differentiated learning targets for the centralized critic and decentralized actors. In addition, we demonstrate how large language models (LLMs) can be leveraged to provide preference labels that enhance the quality of the learned reward model. Empirical evaluations on state-of-the-art benchmarks, including MAMuJoCo and SMACv2, show that our method achieves superior performance compared to existing baselines, highlighting its effectiveness in addressing sparse-reward challenges in online MARL.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

2509.21828

Country: North America > United States > Oregon (0.28)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceJul-8-2025

Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections

Wang, Bo, Cheng, Qinyuan, Peng, Runyu, Bao, Rong, Li, Peiji, Guo, Qipeng, Li, Linyang, Zeng, Zhiyuan, Zhou, Yunhua, Qiu, Xipeng

Post-training processes are essential phases in grounding pre-trained language models to real-world tasks, with learning from demonstrations or preference signals playing a crucial role in this adaptation. We present a unified theoretical framework bridging Supervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training. Through rigorous mathematical derivation, we demonstrate that both SFT and preference learning methods like Direct Preference Optimization (DPO) operate within the same optimal policy-reward subspace, with SFT representing a special case of implicit reward learning. Our analysis reveals a critical limitation in conventional SFT: the KL divergence term in distribution matching becomes constant with respect to the policy during optimization, failing to constrain model updates. To address this, we propose a simple yet effective learning rate reduction approach that yields significant performance improvements (up to \textbf{25\%} relative gain and \textbf{6\%} absolute win rate increase in instruction following tasks. Additionally, we derive alternative SFT objectives from various f-divergence functions that preserve the KL term during optimization, further enhancing post-DPO model performance. Finally, we extend the theoretical relationship between LLM logits and Q-functions from preference learning to the SFT context, providing mathematical derivations and experimental validation.

large language model, machine learning, natural language, (19 more...)

2507.00018

Country:

Asia (1.00)
North America > United States (0.93)
Europe > Austria > Vienna (0.14)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Neural Information Processing SystemsMay-27-2025, 17:18:32 GMT

Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment

cal-dpo, calibrated direct preference optimization, implicit reward, (2 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)

arXiv.org Artificial IntelligenceMar-6-2025

Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment

Yang, Wen, Wu, Junhong, Wang, Chen, Zong, Chengqing, Zhang, Jiajun

Direct Preference Optimization (DPO) has become a prominent method for aligning Large Language Models (LLMs) with human preferences. While DPO has enabled significant progress in aligning English LLMs, multilingual preference alignment is hampered by data scarcity. To address this, we propose a novel approach that $\textit{captures}$ learned preferences from well-aligned English models by implicit rewards and $\textit{transfers}$ them to other languages through iterative training. Specifically, we derive an implicit reward model from the logits of an English DPO-aligned model and its corresponding reference model. This reward model is then leveraged to annotate preference relations in cross-lingual instruction-following pairs, using English instructions to evaluate multilingual responses. The annotated data is subsequently used for multilingual DPO fine-tuning, facilitating preference knowledge transfer from English to other languages. Fine-tuning Llama3 for two iterations resulted in a 12.72% average improvement in Win Rate and a 5.97% increase in Length Control Win Rate across all training languages on the X-AlpacaEval leaderboard. Our findings demonstrate that leveraging existing English-aligned models can enable efficient and effective multilingual preference alignment, significantly reducing the need for extensive multilingual preference data. The code is available at https://github.com/ZNLP/Implicit-Cross-Lingual-Rewarding

alignment, arxiv preprint arxiv, preference alignment, (14 more...)

2503.04647

Country:

North America > United States (0.14)
Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Hubei Province > Wuhan (0.04)

Genre: Research Report > New Finding (0.86)

Industry: Education > Curriculum > Subject-Specific Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceDec-18-2024

Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment

Xiao, Teng, Yuan, Yige, Zhu, Huaisheng, Li, Mingxiao, Honavar, Vasant G

We study the problem of aligning large language models (LLMs) with human preference data. Contrastive preference optimization has shown promising results in aligning LLMs with available preference data by optimizing the implicit reward associated with the policy. However, the contrastive objective focuses mainly on the relative values of implicit rewards associated with two responses while ignoring their actual values, resulting in suboptimal alignment with human preferences. To address this limitation, we propose calibrated direct preference optimization (Cal-DPO), a simple yet effective algorithm. We show that substantial improvement in alignment with the given preferences can be achieved simply by calibrating the implicit reward to ensure that the learned implicit rewards are comparable in scale to the ground-truth rewards. We demonstrate the theoretical advantages of Cal-DPO over existing approaches. The results of our experiments on a variety of standard benchmarks show that Cal-DPO remarkably improves off-the-shelf methods.

cal-dpo, large language model, machine learning, (17 more...)

2412.14516

Country:

North America > United States > Pennsylvania (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Media (0.68)
Banking & Finance (0.67)
Government > Regional Government > North America Government > United States Government (0.45)
Education > Curriculum > Subject-Specific Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)