Goto

Collaborating Authors

 crm




Computing Optimal Nash Equilibria in Multiplayer Games

Neural Information Processing Systems

There are other approaches (e.g., [ Here, if all team members play strategies according to an NE minimizing the adversary's utility, the Eq.(1c) ensures that binary variable This space is represented by Eq.(1), which involves nonlinear terms in Eq.(1a) Section 3.4 shows that our techniques can significantly reduce the time The procedure of CRM is shown in Algorithm 2, which is illustrated in Appendix A. A collection N of subsets of players is a binary collection if: 1. { i | i N } N ; Eqs.(1b)-(1g), (3), and (4) is the space of NEs. Example 1 provides an example of N .



Computing Optimal Nash Equilibria in Multiplayer Games

Neural Information Processing Systems

There are other approaches (e.g., [ Here, if all team members play strategies according to an NE minimizing the adversary's utility, the Eq.(1c) ensures that binary variable This space is represented by Eq.(1), which involves nonlinear terms in Eq.(1a) Section 3.4 shows that our techniques can significantly reduce the time The procedure of CRM is shown in Algorithm 2, which is illustrated in Appendix A. A collection N of subsets of players is a binary collection if: 1. { i | i N } N ; Eqs.(1b)-(1g), (3), and (4) is the space of NEs. Example 1 provides an example of N .



Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning

Zhang, Zheng, Shan, Ziwei, Song, Kaitao, Li, Yexin, Ren, Kan

arXiv.org Artificial Intelligence

Process Reward Models (PRMs) have emerged as a promising approach to enhance the reasoning capabilities of large language models (LLMs) by guiding their step-by-step reasoning toward a final answer. However, existing PRMs either treat each reasoning step in isolation, failing to capture inter-step dependencies, or struggle to align process rewards with the final outcome. Consequently, the reward signal fails to respect temporal causality in sequential reasoning and faces ambiguous credit assignment. These limitations make downstream models vulnerable to reward hacking and lead to suboptimal performance. In this work, we propose Conditional Reward Modeling (CRM) that frames LLM reasoning as a temporal process leading to a correct answer. The reward of each reasoning step is not only conditioned on the preceding steps but also explicitly linked to the final outcome of the reasoning trajectory. Further, through this consistent probabilistic modeling, the rewards produced by CRM enable more reliable cross-sample comparison. Experiments across Best-of-N sampling, beam search and reinforcement learning demonstrate that CRM consistently outperforms existing reward models, offering a principled framework for enhancing LLM reasoning. In particular, CRM is more robust to reward hacking and delivers stable downstream improvements without relying on verifiable rewards derived from ground truth. Recent advances in enhancing reasoning abilities have significantly improved the performance of large language models (LLMs) (Snell et al., 2025; Jaech et al., 2024), where models derive final answers through explicit step-by-step reasoning.



User-centric Subjective Leaderboard by Customizable Reward Modeling

Jia, Qi, Song, Xiujie, Zhang, Zicheng, Guo, Yijin, Zhang, Kaiwei, Chen, Zijian, Zhai, Guangtao

arXiv.org Artificial Intelligence

Existing benchmarks for large language models (LLMs) predominantely focus on assessing their capabilities through verifiable tasks. Such objective and static benchmarks offer limited utility for practical LLM selection, making it difficult for users to find suitable models for their individual needs. To bridge this gap, we present the first User-Centric Subjective Leaderboard (USL), which provides a preference-driven, dynamic ranking of LLMs across diverse real-world scenarios. Our work is built upon a thorough investigation of real human preference data, involving more than 10K subjective queries. Our investigation reveals significant diversity and contradictions in human preferences, which limit the effectiveness of state-of-the-art reward models. To address this, we introduce Customizable Reward Models (CRMs). With only 4B parameters, our CRM surpasses the performance of leading models such as GPT-4.1 and Gemini-2.5-pro, showing exceptional generalization capabilities across new topics and criteria. The USL, powered by CRMs, exhibits strong negative correlations to contradictory preferences.


Bidirectional Knowledge Distillation for Enhancing Sequential Recommendation with Large Language Models

Wu, Jiongran, Liu, Jiahao, Li, Dongsheng, Zhang, Guangping, Han, Mingzhe, Gu, Hansu, Zhang, Peng, Shang, Li, Lu, Tun, Gu, Ning

arXiv.org Artificial Intelligence

Large language models (LLMs) have demonstrated exceptional performance in understanding and generating semantic patterns, making them promising candidates for sequential recommendation tasks. However, when combined with conventional recommendation models (CRMs), LLMs often face challenges related to high inference costs and static knowledge transfer methods. In this paper, we propose a novel mutual distillation framework, LLMD4Rec, that fosters dynamic and bidirectional knowledge exchange between LLM-centric and CRM-based recommendation systems. Unlike traditional unidirectional distillation methods, LLMD4Rec enables iterative optimization by alternately refining both models, enhancing the semantic understanding of CRMs and enriching LLMs with collaborative signals from user-item interactions. By leveraging sample-wise adaptive weighting and aligning output distributions, our approach eliminates the need for additional parameters while ensuring effective knowledge transfer. Extensive experiments on real-world datasets demonstrate that LLMD4Rec significantly improves recommendation accuracy across multiple benchmarks without increasing inference costs. This method provides a scalable and efficient solution for combining the strengths of both LLMs and CRMs in sequential recommendation systems.