Goto

Collaborating Authors

 response 1




HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages

Wang, Zhilin, Zeng, Jiaqi, Delalleau, Olivier, Shin, Hoo-Chang, Soares, Felipe, Bukharin, Alexander, Evans, Ellie, Dong, Yi, Kuchaiev, Oleksii

arXiv.org Artificial Intelligence

Preference datasets are essential for training general-domain, instruction-following language models with Reinforcement Learning from Human Feedback (RLHF). Each subsequent data release raises expectations for future data collection, meaning there is a constant need to advance the quality and diversity of openly available preference data. To address this need, we introduce HelpSteer3-Preference, a permissively licensed (CC-BY-4.0), high-quality, human-annotated preference dataset comprising of over 40,000 samples. These samples span diverse real-world applications of large language models (LLMs), including tasks relating to STEM, coding and multilingual scenarios. Using HelpSteer3-Preference, we train Reward Models (RMs) that achieve top performance on RM-Bench (82.4%) and JudgeBench (73.7%). This represents a substantial improvement (~10% absolute) over the previously best-reported results from existing RMs. We demonstrate HelpSteer3-Preference can also be applied to train Generative RMs and how policy models can be aligned with RLHF using our RMs. Dataset (CC-BY-4.0): https://huggingface.co/datasets/nvidia/HelpSteer3#preference Models (NVIDIA Open Model): https://huggingface.co/collections/nvidia/reward-models-68377c5955575f71fcc7a2a3


incorporating the reviewers ' suggestions. 2 Response to Reviewer # 1 3 Comment 1: " The significance of the proposed method is not very clear "

Neural Information Processing Systems

We greatly appreciate the reviewers' effort and helpful comments. Comment 1: "The significance of the proposed method is not very clear..." It also has great theoretical significance in the optimization area. Though the convergence rate of this method could be suboptimal, it's a practical way to In addition, [6] shows some examples of saddle point algorithms where projection onto the constrain sets is hard. Comment 2: "Why do we consider nuclear norm constraint for this classification problem?" We find that this paper does not have section 5.4 and 5.6.


1 Overall Response 1 We sincerely thank all the reviewers for their helpful comments and constructive suggestions

Neural Information Processing Systems

We sincerely thank all the reviewers for their helpful comments and constructive suggestions. Actually, the motivation of using attention and distillation differs from their origins. Regarding the comparison with related work. Net[C2] and NestedNet[C3] (results taken from their papers). The proposed method is robust to hyper-parameters.



provide two responses to the common concerns raised by the reviewers, and then reply each reviewer, respectively

Neural Information Processing Systems

We would like to thank all the reviewers for your helpful comments and suggestions. As shown in Appendix A.3, the layer-wise GCN network has the highest computational complexity in the computational propagation flow. Please see the response in Common Response 2 . For fair comparison we only report the result on semi-supervised task. Please see the response in Common Response 2 .


The Pursuit of Empathy: Evaluating Small Language Models for PTSD Dialogue Support

BN, Suhas, Mahajan, Yash, Mattioli, Dominik, Sherrill, Andrew M., Arriaga, Rosa I., Wiese, Chris W., Abdullah, Saeed

arXiv.org Artificial Intelligence

This paper investigates the capacity of small language models (0.5B-5B parameters) to generate empathetic responses for individuals with PTSD. We introduce Trauma-Informed Dialogue for Empathy (TIDE), a novel dataset comprising 10,000 two-turn conversations across 500 diverse, clinically-grounded PTSD personas (https://huggingface.co/datasets/yenopoya/TIDE). Using frontier model outputs as ground truth, we evaluate eight small LLMs in zero-shot settings and after fine-tuning. Fine-tuning enhances empathetic capabilities, improving cosine similarity and perceived empathy, although gains vary across emotional scenarios and smaller models exhibit a "knowledge transfer ceiling." As expected, Claude Sonnet 3.5 consistently outperforms all models, but surprisingly, the smaller models often approach human-rated empathy levels. Demographic analyses showed that older adults favored responses that validated distress before offering support (p = .004), while graduate-educated users preferred emotionally layered replies in specific scenarios. Gender-based differences were minimal (p > 0.15), suggesting the feasibility of broadly empathetic model designs. This work offers insights into building resource-efficient, emotionally intelligent systems for mental health support.


R3: Robust Rubric-Agnostic Reward Models

Anugraha, David, Tang, Zilu, Miranda, Lester James V., Zhao, Hanyang, Farhansyah, Mohammad Rifqi, Kuwanto, Garry, Wijaya, Derry, Winata, Genta Indra

arXiv.org Artificial Intelligence

Reward models are essential for aligning language model outputs with human preferences, yet existing approaches often lack both controllability and interpretability. These models are typically optimized for narrow objectives, limiting their generalizability to broader downstream tasks. Moreover, their scalar outputs are difficult to interpret without contextual reasoning. To address these limitations, we introduce $\shortmethodname$, a novel reward modeling framework that is rubric-agnostic, generalizable across evaluation dimensions, and provides interpretable, reasoned score assignments. $\shortmethodname$ enables more transparent and flexible evaluation of language models, supporting robust alignment with diverse human values and use cases. Our models, data, and code are available as open source at https://github.com/rubricreward/r3.


Incentivizing High-Quality Human Annotations with Golden Questions

Liu, Shang, Cai, Zhongze, Wang, Hanzhao, Ma, Zhongyao, Li, Xiaocheng

arXiv.org Machine Learning

Human-annotated data plays a vital role in training large language models (LLMs), such as supervised fine-tuning and human preference alignment. However, it is not guaranteed that paid human annotators produce high-quality data. In this paper, we study how to incentivize human annotators to do so. We start from a principal-agent model to model the dynamics between the company (the principal) and the annotator (the agent), where the principal can only monitor the annotation quality by examining $n$ samples. We investigate the maximum likelihood estimators (MLE) and the corresponding hypothesis testing to incentivize annotators: the agent is given a bonus if the MLE passes the test. By analyzing the variance of the outcome, we show that the strategic behavior of the agent makes the hypothesis testing very different from traditional ones: Unlike the exponential rate proved by the large deviation theory, the principal-agent model's hypothesis testing rate is of $Θ(1/\sqrt{n \log n})$. Our theory implies two criteria for the \emph{golden questions} to monitor the performance of the annotators: they should be of (1) high certainty and (2) similar format to normal ones. In that light, we select a set of golden questions in human preference data. By doing incentive-compatible experiments, we find out that the annotators' behavior is better revealed by those golden questions, compared to traditional survey techniques such as instructed manipulation checks.