Diverging Preferences: When do Annotators Disagree and do Models Know?

Zhang, Michael JQ, Wang, Zhilin, Hwang, Jena D., Dong, Yi, Delalleau, Olivier, Choi, Yejin, Choi, Eunsol, Ren, Xiang, Pyatkin, Valentina

arXiv.org Artificial Intelligence 

We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes--task underspecification, response style, refusals, and annotation errors. We find that the majority of disagreements are in opposition with standard reward modeling approaches, which are designed with the assumption that annotator disagreement is noise. We then explore how these findings impact two areas of LLM development: reward modeling and evaluation. In our experiments, we demonstrate how standard reward modeling methods, like the Bradley-Terry model, fail to differentiate whether a given preference judgment is the result of unanimous agreement among annotators or the majority opinion among diverging user preferences. We also find that these tendencies are also echoed by popular LLM-as-Judge evaluation methods, which consistently identify a winning response in cases of diverging preferences. These findings highlight remaining challenges in LLM evaluations, which are greatly influenced by divisive features like response style, and in developing pluralistically aligned LLMs. To address these issues, we develop methods for identifying diverging preferences to mitigate their influence on evaluation and training. As large language models (LLMs) continue to rise in prominence and to serve millions of people on a daily basis, there is an increasing need to ensure that systems are pluralistically aligned (Sorensen et al., 2024). Learning from human preferences has emerged as the standard method for adapting LLMs to facilitate user-assistant interactions with much success. Despite these advances, however, the field continues to struggle with the challenge of handing diverging preferences, where users disagree on the ideal response to a prompt. Prior works on developing pluralistically aligned LLMs have focused on the development of synthetic preference datasets, where disagreements are simulated based on author-defined features and frequencies (Poddar et al., 2024; Chen et al., 2024). In this work, we take a step back to ask the foundational question when and why do human annotators disagree in their preferences? To make this research possible, we introduce MultiPref-Disagreements and HelpSteer2-Disagreements.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found