Goto

Collaborating Authors

 Revel, Manon


SEAL: Systematic Error Analysis for Value ALignment

arXiv.org Artificial Intelligence

Reinforcement Learning from Human Feedback (RLHF) aims to align language models (LMs) with human values by training reward models (RMs) on binary preferences and using these RMs to fine-tune the base LMs. Despite its importance, the internal mechanisms of RLHF remain poorly understood. This paper introduces new metrics to evaluate the effectiveness of modeling and aligning human values, namely feature imprint, alignment resistance and alignment robustness. We categorize alignment datasets into target features (desired values) and spoiler features (undesired concepts). By regressing RM scores against these features, we quantify the extent to which RMs reward them - a metric we term feature imprint. We define alignment resistance as the proportion of the preference dataset where RMs fail to match human preferences, and we assess alignment robustness by analyzing RM responses to perturbed inputs. Our experiments, utilizing open-source components like the Anthropic/hh-rlhf preference dataset and OpenAssistant RMs, reveal significant imprints of target features and a notable sensitivity to spoiler features. We observed a 26% incidence of alignment resistance in portions of the dataset where LM-labelers disagreed with human preferences. Furthermore, we find that misalignment often arises from ambiguous entries within the alignment dataset. These findings underscore the importance of scrutinizing both RMs and alignment datasets for a deeper understanding of value alignment.


Selecting Representative Bodies: An Axiomatic View

arXiv.org Artificial Intelligence

As the world's democratic institutions are challenged by dissatisfied citizens, political scientists and also computer scientists have proposed and analyzed various (innovative) methods to select representative bodies, a crucial task in every democracy. However, a unified framework to analyze and compare different selection mechanisms is missing, resulting in very few comparative works. To address this gap, we advocate employing concepts and tools from computational social choice in order to devise a model in which different selection mechanisms can be formalized. Such a model would allow for desirable representation axioms to be conceptualized and evaluated. We make the first step in this direction by proposing a unifying mathematical formulation of different selection mechanisms as well as various social-choice-inspired axioms such as proportionality and monotonicity.


The Optimal Size of an Epistemic Congress

arXiv.org Artificial Intelligence

We analyze the optimal size of a congress in a representative democracy. We take an epistemic view where voters decide on a binary issue with one ground truth outcome, and each voter votes correctly according to their competence levels in $[0, 1]$. Assuming that we can sample the best experts to form an epistemic congress, we find that the optimal congress size should be linear in the population size. This result is striking because it holds even when allowing the top representatives to be accurate with arbitrarily high probabilities. We then analyze real world data, finding that the actual sizes of congresses are much smaller than the optimal size our theoretical results suggest. We conclude by analyzing under what conditions congresses of sub-optimal sizes would still outperform direct democracy, in which all voters vote.