Goto

Collaborating Authors

 unawareness


The Hawthorne Effect in Reasoning Models Evaluating and Steering Test Awareness

Neural Information Processing Systems

Reasoning-focused LLMs sometimes alter their behavior when they detect that they are being evaluated--which can lead them to optimize for test-passing performance or to comply more readily with harmful prompts if real-world consequences appear absent. We present the first quantitative study of how such "test awareness" impacts model behavior, particularly its performance on safety-related tasks1. We introduce a white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. We apply our method to different state-of-the-art openweight reasoning LLMs across both realistic and hypothetical tasks (denoting tests or simulations). Our results demonstrate that test awareness significantly impacts safety alignment (such as compliance with harmful requests and conforming to stereotypes) with effects varying in both magnitude and direction across models. By providing control over this latent effect, our work aims to provide a stress-test mechanism and increase trust in how we perform safety evaluations.


Mechanism Design under Unawareness -- Extended Abstract

arXiv.org Artificial Intelligence

We study the design of mechanisms under asymmetric awareness and information. While the mechanism designer cannot necessarily commit to a particular social choice function in the face of unawareness, she can at least commit to properties of social choice functions such as efficiency given ex post awareness. Assuming quasi-linear utilities and private values, we show that we can implement in conditional dominant strategies a social choice function that is utilitarian ex post efficient under pooled awareness without the need of the social planner being fully aware ex ante. To this end, we develop novel dynamic versions of Vickrey-Clarke-Groves mechanisms in which true types are revealed and subsequently elaborated at endogenous higher awareness levels. We explore how asymmetric awareness affects budget balance and participation constraints. We show that ex ante unforeseen contingencies are no excuse for deficits. Finally, we propose a dynamic elaboration reverse second price auction for efficient procurement of complex incompletely specified projects with budget balance and participation constraints.


The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness

arXiv.org Artificial Intelligence

Reasoning-focused LLMs sometimes alter their behavior when they detect that they are being evaluated, which can lead them to optimize for test-passing performance or to comply more readily with harmful prompts if real-world consequences appear absent. We present the first quantitative study of how such "test awareness" impacts model behavior, particularly its performance on safety-related tasks. We introduce a white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. We apply our method to different state-of-the-art open-weight reasoning LLMs across both realistic and hypothetical tasks (denoting tests or simulations). Our results demonstrate that test awareness significantly impacts safety alignment (such as compliance with harmful requests and conforming to stereotypes) with effects varying in both magnitude and direction across models. By providing control over this latent effect, our work aims to provide a stress-test mechanism and increase trust in how we perform safety evaluations.



Algorithmic Tradeoffs in Fair Lending: Profitability, Compliance, and Long-Term Impact

arXiv.org Artificial Intelligence

As financial institutions increasingly rely on machine learning models to automate lending decisions, concerns about algorithmic fairness have risen. This paper explores the tradeoff between enforcing fairness constraints (such as demographic parity or equal opportunity) and maximizing lender profitability. Through simulations on synthetic data that reflects real-world lending patterns, we quantify how different fairness interventions impact profit margins and default rates. Our results demonstrate that equal opportunity constraints typically impose lower profit costs than demographic parity, but surprisingly, removing protected attributes from the model (fairness through unawareness) outperforms explicit fairness interventions in both fairness and profitability metrics. We further identify the specific economic conditions under which fair lending becomes profitable and analyze the feature-specific drivers of unfairness. These findings offer practical guidance for designing lending algorithms that balance ethical considerations with business objectives.


Reconsidering Fairness Through Unawareness from the Perspective of Model Multiplicity

arXiv.org Machine Learning

Fairness through Unawareness (FtU) describes the idea that discrimination against demographic groups can be avoided by not considering group membership in the decisions or predictions. This idea has long been criticized in the machine learning literature as not being sufficient to ensure fairness. In addition, the use of additional features is typically thought to increase the accuracy of the predictions for all groups, so that FtU is sometimes thought to be detrimental to all groups. In this paper, we show both theoretically and empirically that FtU can reduce algorithmic discrimination without necessarily reducing accuracy. We connect this insight with the literature on Model Multiplicity, to which we contribute with novel theoretical and empirical results. Furthermore, we illustrate how, in a real-life application, FtU can contribute to the deployment of more equitable policies without losing efficacy. Our findings suggest that FtU is worth considering in practical applications, particularly in high-risk scenarios, and that the use of protected attributes such as gender in predictive models should be accompanied by a clear and well-founded justification.


Review for NeurIPS paper: Achieving Equalized Odds by Resampling Sensitive Attributes

Neural Information Processing Systems

Weaknesses: Intuitively randomising sensitive feature should lead to fairer results, however, fairness though unawareness poses a risk of unfairness by proxy as there are ways of predicting protected characteristic features from other features [Ruggieri et all, 2010, Adler et al 2016]. Also a continuous analog of fairness through unawareness [Dwark et al 2012] has been proposed via counterfactual fairness [Matt J. Kusner, et al, Counterfactual fairness, 2017]. In the counterfactual fairness, one has to estimate a dependency structure over the features, i.e. a causal graph, in order to create a counterfactual example when changing/flipping observational sensitive feature. To properly evaluate the contribution of the proposed approach, it has to be compared --methodologically and empirically -- not only to fairness through unawareness, but also to counterfactual fairness approaches. Another concern is that very little information is dedicate to the analysis how to estimate p(A Y).


A Systems Thinking Approach to Algorithmic Fairness

arXiv.org Artificial Intelligence

Systems thinking provides us with a way to model the algorithmic fairness problem by allowing us to encode prior knowledge and assumptions about where we believe bias might exist in the data generating process. We can then model this using a series of causal graphs, enabling us to link AI/ML systems to politics and the law. By treating the fairness problem as a complex system, we can combine techniques from machine learning, causal inference, and system dynamics. Each of these analytical techniques is designed to capture different emergent aspects of fairness, allowing us to develop a deeper and more holistic view of the problem. This can help policymakers on both sides of the political aisle to understand the complex trade-offs that exist from different types of fairness policies, providing a blueprint for designing AI policy that is aligned to their political agendas.


SECURE: Semantics-aware Embodied Conversation under Unawareness for Lifelong Robot Learning

arXiv.org Artificial Intelligence

This paper addresses a challenging interactive task learning scenario we call rearrangement under unawareness: to manipulate a rigid-body environment in a context where the robot is unaware of a concept that's key to solving the instructed task. We propose SECURE, an interactive task learning framework designed to solve such problems by fixing a deficient domain model using embodied conversation. Through dialogue, the robot discovers and then learns to exploit unforeseen possibilities. Using SECURE, the robot not only learns from the user's corrective feedback when it makes a mistake, but it also learns to make strategic dialogue decisions for revealing useful evidence about novel concepts for solving the instructed task. Together, these abilities allow the robot to generalise to subsequent tasks using newly acquired knowledge. We demonstrate that a robot that is semantics-aware -- that is, it exploits the logical consequences of both sentence and discourse semantics in the learning and inference process -- learns to solve rearrangement under unawareness more effectively than a robot that lacks such capabilities.


Measuring Fairness Under Unawareness of Sensitive Attributes: A Quantification-Based Approach

Journal of Artificial Intelligence Research

Algorithms and models are increasingly deployed to inform decisions about people, inevitably affecting their lives. As a consequence, those in charge of developing these models must carefully evaluate their impact on different groups of people and favour group fairness, that is, ensure that groups determined by sensitive demographic attributes, such as race or sex, are not treated unjustly. To achieve this goal, the availability (awareness) of these demographic attributes to those evaluating the impact of these models is fundamental. Unfortunately, collecting and storing these attributes is often in conflict with industry practices and legislation on data minimisation and privacy. For this reason, it can be hard to measure the group fairness of trained models, even from within the companies developing them. In this work, we tackle the problem of measuring group fairness under unawareness of sensitive attributes, by using techniques from quantification, a supervised learning task concerned with directly providing group-level prevalence estimates (rather than individual-level class labels). We show that quantification approaches are particularly suited to tackle the fairness-under-unawareness problem, as they are robust to inevitable distribution shifts while at the same time decoupling the (desirable) objective of measuring group fairness from the (undesirable) side effect of allowing the inference of sensitive attributes of individuals. More in detail, we show that fairness under unawareness can be cast as a quantification problem and solved with proven methods from the quantification literature. We show that these methods outperform previous approaches to measure demographic parity in five experimental protocols, corresponding to important challenges that complicate the estimation of classifier fairness under unawareness.