Law
MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes
Chiu, Yu Ying, Lee, Michael S., Calcott, Rachel, Handoko, Brandon, de Font-Reaulx, Paul, Rodriguez, Paula, Zhang, Chen Bo Calvin, Han, Ziwen, Sehwag, Udari Madhushani, Maurya, Yash, Knight, Christina Q, Lloyd, Harry R., Bacus, Florence, Mazeika, Mantas, Liu, Bing, Choi, Yejin, Gordon, Mitchell L, Levine, Sydney
As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.
Towards Low-Resource Alignment to Diverse Perspectives with Sparse Feedback
Luo, Chu Fei, Dahan, Samuel, Zhu, Xiaodan
As language models have a greater impact on society, it is important to ensure they are aligned to a diverse range of perspectives and are able to reflect nuance in human values. However, the most popular training paradigms for modern language models often assume there is one optimal answer for every query, leading to generic responses and poor alignment. In this work, we aim to enhance pluralistic alignment of language models in a low-resource setting with two methods: pluralistic decoding and model steering. We empirically demonstrate that model steering offers consistent improvement over zero-shot and few-shot baselines with only 50 annotated samples. Our proposed methods decrease false positives in several high-stakes tasks such as hate speech detection and misinformation detection, and improves the distributional alignment to human values in GlobalOpinionQA. We hope our work highlights the importance of diversity and how language models can be adapted to consider nuanced perspectives.
Detecting Adversarial Fine-tuning with Auditing Agents
Egler, Sarah, Schulman, John, Carlini, Nicholas
Large Language Model (LLM) providers expose fine-tuning APIs that let end users fine-tune their frontier LLMs. Unfortunately, it has been shown that an adversary with fine-tuning access to an LLM can bypass safeguards. Particularly concerning, such attacks may avoid detection with datasets that are only implicitly harmful. Our work studies robust detection mechanisms for adversarial use of fine-tuning APIs. We introduce the concept of a fine-tuning auditing agent and show it can detect harmful fine-tuning prior to model deployment. We provide our auditing agent with access to the fine-tuning dataset, as well as the fine-tuned and pre-fine-tuned models, and request the agent assigns a risk score for the fine-tuning job. We evaluate our detection approach on a diverse set of eight strong fine-tuning attacks from the literature, along with five benign fine-tuned models, totaling over 1400 independent audits. These attacks are undetectable with basic content moderation on the dataset, highlighting the challenge of the task. With the best set of affordances, our auditing agent achieves a 56.2% detection rate of adversarial fine-tuning at a 1% false positive rate. Most promising, the auditor is able to detect covert cipher attacks that evade safety evaluations and content moderation of the dataset. While benign fine-tuning with unintentional subtle safety degradation remains a challenge, we establish a baseline configuration for further work in this area. We release our auditing agent at https://github.com/safety-research/finetuning-auditor.
Machine Learning for Climate Policy: Understanding Policy Progression in the European Green Deal
West, Patricia, Wan, Michelle WL, Hepburn, Alexander, Simpson, Edwin, Santos-Rodriguez, Raul, Clark, Jeffrey N
Climate change demands effective legislative action to mitigate its impacts. This study explores the application of machine learning (ML) to understand the progression of climate policy from announcement to adoption, focusing on policies within the European Green Deal. We present a dataset of 165 policies, incorporating text and metadata. We aim to predict a policy's progression status, and compare text representation methods, including TF-IDF, BERT, and ClimateBERT. Metadata features are included to evaluate the impact on predictive performance. On text features alone, ClimateBERT outperforms other approaches (RMSE = 0.17, R^2 = 0.29), while BERT achieves superior performance with the addition of metadata features (RMSE = 0.16, R^2 = 0.38). Using methods from explainable AI highlights the influence of factors such as policy wording and metadata including political party and country representation. These findings underscore the potential of ML tools in supporting climate policy analysis and decision-making.
Algorithmic Fairness in AI Surrogates for End-of-Life Decision-Making
Artificial intelligence surrogates are systems designed to infer preferences when individuals lose decision-making capacity. Fairness in such systems is a domain that has been insufficiently explored. Traditional algorithmic fairness frameworks are insufficient for contexts where decisions are relational, existential, and culturally diverse. This paper explores an ethical framework for algorithmic fairness in AI surrogates by mapping major fairness notions onto potential real-world end-of-life scenarios. It then examines fairness across moral traditions. The authors argue that fairness in this domain extends beyond parity of outcomes to encompass moral representation, fidelity to the patient's values, relationships, and worldview.
Membership Inference over Diffusion-models-based Synthetic Tabular Data
This study investigates the privacy risks associated with diffusion-based synthetic tabular data generation methods, focusing on their susceptibility to Membership Inference Attacks (MIAs). We examine two recent models, TabDDPM and TabSyn, by developing query-based MIAs based on the step-wise error comparison method. Our findings reveal that TabDDPM is more vulnerable to these attacks. TabSyn exhibits resilience against our attack models. Our work underscores the importance of evaluating the privacy implications of diffusion models and encourages further research into robust privacy-preserving mechanisms for synthetic data generation.
Aligning Language Models with Investor and Market Behavior for Financial Recommendations
Spadea, Fernando, Seneviratne, Oshani
Most financial recommendation systems often fail to account for key behavioral and regulatory factors, leading to advice that is misaligned with user preferences, difficult to interpret, or unlikely to be followed. We present FLARKO (Financial Language-model for Asset Recommendation with Knowledge-graph Optimization), a novel framework that integrates Large Language Models (LLMs), Knowledge Graphs (KGs), and Kahneman-Tversky Optimization (KTO) to generate asset recommendations that are both profitable and behaviorally aligned. FLARKO encodes users' transaction histories and asset trends as structured KGs, providing interpretable and controllable context for the LLM. To demonstrate the adaptability of our approach, we develop and evaluate both a centralized architecture (CenFLARKO) and a federated variant (FedFLARKO). To our knowledge, this is the first demonstration of combining KTO for fine-tuning of LLMs for financial asset recommendation. We also present the first use of structured KGs to ground LLM reasoning over behavioral financial data in a federated learning (FL) setting. Evaluated on the FAR-Trans dataset, FLARKO consistently outperforms state-of-the-art recommendation baselines on behavioral alignment and joint profitability, while remaining interpretable and resource-efficient.
Limits of Emergent Reasoning of Large Language Models in Agentic Frameworks for Deterministic Games
Su, Chris, Li, Harrison, Marques, Matheus, Flint, George, Zhu, Kevin, Dev, Sunishchal
Recent work reports that Large Reasoning Models (LRMs) undergo a collapse in performance on solving puzzles beyond certain perplexity thresholds. In subsequent discourse, questions have arisen as to whether the nature of the task muddles an evaluation of true reasoning. One potential confound is the requirement that the model keep track of the state space on its own. We provide a large language model (LLM) with an environment interface for Tower of Hanoi problems, allowing it to make a move with a tool call, provide written justification, observe the resulting state space, and reprompt itself for the next move. We observe that access to an environment interface does not delay or eradicate performance collapse. Furthermore, LLM-parameterized policy analysis reveals increasing divergence from both optimal policies and uniformly random policies, suggesting that the model exhibits mode-like collapse at each level of complexity, and that performance is dependent upon whether the mode reflects the correct solution for the problem. We suggest that a similar phenomena might take place in LRMs.
Exploring the Potential of Citiverses for Regulatory Learning
Hupont, Isabelle, Ponti, Marisa, Schade, Sven
Citiverses hold the potential to support regulatory learning by offering immersive, virtual environments for experimenting with policy scenarios and technologies. This paper proposes a science-for-policy agenda to explore the potential of citiverses as experimentation spaces for regulatory learning, grounded in a consultation with a high-level panel of experts, including policymakers from the European Commission, national government science advisers and leading researchers in digital regulation and virtual worlds. It identifies key research areas, including scalability, real-time feedback, complexity modelling, cross-border collaboration, risk reduction, citizen participation, ethical considerations and the integration of emerging technologies. In addition, the paper analyses a set of experimental topics, spanning transportation, urban planning and the environment/climate crisis, that could be tested in citiverse platforms to advance regulatory learning in these areas. The proposed work is designed to inform future research for policy and emphasizes a responsible approach to developing and using citiverses. It prioritizes careful consideration of the ethical, economic, ecological and social dimensions of different regulations. The paper also explores essential preliminary steps necessary for integrating citiverses into the broader ecosystems of experimentation spaces, including test beds, living labs and regulatory sandboxes
Detecting and Preventing Harmful Behaviors in AI Companions: Development and Evaluation of the SHIELD Supervisory System
Ben-Zion, Ziv, Raffelhüschen, Paul, Zettl, Max, Lüönd, Antonia, Burrer, Achim, Homan, Philipp, Spiller, Tobias R
AI companions powered by large language models (LLMs) are increasingly integrated into users' daily lives, offering emotional support and companionship. While existing safety systems focus on overt harms, they rarely address early-stage problematic behaviors that can foster unhealthy emotional dynamics, including over-attachment or reinforcement of social isolation. We developed SHIELD (Supervisory Helper for Identifying Emotional Limits and Dynamics), a LLM-based supervisory system with a specific system prompt that detects and mitigates risky emotional patterns before escalation. SHIELD targets five dimensions of concern: (1) emotional over-attachment, (2) consent and boundary violations, (3) ethical roleplay violations, (4) manipulative engagement, and (5) social isolation reinforcement. These dimensions were defined based on media reports, academic literature, existing AI risk frameworks, and clinical expertise in unhealthy relationship dynamics. To evaluate SHIELD, we created a 100-item synthetic conversation benchmark covering all five dimensions of concern. Testing across five prominent LLMs (GPT-4.1, Claude Sonnet 4, Gemma 3 1B, Kimi K2, Llama Scout 4 17B) showed that the baseline rate of concerning content (10-16%) was significantly reduced with SHIELD (to 3-8%), a 50-79% relative reduction, while preserving 95% of appropriate interactions. The system achieved 59% sensitivity and 95% specificity, with adaptable performance via prompt engineering. This proof-of-concept demonstrates that transparent, deployable supervisory systems can address subtle emotional manipulation in AI companions. Most development materials including prompts, code, and evaluation methods are made available as open source materials for research, adaptation, and deployment.