Law
Corrigibility as a Singular Target: A Vision for Inherently Reliable Foundation Models
Foundation models (FMs) face a critical safety challenge: as capabilities scale, instrumental convergence drives default trajectories toward loss of human control, potentially culminating in existential catastrophe. Current alignment approaches struggle with value specification complexity and fail to address emergent power-seeking behaviors. We propose "Corrigibility as a Singular Target" (CAST)-designing FMs whose overriding objective is empowering designated human principals to guide, correct, and control them. This paradigm shift from static value-loading to dynamic human empowerment transforms instrumental drives: self-preservation serves only to maintain the principal's control; goal modification becomes facilitating principal guidance. We present a comprehensive empirical research agenda spanning training methodologies (RLAIF, SFT, synthetic data generation), scalability testing across model sizes, and demonstrations of controlled instructability. Our vision: FMs that become increasingly responsive to human guidance as capabilities grow, offering a path to beneficial AI that remains as tool-like as possible, rather than supplanting human judgment. This addresses the core alignment problem at its source, preventing the default trajectory toward misaligned instrumental convergence.
Conditioning Large Language Models on Legal Systems? Detecting Punishable Hate Speech
Ludwig, Florian, Zesch, Torsten, Zufall, Frederike
The assessment of legal problems requires the consideration of a specific legal system and its levels of abstraction, from constitutional law to statutory law to case law. The extent to which Large Language Models (LLMs) internalize such legal systems is unknown. In this paper, we propose and investigate different approaches to condition LLMs at different levels of abstraction in legal systems. This paper examines different approaches to conditioning LLMs at multiple levels of abstraction in legal systems to detect potentially punishable hate speech. We focus on the task of classifying whether a specific social media posts falls under the criminal offense of incitement to hatred as prescribed by the German Criminal Code. The results show that there is still a significant performance gap between models and legal experts in the legal assessment of hate speech, regardless of the level of abstraction with which the models were conditioned. Our analysis revealed, that models conditioned on abstract legal knowledge lacked deep task understanding, often contradicting themselves and hallucinating answers, while models using concrete legal knowledge performed reasonably well in identifying relevant target groups, but struggled with classifying target conducts.
A Multi-Agent Framework for Mitigating Dialect Biases in Privacy Policy Question-Answering Systems
Klisura, Đorđe, Torres, Astrid R Bernaga, Gárate-Escamilla, Anna Karen, Biswal, Rajesh Roshan, Yang, Ke, Pataci, Hilal, Rios, Anthony
Privacy policies inform users about data collection and usage, yet their complexity limits accessibility for diverse populations. Existing Privacy Policy Question Answering (QA) systems exhibit performance disparities across English dialects, disadvantaging speakers of non-standard varieties. We propose a novel multi-agent framework inspired by human-centered design principles to mitigate dialectal biases. Our approach integrates a Dialect Agent, which translates queries into Standard American English (SAE) while preserving dialectal intent, and a Privacy Policy Agent, which refines predictions using domain expertise. Unlike prior approaches, our method does not require retraining or dialect-specific fine-tuning, making it broadly applicable across models and domains. Evaluated on PrivacyQA and PolicyQA, our framework improves GPT-4o-mini's zero-shot accuracy from 0.394 to 0.601 on PrivacyQA and from 0.352 to 0.464 on PolicyQA, surpassing or matching few-shot baselines without additional training data. These results highlight the effectiveness of structured agent collaboration in mitigating dialect biases and underscore the importance of designing NLP systems that account for linguistic diversity to ensure equitable access to privacy information.
TaxAgent: How Large Language Model Designs Fiscal Policy
Wang, Jizhou, Fang, Xiaodan, Huang, Lei, Huang, Yongfeng
--Economic inequality is a global challenge, intensifying disparities in education, healthcare, and social stability. Traditional systems like the U.S. federal income tax reduce inequality but lack adaptability. Although models like the Saez Optimal T axation adjust dynamically, they fail to address taxpayer heterogeneity and irrational behavior . This study introduces T axAgent, a novel integration of large language models (LLMs) with agent-based modeling (ABM) to design adaptive tax policies. In our macroeconomic simulation, heterogeneous H-Agents (households) simulate real-world taxpayer behaviors while the T axAgent (government) utilizes LLMs to iteratively optimize tax rates, balancing equity and productivity. Benchmarked against Saez Optimal T axation, U.S. federal income taxes, and free markets, T axAgent achieves superior equity-efficiency tradeoffs. This research offers a novel taxation solution and a scalable, data-driven framework for fiscal policy evaluation. Economic inequality is a critical global issue with profound social, political, and economic impacts. Research highlights its detrimental effects on education, healthcare, political stability, and economic growth[1, 2, 3].
IndoSafety: Culturally Grounded Safety for LLMs in Indonesian Languages
Azmi, Muhammad Falensi, Kautsar, Muhammad Dehan Al, Wicaksono, Alfan Farizki, Koto, Fajri
Although region-specific large language models (LLMs) are increasingly developed, their safety remains underexplored, particularly in culturally diverse settings like Indonesia, where sensitivity to local norms is essential and highly valued by the community. In this work, we present IndoSafety, the first high-quality, human-verified safety evaluation dataset tailored for the Indonesian context, covering five language varieties: formal and colloquial Indonesian, along with three major local languages: Javanese, Sundanese, and Minangkabau. IndoSafety is constructed by extending prior safety frameworks to develop a taxonomy that captures Indonesia's sociocultural context. We find that existing Indonesian-centric LLMs often generate unsafe outputs, particularly in colloquial and local language settings, while fine-tuning on IndoSafety significantly improves safety while preserving task performance. Our work highlights the critical need for culturally grounded safety evaluation and provides a concrete step toward responsible LLM deployment in multilingual settings. Warning: This paper contains example data that may be offensive, harmful, or biased.
Learning Together to Perform Better: Teaching Small-Scale LLMs to Collaborate via Preferential Rationale Tuning
Patnaik, Sohan, Aggarwal, Milan, Bhatia, Sumit, Krishnamurthy, Balaji
LLMssuch as GPT-4 have shown a remarkable ability to solve complex questions by generating step-by-step rationales. Prior works have utilized this capability to improve smaller and cheaper LMs (say, with 7B parameters). However, various practical constraints, such as copyright and legal issues, owing to lack of transparency in the pre-training data of large (often closed) models, prevent their use in commercial settings. Little focus has been given to improving the innate reasoning ability of smaller models without distilling information from larger LLMs. To address this, we propose COLLATE, a trainable framework that tunes a (small) LLM to generate those outputs from a pool of diverse rationales that selectively improves the downstream task. COLLATE enforces multiple instances of the same LLM to exhibit distinct behavior and employs them to generate rationales to obtain diverse outputs. The LLM is then tuned via preference optimization to choose the candidate rationale which maximizes the likelihood of ground-truth answer. COLLATE outperforms several trainable and prompting baselines on 5 datasets across 3 domains: maths problem solving, natural language inference, and commonsense reasoning. We show the eff icacy of COLLATE on LLMs from different model families across varying parameter scales (1B to 8B) and demonstrate the benefit of multiple rationale providers guided by the end task through ablations. Code is released here (https://github.com/Sohanpatnaik106/collate).
MidPO: Dual Preference Optimization for Safety and Helpfulness in Large Language Models via a Mixture of Experts Framework
Qi, Yupeng, Lyu, Ziyu, Yang, Min, Wang, Yanlin, Bai, Lu, Cui, Lixin
As large language models (LLMs) are increasingly applied across various domains, enhancing safety while maintaining the helpfulness of LLMs has become a critical challenge. Recent studies solve this problem through safety-constrained online preference optimization or safety-constrained offline preference optimization. However, the safety-constrained online methods often suffer from excessive safety, which might reduce helpfulness, while the safety-constrained offline methods perform poorly in adaptively balancing safety and helpfulness. To address these limitations, we propose MidPO, a \textbf{\underline{Mi}}xture of Experts (MoE) framework for safety-helpfulness \textbf{\underline{d}}ual \textbf{\underline{P}}reference \textbf{\underline{O}}ptimization. Firstly, MidPO devises single-preference enhanced direct preference optimization approach to transform the base model into two independent experts, termed safety and helpfulness experts, and fine-tunes the two independent experts for optimal safety or helpfulness performance. Secondly, to achieve an effective balance between safety and helpfulness, MidPO incorporates the two experts into the MoE framework and designs a dynamic routing mechanism to allocate contributions from each expert adaptively. We conduct quantitative and qualitative experiments on three popular datasets to demonstrate the proposed MidPO significantly outperforms state-of-the-art approaches in both safety and helpfulness. The code and models will be released.
Weak Supervision for Real World Graphs
Nair, Pratheeksha, Rabbany, Reihaneh
Node classification in real world graphs often suffers from label scarcity and noise, especially in high stakes domains like human trafficking detection and misinformation monitoring. While direct supervision is limited, such graphs frequently contain weak signals, noisy or indirect cues, that can still inform learning. We propose WSNET, a novel weakly supervised graph contrastive learning framework that leverages these weak signals to guide robust representation learning. WSNET integrates graph structure, node features, and multiple noisy supervision sources through a contrastive objective tailored for weakly labeled data. Across three real world datasets and synthetic benchmarks with controlled noise, WSNET consistently outperforms state of the art contrastive and noisy label learning methods by up to 15% in F1 score. Our results highlight the effectiveness of contrastive learning under weak supervision and the promise of exploiting imperfect labels in graph based settings.
Gender Inequality in English Textbooks Around the World: an NLP Approach
Textbooks are important for shaping children's understanding of the world. Previous studies of individual countries have suggested that gender inequality exists. There lacks a study that compares gender inequality in textbooks around the world. This study uses NLP approaches to quantify gender inequality in English textbooks in 7 cultural spheres, 22 countries, by measuring the count, firstness, and TF IDF words by gender. The study also counted the names that appeared in TF IDF word lists and sorted the names by gender, found out that LLMs can distinguish between the different TF IDF word lists, and mapped the TF IDF words to GloVe to see that some keywords are closer to one gender than the other. The study found more male count, firstness, and names. The study found that there is significant gender inequality in all the textbooks. Gender inequality is demonstrated the least in textbooks of the Latin Cultural Sphere.
AnswerCarefully: A Dataset for Improving the Safety of Japanese LLM Output
Suzuki, Hisami, Katsumata, Satoru, Kodama, Takashi, Takahashi, Tetsuro, Nakayama, Kouta, Sekine, Satoshi
In this paper we present AnswerCarefully, a dataset for promoting the safety and appropriateness of Japanese LLM outputs. The dataset consists of 1,800 pairs of questions and reference answers, where the questions require special attention in answering. It covers a wide range of risk categories established in prior English-language datasets, but the data samples are original in that they are manually created to reflect the socio-cultural context of LLM usage in Japan. We show that using this dataset for instruction to fine-tune a Japanese LLM led to improved output safety without compromising the utility of general responses. We also report the results of a safety evaluation of 12 Japanese LLMs using this dataset as a benchmark. Finally, we describe the latest update on the dataset which provides English translations and annotations of the questions, aimed at facilitating the derivation of similar datasets in different languages and regions.