Law
Arbiters of Ambivalence: Challenges of Using LLMs in No-Consensus Tasks
Radharapu, Bhaktipriya, Revel, Manon, Ung, Megan, Ruder, Sebastian, Williams, Adina
The increasing use of LLMs as substitutes for humans in ``aligning'' LLMs has raised questions about their ability to replicate human judgments and preferences, especially in ambivalent scenarios where humans disagree. This study examines the biases and limitations of LLMs in three roles: answer generator, judge, and debater. These roles loosely correspond to previously described alignment frameworks: preference alignment (judge) and scalable oversight (debater), with the answer generator reflecting the typical setting with user interactions. We develop a ``no-consensus'' benchmark by curating examples that encompass a variety of a priori ambivalent scenarios, each presenting two possible stances. Our results show that while LLMs can provide nuanced assessments when generating open-ended answers, they tend to take a stance on no-consensus topics when employed as judges or debaters. These findings underscore the necessity for more sophisticated methods for aligning LLMs without human oversight, highlighting that LLMs cannot fully capture human disagreement even on topics where humans themselves are divided.
DP-RTFL: Differentially Private Resilient Temporal Federated Learning for Trustworthy AI in Regulated Industries
Federated Learning (FL) has emerged as a critical paradigm for enabling privacy-preserving machine learning, particularly in regulated sectors such as finance and healthcare. However, standard FL strategies often encounter significant operational challenges related to fault tolerance, system resilience against concurrent client and server failures, and the provision of robust, verifiable privacy guarantees essential for handling sensitive data. These deficiencies can lead to training disruptions, data loss, compromised model integrity, and non-compliance with data protection regulations (e.g., GDPR, CCPA). This paper introduces Differentially Private Resilient Temporal Federated Learning (DP-RTFL), an advanced FL framework designed to ensure training continuity, precise state recovery, and strong data privacy. DP-RTFL integrates local Differential Privacy (LDP) at the client level with resilient temporal state management and integrity verification mechanisms, such as hash-based commitments (referred to as Zero-Knowledge Integrity Proofs or ZKIPs in this context). The framework is particularly suited for critical applications like credit risk assessment using sensitive financial data, aiming to be operationally robust, auditable, and scalable for enterprise AI deployments. The implementation of the DP-RTFL framework is available as open-source.
USB: A Comprehensive and Unified Safety Evaluation Benchmark for Multimodal Large Language Models
Zheng, Baolin, Chen, Guanlin, Zhong, Hongqiong, Teng, Qingyang, Tan, Yingshui, Liu, Zhendong, Wang, Weixun, Liu, Jiaheng, Yang, Jian, Jing, Huiyun, Wei, Jincheng, Su, Wenbo, Zhu, Xiaoyong, Zheng, Bo, Zhang, Kaifu
Despite their remarkable achievements and widespread adoption, Multimodal Large Language Models (MLLMs) have revealed significant security vulnerabilities, highlighting the urgent need for robust safety evaluation benchmarks. Existing MLLM safety benchmarks, however, fall short in terms of data quality and coverge, and modal risk combinations, resulting in inflated and contradictory evaluation results, which hinders the discovery and governance of security concerns. Besides, we argue that vulnerabilities to harmful queries and oversensitivity to harmless ones should be considered simultaneously in MLLMs safety evaluation, whereas these were previously considered separately. In this paper, to address these shortcomings, we introduce Unified Safety Benchmarks (USB), which is one of the most comprehensive evaluation benchmarks in MLLM safety. Our benchmark features high-quality queries, extensive risk categories, comprehensive modal combinations, and encompasses both vulnerability and oversensitivity evaluations. From the perspective of two key dimensions: risk categories and modality combinations, we demonstrate that the available benchmarks -- even the union of the vast majority of them -- are far from being truly comprehensive. To bridge this gap, we design a sophisticated data synthesis pipeline that generates extensive, high-quality complementary data addressing previously unexplored aspects. By combining open-source datasets with our synthetic data, our benchmark provides 4 distinct modality combinations for each of the 61 risk sub-categories, covering both English and Chinese across both vulnerability and oversensitivity dimensions.
An Explicit Syllogistic Legal Reasoning Framework for Large Language Models
Zhang, Kepu, Yu, Weijie, Sun, Zhongxiang, Xu, Jun
Syllogistic reasoning is crucial for sound legal decision-making, allowing legal professionals to draw logical conclusions by applying general principles to specific case facts. While large language models (LLMs) can answer legal questions, they often struggle with explicit syllogistic reasoning. Their outputs tend to be implicit, unstructured, and consequently, less explainable and trustworthy. To overcome these limitations, we introduce SyLeR, a novel framework designed to enable LLMs to perform explicit syllogistic legal reasoning. SyLeR employs a tree-structured hierarchical retrieval mechanism to synthesize relevant legal statutes and precedents, thereby constructing comprehensive major premises. This is followed by a two-stage fine-tuning process: an initial supervised fine-tuning warm-up establishes a foundational understanding of syllogistic reasoning, while reinforcement learning, guided by a structure-aware reward mechanism, refines the model's capacity to generate diverse, logically sound, and well-structured reasoning paths. We conducted extensive experiments to evaluate SyLeR's performance. Our evaluations spanned diverse dimensions, including both in-domain and cross-domain user groups (legal laypersons and practitioners), multiple languages (Chinese and French), and various LLM backbones (legal-specific and open-domain LLMs). The results consistently demonstrate that SyLeR significantly enhances response accuracy and reliably produces explicit, explainable, and trustworthy legal reasoning.
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis
Wu, Xiaorui, Mao, Xiaofeng, Li, Fei, Zhang, Xin, Li, Xuanhong, Teng, Chong, Ji, Donghong, Li, Zhuang
Large Language Models (LLMs) excel in various natural language processing tasks but remain vulnerable to generating harmful content or being exploited for malicious purposes. Although safety alignment datasets have been introduced to mitigate such risks through supervised fine-tuning (SFT), these datasets often lack comprehensive risk coverage. Most existing datasets focus primarily on lexical diversity while neglecting other critical dimensions. To address this limitation, we propose a novel analysis framework to systematically measure the risk coverage of alignment datasets across three essential dimensions: Lexical Diversity, Malicious Intent, and Jailbreak Tactics. We further introduce TRIDENT, an automated pipeline that leverages persona-based, zero-shot LLM generation to produce diverse and comprehensive instructions spanning these dimensions. Each harmful instruction is paired with an ethically aligned response, resulting in two datasets: TRIDENT-Core, comprising 26,311 examples, and TRIDENT-Edge, with 18,773 examples. Fine-tuning Llama 3.1-8B on TRIDENT-Edge demonstrates substantial improvements, achieving an average 14.29% reduction in Harm Score, and a 20% decrease in Attack Success Rate compared to the best-performing baseline model fine-tuned on the WildBreak dataset.
Beyond the Black Box: Interpretability of LLMs in Finance
Large Language Models (LLMs) exhibit remarkable capabilities across a spectrum of tasks in financial services, including report generation, chatbots, sentiment analysis, regulatory compliance, investment advisory, financial knowledge retrieval, and summarization. However, their intrinsic complexity and lack of transparency pose significant challenges, especially in the highly regulated financial sector, where interpretability, fairness, and accountability are critical. As far as we are aware, this paper presents the first application in the finance domain of understanding and utilizing the inner workings of LLMs through mechanistic interpretability, addressing the pressing need for transparency and control in AI systems. Mechanistic interpretability is the most intuitive and transparent way to understand LLM behavior by reverse-engineering their internal workings. By dissecting the activations and circuits within these models, it provides insights into how specific features or components influence predictions - making it possible not only to observe but also to modify model behavior. In this paper, we explore the theoretical aspects of mechanistic interpretability and demonstrate its practical relevance through a range of financial use cases and experiments, including applications in trading strategies, sentiment analysis, bias, and hallucination detection. While not yet widely adopted, mechanistic interpretability is expected to become increasingly vital as adoption of LLMs increases. Advanced interpretability tools can ensure AI systems remain ethical, transparent, and aligned with evolving financial regulations. In this paper, we have put special emphasis on how these techniques can help unlock interpretability requirements for regulatory and compliance purposes - addressing both current needs and anticipating future expectations from financial regulators globally.
PRISM: A Framework for Producing Interpretable Political Bias Embeddings with Political-Aware Cross-Encoder
Sun, Yiqun, Huang, Qiang, Tung, Anthony K. H., Yu, Jun
Semantic Text Embedding is a fundamental NLP task that encodes textual content into vector representations, where proximity in the embedding space reflects semantic similarity. While existing embedding models excel at capturing general meaning, they often overlook ideological nuances, limiting their effectiveness in tasks that require an understanding of political bias. To address this gap, we introduce PRISM, the first framework designed to Produce inteRpretable polItical biaS eMbeddings. PRISM operates in two key stages: (1) Controversial Topic Bias Indicator Mining, which systematically extracts fine-grained political topics and their corresponding bias indicators from weakly labeled news data, and (2) Cross-Encoder Political Bias Embedding, which assigns structured bias scores to news articles based on their alignment with these indicators. This approach ensures that embeddings are explicitly tied to bias-revealing dimensions, enhancing both interpretability and predictive power. Through extensive experiments on two large-scale datasets, we demonstrate that PRISM outperforms state-of-the-art text embedding models in political bias classification while offering highly interpretable representations that facilitate diversified retrieval and ideological analysis. The source code is available at https://github.com/dukesun99/ACL-PRISM.
ROAD: Responsibility-Oriented Reward Design for Reinforcement Learning in Autonomous Driving
Chen, Yongming, Chen, Miner, Liao, Liewen, Jiang, Mingyang, Zuo, Xiang, Zhang, Hengrui, Xi, Yuchen, Zhang, Songan
Reinforcement learning (RL) in autonomous driving employs a trial-and-error mechanism, enhancing robustness in unpredictable environments. However, crafting effective reward functions remains challenging, as conventional approaches rely heavily on manual design and demonstrate limited efficacy in complex scenarios. To address this issue, this study introduces a responsibility-oriented reward function that explicitly incorporates traffic regulations into the RL framework. Specifically, we introduced a Traffic Regulation Knowledge Graph and leveraged Vision-Language Models alongside Retrieval-Augmented Generation techniques to automate reward assignment. This integration guides agents to adhere strictly to traffic laws, thus minimizing rule violations and optimizing decision-making performance in diverse driving conditions. Experimental validations demonstrate that the proposed methodology significantly improves the accuracy of assigning accident responsibilities and effectively reduces the agent's liability in traffic incidents.
LLM Agents Should Employ Security Principles
Zhang, Kaiyuan, Su, Zian, Chen, Pin-Yu, Bertino, Elisa, Zhang, Xiangyu, Li, Ninghui
Large Language Model (LLM) agents show considerable promise for automating complex tasks using contextual reasoning; however, interactions involving multiple agents and the system's susceptibility to prompt injection and other forms of context manipulation introduce new vulnerabilities related to privacy leakage and system exploitation. This position paper argues that the well-established design principles in information security, which are commonly referred to as security principles, should be employed when deploying LLM agents at scale. Design principles such as defense-in-depth, least privilege, complete mediation, and psychological acceptability have helped guide the design of mechanisms for securing information systems over the last five decades, and we argue that their explicit and conscientious adoption will help secure agentic systems. To illustrate this approach, we introduce AgentSandbox, a conceptual framework embedding these security principles to provide safeguards throughout an agent's life-cycle. We evaluate with state-of-the-art LLMs along three dimensions: benign utility, attack utility, and attack success rate. AgentSandbox maintains high utility for its intended functions under both benign and adversarial evaluations while substantially mitigating privacy risks. By embedding secure design principles as foundational elements within emerging LLM agent protocols, we aim to promote trustworthy agent ecosystems aligned with user privacy expectations and evolving regulatory requirements.
Exploring Societal Concerns and Perceptions of AI: A Thematic Analysis through the Lens of Problem-Seeking
This study introduces a novel conceptual framework distinguishing problem-seeking from problem-solving to clarify the unique features of human intelligence in contrast to AI. Problem-seeking refers to the embodied, emotionally grounded process by which humans identify and set goals, while problem-solving denotes the execution of strategies aimed at achieving such predefined objectives. The framework emphasizes that while AI excels at efficiency and optimization, it lacks the orientation derived from experiential grounding and the embodiment flexibility intrinsic to human cognition. To empirically explore this distinction, the research analyzes metadata from 157 YouTube videos discussing AI. Conducting a thematic analysis combining qualitative insights with keyword-based quantitative metrics, this mixed-methods approach uncovers recurring themes in public discourse, including privacy, job displacement, misinformation, optimism, and ethical concerns. The results reveal a dual sentiment: public fascination with AI's capabilities coexists with anxiety and skepticism about its societal implications. The discussion critiques the orthogonality thesis, which posits that intelligence is separable from goal content, and instead argues that human intelligence integrates goal-setting and goal-pursuit. It underscores the centrality of embodied cognition in human reasoning and highlights how AI's limitations come from its current reliance on computational processing. The study advocates for enhancing emotional and digital literacy to foster responsible AI engagement. It calls for reframing public discourse to recognize AI as a tool that augments -- rather than replaces -- human intelligence. By positioning problem seeking at the core of cognition and as a critical dimension of intelligence, this research offers new perspectives on ethically aligned and human-centered AI development.