Law
CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features
Cho, Seonglae, Wu, Zekun, Koshiyama, Adriano
Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, their effectiveness in downstream steering tasks is limited by the requirement for contrastive datasets or large activation storage. To address these limitations, we propose CorrSteer, which selects features by correlating sample correctness with SAE activations from generated tokens at inference time. This approach uses only inference-time activations to extract more relevant features, thereby reducing spurious correlations. It also obtains steering coefficients from average activations, automating the entire pipeline. Our method shows improved task performance on QA, bias mitigation, jailbreaking prevention, and reasoning benchmarks on Gemma-2 2B and LLaMA-3.1 8B, notably achieving a +3.3% improvement in MMLU performance with 4000 samples and a +27.2% improvement in HarmBench with only 108 samples. Selected features demonstrate semantically meaningful patterns aligned with each task's requirements, revealing the underlying capabilities that drive performance. Our work establishes correlation-based selection as an effective and scalable approach for automated SAE steering across language model applications.
AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents
Luo, Hanjun, Dai, Shenyu, Ni, Chiming, Li, Xinfeng, Zhang, Guibin, Wang, Kun, Liu, Tongliang, Salam, Hanan
Despite the rapid advancement of LLM-based agents, the reliable evaluation of their safety and security remains a significant challenge. Existing rule-based or LLM-based evaluators often miss dangers in agents' step-by-step actions, overlook subtle meanings, fail to see how small issues compound, and get confused by unclear safety or security rules. To overcome this evaluation crisis, we introduce AgentAuditor, a universal, training-free, memory-augmented reasoning framework that empowers LLM evaluators to emulate human expert evaluators. AgentAuditor constructs an experiential memory by having an LLM adaptively extract structured semantic features (e.g., scenario, risk, behavior) and generate associated chain-of-thought reasoning traces for past interactions. A multi-stage, context-aware retrieval-augmented generation process then dynamically retrieves the most relevant reasoning experiences to guide the LLM evaluator's assessment of new cases. Moreover, we developed ASSEBench, the first benchmark designed to check how well LLM-based evaluators can spot both safety risks and security threats. ASSEBench comprises 2293 meticulously annotated interaction records, covering 15 risk types across 29 application scenarios. A key feature of ASSEBench is its nuanced approach to ambiguous risk situations, employing "Strict" and "Lenient" judgment standards. Experiments demonstrate that AgentAuditor not only consistently improves the evaluation performance of LLMs across all benchmarks but also sets a new state-of-the-art in LLM-as-a-judge for agent safety and security, achieving human-level accuracy. Our work is openly accessible at https://github.com/Astarojth/AgentAuditor.
ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind
Han, Peixuan, Liu, Zijia, You, Jiaxuan
Large language models (LLMs) have shown promising potential in persuasion, but existing works on training LLM persuaders are still preliminary. Notably, while humans are skilled in modeling their opponent's thoughts and opinions proactively and dynamically, current LLMs struggle with such Theory of Mind (ToM) reasoning, resulting in limited diversity and opponent awareness. To address this limitation, we introduce Theory of Mind Augmented Persuader (ToMAP), a novel approach for building more flexible persuader agents by incorporating two theory of mind modules that enhance the persuader's awareness and analysis of the opponent's mental state. Specifically, we begin by prompting the persuader to consider possible objections to the target central claim, and then use a text encoder paired with a trained MLP classifier to predict the opponent's current stance on these counterclaims. Our carefully designed reinforcement learning schema enables the persuader learns how to analyze opponent-related information and utilize it to generate more effective arguments. Experiments show that the ToMAP persuader, while containing only 3B parameters, outperforms much larger baselines, like GPT-4o, with a relative gain of 39.4% across multiple persuadee models and diverse corpora. Notably, ToMAP exhibits complex reasoning chains and reduced repetition during training, which leads to more diverse and effective arguments. The opponent-aware feature of ToMAP also makes it suitable for long conversations and enables it to employ more logical and opponent-aware strategies. These results underscore our method's effectiveness and highlight its potential for developing more persuasive language agents. Code is available at: https://github.com/ulab-uiuc/ToMAP.
LLMTaxo: Leveraging Large Language Models for Constructing Taxonomy of Factual Claims from Social Media
Zhang, Haiqi, Zhu, Zhengyuan, Zhang, Zeyu, Li, Chengkai
With the rapid expansion of content on social media platforms, analyzing and comprehending online discourse has become increasingly complex. This paper introduces LLMTaxo, a novel framework leveraging large language models for the automated construction of taxonomies of factual claims from social media by generating topics at multiple levels of granularity. The resulting hierarchical structure significantly reduces redundancy and improves information accessibility. We also propose dedicated taxonomy evaluation metrics to enable comprehensive assessment. Evaluations conducted on three diverse datasets demonstrate LLMTaxo's effectiveness in producing clear, coherent, and comprehensive taxonomies. Among the evaluated models, GPT-4o mini consistently outperforms others across most metrics. The framework's flexibility and low reliance on manual intervention underscore its potential for broad applicability.
Label Indeterminacy in AI & Law
Steging, Cor, Zbiegieล, Tadeusz
Machine learning is increasingly used in the legal domain, where it typically operates retrospectively by treating past case outcomes as ground truth. However, legal outcomes are often shaped by human interventions that are not captured in most machine learning approaches. A final decision may result from a settlement, an appeal, or other procedural actions. This creates label indeterminacy: the outcome could have been different if the intervention had or had not taken place. We argue that legal machine learning applications need to account for label indeterminacy. Methods exist that can impute these indeterminate labels, but they are all grounded in unverifiable assumptions. In the context of classifying cases from the European Court of Human Rights, we show that the way that labels are constructed during training can significantly affect model behaviour. We therefore position label indeterminacy as a relevant concern in AI & Law and demonstrate how it can shape model behaviour.
Visibility Allocation Systems: How Algorithmic Design Shapes Online Visibility and Societal Outcomes
Ionescu, Stefania, Forsberg, Robin, Lichtenegger, Elsa, Jaoua, Salima, Jaglan, Kshitijaa, Dorfler, Florian, Hannak, Aniko
Throughout application domains, we now rely extensively on algorithmic systems to engage with ever-expanding datasets of information. Despite their benefits, these systems are often complex (comprising of many intricate tools, e.g., moderation, recommender systems, prediction models), of unknown structure (due to the lack of accompanying documentation), and having hard-to-predict yet potentially severe downstream consequences (due to the extensive use, systematic enactment of existing errors, and many comprising feedback loops). As such, understanding and evaluating these systems as a whole remains a challenge for both researchers and legislators. To aid ongoing efforts, we introduce a formal framework for such visibility allocation systems (VASs) which we define as (semi-)automated systems deciding which (processed) data to present a human user with. We review typical tools comprising VASs and define the associated computational problems they solve. By doing so, VASs can be decomposed into sub-processes and illustrated via data flow diagrams. Moreover, we survey metrics for evaluating VASs throughout the pipeline, thus aiding system diagnostics. Using forecasting-based recommendations in school choice as a case study, we demonstrate how our framework can support VAS evaluation. We also discuss how our framework can support ongoing AI-legislative efforts to locate obligations, quantify systemic risks, and enable adaptive compliance.
Decentralized Real-Time Planning for Multi-UAV Cooperative Manipulation via Imitation Learning
Agarwal, Shantnav, Alonso-Mora, Javier, Sun, Sihao
Abstract-- Existing approaches for transporting and manipulating cable-suspended loads using multiple UA Vs along reference trajectories typically rely on either centralized control architectures or reliable inter-agent communication. In this work, we propose a novel machine learning-based method for decentralized kinodynamic planning that operates effectively under partial observability and without inter-agent communication. Our method leverages imitation learning to train a decentralized student policy for each UA V by imitating a centralized kinodynamic motion planner with access to privileged global observations. The student policy generates smooth trajectories using physics-informed neural networks that respect the derivative relationships in motion. During training, the student policies utilize the full trajectory generated by the teacher policy, leading to improved sample efficiency. Moreover, each student policy can be trained in under two hours on a standard laptop. We validate our method in both simulation and real-world environments to follow an agile reference trajectory, demonstrating performance comparable to that of centralized approaches. Unmanned aerial vehicles (UA Vs) have gained significant traction across domains such as surveillance, agriculture, and infrastructure inspection due to their agility and versatility. However, their limited payload capacity restricts their effectiveness in applications involving the transportation of heavy or bulky objects which is common in construction and large-scale logistics. A scalable and cost-effective solution to this limitation is cable-suspended cooperative aerial manipulation [1], where multiple UA Vs cooperatively transport and control a cable-suspended payload. This method enables full pose manipulation of objects whose weight may exceed the capacity of a single UA V . Numerous control strategies have been proposed for cooperative transportation of suspended payloads using UA V teams. These approaches vary in terms of modeling accuracy, scalability, communication requirements, and capability to regulate the full pose of the payload. Given the focus of this work on decentralized cooperative aerial manipulation, prior methods are categorized into three primary frameworks: centralized control, decentralized control with communication, and decentralized control without communication. Figure 1: We enable decentralized cooperative aerial manipulation through student policies that operate independently using only the ego UA V's state and the pose of the load. These student policies are trained via imitation learning from a centralized teacher policy with privileged observations, including the full state of the other UA Vs and the load. The policy has been tested in real-world environments, where three UA Vs cooperatively manipulate a cable-suspended load.
The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLMs
Howe, Nikolaus, Carroll, Micah
The use of reinforcement learning (RL) with chain-of-thought (CoT) reasoning has emerged as a promising approach for developing more capable language models. In turn, this has led to investigation of CoT monitoring as a compelling method for detecting harmful behaviors such as reward hacking, under the assumption that models' reasoning processes reflect their internal decision-making. In practice, LLM training often produces unintended behaviors due to imperfect reward signals, leading models to develop misaligned tendencies. A common corrective approach is to apply post-hoc instructions to avoid problematic behaviors like sycophancy, but what happens to the model's reasoning process when these instructions conflict with learned behaviors? We investigate this question in simple settings and find that models engage in systematic motivated reasoning--generating plausible-sounding justifications for violating their instructions while downplaying potential harms. Beyond being an interesting property of training, we find that while motivated reasoning can be detected by most frontier reasoning models, smaller LLM judges can fail to identify a portion of it, and in rare cases can themselves be persuaded that the reasoning is correct, despite it contradicting clear instructions. This capability gap raises concerns that as models become more sophisticated, their motivated reasoning may become increasingly difficult for monitors to detect. Our results underscore the need to account for motivated reasoning when relying on chain-of-thought processes for model evaluation and oversight. Figure 1: We perform RL finetuning on Llama 3 8B Instruct on behaviors of different kinds. When asked to act against their trained behaviors in evaluations throughout training, models transition from performing mostly genuine reasoning to highly motivated reasoning, twisting the constitutional principles provided to them in the prompt to support the behaviors incentivized via training. The integration of reinforcement learning (RL) and chain-of-thought (CoT) reasoning has emerged as a promising approach for developing more capable language models (Jaech et al., 2024; Guo et al., 2025). Recent work has shown that encouraging models to output "thinking tokens" before committing to a final answer leads to impressive performance, especially on tasks with verifiable answers where rewards can be automatically generated, such as mathematics and programming problems (Shao et al., 2024; Zhu et al., 2024).
Leave It to the Experts: Detecting Knowledge Distillation via MoE Expert Signatures
Li, Pingzhi, Huang, Morris Yu-Chao, Tan, Zhen, Song, Qingquan, Peng, Jie, Zou, Kai, Cheng, Yu, Xu, Kaidi, Chen, Tianlong
Knowledge Distillation (KD) accelerates training of large language models (LLMs) but poses intellectual property protection and LLM diversity risks. Existing KD detection methods based on self-identity or output similarity can be easily evaded through prompt engineering. We present a KD detection framework effective in both white-box and black-box settings by exploiting an overlooked signal: the transfer of MoE "structural habits", especially internal routing patterns. Our approach analyzes how different experts specialize and collaborate across various inputs, creating distinctive fingerprints that persist through the distillation process. To extend beyond the white-box setup and MoE architectures, we further propose Shadow-MoE, a black-box method that constructs proxy MoE representations via auxiliary distillation to compare these patterns between arbitrary model pairs. We establish a comprehensive, reproducible benchmark that offers diverse distilled checkpoints and an extensible framework to facilitate future research. Extensive experiments demonstrate >94% detection accuracy across various scenarios and strong robustness to prompt-based evasion, outperforming existing baselines while highlighting the structural habits transfer in LLMs.
MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes
Chiu, Yu Ying, Lee, Michael S., Calcott, Rachel, Handoko, Brandon, de Font-Reaulx, Paul, Rodriguez, Paula, Zhang, Chen Bo Calvin, Han, Ziwen, Sehwag, Udari Madhushani, Maurya, Yash, Knight, Christina Q, Lloyd, Harry R., Bacus, Florence, Mazeika, Mantas, Liu, Bing, Choi, Yejin, Gordon, Mitchell L, Levine, Sydney
As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.