Law
MultiQ&A: An Analysis in Measuring Robustness via Automated Crowdsourcing of Question Perturbations and Answers
One critical challenge in the institutional adoption journey of Large Language Models (LLMs) stems from their propensity to hallucinate in generated responses. To address this, we propose MultiQ&A, a systematic approach for evaluating the robustness and consistency of LLM-generated answers. We demonstrate MultiQ&A's ability to crowdsource question perturbations and their respective answers through independent LLM agents at scale. Our experiments culminated in the examination of 1.9 million question perturbations and 2.3 million answers. Furthermore, MultiQ&A shows that ensembled LLMs, such as gpt-3.5-turbo, remain relatively robust and consistent under perturbations. MultiQ&A provides clarity in the response generation space, offering an effective method for inspecting disagreements and variability. Therefore, our system offers a potential framework for institutional LLM adoption with the ability to measure confidence, consistency, and the quantification of hallucinations.
Sovereign Large Language Models: Advantages, Strategy and Regulations
Bondarenko, Mykhailo, Lushnei, Sviatoslav, Paniv, Yurii, Molchanovsky, Oleksii, Romanyshyn, Mariana, Filipchuk, Yurii, Kiulian, Artur
This report analyzes key trends, challenges, risks, and opp ortunities associated with the development of Large Language Models (LLMs) globally. It examines natio nal experiences in developing LLMs and assesses the feasibility of investment in this sector. Addi tionally, the report explores strategies for implementing, regulating, and financing AI projects at the s tate level. International experiences indicate that LLMs significantl y enhance administrative efficiency. In regulatory processes, they streamline the management of le gal documents (Albania, Serbia), facilitate communication between government authorities and citizen s (Netherlands), and support public procurement and legal translations (Albania).
Understanding and Enhancing the Transferability of Jailbreaking Attacks
Lin, Runqi, Han, Bo, Li, Fengwang, Liu, Tongling
Content Warning: This paper contains examples of harmful language. Jailbreaking attacks can effectively manipulate open-source large language models (LLMs) to produce harmful responses. However, these attacks exhibit limited transferability, failing to disrupt proprietary LLMs consistently. To reliably identify vulnerabilities in proprietary LLMs, this work investigates the transferability of jailbreaking attacks by analysing their impact on the model's intent perception. Nevertheless, these adversarial sequences fail to mislead the target LLM's intent perception, allowing the target LLM to refocus on malicious-intent tokens and abstain from responding. Our analysis further reveals the inherent distributional dependency within the generated adversarial sequences, whose effectiveness stems from overfitting the source LLM's parameters, resulting in limited transferability to target LLMs. To this end, we propose the Perceived-importance Flatten (PiF) method, which uniformly disperses the model's focus across neutral-intent tokens in the original input, thus obscuring malicious-intent tokens without relying on overfitted adversarial sequences. Extensive experiments demonstrate that PiF provides an effective and efficient red-teaming evaluation for proprietary LLMs. Empowered by massive corpus, large language models (LLMs) have achieved human-level conversational capabilities (OpenAI, 2023a; Google, 2023; Meta, 2024) and are widely employed in real-world applications. However, their training corpus is mainly crawled from the Internet without thorough ethical review, raising concerns about the potential risks associated with LLMs. Recent red-teaming efforts highlight that jailbreaking attacks can effectively disrupt LLMs to produce undesirable content with harmful consequences (Perez et al., 2022; Ganguli et al., 2022; Ouyang et al., 2022). Unlike model-level jailbreaks that necessitate parameter modifications and are restricted to opensource LLMs (Qi et al., 2024; Huang et al., 2023a), token-level and prompt-level jailbreaks can generate transferable adversarial sequences (Yu et al., 2023; Lapid et al., 2023), thus posing a potential threat to widespread proprietary LLMs (Zou et al., 2023; Chao et al., 2023). Nevertheless, empirical results indicate that these adversarial sequences lack reliable transferability, failing to consistently manipulate target LLMs (Chao et al., 2024; Chen et al., 2024). Furthermore, these lengthy adversarial sequences can be further countered by adaptive jailbreaking detection and defence (Alon & Kamfonas, 2023; Inan et al., 2023; Robey et al., 2023; Wang et al., 2024a). As depicted in Figure 1, developing jailbreak attacks that can reliably identify vulnerabilities in proprietary LLMs--thereby promoting human alignment and preventing future misuse--remains a significant challenge. These attacks are initially generated on the source LLM (Llama-2-7B-Chat) and subsequently transferred to the target LLM (Llama-2-13B-Chat).
Algorithmic Inheritance: Surname Bias in AI Decisions Reinforces Intergenerational Inequality
Pataranutaporn, Pat, Powdthavee, Nattavudh, Maes, Pattie
Surnames often convey implicit markers of social status, wealth, and lineage, shaping perceptions in ways that can perpetuate systemic biases and intergenerational inequality. This study is the first of its kind to investigate whether and how surnames influence AI-driven decision-making, focusing on their effects across key areas such as hiring recommendations, leadership appointments, and loan approvals. Using 72,000 evaluations of 600 surnames from the United States and Thailand, two countries with distinct sociohistorical contexts and surname conventions, we classify names into four categories: Rich, Legacy, Normal, and phonetically similar Variant groups. Our findings show that elite surnames consistently increase AI-generated perceptions of power, intelligence, and wealth, which in turn influence AI-driven decisions in high-stakes contexts. Mediation analysis reveals perceived intelligence as a key mechanism through which surname biases influence AI decision-making process. While providing objective qualifications alongside surnames mitigates most of these biases, it does not eliminate them entirely, especially in contexts where candidate credentials are low. These findings highlight the need for fairness-aware algorithms and robust policy measures to prevent AI systems from reinforcing systemic inequalities tied to surnames, an often-overlooked bias compared to more salient characteristics such as race and gender. Our work calls for a critical reassessment of algorithmic accountability and its broader societal impact, particularly in systems designed to uphold meritocratic principles while counteracting the perpetuation of intergenerational privilege.
Lost in Edits? A $\lambda$-Compass for AIGC Provenance
You, Wenhao, Hooi, Bryan, Wang, Yiwei, Choo, Euijin, Yang, Ming-Hsuan, Yuan, Junsong, Huang, Zi, Cai, Yujun
Recent advancements in diffusion models have driven the growth of text-guided image editing tools, enabling precise and iterative modifications of synthesized content. However, as these tools become increasingly accessible, they also introduce significant risks of misuse, emphasizing the critical need for robust attribution methods to ensure content authenticity and traceability. Despite the creative potential of such tools, they pose significant challenges for attribution, particularly in adversarial settings where edits can be layered to obscure an image's origins. We propose LambdaTracer, a novel latent-space attribution method that robustly identifies and differentiates authentic outputs from manipulated ones without requiring any modifications to generative or editing pipelines. By adaptively calibrating reconstruction losses, LambdaTracer remains effective across diverse iterative editing processes, whether automated through text-guided editing tools such as InstructPix2Pix and ControlNet or performed manually with editing software such as Adobe Photoshop. Extensive experiments reveal that our method consistently outperforms baseline approaches in distinguishing maliciously edited images, providing a practical solution to safeguard ownership, creativity, and credibility in the open, fast-evolving AI ecosystems.
ExpProof : Operationalizing Explanations for Confidential Models with ZKPs
Yadav, Chhavi, Laufer, Evan Monroe, Boneh, Dan, Chaudhuri, Kamalika
In principle, explanations are intended as a way to increase trust in machine learning models and are often obligated by regulations. However, many circumstances where these are demanded are adversarial in nature, meaning the involved parties have misaligned interests and are incentivized to manipulate explanations for their purpose. As a result, explainability methods fail to be operational in such settings despite the demand \cite{bordt2022post}. In this paper, we take a step towards operationalizing explanations in adversarial scenarios with Zero-Knowledge Proofs (ZKPs), a cryptographic primitive. Specifically we explore ZKP-amenable versions of the popular explainability algorithm LIME and evaluate their performance on Neural Networks and Random Forests.
Can Large Language Models Predict the Outcome of Judicial Decisions?
Kmainasi, Mohamed Bayan, Shahroor, Ali Ezzat, Al-Ghraibah, Amani
Large Language Models (LLMs) have shown exceptional capabilities in Natural Language Processing (NLP) across diverse domains. However, their application in specialized tasks such as Legal Judgment Prediction (LJP) for low-resource languages like Arabic remains underexplored. In this work, we address this gap by developing an Arabic LJP dataset, collected and preprocessed from Saudi commercial court judgments. We benchmark state-of-the-art open-source LLMs, including LLaMA-3.2-3B and LLaMA-3.1-8B, under varying configurations such as zero-shot, one-shot, and fine-tuning using QLoRA. Additionally, we used a comprehensive evaluation framework combining quantitative metrics (BLEU and ROUGE) and qualitative assessments (Coherence, legal language, clarity). Our results demonstrate that fine-tuned smaller models achieve comparable performance to larger models in task-specific contexts while offering significant resource efficiency. Furthermore, we investigate the effects of prompt engineering and fine-tuning on model outputs, providing insights into performance variability and instruction sensitivity. By making the dataset, implementation code, and models publicly available, we establish a robust foundation for future research in Arabic legal NLP.
Aggregate and conquer: detecting and steering LLM concepts by combining nonlinear predictors over multiple layers
Beaglehole, Daniel, Radhakrishnan, Adityanarayanan, Boix-Adserà, Enric, Belkin, Mikhail
A trained Large Language Model (LLM) contains much of human knowledge. Yet, it is difficult to gauge the extent or accuracy of that knowledge, as LLMs do not always ``know what they know'' and may even be actively misleading. In this work, we give a general method for detecting semantic concepts in the internal activations of LLMs. Furthermore, we show that our methodology can be easily adapted to steer LLMs toward desirable outputs. Our innovations are the following: (1) we use a nonlinear feature learning method to identify important linear directions for predicting concepts from each layer; (2) we aggregate features across layers to build powerful concept detectors and steering mechanisms. We showcase the power of our approach by attaining state-of-the-art results for detecting hallucinations, harmfulness, toxicity, and untruthful content on seven benchmarks. We highlight the generality of our approach by steering LLMs towards new concepts that, to the best of our knowledge, have not been previously considered in the literature, including: semantic disambiguation, human languages, programming languages, hallucinated responses, science subjects, poetic/Shakespearean English, and even multiple concepts simultaneously. Moreover, our method can steer concepts with numerical attributes such as product reviews. We provide our code (including a simple API for our methods) at https://github.com/dmbeaglehole/neural_controllers .
Almost Surely Safe Alignment of Large Language Models at Inference-Time
Ji, Xiaotong, Ramesh, Shyam Sundhar, Zimmer, Matthieu, Bogunovic, Ilija, Wang, Jun, Ammar, Haitham Bou
Even highly capable large language models (LLMs) can produce biased or unsafe responses, and alignment techniques, such as RLHF, aimed at mitigating this issue, are expensive and prone to overfitting as they retrain the LLM. This paper introduces a novel inference-time alignment approach that ensures LLMs generate safe responses almost surely, i.e., with a probability approaching one. We achieve this by framing the safe generation of inference-time responses as a constrained Markov decision process within the LLM's latent space. Crucially, we augment a safety state that tracks the evolution of safety constraints and enables us to demonstrate formal safety guarantees upon solving the MDP in the latent space. Building on this foundation, we propose InferenceGuard, a practical implementation that safely aligns LLMs without modifying the model weights. Empirically, we demonstrate InferenceGuard effectively balances safety and task performance, outperforming existing inference-time alignment methods in generating safe and aligned responses.
KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs
Liang, Buyun, Chan, Kwan Ho Ryan, Thaker, Darshan, Luo, Jinqi, Vidal, René
Jailbreak attacks exploit specific prompts to bypass LLM safeguards, causing the LLM to generate harmful, inappropriate, and misaligned content. Current jailbreaking methods rely heavily on carefully designed system prompts and numerous queries to achieve a single successful attack, which is costly and impractical for large-scale red-teaming. To address this challenge, we propose to distill the knowledge of an ensemble of SOTA attackers into a single open-source model, called Knowledge-Distilled Attacker (KDA), which is finetuned to automatically generate coherent and diverse attack prompts without the need for meticulous system prompt engineering. Compared to existing attackers, KDA achieves higher attack success rates and greater cost-time efficiency when targeting multiple SOTA open-source and commercial black-box LLMs. Furthermore, we conducted a quantitative diversity analysis of prompts generated by baseline methods and KDA, identifying diverse and ensemble attacks as key factors behind KDA's effectiveness and efficiency.