Law
Stream-Based Monitoring of Algorithmic Fairness
Baumeister, Jan, Finkbeiner, Bernd, Scheerer, Frederik, Siber, Julian, Wagenpfeil, Tobias
Automatic decision and prediction systems are increasingly deployed in applications where they significantly impact the livelihood of people, such as for predicting the creditworthiness of loan applicants or the recidivism risk of defendants. These applications have given rise to a new class of algorithmic-fairness specifications that require the systems to decide and predict without bias against social groups. Verifying these specifications statically is often out of reach for realistic systems, since the systems may, e.g., employ complex learning components, and reason over a large input space. In this paper, we therefore propose stream-based monitoring as a solution for verifying the algorithmic fairness of decision and prediction systems at runtime. Concretely, we present a principled way to formalize algorithmic fairness over temporal data streams in the specification language RTLola and demonstrate the efficacy of this approach on a number of benchmarks. Besides synthetic scenarios that particularly highlight its efficiency on streams with a scaling amount of data, we notably evaluate the monitor on real-world data from the recidivism prediction tool COMPAS.
The Pitfalls of "Security by Obscurity" And What They Mean for Transparent AI
Hall, Peter, Mundahl, Olivia, Park, Sunoo
Calls for transparency in AI systems are growing in number and urgency from diverse stakeholders ranging from regulators to researchers to users (with a comparative absence of companies developing AI). Notions of transparency for AI abound, each addressing distinct interests and concerns. In computer security, transparency is likewise regarded as a key concept. The security community has for decades pushed back against so-called security by obscurity -- the idea that hiding how a system works protects it from attack -- against significant pressure from industry and other stakeholders. Over the decades, in a community process that is imperfect and ongoing, security researchers and practitioners have gradually built up some norms and practices around how to balance transparency interests with possible negative side effects. This paper asks: What insights can the AI community take from the security community's experience with transparency? We identify three key themes in the security community's perspective on the benefits of transparency and their approach to balancing transparency against countervailing interests. For each, we investigate parallels and insights relevant to transparency in AI. We then provide a case study discussion on how transparency has shaped the research subfield of anonymization. Finally, shifting our focus from similarities to differences, we highlight key transparency issues where modern AI systems present challenges different from other kinds of security-critical systems, raising interesting open questions for the security and AI communities alike.
Can AI Solve the Peer Review Crisis? A Large Scale Experiment on LLM's Performance and Biases in Evaluating Economics Papers
Pataranutaporn, Pat, Powdthavee, Nattavudh, Maes, Pattie
We investigate whether artificial intelligence can address the peer review crisis in economics by analyzing 27,090 evaluations of 9,030 unique submissions using a large language model (LLM). The experiment systematically varies author characteristics (e.g., affiliation, reputation, gender) and publication quality (e.g., top-tier, mid-tier, low-tier, AI generated papers). The results indicate that LLMs effectively distinguish paper quality but exhibit biases favoring prominent institutions, male authors, and renowned economists. Additionally, LLMs struggle to differentiate high-quality AI-generated papers from genuine top-tier submissions. While LLMs offer efficiency gains, their susceptibility to bias necessitates cautious integration and hybrid peer review models to balance equity and accuracy.
Autonomy and Safety Assurance in the Early Development of Robotics and Autonomous Systems
Abeywickrama, Dhaminda B., Fisher, Michael, Wheeler, Frederic, Dennis, Louise
This report provides an overview of the workshop titled Autonomy and Safety Assurance in the Early Development of Robotics and Autonomous Systems, hosted by the Centre for Robotic Autonomy in Demanding and Long-Lasting Environments (CRADLE) on September 2, 2024, at The University of Manchester, UK. The event brought together representatives from six regulatory and assurance bodies across diverse sectors to discuss challenges and evidence for ensuring the safety of autonomous and robotic systems, particularly autonomous inspection robots (AIR). The workshop featured six invited talks by the regulatory and assurance bodies. CRADLE aims to make assurance an integral part of engineering reliable, transparent, and trustworthy autonomous systems. Key discussions revolved around three research questions: (i) challenges in assuring safety for AIR; (ii) evidence for safety assurance; and (iii) how assurance cases need to differ for autonomous systems. Following the invited talks, the breakout groups further discussed the research questions using case studies from ground (rail), nuclear, underwater, and drone-based AIR. This workshop offered a valuable opportunity for representatives from industry, academia, and regulatory bodies to discuss challenges related to assured autonomy. Feedback from participants indicated a strong willingness to adopt a design-for-assurance process to ensure that robots are developed and verified to meet regulatory expectations.
Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models
Ding, Yi, Li, Lijun, Cao, Bing, Shao, Jing
Large Vision-Language Models (VLMs) have achieved remarkable performance across a wide range of tasks. However, their deployment in safety-critical domains poses significant challenges. Existing safety fine-tuning methods, which focus on textual or multimodal content, fall short in addressing challenging cases or disrupt the balance between helpfulness and harmlessness. Our evaluation highlights a safety reasoning gap: these methods lack safety visual reasoning ability, leading to such bottlenecks. To address this limitation and enhance both visual perception and reasoning in safety-critical contexts, we propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance. Specifically, we introduce the Multi-Image Safety (MIS) dataset, an instruction-following dataset tailored for multi-image safety scenarios, consisting of training and test splits. Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks requiring safety-related visual reasoning. This approach not only delivers exceptional safety performance but also preserves general capabilities without any trade-offs. Specifically, fine-tuning with MIS increases average accuracy by 0.83% across five general benchmarks and reduces the Attack Success Rate (ASR) on multiple safety benchmarks by a large margin. Data and Models are released under: \href{https://dripnowhy.github.io/MIS/}{\texttt{https://dripnowhy.github.io/MIS/}}
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
Feuer, Benjamin, Hegde, Chinmay
Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at https://github.com/penfever/wildchat-50m.
Semantic Web and Creative AI -- A Technical Report from ISWS 2023
Ahmad, Raia Abu, Alharbi, Reham, Barile, Roberto, Bรถckling, Martin, Bolanos, Francisco, Bonfitto, Sara, Bruns, Oleksandra, Celino, Irene, Chudasama, Yashrajsinh, Critelli, Martin, d'Amato, Claudia, D'Ippolito, Giada, Dasoulas, Ioannis, De Giorgis, Stefano, De Leo, Vincenzo, Di Bonaventura, Chiara, Di Panfilo, Marco, Dobriy, Daniil, Domingue, John, Duan, Xuemin, Dumontier, Michel, Efeoglu, Sefika, Eschauzier, Ruben, Ginwa, Fakih, Ferranti, Nicolas, Graciotti, Arianna, Hanisch, Philipp, Hannah, George, Heidari, Golsa, Hogan, Aidan, Hussein, Hassan, Jouglar, Alexane, Kalo, Jan-Christoph, Kieffer, Manoรฉ, Klironomos, Antonis, Koch, Inรชs, Lajewska, Weronika, Lazzari, Nicolas, Lindekrans, Mikael, Lippolis, Anna Sofia, Llugiqi, Majlinda, Mancini, Eleonora, Marzi, Eleonora, Menotti, Laura, Flores, Daniela Milon, Nagowah, Soulakshmee, Neubert, Kerstin, Niazmand, Emetis, Norouzi, Ebrahim, Martinez, Beatriz Olarte, Oudshoorn, Anouk Michelle, Poltronieri, Andrea, Presutti, Valentina, Purohit, Disha, Raoufi, Ensiyeh, Ringwald, Celian, Rockstroh, Johanna, Rudolph, Sebastian, Sack, Harald, Saeed, Zafar, Saeedizade, Mohammad Javad, Sahbi, Aya, Santini, Cristian, Simic, Aleksandra, Sommer, Dennis, Sousa, Rita, Tan, Mary Ann, Tarikere, Vidyashree, Tietz, Tabea, Tirpitz, Liam, Tomasino, Arnaldo, van Harmelen, Frank, Vissoci, Joao, Woods, Caitlin, Zhang, Bohui, Zhang, Xinyue, Zheng, Heng
The International Semantic Web Research School (ISWS) is a week-long intensive program designed to immerse participants in the field. This document reports a collaborative effort performed by ten teams of students, each guided by a senior researcher as their mentor, attending ISWS 2023. Each team provided a different perspective to the topic of creative AI, substantiated by a set of research questions as the main subject of their investigation. The 2023 edition of ISWS focuses on the intersection of Semantic Web technologies and Creative AI. ISWS 2023 explored various intersections between Semantic Web technologies and creative AI. A key area of focus was the potential of LLMs as support tools for knowledge engineering. Participants also delved into the multifaceted applications of LLMs, including legal aspects of creative content production, humans in the loop, decentralised approaches to multimodal generative AI models, nanopublications and AI for personal scientific knowledge graphs, commonsense knowledge in automatic story and narrative completion, generative AI for art critique, prompt engineering, automatic music composition, commonsense prototyping and conceptual blending, and elicitation of tacit knowledge. As Large Language Models and semantic technologies continue to evolve, new exciting prospects are emerging: a future where the boundaries between creative expression and factual knowledge become increasingly permeable and porous, leading to a world of knowledge that is both informative and inspiring.
Conversation Games and a Strategic View of the Turing Test
Although many game-theoretic models replicate real interactions that often rely on natural language, explicit study of games where language is central to strategic interaction remains limited. This paper introduces the \emph{conversation game}, a multi-stage, extensive-form game based on linguistic strategic interaction. We focus on a subset of the games, called verdict games. In a verdict game, two players alternate to contribute to a conversation, which is evaluated at each stage by a non-strategic judge who may render a conclusive binary verdict, or a decision to continue the dialogue. The game ends once a limit is reached or a verdict is given. We show many familiar processes, such as interrogation or a court process fall under this category. We also, show that the Turing test is an instance of verdict game, and discuss the significance of a strategic view of the Turing test in the age of advanced AI deception. We show the practical relevance of the proposed concepts by simulation experiments, and show that a strategic agent outperforms a naive agent by a high margin.
Model-Free RL Agents Demonstrate System 1-Like Intentionality
This paper argues that model-free reinforcement learning (RL) agents, while lacking explicit planning mechanisms, exhibit behaviours that can be analogised to System 1 ("thinking fast") processes in human cognition. Unlike model-based RL agents, which operate akin to System 2 ("thinking slow") reasoning by leveraging internal representations for planning, model-free agents react to environmental stimuli without anticipatory modelling. We propose a novel framework linking the dichotomy of System 1 and System 2 to the distinction between model-free and model-based RL. This framing challenges the prevailing assumption that intentionality and purposeful behaviour require planning, suggesting instead that intentionality can manifest in the structured, reactive behaviours of model-free agents. By drawing on interdisciplinary insights from cognitive psychology, legal theory, and experimental jurisprudence, we explore the implications of this perspective for attributing responsibility and ensuring AI safety. These insights advocate for a broader, contextually informed interpretation of intentionality in RL systems, with implications for their ethical deployment and regulation.
xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking
Lee, Sunbowen, Ni, Shiwen, Wei, Chi, Li, Shuaimin, Fan, Liyang, Argha, Ahmadreza, Alinejad-Rokny, Hamid, Xu, Ruifeng, Gong, Yicheng, Yang, Min
Safety alignment mechanism are essential for preventing large language models (LLMs) from generating harmful information or unethical content. However, cleverly crafted prompts can bypass these safety measures without accessing the model's internal parameters, a phenomenon known as black-box jailbreak. Existing heuristic black-box attack methods, such as genetic algorithms, suffer from limited effectiveness due to their inherent randomness, while recent reinforcement learning (RL) based methods often lack robust and informative reward signals. To address these challenges, we propose a novel black-box jailbreak method leveraging RL, which optimizes prompt generation by analyzing the embedding proximity between benign and malicious prompts. This approach ensures that the rewritten prompts closely align with the intent of the original prompts while enhancing the attack's effectiveness. Furthermore, we introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success. Experimental results show the superiority of our approach, achieving state-of-the-art (SOTA) performance on several prominent open and closed-source LLMs, including Qwen2.5-7B-Instruct, Llama3.1-8B-Instruct, and GPT-4o-0806. Our method sets a new benchmark in jailbreak attack effectiveness, highlighting potential vulnerabilities in LLMs. The codebase for this work is available at https://github.com/Aegis1863/xJailbreak.