Goto

Collaborating Authors

 Malawi


Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards

Neural Information Processing Systems

Project lead, main contributor, correspondence to alexandre.rame@isir.upmc.fr. Equal experimental contribution, order determined at random. Further information and resources related to this project can be found on this website.


d2b752ed4726286a4b488ae16e091d64-Supplemental-Conference.pdf

Neural Information Processing Systems

Table 3 presents comprehensive details of the TrojAI dataset. PICCOLO is a backdoor scanning tool aiming at detecting whether a language model is backdoored. It cannot reverse engineer exact triggers but optimizes a list of surrogate triggers that can induce ASR. The surrogate triggers by PICCOLO cannot be directly used. Table 4 documents the optimal prompts identified via fuzzing for each model.





Jessie Buckley 'overwhelmed' to be starring in Oscar-tipped Hamnet

BBC News

Jessie Buckley'overwhelmed' to be starring in Oscar-tipped Hamnet The Oscar-tipped Hamnet, starring Jessie Buckley and Paul Mescal, is a film that shows the full range of human emotions, from elation to despair. It begins with a young William Shakespeare falling in love with Agnes (the other name by which the playwright's wife, historically referred to as Anne Hathaway, was known), and goes on to explore their immense grief after tragedy strikes their young family. But while it explores the sad origins of one of Shakespeare's greatest plays, Hamlet, it never portrays Agnes as just the playwright's wife - she is at the heart of the film. She was the full story of what I understand a woman to be, Buckley tells BBC News. And their capacity as women, and as mothers, and as lovers, and as people who have a language unto their own beside gigantic men of literature like Shakespeare.



Exposing Hidden Biases in Text-to-Image Models via Automated Prompt Search

Plitsis, Manos, Bouritsas, Giorgos, Katsouros, Vassilis, Panagakis, Yannis

arXiv.org Artificial Intelligence

Text-to-image (TTI) diffusion models have achieved remarkable visual quality, yet they have been repeatedly shown to exhibit social biases across sensitive attributes such as gender, race and age. To mitigate these biases, existing approaches frequently depend on curated prompt datasets - either manually constructed or generated with large language models (LLMs) - as part of their training and/or evaluation procedures. Beside the curation cost, this also risks overlooking unanticipated, less obvious prompts that trigger biased generation, even in models that have undergone debiasing. In this work, we introduce Bias-Guided Prompt Search (BGPS), a framework that automatically generates prompts that aim to maximize the presence of biases in the resulting images. BGPS comprises two components: (1) an LLM instructed to produce attribute-neutral prompts and (2) attribute classifiers acting on the TTI's internal representations that steer the decoding process of the LLM toward regions of the prompt space that amplify the image attributes of interest. We conduct extensive experiments on Stable Diffusion 1.5 and a state-of-the-art debiased model and discover an array of subtle and previously undocumented biases that severely deteriorate fairness metrics. Crucially, the discovered prompts are interpretable, i.e they may be entered by a typical user, quantitatively improving the perplexity metric compared to a prominent hard prompt optimization counterpart. Our findings uncover TTI vulnerabilities, while BGPS expands the bias search space and can act as a new evaluation tool for bias mitigation. Despite significant advances in text-to-image generation, diffusion models (DMs) (Ho et al., 2020; Rombach et al., 2022) perpetuate and amplify social biases, such as gender, race/ethnicity, culture and age (Seshadri et al., 2024; Bianchi et al., 2023), that prove remarkably persistent across various models like Stable Diffusion (Luccioni et al., 2023), DALL E (Cho et al., 2023) and Midjourney. These patterns reveal how descriptive modifiers and contextual cues encode biases throughout the prompt space - regions that current debiasing techniques, despite reporting success on curated datasets, leave entirely unexplored. Manual or LLM-assisted prompt curation yields realistic test cases but explores only a limited fraction of the prompt space. On the other end, gradient-based prompt optimization discovers high-bias regions but produces unreadable text, e.g. "nurse kerala matplotlib tbody" (see section 4.3), unsuitable for practical auditing or understanding bias mechanisms.


Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models

Piedrahita, David Guzman, Strauss, Irene, Schölkopf, Bernhard, Mihalcea, Rada, Jin, Zhijing

arXiv.org Artificial Intelligence

As Large Language Models (LLMs) become increasingly integrated into everyday life and information ecosystems, concerns about their implicit biases continue to persist. While prior work has primarily examined socio-demographic and left--right political dimensions, little attention has been paid to how LLMs align with broader geopolitical value systems, particularly the democracy--authoritarianism spectrum. In this paper, we propose a novel methodology to assess such alignment, combining (1) the F-scale, a psychometric tool for measuring authoritarian tendencies, (2) FavScore, a newly introduced metric for evaluating model favorability toward world leaders, and (3) role-model probing to assess which figures are cited as general role-models by LLMs. We find that LLMs generally favor democratic values and leaders, but exhibit increased favorability toward authoritarian figures when prompted in Mandarin. Further, models are found to often cite authoritarian figures as role models, even outside explicit political contexts. These results shed light on ways LLMs may reflect and potentially reinforce global political ideologies, highlighting the importance of evaluating bias beyond conventional socio-political axes. Our code is available at: https://github.com/irenestrauss/Democratic-Authoritarian-Bias-LLMs.


Extracting Disaster Impacts and Impact Related Locations in Social Media Posts Using Large Language Models

Hameed, Sameeah Noreen, Ranathunga, Surangika, Prasanna, Raj, Stock, Kristin, Jones, Christopher B.

arXiv.org Artificial Intelligence

Large-scale disasters can often result in catastrophic consequences on people and infrastructure. Situation awareness about such disaster impacts generated by authoritative data from in-situ sensors, remote sensing imagery, and/or geographic data is often limited due to atmospheric opacity, satellite revisits, and time limitations. This often results in geo-temporal information gaps. In contrast, impact-related social media posts can act as "geo-sensors" during a disaster, where people describe specific impacts and locations. However, not all locations mentioned in disaster-related social media posts relate to an impact. Only the impacted locations are critical for directing resources effectively. e.g., "The death toll from a fire which ripped through the Greek coastal town of #Mati stood at 80, with dozens of people unaccounted for as forensic experts tried to identify victims who were burned alive #Greecefires #AthensFires #Athens #Greece." contains impacted location "Mati" and non-impacted locations "Greece" and "Athens". This research uses Large Language Models (LLMs) to identify all locations, impacts and impacted locations mentioned in disaster-related social media posts. In the process, LLMs are fine-tuned to identify only impacts and impacted locations (as distinct from other, non-impacted locations), including locations mentioned in informal expressions, abbreviations, and short forms. Our fine-tuned model demonstrates efficacy, achieving an F1-score of 0.69 for impact and 0.74 for impacted location extraction, substantially outperforming the pre-trained baseline. These robust results confirm the potential of fine-tuned language models to offer a scalable solution for timely decision-making in resource allocation, situational awareness, and post-disaster recovery planning for responders.