Williams, Adina
Changing Answer Order Can Decrease MMLU Accuracy
Gupta, Vipul, Pantoja, David, Ross, Candace, Williams, Adina, Ung, Megan
For can affect multiple choice tests, for example, example, NLP model accuracy has been shown to when answers are presented in a different order be fairly brittle. For example, accuracy can drop during retest (Krosnick and Fabrigar, 1991; when researchers apply input alterations based Tellinghuisen and Sulikowski, 2008; Lions et al., on paraphrasing (Gan and Ng, 2019), word order 2022). However, as models do not have the biological changes (Gauthier and Levy, 2019; Ribeiro et al., limitations of humans, we may expect them 2020; Sinha et al., 2021a, 2022; Allen-Zhu and Li, to exhibit less variation than humans, or possibly 2023a,b; Berglund et al., 2023; Golovneva et al., even none at all. Thus, we claim that a model 2024; Kitouni et al., 2024), or other minor, largely should be robust to answer order changes: if it gets meaning-preserving input variations or perturbations the correct answer to a question when the answer (Belinkov and Bisk, 2018; Ebrahimi et al., is labeled'A', it should also always get the correct 2018; Jiang et al., 2020; Gao et al., 2021; Li et al., answer when it is labeled'C'. Put another way, 2021; Sinha et al., 2021b; Moradi and Samwald, the model should select the same answer for each 2021; Papakipos and Bitton, 2022; Qian et al., question, regardless of its label, for every possible 2022; Goodarzi et al., 2023; Sinha et al., 2023).
Decomposed evaluations of geographic disparities in text-to-image models
Sureddy, Abhishek, Padalia, Dishant, Periyakaruppa, Nandhinee, Saha, Oindrila, Williams, Adina, Romero-Soriano, Adriana, Richards, Megan, Kirichenko, Polina, Hall, Melissa
Recent work has identified substantial disparities in generated images of different geographic regions, including stereotypical depictions of everyday objects like houses and cars. However, existing measures for these disparities have been limited to either human evaluations, which are time-consuming and costly, or automatic metrics evaluating full images, which are unable to attribute these disparities to specific parts of the generated images. In this work, we introduce a new set of metrics, Decomposed Indicators of Disparities in Image Generation (Decomposed-DIG), that allows us to separately measure geographic disparities in the depiction of objects and backgrounds in generated images. Using Decomposed-DIG, we audit a widely used latent diffusion model and find that generated images depict objects with better realism than backgrounds and that backgrounds in generated images tend to contain larger regional disparities than objects. We use Decomposed-DIG to pinpoint specific examples of disparities, such as stereotypical background generation in Africa, struggling to generate modern vehicles in Africa, and unrealistically placing some objects in outdoor settings. Informed by our metric, we use a new prompting structure that enables a 52% worst-region improvement and a 20% average improvement in generated background diversity.
The Factorization Curse: Which Tokens You Predict Underlie the Reversal Curse and More
Kitouni, Ouail, Nolte, Niklas, Bouchacourt, Diane, Williams, Adina, Rabbat, Mike, Ibrahim, Mark
Today's best language models still struggle with hallucinations: factually incorrect generations, which impede their ability to reliably retrieve information seen during training. The reversal curse, where models cannot recall information when probed in a different order than was encountered during training, exemplifies this in information retrieval. We reframe the reversal curse as a factorization curse - a failure of models to learn the same joint distribution under different factorizations. Through a series of controlled experiments with increasing levels of realism including WikiReversal, a setting we introduce to closely simulate a knowledge intensive finetuning task, we find that the factorization curse is an inherent failure of the next-token prediction objective used in popular large language models. Moreover, we demonstrate reliable information retrieval cannot be solved with scale, reversed tokens, or even naive bidirectional-attention training. Consequently, various approaches to finetuning on specialized data would necessarily provide mixed results on downstream tasks, unless the model has already seen the right sequence of tokens. Across five tasks of varying levels of complexity, our results uncover a promising path forward: factorization-agnostic objectives can significantly mitigate the reversal curse and hint at improved knowledge storage and planning capabilities.
Introducing v0.5 of the AI Safety Benchmark from MLCommons
Vidgen, Bertie, Agrawal, Adarsh, Ahmed, Ahmed M., Akinwande, Victor, Al-Nuaimi, Namir, Alfaraj, Najla, Alhajjar, Elie, Aroyo, Lora, Bavalatti, Trupti, Bartolo, Max, Blili-Hamelin, Borhane, Bollacker, Kurt, Bomassani, Rishi, Boston, Marisa Ferrara, Campos, Siméon, Chakra, Kal, Chen, Canyu, Coleman, Cody, Coudert, Zacharie Delpierre, Derczynski, Leon, Dutta, Debojyoti, Eisenberg, Ian, Ezick, James, Frase, Heather, Fuller, Brian, Gandikota, Ram, Gangavarapu, Agasthya, Gangavarapu, Ananya, Gealy, James, Ghosh, Rajat, Goel, James, Gohar, Usman, Goswami, Sujata, Hale, Scott A., Hutiri, Wiebke, Imperial, Joseph Marvin, Jandial, Surgan, Judd, Nick, Juefei-Xu, Felix, Khomh, Foutse, Kailkhura, Bhavya, Kirk, Hannah Rose, Klyman, Kevin, Knotz, Chris, Kuchnik, Michael, Kumar, Shachi H., Kumar, Srijan, Lengerich, Chris, Li, Bo, Liao, Zeyi, Long, Eileen Peters, Lu, Victor, Luger, Sarah, Mai, Yifan, Mammen, Priyanka Mary, Manyeki, Kelvin, McGregor, Sean, Mehta, Virendra, Mohammed, Shafee, Moss, Emanuel, Nachman, Lama, Naganna, Dinesh Jinenhally, Nikanjam, Amin, Nushi, Besmira, Oala, Luis, Orr, Iftach, Parrish, Alicia, Patlak, Cigdem, Pietri, William, Poursabzi-Sangdeh, Forough, Presani, Eleonora, Puletti, Fabrizio, Röttger, Paul, Sahay, Saurav, Santos, Tim, Scherrer, Nino, Sebag, Alice Schoenauer, Schramowski, Patrick, Shahbazi, Abolfazl, Sharma, Vin, Shen, Xudong, Sistla, Vamsi, Tang, Leonard, Testuggine, Davide, Thangarasa, Vithursan, Watkins, Elizabeth Anne, Weiss, Rebecca, Welty, Chris, Wilbers, Tyler, Williams, Adina, Wu, Carole-Jean, Yadav, Poonam, Yang, Xianjun, Zeng, Yi, Zhang, Wenhui, Zhdanov, Fedor, Zhu, Jiacheng, Liang, Percy, Mattson, Peter, Vanschoren, Joaquin
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.
[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus
Choshen, Leshem, Cotterell, Ryan, Hu, Michael Y., Linzen, Tal, Mueller, Aaron, Ross, Candace, Warstadt, Alex, Wilcox, Ethan, Williams, Adina, Zhuang, Chengxu
After last year's successful BabyLM Challenge, the competition will be hosted again in 2024/2025. The overarching goals of the challenge remain the same; however, some of the competition rules will be different. The big changes for this year's competition are as follows: First, we replace the loose track with a paper track, which allows (for example) non-model-based submissions, novel cognitively-inspired benchmarks, or analysis techniques. Second, we are relaxing the rules around pretraining data, and will now allow participants to construct their own datasets provided they stay within the 100M-word or 10M-word budget. Third, we introduce a multimodal vision-and-language track, and will release a corpus of 50% text-only and 50% image-text multimodal data as a starting point for LM model training. The purpose of this CfP is to provide rules for this year's challenge, explain these rule changes and their rationale in greater detail, give a timeline of this year's competition, and provide answers to frequently asked questions from last year's challenge.
Improving Text-to-Image Consistency via Automatic Prompt Optimization
Mañas, Oscar, Astolfi, Pietro, Hall, Melissa, Ross, Candace, Urbanek, Jack, Williams, Adina, Agrawal, Aishwarya, Romero-Soriano, Adriana, Drozdzal, Michal
Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate aesthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions to improve prompt-image consistency suffer from the following challenges: (1) they oftentimes require model fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper, we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs.
Compositional learning of functions in humans and machines
Zhou, Yanli, Lake, Brenden M., Williams, Adina
The ability to learn and compose functions is foundational to efficient learning and reasoning in humans, enabling flexible generalizations such as creating new dishes from known cooking processes. Beyond sequential chaining of functions, existing linguistics literature indicates that humans can grasp more complex compositions with interacting functions, where output production depends on context changes induced by different function orderings. Extending the investigation into the visual domain, we developed a function learning paradigm to explore the capacity of humans and neural network models in learning and reasoning with compositional functions under varied interaction conditions. Following brief training on individual functions, human participants were assessed on composing two learned functions, in ways covering four main interaction types, including instances in which the application of the first function creates or removes the context for applying the second function. Our findings indicate that humans can make zero-shot generalizations on novel visual function compositions across interaction conditions, demonstrating sensitivity to contextual changes. A comparison with a neural network model on the same task reveals that, through the meta-learning for compositionality (MLC) approach, a standard sequence-to-sequence Transformer can mimic human generalization patterns in composing functions.
EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models
de Seyssel, Maureen, D'Avirro, Antony, Williams, Adina, Dupoux, Emmanuel
We introduce EmphAssess, a prosodic benchmark designed to evaluate the capability of speech-to-speech models to encode and reproduce prosodic emphasis. We apply this to two tasks: speech resynthesis and speech-to-speech translation. In both cases, the benchmark evaluates the ability of the model to encode emphasis in the speech input and accurately reproduce it in the output, potentially across a change of speaker and language. As part of the evaluation pipeline, we introduce EmphaClass, a new model that classifies emphasis at the frame or word level.
Pareto Probing: Trading Off Accuracy for Complexity
Pimentel, Tiago, Saphra, Naomi, Williams, Adina, Cotterell, Ryan
The question of how to probe contextual word representations for linguistic structure in a way that is both principled and useful has seen significant attention recently in the NLP literature. In our contribution to this discussion, we argue for a probe metric that reflects the fundamental trade-off between probe complexity and performance: the Pareto hypervolume. To measure complexity, we present a number of parametric and non-parametric metrics. Our experiments using Pareto hypervolume as an evaluation metric show that probes often do not conform to our expectations -- e.g., why should the non-contextual fastText representations encode more morpho-syntactic information than the contextual BERT representations? These results suggest that common, simplistic probing tasks, such as part-of-speech labeling and dependency arc labeling, are inadequate to evaluate the linguistic structure encoded in contextual word representations. This leads us to propose full dependency parsing as a probing task. In support of our suggestion that harder probing tasks are necessary, our experiments with dependency parsing reveal a wide gap in syntactic knowledge between contextual and non-contextual representations.
Grammatical Gender's Influence on Distributional Semantics: A Causal Perspective
Stańczak, Karolina, Du, Kevin, Williams, Adina, Augenstein, Isabelle, Cotterell, Ryan
How much meaning influences gender assignment across languages is an active area of research in modern linguistics and cognitive science. We can view current approaches as aiming to determine where gender assignment falls on a spectrum, from being fully arbitrarily determined to being largely semantically determined. For the latter case, there is a formulation of the neo-Whorfian hypothesis, which claims that even inanimate noun gender influences how people conceive of and talk about objects (using the choice of adjective used to modify inanimate nouns as a proxy for meaning). We offer a novel, causal graphical model that jointly represents the interactions between a noun's grammatical gender, its meaning, and adjective choice. In accordance with past results, we find a relationship between the gender of nouns and the adjectives which modify them. However, when we control for the meaning of the noun, we find that grammatical gender has a near-zero effect on adjective choice, thereby calling the neo-Whorfian hypothesis into question.