Ross, Candace
BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop
Charpentier, Lucas, Choshen, Leshem, Cotterell, Ryan, Gul, Mustafa Omer, Hu, Michael, Jumelet, Jaap, Linzen, Tal, Liu, Jing, Mueller, Aaron, Ross, Candace, Shah, Raj Sanjay, Warstadt, Alex, Wilcox, Ethan, Williams, Adina
BabyLM aims to dissolve the boundaries between cognitive modeling and language modeling. We call for both workshop papers and for researchers to join the 3rd BabyLM competition. As in previous years, we call for participants in the data-efficient pretraining challenge in the general track. This year, we also offer a new track: INTERACTION. This new track encourages interactive behavior, learning from a teacher, and adapting the teaching material to the student. We also call for papers outside the competition in any relevant areas. These include training efficiency, cognitively plausible research, weak model evaluation, and more.
What makes a good metric? Evaluating automatic metrics for text-to-image consistency
Ross, Candace, Hall, Melissa, Soriano, Adriana Romero, Williams, Adina
Language models are increasingly being incorporated as components in larger AI systems for various purposes, from prompt optimization to automatic evaluation. In this work, we analyze the construct validity of four recent, commonly used methods for measuring text-to-image consistency - CLIPScore, TIFA, VPEval, and DSG - which rely on language models and/or VQA models as components. We define construct validity for text-image consistency metrics as a set of desiderata that text-image consistency metrics should have, and find that no tested metric satisfies all of them. We find that metrics lack sufficient sensitivity to language and visual properties. Next, we find that TIFA, VPEval and DSG contribute novel information above and beyond CLIPScore, but also that they correlate highly with each other. We also ablate different aspects of the text-image consistency metrics and find that not all model components are strictly necessary, also a symptom of insufficient sensitivity to visual information. Finally, we show that all three VQA-based metrics likely rely on familiar text shortcuts (such as yes-bias in QA) that call their aptitude as quantitative evaluations of model performance into question.
Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Hu, Michael Y., Mueller, Aaron, Ross, Candace, Williams, Adina, Linzen, Tal, Zhuang, Chengxu, Cotterell, Ryan, Choshen, Leshem, Warstadt, Alex, Wilcox, Ethan Gotlieb
The BabyLM Challenge is a community effort to close the data-efficiency gap between human and computational language learners. Participants compete to optimize language model training on a fixed language data budget of 100 million words or less. This year, we released improved text corpora, as well as a vision-and-language corpus to facilitate research into cognitively plausible vision language models. Submissions were compared on evaluation tasks targeting grammatical ability, (visual) question answering, pragmatic abilities, and grounding, among other abilities. Participants could submit to a 10M-word text-only track, a 100M-word text-only track, and/or a 100M-word and image multimodal track. From 31 submissions employing diverse methods, a hybrid causal-masked language model architecture outperformed other approaches. No submissions outperformed the baselines in the multimodal track. In follow-up analyses, we found a strong relationship between training FLOPs and average performance across tasks, and that the best-performing submissions proposed changes to the training data, training objective, and model architecture. This year's BabyLM Challenge shows that there is still significant room for innovation in this setting, in particular for image-text modeling, but community-driven research can yield actionable insights about effective strategies for small-scale language modeling.
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Gupta, Vipul, Ross, Candace, Pantoja, David, Passonneau, Rebecca J., Ung, Megan, Williams, Adina
One of the most challenging problems facing NLP today is evaluation. Some of the most pressing issues pertain to benchmark saturation, data contamination, and diversity in the quality of test examples. To address these concerns, we propose Selection Methodology for Accurate, Reduced, and Targeted (SMART) filtering, a novel approach to select a high-quality subset of examples from existing benchmark datasets by systematically removing less informative and less challenging examples. Our approach applies three filtering criteria, removing (i) easy examples, (ii) data-contaminated examples, and (iii) examples that are similar to each other based on distance in an embedding space. We demonstrate the effectiveness of SMART on three multiple choice QA datasets, where our methodology increases efficiency by reducing dataset size by 48\% on average, while increasing Pearson correlation with rankings from ChatBot Arena, a more open-ended human evaluation setting. Our method enables us to be more efficient, whether using SMART to make new benchmarks more challenging or to revitalize older datasets, while still preserving the relative model rankings.
Changing Answer Order Can Decrease MMLU Accuracy
Gupta, Vipul, Pantoja, David, Ross, Candace, Williams, Adina, Ung, Megan
For can affect multiple choice tests, for example, example, NLP model accuracy has been shown to when answers are presented in a different order be fairly brittle. For example, accuracy can drop during retest (Krosnick and Fabrigar, 1991; when researchers apply input alterations based Tellinghuisen and Sulikowski, 2008; Lions et al., on paraphrasing (Gan and Ng, 2019), word order 2022). However, as models do not have the biological changes (Gauthier and Levy, 2019; Ribeiro et al., limitations of humans, we may expect them 2020; Sinha et al., 2021a, 2022; Allen-Zhu and Li, to exhibit less variation than humans, or possibly 2023a,b; Berglund et al., 2023; Golovneva et al., even none at all. Thus, we claim that a model 2024; Kitouni et al., 2024), or other minor, largely should be robust to answer order changes: if it gets meaning-preserving input variations or perturbations the correct answer to a question when the answer (Belinkov and Bisk, 2018; Ebrahimi et al., is labeled'A', it should also always get the correct 2018; Jiang et al., 2020; Gao et al., 2021; Li et al., answer when it is labeled'C'. Put another way, 2021; Sinha et al., 2021b; Moradi and Samwald, the model should select the same answer for each 2021; Papakipos and Bitton, 2022; Qian et al., question, regardless of its label, for every possible 2022; Goodarzi et al., 2023; Sinha et al., 2023).
Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance
Hemmat, Reyhane Askari, Hall, Melissa, Sun, Alicia, Ross, Candace, Drozdzal, Michal, Romero-Soriano, Adriana
With the growing popularity of text-to-image generative models, there has been increasing focus on understanding their risks and biases. Recent work has found that state-of-the-art models struggle to depict everyday objects with the true diversity of the real world and have notable gaps between geographic regions. In this work, we aim to increase the diversity of generated images of common objects such that per-region variations are representative of the real world. We introduce an inference time intervention, contextualized Vendi Score Guidance (c-VSG), that guides the backwards steps of latent diffusion models to increase the diversity of a sample as compared to a "memory bank" of previously generated images while constraining the amount of variation within that of an exemplar set of real-world contextualizing images. We evaluate c-VSG with two geographically representative datasets and find that it substantially increases the diversity of generated images, both for the worst performing regions and on average, while simultaneously maintaining or improving image quality and consistency. Additionally, qualitative analyses reveal that diversity of generated images is significantly improved, including along the lines of reductive region portrayals present in the original model. We hope that this work is a step towards text-to-image generative models that reflect the true geographic diversity of the world.
An Introduction to Vision-Language Modeling
Bordes, Florian, Pang, Richard Yuanzhe, Ajay, Anurag, Li, Alexander C., Bardes, Adrien, Petryk, Suzanne, Mañas, Oscar, Lin, Zhiqiu, Mahmoud, Anas, Jayaraman, Bargav, Ibrahim, Mark, Hall, Melissa, Xiong, Yunyang, Lebensold, Jonathan, Ross, Candace, Jayakumar, Srihari, Guo, Chuan, Bouchacourt, Diane, Al-Tahan, Haider, Padthe, Karthik, Sharma, Vasu, Xu, Hu, Tan, Xiaoqing Ellen, Richards, Megan, Lavoie, Samuel, Astolfi, Pietro, Hemmat, Reyhane Askari, Chen, Jun, Tirumala, Kushal, Assouel, Rim, Moayeri, Mazda, Talattof, Arjang, Chaudhuri, Kamalika, Liu, Zechun, Chen, Xilun, Garrido, Quentin, Ullrich, Karen, Agrawal, Aishwarya, Saenko, Kate, Celikyilmaz, Asli, Chandra, Vikas
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus
Choshen, Leshem, Cotterell, Ryan, Hu, Michael Y., Linzen, Tal, Mueller, Aaron, Ross, Candace, Warstadt, Alex, Wilcox, Ethan, Williams, Adina, Zhuang, Chengxu
After last year's successful BabyLM Challenge, the competition will be hosted again in 2024/2025. The overarching goals of the challenge remain the same; however, some of the competition rules will be different. The big changes for this year's competition are as follows: First, we replace the loose track with a paper track, which allows (for example) non-model-based submissions, novel cognitively-inspired benchmarks, or analysis techniques. Second, we are relaxing the rules around pretraining data, and will now allow participants to construct their own datasets provided they stay within the 100M-word or 10M-word budget. Third, we introduce a multimodal vision-and-language track, and will release a corpus of 50% text-only and 50% image-text multimodal data as a starting point for LM model training. The purpose of this CfP is to provide rules for this year's challenge, explain these rule changes and their rationale in greater detail, give a timeline of this year's competition, and provide answers to frequently asked questions from last year's challenge.
Improving Text-to-Image Consistency via Automatic Prompt Optimization
Mañas, Oscar, Astolfi, Pietro, Hall, Melissa, Ross, Candace, Urbanek, Jack, Williams, Adina, Agrawal, Aishwarya, Romero-Soriano, Adriana, Drozdzal, Michal
Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate aesthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions to improve prompt-image consistency suffer from the following challenges: (1) they oftentimes require model fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper, we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs.
Leveraging Diffusion Perturbations for Measuring Fairness in Computer Vision
Lui, Nicholas, Chia, Bryan, Berrios, William, Ross, Candace, Kiela, Douwe
Computer vision models have been known to encode harmful biases, leading to the potentially unfair treatment of historically marginalized groups, such as people of color. However, there remains a lack of datasets balanced along demographic traits that can be used to evaluate the downstream fairness of these models. In this work, we demonstrate that diffusion models can be leveraged to create such a dataset. We first use a diffusion model to generate a large set of images depicting various occupations. Subsequently, each image is edited using inpainting to generate multiple variants, where each variant refers to a different perceived race. Using this dataset, we benchmark several vision-language models on a multi-class occupation classification task. We find that images generated with non-Caucasian labels have a significantly higher occupation misclassification rate than images generated with Caucasian labels, and that several misclassifications are suggestive of racial biases. We measure a model's downstream fairness by computing the standard deviation in the probability of predicting the true occupation label across the different perceived identity groups. Using this fairness metric, we find significant disparities between the evaluated vision-and-language models. We hope that our work demonstrates the potential value of diffusion methods for fairness evaluations.