Dugan, Liam
Group-Adaptive Threshold Optimization for Robust AI-Generated Text Detection
Jung, Minseok, Panizo, Cynthia Fuertes, Dugan, Liam, R., Yi, Fung, null, Chen, Pin-Yu, Liang, Paul Pu
The advancement of large language models (LLMs) has made it difficult to differentiate human-written text from AI-generated text. Several AI-text detectors have been developed in response, which typically utilize a fixed global threshold (e.g., {\theta} = 0.5) to classify machine-generated text. However, we find that one universal threshold can fail to account for subgroup-specific distributional variations. For example, when using a fixed threshold, detectors make more false positive errors on shorter human-written text than longer, and more positive classifications on neurotic writing styles than open among long text. These discrepancies can lead to misclassification that disproportionately affects certain groups. We address this critical limitation by introducing FairOPT, an algorithm for group-specific threshold optimization in AI-generated content classifiers. Our approach partitions data into subgroups based on attributes (e.g., text length and writing style) and learns decision thresholds for each group, which enables careful balancing of performance and fairness metrics within each subgroup. In experiments with four AI text classifiers on three datasets, FairOPT enhances overall F1 score and decreases balanced error rate (BER) discrepancy across subgroups. Our framework paves the way for more robust and fair classification criteria in AI-generated output detection.
GenAI Content Detection Task 3: Cross-Domain Machine-Generated Text Detection Challenge
Dugan, Liam, Zhu, Andrew, Alam, Firoj, Nakov, Preslav, Apidianaki, Marianna, Callison-Burch, Chris
Recently there have been many shared tasks targeting the detection of generated text from Large Language Models (LLMs). However, these shared tasks tend to focus either on cases where text is limited to one particular domain or cases where text can be from many domains, some of which may not be seen during test time. In this shared task, using the newly released RAID benchmark, we aim to answer whether or not models can detect generated text from a large, yet fixed, number of domains and LLMs, all of which are seen during training. Over the course of three months, our task was attempted by 9 teams with 23 detector submissions. We find that multiple participants were able to obtain accuracies of over 99% on machine-generated text from RAID while maintaining a 5% False Positive Rate -- suggesting that detectors are able to robustly detect text from many domains and models simultaneously. We discuss potential interpretations of this result and provide directions for future research.
MiRAGeNews: Multimodal Realistic AI-Generated News Detection
Huang, Runsheng, Dugan, Liam, Yang, Yue, Callison-Burch, Chris
The proliferation of inflammatory or misleading "fake" news content has become increasingly common in recent years. Simultaneously, it has become easier than ever to use AI tools to generate photorealistic images depicting any scene imaginable. Combining these two -- AI-generated fake news content -- is particularly potent and dangerous. To combat the spread of AI-generated fake news, we propose the MiRAGeNews Dataset, a dataset of 12,500 high-quality real and AI-generated image-caption pairs from state-of-the-art generators. We find that our dataset poses a significant challenge to humans (60% F-1) and state-of-the-art multi-modal LLMs (< 24% F-1). Using our dataset we train a multi-modal detector (MiRAGe) that improves by +5.1% F-1 over state-of-the-art baselines on image-caption pairs from out-of-domain image generators and news publishers. We release our code and data to aid future work on detecting AI-generated content.
RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors
Dugan, Liam, Hwang, Alyssa, Trhlik, Filip, Ludan, Josh Magnus, Zhu, Andrew, Xu, Hainiu, Ippolito, Daphne, Callison-Burch, Chris
Many commercial and open-source models claim to detect machine-generated text with extremely high accuracy (99% or more). However, very few of these detectors are evaluated on shared benchmark datasets and even when they are, the datasets used for evaluation are insufficiently challenging-lacking variations in sampling strategy, adversarial attacks, and open-source generative models. In this work we present RAID: the largest and most challenging benchmark dataset for machine-generated text detection. RAID includes over 6 million generations spanning 11 models, 8 domains, 11 adversarial attacks and 4 decoding strategies. Using RAID, we evaluate the out-of-domain and adversarial robustness of 8 open- and 4 closed-source detectors and find that current detectors are easily fooled by adversarial attacks, variations in sampling strategies, repetition penalties, and unseen generative models. We release our data along with a leaderboard to encourage future research.
FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language Models
Zhu, Andrew, Hwang, Alyssa, Dugan, Liam, Callison-Burch, Chris
One type of question that is commonly found in day-to-day scenarios is ``fan-out'' questions, complex multi-hop, multi-document reasoning questions that require finding information about a large number of entities. However, there exist few resources to evaluate this type of question-answering capability among large language models. To evaluate complex reasoning in LLMs more fully, we present FanOutQA, a high-quality dataset of fan-out question-answer pairs and human-annotated decompositions with English Wikipedia as the knowledge base. We formulate three benchmark settings across our dataset and benchmark 7 LLMs, including GPT-4, LLaMA 2, Claude-2.1, and Mixtral-8x7B, finding that contemporary models still have room to improve reasoning over inter-document dependencies in a long context. We provide our dataset and open-source tools to run models to encourage evaluation at https://fanoutqa.com
Interpretable-by-Design Text Classification with Iteratively Generated Concept Bottleneck
Ludan, Josh Magnus, Lyu, Qing, Yang, Yue, Dugan, Liam, Yatskar, Mark, Callison-Burch, Chris
Deep neural networks excel in text classification tasks, yet their application in high-stakes domains is hindered by their lack of interpretability. To address this, we propose Text Bottleneck Models (TBMs), an intrinsically interpretable text classification framework that offers both global and local explanations. Rather than directly predicting the output label, TBMs predict categorical values for a sparse set of salient concepts and use a linear layer over those concept values to produce the final prediction. These concepts can be automatically discovered and measured by a Large Language Model (LLM), without the need for human curation. On 12 diverse datasets, using GPT-4 for both concept generation and measurement, we show that TBMs can rival the performance of established black-box baselines such as GPT-4 fewshot and finetuned DeBERTa, while falling short against finetuned GPT-3.5. Overall, our findings suggest that TBMs are a promising new framework that enhances interpretability, with minimal performance tradeoffs, particularly for general-domain text.
Kani: A Lightweight and Highly Hackable Framework for Building Language Model Applications
Zhu, Andrew, Dugan, Liam, Hwang, Alyssa, Callison-Burch, Chris
Language model applications are becoming increasingly popular and complex, often including features like tool usage and retrieval augmentation. However, existing frameworks for such applications are often opinionated, deciding for developers how their prompts ought to be formatted and imposing limitations on customizability and reproducibility. To solve this we present Kani: a lightweight, flexible, and model-agnostic open-source framework for building language model applications. Kani helps developers implement a variety of complex features by supporting the core building blocks of chat interaction: model interfacing, chat management, and robust function calling. All Kani core functions are easily overridable and well documented to empower developers to customize functionality for their own needs. Kani thus serves as a useful tool for researchers, hobbyists, and industry professionals alike to accelerate their development while retaining interoperability and fine-grained control.
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Srivastava, Aarohi, Rastogi, Abhinav, Rao, Abhishek, Shoeb, Abu Awal Md, Abid, Abubakar, Fisch, Adam, Brown, Adam R., Santoro, Adam, Gupta, Aditya, Garriga-Alonso, Adriร , Kluska, Agnieszka, Lewkowycz, Aitor, Agarwal, Akshat, Power, Alethea, Ray, Alex, Warstadt, Alex, Kocurek, Alexander W., Safaya, Ali, Tazarv, Ali, Xiang, Alice, Parrish, Alicia, Nie, Allen, Hussain, Aman, Askell, Amanda, Dsouza, Amanda, Slone, Ambrose, Rahane, Ameet, Iyer, Anantharaman S., Andreassen, Anders, Madotto, Andrea, Santilli, Andrea, Stuhlmรผller, Andreas, Dai, Andrew, La, Andrew, Lampinen, Andrew, Zou, Andy, Jiang, Angela, Chen, Angelica, Vuong, Anh, Gupta, Animesh, Gottardi, Anna, Norelli, Antonio, Venkatesh, Anu, Gholamidavoodi, Arash, Tabassum, Arfa, Menezes, Arul, Kirubarajan, Arun, Mullokandov, Asher, Sabharwal, Ashish, Herrick, Austin, Efrat, Avia, Erdem, Aykut, Karakaล, Ayla, Roberts, B. Ryan, Loe, Bao Sheng, Zoph, Barret, Bojanowski, Bartลomiej, รzyurt, Batuhan, Hedayatnia, Behnam, Neyshabur, Behnam, Inden, Benjamin, Stein, Benno, Ekmekci, Berk, Lin, Bill Yuchen, Howald, Blake, Orinion, Bryan, Diao, Cameron, Dour, Cameron, Stinson, Catherine, Argueta, Cedrick, Ramรญrez, Cรฉsar Ferri, Singh, Chandan, Rathkopf, Charles, Meng, Chenlin, Baral, Chitta, Wu, Chiyu, Callison-Burch, Chris, Waites, Chris, Voigt, Christian, Manning, Christopher D., Potts, Christopher, Ramirez, Cindy, Rivera, Clara E., Siro, Clemencia, Raffel, Colin, Ashcraft, Courtney, Garbacea, Cristina, Sileo, Damien, Garrette, Dan, Hendrycks, Dan, Kilman, Dan, Roth, Dan, Freeman, Daniel, Khashabi, Daniel, Levy, Daniel, Gonzรกlez, Daniel Moseguรญ, Perszyk, Danielle, Hernandez, Danny, Chen, Danqi, Ippolito, Daphne, Gilboa, Dar, Dohan, David, Drakard, David, Jurgens, David, Datta, Debajyoti, Ganguli, Deep, Emelin, Denis, Kleyko, Denis, Yuret, Deniz, Chen, Derek, Tam, Derek, Hupkes, Dieuwke, Misra, Diganta, Buzan, Dilyar, Mollo, Dimitri Coelho, Yang, Diyi, Lee, Dong-Ho, Schrader, Dylan, Shutova, Ekaterina, Cubuk, Ekin Dogus, Segal, Elad, Hagerman, Eleanor, Barnes, Elizabeth, Donoway, Elizabeth, Pavlick, Ellie, Rodola, Emanuele, Lam, Emma, Chu, Eric, Tang, Eric, Erdem, Erkut, Chang, Ernie, Chi, Ethan A., Dyer, Ethan, Jerzak, Ethan, Kim, Ethan, Manyasi, Eunice Engefu, Zheltonozhskii, Evgenii, Xia, Fanyue, Siar, Fatemeh, Martรญnez-Plumed, Fernando, Happรฉ, Francesca, Chollet, Francois, Rong, Frieda, Mishra, Gaurav, Winata, Genta Indra, de Melo, Gerard, Kruszewski, Germรกn, Parascandolo, Giambattista, Mariani, Giorgio, Wang, Gloria, Jaimovitch-Lรณpez, Gonzalo, Betz, Gregor, Gur-Ari, Guy, Galijasevic, Hana, Kim, Hannah, Rashkin, Hannah, Hajishirzi, Hannaneh, Mehta, Harsh, Bogar, Hayden, Shevlin, Henry, Schรผtze, Hinrich, Yakura, Hiromu, Zhang, Hongming, Wong, Hugh Mee, Ng, Ian, Noble, Isaac, Jumelet, Jaap, Geissinger, Jack, Kernion, Jackson, Hilton, Jacob, Lee, Jaehoon, Fisac, Jaime Fernรกndez, Simon, James B., Koppel, James, Zheng, James, Zou, James, Kocoล, Jan, Thompson, Jana, Wingfield, Janelle, Kaplan, Jared, Radom, Jarema, Sohl-Dickstein, Jascha, Phang, Jason, Wei, Jason, Yosinski, Jason, Novikova, Jekaterina, Bosscher, Jelle, Marsh, Jennifer, Kim, Jeremy, Taal, Jeroen, Engel, Jesse, Alabi, Jesujoba, Xu, Jiacheng, Song, Jiaming, Tang, Jillian, Waweru, Joan, Burden, John, Miller, John, Balis, John U., Batchelder, Jonathan, Berant, Jonathan, Frohberg, Jรถrg, Rozen, Jos, Hernandez-Orallo, Jose, Boudeman, Joseph, Guerr, Joseph, Jones, Joseph, Tenenbaum, Joshua B., Rule, Joshua S., Chua, Joyce, Kanclerz, Kamil, Livescu, Karen, Krauth, Karl, Gopalakrishnan, Karthik, Ignatyeva, Katerina, Markert, Katja, Dhole, Kaustubh D., Gimpel, Kevin, Omondi, Kevin, Mathewson, Kory, Chiafullo, Kristen, Shkaruta, Ksenia, Shridhar, Kumar, McDonell, Kyle, Richardson, Kyle, Reynolds, Laria, Gao, Leo, Zhang, Li, Dugan, Liam, Qin, Lianhui, Contreras-Ochando, Lidia, Morency, Louis-Philippe, Moschella, Luca, Lam, Lucas, Noble, Lucy, Schmidt, Ludwig, He, Luheng, Colรณn, Luis Oliveros, Metz, Luke, ลenel, Lรผtfi Kerem, Bosma, Maarten, Sap, Maarten, ter Hoeve, Maartje, Farooqi, Maheen, Faruqui, Manaal, Mazeika, Mantas, Baturan, Marco, Marelli, Marco, Maru, Marco, Quintana, Maria Jose Ramรญrez, Tolkiehn, Marie, Giulianelli, Mario, Lewis, Martha, Potthast, Martin, Leavitt, Matthew L., Hagen, Matthias, Schubert, Mรกtyรกs, Baitemirova, Medina Orduna, Arnaud, Melody, McElrath, Melvin, Yee, Michael A., Cohen, Michael, Gu, Michael, Ivanitskiy, Michael, Starritt, Michael, Strube, Michael, Swฤdrowski, Michaล, Bevilacqua, Michele, Yasunaga, Michihiro, Kale, Mihir, Cain, Mike, Xu, Mimee, Suzgun, Mirac, Walker, Mitch, Tiwari, Mo, Bansal, Mohit, Aminnaseri, Moin, Geva, Mor, Gheini, Mozhdeh, T, Mukund Varma, Peng, Nanyun, Chi, Nathan A., Lee, Nayeon, Krakover, Neta Gur-Ari, Cameron, Nicholas, Roberts, Nicholas, Doiron, Nick, Martinez, Nicole, Nangia, Nikita, Deckers, Niklas, Muennighoff, Niklas, Keskar, Nitish Shirish, Iyer, Niveditha S., Constant, Noah, Fiedel, Noah, Wen, Nuan, Zhang, Oliver, Agha, Omar, Elbaghdadi, Omar, Levy, Omer, Evans, Owain, Casares, Pablo Antonio Moreno, Doshi, Parth, Fung, Pascale, Liang, Paul Pu, Vicol, Paul, Alipoormolabashi, Pegah, Liao, Peiyuan, Liang, Percy, Chang, Peter, Eckersley, Peter, Htut, Phu Mon, Hwang, Pinyu, Miลkowski, Piotr, Patil, Piyush, Pezeshkpour, Pouya, Oli, Priti, Mei, Qiaozhu, Lyu, Qing, Chen, Qinlang, Banjade, Rabin, Rudolph, Rachel Etta, Gabriel, Raefer, Habacker, Rahel, Risco, Ramon, Milliรจre, Raphaรซl, Garg, Rhythm, Barnes, Richard, Saurous, Rif A., Arakawa, Riku, Raymaekers, Robbe, Frank, Robert, Sikand, Rohan, Novak, Roman, Sitelew, Roman, LeBras, Ronan, Liu, Rosanne, Jacobs, Rowan, Zhang, Rui, Salakhutdinov, Ruslan, Chi, Ryan, Lee, Ryan, Stovall, Ryan, Teehan, Ryan, Yang, Rylan, Singh, Sahib, Mohammad, Saif M., Anand, Sajant, Dillavou, Sam, Shleifer, Sam, Wiseman, Sam, Gruetter, Samuel, Bowman, Samuel R., Schoenholz, Samuel S., Han, Sanghyun, Kwatra, Sanjeev, Rous, Sarah A., Ghazarian, Sarik, Ghosh, Sayan, Casey, Sean, Bischoff, Sebastian, Gehrmann, Sebastian, Schuster, Sebastian, Sadeghi, Sepideh, Hamdan, Shadi, Zhou, Sharon, Srivastava, Shashank, Shi, Sherry, Singh, Shikhar, Asaadi, Shima, Gu, Shixiang Shane, Pachchigar, Shubh, Toshniwal, Shubham, Upadhyay, Shyam, Shyamolima, null, Debnath, null, Shakeri, Siamak, Thormeyer, Simon, Melzi, Simone, Reddy, Siva, Makini, Sneha Priscilla, Lee, Soo-Hwan, Torene, Spencer, Hatwar, Sriharsha, Dehaene, Stanislas, Divic, Stefan, Ermon, Stefano, Biderman, Stella, Lin, Stephanie, Prasad, Stephen, Piantadosi, Steven T., Shieber, Stuart M., Misherghi, Summer, Kiritchenko, Svetlana, Mishra, Swaroop, Linzen, Tal, Schuster, Tal, Li, Tao, Yu, Tao, Ali, Tariq, Hashimoto, Tatsu, Wu, Te-Lin, Desbordes, Thรฉo, Rothschild, Theodore, Phan, Thomas, Wang, Tianle, Nkinyili, Tiberius, Schick, Timo, Kornev, Timofei, Tunduny, Titus, Gerstenberg, Tobias, Chang, Trenton, Neeraj, Trishala, Khot, Tushar, Shultz, Tyler, Shaham, Uri, Misra, Vedant, Demberg, Vera, Nyamai, Victoria, Raunak, Vikas, Ramasesh, Vinay, Prabhu, Vinay Uday, Padmakumar, Vishakh, Srikumar, Vivek, Fedus, William, Saunders, William, Zhang, William, Vossen, Wout, Ren, Xiang, Tong, Xiaoyu, Zhao, Xinran, Wu, Xinyi, Shen, Xudong, Yaghoobzadeh, Yadollah, Lakretz, Yair, Song, Yangqiu, Bahri, Yasaman, Choi, Yejin, Yang, Yichi, Hao, Yiding, Chen, Yifu, Belinkov, Yonatan, Hou, Yu, Hou, Yufang, Bai, Yuntao, Seid, Zachary, Zhao, Zhuoye, Wang, Zijian, Wang, Zijie J., Wang, Zirui, Wu, Ziyi
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
Learning When to Speak: Latency and Quality Trade-offs for Simultaneous Speech-to-Speech Translation with Offline Models
Dugan, Liam, Wadhawan, Anshul, Spence, Kyle, Callison-Burch, Chris, McGuire, Morgan, Zordan, Victor
Recent work in speech-to-speech translation (S2ST) has focused primarily on offline settings, where the full input utterance is available before any output is given. This, however, is not reasonable in many real-world scenarios. In latency-sensitive applications, rather than waiting for the full utterance, translations should be spoken as soon as the information in the input is present. In this work, we introduce a system for simultaneous S2ST targeting real-world use cases. Our system supports translation from 57 languages to English with tunable parameters for dynamically adjusting the latency of the output -- including four policies for determining when to speak an output sequence. We show that these policies achieve offline-level accuracy with minimal increases in latency over a Greedy (wait-$k$) baseline. We open-source our evaluation code and interactive test script to aid future SimulS2ST research and application development.
Exploring the Curious Case of Code Prompts
Zhang, Li, Dugan, Liam, Xu, Hainiu, Callison-Burch, Chris
Recent work has shown that prompting language models with code-like representations of natural language leads to performance improvements on structured reasoning tasks. However, such tasks comprise only a small subset of all natural language tasks. In our work, we seek to answer whether or not code-prompting is the preferred way of interacting with language models in general. We compare code and text prompts across three popular GPT models (davinci, code-davinci-002, and text-davinci-002) on a broader selection of tasks (e.g., QA, sentiment, summarization) and find that with few exceptions, code prompts do not consistently outperform text prompts. Furthermore, we show that the style of code prompt has a large effect on performance for some but not all tasks and that fine-tuning on text instructions leads to better relative performance of code prompts.