Thain, Nithum
Improving Neutral Point of View Text Generation through Parameter-Efficient Reinforcement Learning and a Small-Scale High-Quality Dataset
Hoffmann, Jessica, Ahlheim, Christiane, Yu, Zac, Walfrand, Aria, Jin, Jarvis, Tano, Marie, Beirami, Ahmad, van Liemt, Erin, Thain, Nithum, Sidahmed, Hakim, Dixon, Lucas
This paper describes the construction of a dataset and the evaluation of training methods to improve generative large language models' (LLMs) ability to answer queries on sensitive topics with a Neutral Point of View (NPOV), i.e., to provide significantly more informative, diverse and impartial answers. The dataset, the SHQ-NPOV dataset, comprises 300 high-quality, human-written quadruplets: a query on a sensitive topic, an answer, an NPOV rating, and a set of links to source texts elaborating the various points of view. The first key contribution of this paper is a new methodology to create such datasets through iterative rounds of human peer-critique and annotator training, which we release alongside the dataset. The second key contribution is the identification of a highly effective training regime for parameter-efficient reinforcement learning (PE-RL) to improve NPOV generation. We compare and extensively evaluate PE-RL and multiple baselines-including LoRA finetuning (a strong baseline), SFT and RLHF. PE-RL not only improves on overall NPOV quality compared to the strongest baseline ($97.06\%\rightarrow 99.08\%$), but also scores much higher on features linguists identify as key to separating good answers from the best answers ($60.25\%\rightarrow 85.21\%$ for presence of supportive details, $68.74\%\rightarrow 91.43\%$ for absence of oversimplification). A qualitative analysis corroborates this. Finally, our evaluation finds no statistical differences between results on topics that appear in the training dataset and those on separated evaluation topics, which provides strong evidence that our approach to training PE-RL exhibits very effective out of topic generalization.
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, null, Mesnard, Thomas, Hardin, Cassidy, Dadashi, Robert, Bhupatiraju, Surya, Pathak, Shreya, Sifre, Laurent, Riviรจre, Morgane, Kale, Mihir Sanjay, Love, Juliette, Tafti, Pouya, Hussenot, Lรฉonard, Sessa, Pier Giuseppe, Chowdhery, Aakanksha, Roberts, Adam, Barua, Aditya, Botev, Alex, Castro-Ros, Alex, Slone, Ambrose, Hรฉliou, Amรฉlie, Tacchetti, Andrea, Bulanova, Anna, Paterson, Antonia, Tsai, Beth, Shahriari, Bobak, Lan, Charline Le, Choquette-Choo, Christopher A., Crepy, Clรฉment, Cer, Daniel, Ippolito, Daphne, Reid, David, Buchatskaya, Elena, Ni, Eric, Noland, Eric, Yan, Geng, Tucker, George, Muraru, George-Christian, Rozhdestvenskiy, Grigory, Michalewski, Henryk, Tenney, Ian, Grishchenko, Ivan, Austin, Jacob, Keeling, James, Labanowski, Jane, Lespiau, Jean-Baptiste, Stanway, Jeff, Brennan, Jenny, Chen, Jeremy, Ferret, Johan, Chiu, Justin, Mao-Jones, Justin, Lee, Katherine, Yu, Kathy, Millican, Katie, Sjoesund, Lars Lowe, Lee, Lisa, Dixon, Lucas, Reid, Machel, Mikuลa, Maciej, Wirth, Mateo, Sharman, Michael, Chinaev, Nikolai, Thain, Nithum, Bachem, Olivier, Chang, Oscar, Wahltinez, Oscar, Bailey, Paige, Michel, Paul, Yotov, Petko, Chaabouni, Rahma, Comanescu, Ramona, Jana, Reena, Anil, Rohan, McIlroy, Ross, Liu, Ruibo, Mullins, Ryan, Smith, Samuel L, Borgeaud, Sebastian, Girgin, Sertan, Douglas, Sholto, Pandya, Shree, Shakeri, Siamak, De, Soham, Klimenko, Ted, Hennigan, Tom, Feinberg, Vlad, Stokowiec, Wojciech, Chen, Yu-hui, Ahmed, Zafarali, Gong, Zhitao, Warkentin, Tris, Peran, Ludovic, Giang, Minh, Farabet, Clรฉment, Vinyals, Oriol, Dean, Jeff, Kavukcuoglu, Koray, Hassabis, Demis, Ghahramani, Zoubin, Eck, Douglas, Barral, Joelle, Pereira, Fernando, Collins, Eli, Joulin, Armand, Fiedel, Noah, Senter, Evan, Andreev, Alek, Kenealy, Kathleen
This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations.
Detecting Hallucination and Coverage Errors in Retrieval Augmented Generation for Controversial Topics
Chang, Tyler A., Tomanek, Katrin, Hoffmann, Jessica, Thain, Nithum, van Liemt, Erin, Meier-Hellstern, Kathleen, Dixon, Lucas
We explore a strategy to handle controversial topics in LLM-based chatbots based on Wikipedia's Neutral Point of View (NPOV) principle: acknowledge the absence of a single true answer and surface multiple perspectives. We frame this as retrieval augmented generation, where perspectives are retrieved from a knowledge base and the LLM is tasked with generating a fluent and faithful response from the given perspectives. As a starting point, we use a deterministic retrieval system and then focus on common LLM failure modes that arise during this approach to text generation, namely hallucination and coverage errors. We propose and evaluate three methods to detect such errors based on (1) word-overlap, (2) salience, and (3) LLM-based classifiers. Our results demonstrate that LLM-based classifiers, even when trained only on synthetic errors, achieve high error detection performance, with ROC AUC scores of 95.3% for hallucination and 90.5% for coverage error detection on unambiguous error cases. We show that when no training data is available, our other methods still yield good results on hallucination (84.0%) and coverage error (85.2%) detection.
ConstitutionalExperts: Training a Mixture of Principle-based Prompts
Petridis, Savvas, Wedin, Ben, Yuan, Ann, Wexler, James, Thain, Nithum
Large language models (LLMs) are highly capable at a variety of tasks given the right prompt, but writing one is still a difficult and tedious process. In this work, we introduce ConstitutionalExperts, a method for learning a prompt consisting of constitutional principles (i.e. rules), given a training dataset. Unlike prior methods that optimize the prompt as a single entity, our method incrementally improves the prompt by surgically editing individual principles. We also show that we can improve overall performance by learning unique prompts for different semantic regions of the training data and using a mixture-of-experts (MoE) architecture to route inputs at inference time. We compare our method to other state of the art prompt-optimization techniques across six benchmark datasets. We also investigate whether MoE improves these other techniques. Our results suggest that ConstitutionalExperts outperforms other prompt optimization techniques by 10.9% (F1) and that mixture-of-experts improves all techniques, suggesting its broad applicability.
Towards Agile Text Classifiers for Everyone
Mozes, Maximilian, Hoffmann, Jessica, Tomanek, Katrin, Kouate, Muhamed, Thain, Nithum, Yuan, Ann, Bolukbasi, Tolga, Dixon, Lucas
Text-based safety classifiers are widely used for content moderation and increasingly to tune generative language model behavior - a topic of growing concern for the safety of digital assistants and chatbots. However, different policies require different classifiers, and safety policies themselves improve from iteration and adaptation. This paper introduces and evaluates methods for agile text classification, whereby classifiers are trained using small, targeted datasets that can be quickly developed for a particular policy. Experimenting with 7 datasets from three safety-related domains, comprising 15 annotation schemes, led to our key finding: prompt-tuning large language models, like PaLM 62B, with a labeled dataset of as few as 80 examples can achieve state-of-the-art performance. We argue that this enables a paradigm shift for text classification, especially for models supporting safer online discourse. Instead of collecting millions of examples to attempt to create universal safety classifiers over months or years, classifiers could be tuned using small datasets, created by individuals or small organizations, tailored for specific use cases, and iterated on and adapted in the time-span of a day.
Improving Classifier Robustness through Active Generation of Pairwise Counterfactuals
Balashankar, Ananth, Wang, Xuezhi, Qin, Yao, Packer, Ben, Thain, Nithum, Chen, Jilin, Chi, Ed H., Beutel, Alex
Counterfactual Data Augmentation (CDA) is a commonly used technique for improving robustness in natural language classifiers. However, one fundamental challenge is how to discover meaningful counterfactuals and efficiently label them, with minimal human labeling cost. Most existing methods either completely rely on human-annotated labels, an expensive process which limits the scale of counterfactual data, or implicitly assume label invariance, which may mislead the model with incorrect labels. In this paper, we present a novel framework that utilizes counterfactual generative models to generate a large number of diverse counterfactuals by actively sampling from regions of uncertainty, and then automatically label them with a learned pairwise classifier. Our key insight is that we can more correctly label the generated counterfactuals by training a pairwise classifier that interpolates the relationship between the original example and the counterfactual. We demonstrate that with a small amount of human-annotated counterfactual data (10%), we can generate a counterfactual augmentation dataset with learned labels, that provides an 18-20% improvement in robustness and a 14-21% reduction in errors on 6 out-of-domain datasets, comparable to that of a fully human-annotated counterfactual dataset for both sentiment classification and question paraphrase tasks.
Gradient-Based Automated Iterative Recovery for Parameter-Efficient Tuning
Mozes, Maximilian, Bolukbasi, Tolga, Yuan, Ann, Liu, Frederick, Thain, Nithum, Dixon, Lucas
Pretrained large language models (LLMs) are able to solve a wide variety of tasks through transfer learning. Various explainability methods have been developed to investigate their decision making process. TracIn (Pruthi et al., 2020) is one such gradient-based method which explains model inferences based on the influence of training examples. In this paper, we explore the use of TracIn to improve model performance in the parameter-efficient tuning (PET) setting. We develop conversational safety classifiers via the prompt-tuning PET method and show how the unique characteristics of the PET regime enable TracIn to identify the cause for certain misclassifications by LLMs. We develop a new methodology for using gradient-based explainability techniques to improve model performance, G-BAIR: gradient-based automated iterative recovery. We show that G-BAIR can recover LLM performance on benchmarks after manually corrupting training labels. This suggests that influence methods like TracIn can be used to automatically perform data cleaning, and introduces the potential for interactive debugging and relabeling for PET-based transfer learning methods.
Fairness without Demographics through Adversarially Reweighted Learning
Lahoti, Preethi, Beutel, Alex, Chen, Jilin, Lee, Kang, Prost, Flavien, Thain, Nithum, Wang, Xuezhi, Chi, Ed H.
Much of the previous machine learning (ML) fairness literature assumes that protected features such as race and sex are present in the dataset, and relies upon them to mitigate fairness concerns. However, in practice factors like privacy and regulation often preclude the collection of protected features, or their use for training or inference, severely limiting the applicability of traditional fairness research. Therefore we ask: How can we train an ML model to improve fairness when we do not even know the protected group memberships? In this work we address this problem by proposing Adversarially Reweighted Learning (ARL). In particular, we hypothesize that non-protected features and task labels are valuable for identifying fairness issues, and can be used to co-train an adversarial reweighting approach for improving fairness. Our results show that {ARL} improves Rawlsian Max-Min fairness, with notable AUC improvements for worst-case protected groups in multiple datasets, outperforming state-of-the-art alternatives.
Debiasing Embeddings for Reduced Gender Bias in Text Classification
Prost, Flavien, Thain, Nithum, Bolukbasi, Tolga
We investigate how this bias affects downstream classification tasks, using the case study of occupation classification (De-Arteaga et al., 2019). We show that traditional techniques for debiasing embeddings can actually worsen the bias of the downstream classifier by providing a less noisy channel for communicating gender information. With a relatively minor adjustment, however, we show how these same techniques can be used to simultaneously reduce bias and maintain high classification accuracy.
Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification
Borkan, Daniel, Dixon, Lucas, Sorensen, Jeffrey, Thain, Nithum, Vasserman, Lucy
Machine learning systems, if not constrained, will compounding existing challenges to fairness in society often learn the simplest associations that can predict the labels, so at large. In this paper, we introduce a suite of threshold-agnostic any incorrect associations present in the training data can produce metrics that provide a nuanced view of this unintended bias, by unintended associations in the final model. Toxicity models specifically considering the various ways that a classifier's score distribution have been shown to capture and reproduce biases common can vary across designated groups. We also introduce a large new in society, for example mis-associating the names of frequently test set of online comments with crowd-sourced annotations for attacked identity groups (such as "gay", and "muslim" etc.) with identity references. We use this to show how our metrics can be toxicity [5, 17]. This unintended model bias could be due to the used to find new and potentially subtle unintended bias in existing demographic composition of the online user pool, the latent or public models.