Chen, Mingda
Improving Factuality with Explicit Working Memory
Chen, Mingda, Li, Yang, Padthe, Karthik, Shao, Rulin, Sun, Alicia, Zettlemoyer, Luke, Gosh, Gargi, Yih, Wen-tau
In the realm of long-form text generation, a notable vulnerability of large language models (LLMs) is their propensity for hallucination, wherein the generated text contains factually inaccurate information. By prepending the input prompt with relevant documents from trustworthy sources, retrieved-augmented generation (RAG) (Lewis et al., 2020; Shi et al., 2024) has been shown to be a simple yet effective approach that substantially mitigates the hallucination issue. To further enhance the factual accuracy of model output, various iterative prompting methods have been proposed that build upon RAG. For instance, FLARE (Jiang et al., 2023) generates responses sentence by sentence, and if a newly generated sentence contains low-probability tokens, it retrieves a new set of documents and re-runs RAG to regenerate the sentence. Alternatively, Self-RAG (Asai et al., 2024) employs a self-critic component to verify the correctness of each partial generation and repeatedly queries a retrieval system to update the background knowledge, thereby producing more accurate and faithful responses. While these systems demonstrate significant empirical improvement, they are restricted in the traditional RAG design. Context-relevant knowledge through retrieval is the only online feedback to the model, incorporated as part of the input string.
Characterizing and Efficiently Accelerating Multimodal Generation Model Inference
Lee, Yejin, Sun, Anna, Hosmer, Basil, Acun, Bilge, Balioglu, Can, Wang, Changhan, Hernandez, Charles David, Puhrsch, Christian, Haziza, Daniel, Guessous, Driss, Massa, Francisco, Kahn, Jacob, Wan, Jeffrey, Reizenstein, Jeremy, Zhai, Jiaqi, Isaacson, Joe, Schlosser, Joel, Pino, Juan, Sadagopan, Kaushik Ram, Shamis, Leonid, Ma, Linjian, Hwang, Min-Jae, Chen, Mingda, Elhoushi, Mostafa, Rodriguez, Pedro, Pasunuru, Ram, Yih, Scott, Popuri, Sravya, Liu, Xing, Wu, Carole-Jean
Generative artificial intelligence (AI) technology is revolutionizing the computing industry. Not only its applications have broadened to various sectors but also poses new system design and optimization opportunities. The technology is capable of understanding and responding in multiple modalities. However, the advanced capability currently comes with significant system resource demands. To sustainably scale generative AI capabilities to billions of users in the world, inference must be fast and efficient. This paper pinpoints key system design and optimization opportunities by characterizing a family of emerging multi-modal generation models on real systems. Auto-regressive token generation is a critical latency performance bottleneck, typically dominated by GPU idle time. In addition to memory-intensive attention across the generative AI models, linear operations constitute significant inference latency due to the feed forward networks in Transformer-based models. We demonstrate that state-of-the-art optimization levers, spanning from applications to system software and hardware, set a 3.88x better baseline.
RA-DIT: Retrieval-Augmented Dual Instruction Tuning
Lin, Xi Victoria, Chen, Xilun, Chen, Mingda, Shi, Weijia, Lomeli, Maria, James, Rich, Rodriguez, Pedro, Kahn, Jacob, Szilvasy, Gergely, Lewis, Mike, Zettlemoyer, Luke, Yih, Scott
Retrieval-augmented language models (RALMs) improve performance by accessing long-tail and up-to-date knowledge from external data stores, but are challenging to build. Existing approaches require either expensive retrieval-specific modifications to LM pre-training or use post-hoc integration of the data store that leads to suboptimal performance. We introduce Retrieval-Augmented Dual Instruction Tuning (RA-DIT), a lightweight fine-tuning methodology that provides a third option by retrofitting any LLM with retrieval capabilities. Our approach operates in two distinct fine-tuning steps: (1) one updates a pre-trained LM to better use retrieved information, while (2) the other updates the retriever to return more relevant results, as preferred by the LM. By fine-tuning over tasks that require both knowledge utilization and contextual awareness, we demonstrate that each stage yields significant performance improvements, and using both leads to additional gains. Our best model, RA-DIT 65B, achieves state-of-the-art performance across a range of knowledge-intensive zero- and few-shot learning benchmarks, significantly outperforming existing in-context RALM approaches by up to +8.9% in 0-shot setting and +1.4% in 5-shot setting on average.
xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages
Chen, Mingda, Heffernan, Kevin, รelebi, Onur, Mourachko, Alex, Schwenk, Holger
We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xSIM++. In comparison to xSIM, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which more closely mirror the scenarios we encounter during large-scale mining. We validate this proxy by running a significant number of bitext mining experiments for a set of low-resource languages, and subsequently train NMT systems on the mined data. In comparison to xSIM, we show that xSIM++ is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts, providing a reliable proxy of bitext mining performance without needing to run expensive bitext mining pipelines. xSIM++ also reports performance for different error types, offering more fine-grained feedback for model development.
Efficient Open Domain Multi-Hop Question Answering with Few-Shot Data Synthesis
Chen, Mingda, Chen, Xilun, Yih, Wen-tau
Few-shot learning for open domain multi-hop question answering typically relies on large language models (LLMs). While powerful, LLMs are inefficient at the inference time. We propose a data synthesis framework for multi-hop question answering that allows for improving smaller language models with less than 10 human-annotated question answer pairs. The framework is built upon the data generation functions parameterized by LLMs and prompts, which requires minimal hand-crafted features. Empirically, we synthesize millions of multi-hop questions and claims. After finetuning language models on the synthetic data, we evaluate the models on popular benchmarks on multi-hop question answering and fact verification. Our experimental results show that finetuning on the synthetic data improves model performance significantly, allowing our finetuned models to be competitive with prior models while being almost one-third the size in terms of parameter counts.
BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric
Chen, Mingda, Duquenne, Paul-Ambroise, Andrews, Pierre, Kao, Justine, Mourachko, Alexandre, Schwenk, Holger, Costa-jussร , Marta R.
End-to-End speech-to-speech translation (S2ST) is generally evaluated with text-based metrics. This means that generated speech has to be automatically transcribed, making the evaluation dependent on the availability and quality of automatic speech recognition (ASR) systems. In this paper, we propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems. BLASER leverages a multilingual multimodal encoder to directly encode the speech segments for source input, translation output and reference into a shared embedding space and computes a score of the translation quality that can be used as a proxy to human evaluation. To evaluate our approach, we construct training and evaluation sets from more than 40k human annotations covering seven language directions. The best results of BLASER are achieved by training with supervision from human rating scores. We show that when evaluated at the sentence level, BLASER correlates significantly better with human judgment compared to ASR-dependent metrics including ASR-SENTBLEU in all translation directions and ASR-COMET in five of them. Our analysis shows combining speech and text as inputs to BLASER does not increase the correlation with human scores, but best correlations are achieved when using speech, which motivates the goal of our research. Moreover, we show that using ASR for references is detrimental for text-based metrics.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Lan, Zhenzhong, Chen, Mingda, Goodman, Sebastian, Gimpel, Kevin, Sharma, Piyush, Soricut, Radu
A BSTRACT Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations, longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT (Devlin et al., 2019). Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT -large. The code and the pretrained models are available at https://github.com/ Many nontrivial NLP tasks, including those that have limited training data, have greatly benefited from these pre-trained models. One of the most compelling signs of these breakthroughs is the evolution of machine performance on a reading comprehension task designed for middle and highschool English exams in China, the RACE test (Lai et al., 2017): the paper that originally describes the task and formulates the modeling challenge reports then state-of-the-art machine accuracy at 44. 1%; the latest published result reports their model performance at 83. 2% (Liu et al., 2019); the work we present here pushes it even higher to 89 .4%, a stunning 45 .3% Evidence from these improvements reveals that a large network is of crucial importance for achieving state-of-the-art performance (Devlin et al., 2019; Radford et al., 2019). It has become common practice to pre-train large models and distill them down to smaller ones (Sun et al., 2019; Turc et al., 2019) for real applications.