Lin, Jimmy
Which Model Shall I Choose? Cost/Quality Trade-offs for Text Classification Tasks
Zong, Shi, Seltzer, Josh, Jiahua, null, Pan, null, Cheng, Kathy, Lin, Jimmy
Industry practitioners always face the problem of choosing the appropriate model for deployment under different considerations, such as to maximize a metric that is crucial for production, or to reduce the total cost given financial concerns. In this work, we focus on the text classification task and present a quantitative analysis for this challenge. Using classification accuracy as the main metric, we evaluate the classifiers' performances for a variety of models, including large language models, along with their associated costs, including the annotation cost, training (fine-tuning) cost, and inference cost. We then discuss the model choices for situations like having a large number of samples needed for inference. We hope our work will help people better understand the cost/quality trade-offs for the text classification task.
Precise Zero-Shot Dense Retrieval without Relevance Labels
Gao, Luyu, Ma, Xueguang, Lin, Jimmy, Callan, Jamie
While dense retrieval has been shown effective and efficient across tasks and languages, it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance label is available. In this paper, we recognize the difficulty of zero-shot learning and encoding relevance. Instead, we propose to pivot through Hypothetical Document Embeddings~(HyDE). Given a query, HyDE first zero-shot instructs an instruction-following language model (e.g. InstructGPT) to generate a hypothetical document. The document captures relevance patterns but is unreal and may contain false details. Then, an unsupervised contrastively learned encoder~(e.g. Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, where similar real documents are retrieved based on vector similarity. This second step ground the generated document to the actual corpus, with the encoder's dense bottleneck filtering out the incorrect details. Our experiments show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever and shows strong performance comparable to fine-tuned retrievers, across various tasks (e.g. web search, QA, fact verification) and languages~(e.g. sw, ko, ja).
Less is More: Parameter-Free Text Classification with Gzip
Jiang, Zhiying, Yang, Matthew Y. R., Tsirlin, Mikhail, Tang, Raphael, Lin, Jimmy
Deep neural networks (DNNs) are often used for text classification tasks as they usually achieve high levels of accuracy. However, DNNs can be computationally intensive with billions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that's easy, light-weight and universal in text classification: a combination of a simple compressor like gzip with a $k$-nearest-neighbor classifier. Without any training, pre-training or fine-tuning, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distributed datasets. It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also performs particularly well in few-shot settings where labeled data are too scarce for DNNs to achieve a satisfying accuracy.
Improving Precancerous Case Characterization via Transformer-based Ensemble Learning
Zhong, Yizhen, Xiao, Jiajie, Vetterli, Thomas, Matin, Mahan, Loo, Ellen, Lin, Jimmy, Bourgon, Richard, Shapira, Ofer
The application of natural language processing (NLP) to cancer pathology reports has been focused on detecting cancer cases, largely ignoring precancerous cases. Improving the characterization of precancerous adenomas assists in developing diagnostic tests for early cancer detection and prevention, especially for colorectal cancer (CRC). Here we developed transformer-based deep neural network NLP models to perform the CRC phenotyping, with the goal of extracting precancerous lesion attributes and distinguishing cancer and precancerous cases. We achieved 0.914 macro-F1 scores for classifying patients into negative, non-advanced adenoma, advanced adenoma and CRC. We further improved the performance to 0.923 using an ensemble of classifiers for cancer status classification and lesion size named entity recognition (NER). Our results demonstrated the potential of using NLP to leverage real-world health record data to facilitate the development of diagnostic tests for early cancer prevention.
What the DAAM: Interpreting Stable Diffusion Using Cross Attention
Tang, Raphael, Liu, Linqing, Pandey, Akshat, Jiang, Zhiying, Yang, Gefei, Kumar, Karun, Stenetorp, Pontus, Lin, Jimmy, Ture, Ferhan
Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation, but they remain poorly understood, lacking interpretability analyses. In this paper, we perform a text-image attribution analysis on Stable Diffusion, a recently open-sourced model. To produce pixel-level attribution maps, we upscale and aggregate cross-attention word-pixel scores in the denoising subnetwork, naming our method DAAM. We evaluate its correctness by testing its semantic segmentation ability on nouns, as well as its generalized attribution quality on all parts of speech, rated by humans. We then apply DAAM to study the role of syntax in the pixel space, characterizing head--dependent heat map interaction patterns for ten common dependency relations. Finally, we study several semantic phenomena using DAAM, with a focus on feature entanglement, where we find that cohyponyms worsen generation quality and descriptive adjectives attend too broadly. To our knowledge, we are the first to interpret large diffusion models from a visuolinguistic perspective, which enables future lines of research. Our code is at https://github.com/castorini/daam.
SpeechNet: Weakly Supervised, End-to-End Speech Recognition at Industrial Scale
Tang, Raphael, Kumar, Karun, Yang, Gefei, Pandey, Akshat, Mao, Yajie, Belyaev, Vladislav, Emmadi, Madhuri, Murray, Craig, Ture, Ferhan, Lin, Jimmy
End-to-end automatic speech recognition systems represent the state of the art, but they rely on thousands of hours of manually annotated speech for training, as well as heavyweight computation for inference. Of course, this impedes commercialization since most companies lack vast human and computational resources. In this paper, we explore training and deploying an ASR system in the label-scarce, compute-limited setting. To reduce human labor, we use a third-party ASR system as a weak supervision source, supplemented with labeling functions derived from implicit user feedback. To accelerate inference, we propose to route production-time queries across a pool of CUDA graphs of varying input lengths, the distribution of which best matches the traffic's. Compared to our third-party ASR, we achieve a relative improvement in word-error rate of 8% and a speedup of 600%. Our system, called SpeechNet, currently serves 12 million queries per day on our voice-enabled smart television. To our knowledge, this is the first time a large-scale, Wav2vec-based deployment has been described in the academic literature.
CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for Efficient and Effective Multi-Vector Retrieval
Li, Minghan, Lin, Sheng-Chieh, Oguz, Barlas, Ghoshal, Asish, Lin, Jimmy, Mehdad, Yashar, Yih, Wen-tau, Chen, Xilun
Multi-vector retrieval methods combine the merits of sparse (e.g. BM25) and dense (e.g. DPR) retrievers and have achieved state-of-the-art performance on various retrieval tasks. These methods, however, are orders of magnitude slower and need much more space to store their indices compared to their single-vector counterparts. In this paper, we unify different multi-vector retrieval models from a token routing viewpoint and propose conditional token interaction via dynamic lexical routing, namely CITADEL, for efficient and effective multi-vector retrieval. CITADEL learns to route different token vectors to the predicted lexical ``keys'' such that a query token vector only interacts with document token vectors routed to the same key. This design significantly reduces the computation cost while maintaining high accuracy. Notably, CITADEL achieves the same or slightly better performance than the previous state of the art, ColBERT-v2, on both in-domain (MS MARCO) and out-of-domain (BEIR) evaluations, while being nearly 40 times faster. Code and data are available at https://github.com/facebookresearch/dpr-scale.
On the Interaction Between Differential Privacy and Gradient Compression in Deep Learning
Lin, Jimmy
While differential privacy and gradient compression are separately well-researched topics in machine learning, the study of interaction between these two topics is still relatively new. We perform a detailed empirical study on how the Gaussian mechanism for differential privacy and gradient compression jointly impact test accuracy in deep learning. The existing literature in gradient compression mostly evaluates compression in the absence of differential privacy guarantees, and demonstrate that sufficiently high compression rates reduce accuracy. Similarly, existing literature in differential privacy evaluates privacy mechanisms in the absence of compression, and demonstrates that sufficiently strong privacy guarantees reduce accuracy. In this work, we observe while gradient compression generally has a negative impact on test accuracy in non-private training, it can sometimes improve test accuracy in differentially private training. Specifically, we observe that when employing aggressive sparsification or rank reduction to the gradients, test accuracy is less affected by the Gaussian noise added for differential privacy. These observations are explained through an analysis how differential privacy and compression effects the bias and variance in estimating the average gradient. We follow this study with a recommendation on how to improve test accuracy under the context of differentially private deep learning and gradient compression. We evaluate this proposal and find that it can reduce the negative impact of noise added by differential privacy mechanisms on test accuracy by up to 24.6%, and reduce the negative impact of gradient sparsification on test accuracy by up to 15.1%.
MS MARCO: Benchmarking Ranking Models in the Large-Data Regime
Craswell, Nick, Mitra, Bhaskar, Yilmaz, Emine, Campos, Daniel, Lin, Jimmy
Evaluation efforts such as TREC, CLEF, NTCIR and FIRE, alongside public leaderboard such as MS MARCO, are intended to encourage research and track our progress, addressing big questions in our field. However, the goal is not simply to identify which run is "best", achieving the top score. The goal is to move the field forward by developing new robust techniques, that work in many different settings, and are adopted in research and practice. This paper uses the MS MARCO and TREC Deep Learning Track as our case study, comparing it to the case of TREC ad hoc ranking in the 1990s. We show how the design of the evaluation effort can encourage or discourage certain outcomes, and raising questions about internal and external validity of results. We provide some analysis of certain pitfalls, and a statement of best practices for avoiding such pitfalls. We summarize the progress of the effort so far, and describe our desired end state of "robust usefulness", along with steps that might be required to get us there.
Investigating the Limitations of Transformers with Simple Arithmetic Tasks
Nogueira, Rodrigo, Jiang, Zhiying, Lin, Jimmy
The ability to perform arithmetic tasks is a remarkable trait of human intelligence and might form a critical component of more complex reasoning tasks. In this work, we investigate if the surface form of a number has any influence on how sequence-to-sequence language models learn simple arithmetic tasks such as addition and subtraction across a wide range of values. We find that how a number is represented in its surface form has a strong influence on the model's accuracy. In particular, the model fails to learn addition of five-digit numbers when using subwords (e.g., "32"), and it struggles to learn with character-level representations (e.g., "3 2"). By introducing position tokens (e.g., "3 10e1 2"), the model learns to accurately add and subtract numbers up to 60 digits. We conclude that modern pretrained language models can easily learn arithmetic from very few examples, as long as we use the proper surface representation. This result bolsters evidence that subword tokenizers and positional encodings are components in current transformer designs that might need improvement. Moreover, we show that regardless of the number of parameters and training examples, models cannot seem to learn addition rules that are independent of the length of the numbers seen during training. Abstraction and composition are two important themes in the study of human languages, made possible by different linguistic representations. Although treatments in different linguistic traditions vary, representations at the lexical, syntactic, and semantic levels are a common feature in nearly all theoretical studies of human language, and until relatively recently, these representations are explicitly "materialized" in language processing pipelines (for example, semantic role labeling takes as input a syntactic parse).