Rosenthal, Sara
Granite Embedding Models
Awasthy, Parul, Trivedi, Aashka, Li, Yulong, Bornea, Mihaela, Cox, David, Daniels, Abraham, Franz, Martin, Goodhart, Gabe, Iyer, Bhavani, Kumar, Vishwajeet, Lastras, Luis, McCarley, Scott, Murthy, Rudra, P, Vignesh, Rosenthal, Sara, Roukos, Salim, Sen, Jaydeep, Sharma, Sukriti, Sil, Avirup, Soule, Kate, Sultan, Arafat, Florian, Radu
We introduce the Granite Embedding models, a family of encoder-based embedding models designed for retrieval tasks, spanning dense-retrieval and sparse-retrieval architectures, with both English and Multilingual capabilities. This report provides the technical details of training these highly effective 12 layer embedding models, along with their efficient 6 layer distilled counterparts. Extensive evaluations show that the models, developed with techniques like retrieval oriented pretraining, contrastive finetuning, knowledge distillation, and model merging significantly outperform publicly available models of similar sizes on both internal IBM retrieval and search tasks, and have equivalent performance on widely-used information retrieval benchmarks, while being trained on high-quality data suitable for enterprise use. We publicly release all our Granite Embedding models under the Apache 2.0 license, allowing both research and commercial use at https://huggingface.co/collections/ibm-granite . Figure 1: Average performance on the Granite embedding models (in blue) vs BGE, GTE, Snowflake, E5, and Nomic models on 5 QA and IR datasets: BEIR, ClapNQ, CoIR, RedHat, and UnifiedSearch (the last 2 are internal IBM datasets). The goal of text embedding models is to convert variable length text into a fixed vector, encoding the text semantics into a multidimensional vector in such a way that semantically close texts are close in the vector space, while dissimilar texts have a low similarity. These embeddings can then be used in a variety of tasks, most commonly in retrieval applications, where the relevance of a document to a given query can be determined by the similarity of their embeddings (Dunn et al., 2017; Xiong et al., 2020; Neelakantan et al., 2022)(Zamani et al., 2018; Zhao et al., 2020), but also in document clustering (Angelov, 2020) and text classification (Sun et al., 2019). See Contributions section for full author list.
MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems
Katsis, Yannis, Rosenthal, Sara, Fadnis, Kshitij, Gunasekara, Chulaka, Lee, Young-Suk, Popa, Lucian, Shah, Vraj, Zhu, Huaiyu, Contractor, Danish, Danilevsky, Marina
Retrieval-augmented generation (RAG) has recently become a very popular task for Large Language Models (LLMs). Evaluating them on multi-turn RAG conversations, where the system is asked to generate a response to a question in the context of a preceding conversation is an important and often overlooked task with several additional challenges. We present MTRAG: an end-to-end human-generated multi-turn RAG benchmark that reflects several real-world properties across diverse dimensions for evaluating the full RAG pipeline. MTRAG contains 110 conversations averaging 7.7 turns each across four domains for a total of 842 tasks. We also explore automation paths via synthetic data and LLM-as-a-Judge evaluation. Our human and automatic evaluations show that even state-of-the-art LLM RAG systems struggle on MTRAG. We demonstrate the need for strong retrieval and generation systems that can handle later turns, unanswerable questions, non-standalone questions, and multiple domains. MTRAG is available at https://github.com/ibm/mt-rag-benchmark.
CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems
Rosenthal, Sara, Sil, Avirup, Florian, Radu, Roukos, Salim
Large (NQ) (Kwiatkowski et al., 2019) and SQuAD (Rajpurkar scale research in this area began with the tasks et al., 2016, 2018) which are just a few of Machine Reading Comprehension (Rajpurkar words. It is grounded on a single gold passage, et al., 2016; Rogers et al., 2023; Fisch et al., in contrast to other long-form question answering 2021), and Information Retrieval (Manning et al., (LFQA) datasets such as ELI5 (Fan et al., 2019) 2008; Voorhees and Harman, 2005; Thakur et al., where gold passages are not available. It is built 2021) and has more recently been come to be from a subset of the highly successful Natural Questions known as Retrieval Augmented Generation (Lewis (Kwiatkowski et al., 2019) dataset for extractive et al., 2021; Guu et al., 2020) which encompasses QA from Wikipedia documents based on users both tasks. The recent popularity of generative real web search queries - specifically, the subset of AI with Large Language models (LLM), such as NQ that has long answers (passages) but no short GPT (Brown et al., 2020), Llama (Touvron et al., extractive answers.
Muted: Multilingual Targeted Offensive Speech Identification and Visualization
Tillmann, Christoph, Trivedi, Aashka, Rosenthal, Sara, Borse, Santosh, Zhang, Rong, Sil, Avirup, Bhattacharjee, Bishwaranjan
Offensive language such as hate, abuse, and profanity (HAP) occurs in various content on the web. While previous work has mostly dealt with sentence level annotations, there have been a few recent attempts to identify offensive spans as well. We build upon this work and introduce Muted, a system to identify multilingual HAP content by displaying offensive arguments and their targets using heat maps to indicate their intensity. Muted can leverage any transformer-based HAP-classification model and its attention mechanism out-of-the-box to identify toxic spans, without further fine-tuning. In addition, we use the spaCy library to identify the specific targets and arguments for the words predicted by the attention heatmaps. We present the model's performance on identifying offensive spans and their targets in existing datasets and present new annotations on German text. Finally, we demonstrate our proposed visualization tool on multilingual inputs.
PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development
Sil, Avirup, Sen, Jaydeep, Iyer, Bhavani, Franz, Martin, Fadnis, Kshitij, Bornea, Mihaela, Rosenthal, Sara, McCarley, Scott, Zhang, Rong, Kumar, Vishwajeet, Li, Yulong, Sultan, Md Arafat, Bhat, Riyaz, Florian, Radu, Roukos, Salim
The field of Question Answering (QA) has made remarkable progress in recent years, thanks to the advent of large pre-trained language models, newer realistic benchmark datasets with leaderboards, and novel algorithms for key components such as retrievers and readers. In this paper, we introduce PRIMEQA: a one-stop and open-source QA repository with an aim to democratize QA re-search and facilitate easy replication of state-of-the-art (SOTA) QA methods. PRIMEQA supports core QA functionalities like retrieval and reading comprehension as well as auxiliary capabilities such as question generation.It has been designed as an end-to-end toolkit for various use cases: building front-end applications, replicating SOTA methods on pub-lic benchmarks, and expanding pre-existing methods. PRIMEQA is available at : https://github.com/primeqa.