Roukos, Salim
Granite Embedding Models
Awasthy, Parul, Trivedi, Aashka, Li, Yulong, Bornea, Mihaela, Cox, David, Daniels, Abraham, Franz, Martin, Goodhart, Gabe, Iyer, Bhavani, Kumar, Vishwajeet, Lastras, Luis, McCarley, Scott, Murthy, Rudra, P, Vignesh, Rosenthal, Sara, Roukos, Salim, Sen, Jaydeep, Sharma, Sukriti, Sil, Avirup, Soule, Kate, Sultan, Arafat, Florian, Radu
We introduce the Granite Embedding models, a family of encoder-based embedding models designed for retrieval tasks, spanning dense-retrieval and sparse-retrieval architectures, with both English and Multilingual capabilities. This report provides the technical details of training these highly effective 12 layer embedding models, along with their efficient 6 layer distilled counterparts. Extensive evaluations show that the models, developed with techniques like retrieval oriented pretraining, contrastive finetuning, knowledge distillation, and model merging significantly outperform publicly available models of similar sizes on both internal IBM retrieval and search tasks, and have equivalent performance on widely-used information retrieval benchmarks, while being trained on high-quality data suitable for enterprise use. We publicly release all our Granite Embedding models under the Apache 2.0 license, allowing both research and commercial use at https://huggingface.co/collections/ibm-granite . Figure 1: Average performance on the Granite embedding models (in blue) vs BGE, GTE, Snowflake, E5, and Nomic models on 5 QA and IR datasets: BEIR, ClapNQ, CoIR, RedHat, and UnifiedSearch (the last 2 are internal IBM datasets). The goal of text embedding models is to convert variable length text into a fixed vector, encoding the text semantics into a multidimensional vector in such a way that semantically close texts are close in the vector space, while dissimilar texts have a low similarity. These embeddings can then be used in a variety of tasks, most commonly in retrieval applications, where the relevance of a document to a given query can be determined by the similarity of their embeddings (Dunn et al., 2017; Xiong et al., 2020; Neelakantan et al., 2022)(Zamani et al., 2018; Zhao et al., 2020), but also in document clustering (Angelov, 2020) and text classification (Sun et al., 2019). See Contributions section for full author list.
Graph-based Uncertainty Metrics for Long-form Language Model Outputs
Jiang, Mingjian, Ruan, Yangjun, Sattigeri, Prasanna, Roukos, Salim, Hashimoto, Tatsunori
Recent advancements in Large Language Models (LLMs) have significantly improved text generation capabilities, but these systems are still known to hallucinate, and granular uncertainty estimation for long-form LLM generations remains challenging. In this work, we propose Graph Uncertainty -- which represents the relationship between LLM generations and claims within them as a bipartite graph and estimates the claim-level uncertainty with a family of graph centrality metrics. Under this view, existing uncertainty estimation methods based on the concept of self-consistency can be viewed as using degree centrality as an uncertainty measure, and we show that more sophisticated alternatives such as closeness centrality provide consistent gains at claim-level uncertainty estimation. Moreover, we present uncertainty-aware decoding techniques that leverage both the graph structure and uncertainty estimates to improve the factuality of LLM generations by preserving only the most reliable claims. Compared to existing methods, our graph-based uncertainty metrics lead to an average of 6.8% relative gains on AUPRC across various long-form generation settings, and our end-to-end system provides consistent 2-4% gains in factuality over existing decoding techniques while significantly improving the informativeness of generated responses.
Retrieval Augmented Generation-Based Incident Resolution Recommendation System for IT Support
Isaza, Paulina Toro, Nidd, Michael, Zheutlin, Noah, Ahn, Jae-wook, Bhatt, Chidansh Amitkumar, Deng, Yu, Mahindru, Ruchi, Franz, Martin, Florian, Hans, Roukos, Salim
Clients wishing to implement generative AI in the domain of IT Support and AIOps face two critical issues: domain coverage and model size constraints due to model choice limitations. Clients might choose to not use larger proprietary models such as GPT-4 due to cost and privacy concerns and so are limited to smaller models with potentially less domain coverage that do not generalize to the client's domain. Retrieval augmented generation is a common solution that addresses both of these issues: a retrieval system first retrieves the necessary domain knowledge which a smaller generative model leverages as context for generation. We present a system developed for a client in the IT Support domain for support case solution recommendation that combines retrieval augmented generation (RAG) for answer generation with an encoder-only model for classification and a generative large language model for query generation. We cover architecture details, data collection and annotation, development journey and preliminary validations, expected final deployment process and evaluation plans, and finally lessons learned.
Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks
Abdelaziz, Ibrahim, Basu, Kinjal, Agarwal, Mayank, Kumaravel, Sadhana, Stallone, Matthew, Panda, Rameswar, Rizk, Yara, Bhargav, GP, Crouse, Maxwell, Gunasekara, Chulaka, Ikbal, Shajith, Joshi, Sachin, Karanam, Hima, Kumar, Vineet, Munawar, Asim, Neelam, Sumit, Raghu, Dinesh, Sharma, Udit, Soria, Adriana Meza, Sreedhar, Dheeraj, Venkateswaran, Praveen, Unuvar, Merve, Cox, David, Roukos, Salim, Lastras, Luis, Kapanipathi, Pavan
Large language models (LLMs) have recently shown tremendous promise in serving as the backbone to agentic systems, as demonstrated by their performance in multi-faceted, challenging benchmarks like SWE-Bench and Agent-Bench. However, to realize the true potential of LLMs as autonomous agents, they must learn to identify, call, and interact with external tools and application program interfaces (APIs) to complete complex tasks. These tasks together are termed function calling. Endowing LLMs with function calling abilities leads to a myriad of advantages, such as access to current and domain-specific information in databases and knowledge sources, and the ability to outsource tasks that can be reliably performed by tools, e.g., a Python interpreter or calculator. While there has been significant progress in function calling with LLMs, there is still a dearth of open models that perform on par with proprietary LLMs like GPT, Claude, and Gemini. Therefore, in this work, we introduce the GRANITE-20B-FUNCTIONCALLING model under an Apache 2.0 license. The model is trained using a multi-task training approach on seven fundamental tasks encompassed in function calling, those being Nested Function Calling, Function Chaining, Parallel Functions, Function Name Detection, Parameter-Value Pair Detection, Next-Best Function, and Response Generation. We present a comprehensive evaluation on multiple out-of-domain datasets comparing GRANITE-20B-FUNCTIONCALLING to more than 15 other best proprietary and open models. GRANITE-20B-FUNCTIONCALLING provides the best performance among all open models on the Berkeley Function Calling Leaderboard and fourth overall. As a result of the diverse tasks and datasets used for training our model, we show that GRANITE-20B-FUNCTIONCALLING has better generalizability on multiple tasks in seven different evaluation datasets.
Prompts as Auto-Optimized Training Hyperparameters: Training Best-in-Class IR Models from Scratch with 10 Gold Labels
Xian, Jasper, Samuel, Saron, Khoubsirat, Faraz, Pradeep, Ronak, Sultan, Md Arafat, Florian, Radu, Roukos, Salim, Sil, Avirup, Potts, Christopher, Khattab, Omar
We develop a method for training small-scale (under 100M parameter) neural information retrieval models with as few as 10 gold relevance labels. The method depends on generating synthetic queries for documents using a language model (LM), and the key step is that we automatically optimize the LM prompt that is used to generate these queries based on training quality. In experiments with the BIRCO benchmark, we find that models trained with our method outperform RankZephyr and are competitive Figure 1: An overview of the PATH pipeline for training with RankLLama, both of which are 7B parameter a reranker with synthetic queries. A user only needs to models trained on over 100K labels. These input a prompt with the task description and as few as findings point to the power of automatic prompt 10 relevance judgements to achieve strong results.
Can a Multichoice Dataset be Repurposed for Extractive Question Answering?
Lynn, Teresa, Altakrori, Malik H., Magdy, Samar Mohamed, Das, Rocktim Jyoti, Lyu, Chenyang, Nasr, Mohamed, Samih, Younes, Aji, Alham Fikri, Nakov, Preslav, Godbole, Shantanu, Roukos, Salim, Florian, Radu, Habash, Nizar
The rapid evolution of Natural Language Processing (NLP) has favored major languages such as English, leaving a significant gap for many others due to limited resources. This is especially evident in the context of data annotation, a task whose importance cannot be underestimated, but which is time-consuming and costly. Thus, any dataset for resource-poor languages is precious, in particular when it is task-specific. Here, we explore the feasibility of repurposing existing datasets for a new NLP task: we repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA), to enable extractive QA (EQA) in the style of machine reading comprehension. We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). We also present QA evaluation results for several monolingual and cross-lingual QA pairs including English, MSA, and five Arabic dialects. Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced. We also conduct a thorough analysis and share our insights from the process, which we hope will contribute to a deeper understanding of the challenges and the opportunities associated with task reformulation in NLP research.
CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems
Rosenthal, Sara, Sil, Avirup, Florian, Radu, Roukos, Salim
Large (NQ) (Kwiatkowski et al., 2019) and SQuAD (Rajpurkar scale research in this area began with the tasks et al., 2016, 2018) which are just a few of Machine Reading Comprehension (Rajpurkar words. It is grounded on a single gold passage, et al., 2016; Rogers et al., 2023; Fisch et al., in contrast to other long-form question answering 2021), and Information Retrieval (Manning et al., (LFQA) datasets such as ELI5 (Fan et al., 2019) 2008; Voorhees and Harman, 2005; Thakur et al., where gold passages are not available. It is built 2021) and has more recently been come to be from a subset of the highly successful Natural Questions known as Retrieval Augmented Generation (Lewis (Kwiatkowski et al., 2019) dataset for extractive et al., 2021; Guu et al., 2020) which encompasses QA from Wikipedia documents based on users both tasks. The recent popularity of generative real web search queries - specifically, the subset of AI with Large Language models (LLM), such as NQ that has long answers (passages) but no short GPT (Brown et al., 2020), Llama (Touvron et al., extractive answers.
Self-Refinement of Language Models from External Proxy Metrics Feedback
Ramji, Keshav, Lee, Young-Suk, Astudillo, Ramón Fernandez, Sultan, Md Arafat, Naseem, Tahira, Munawar, Asim, Florian, Radu, Roukos, Salim
It is often desirable for Large Language Models (LLMs) to capture multiple objectives when providing a response. In document-grounded response generation, for example, agent responses are expected to be relevant to a user's query while also being grounded in a given document. In this paper, we introduce Proxy Metric-based Self-Refinement (ProMiSe), which enables an LLM to refine its own initial response along key dimensions of quality guided by external metrics feedback, yielding an overall better final response. ProMiSe leverages feedback on response quality through principle-specific proxy metrics, and iteratively refines its response one principle at a time. We apply ProMiSe to open source language models Flan-T5-XXL and Llama-2-13B-Chat, to evaluate its performance on document-grounded question answering datasets, MultiDoc2Dial and QuAC, demonstrating that self-refinement improves response quality. We further show that fine-tuning Llama-2-13B-Chat on the synthetic dialogue data generated by ProMiSe yields significant performance improvements over the zero-shot baseline as well as a supervised fine-tuned model on human annotated data.
Formally Specifying the High-Level Behavior of LLM-Based Agents
Crouse, Maxwell, Abdelaziz, Ibrahim, Astudillo, Ramon, Basu, Kinjal, Dan, Soham, Kumaravel, Sadhana, Fokoue, Achille, Kapanipathi, Pavan, Roukos, Salim, Lastras, Luis
Autonomous, goal-driven agents powered by LLMs have recently emerged as promising tools for solving challenging problems without the need for task-specific finetuned models that can be expensive to procure. Currently, the design and implementation of such agents is ad hoc, as the wide variety of tasks that LLM-based agents may be applied to naturally means there can be no one-size-fits-all approach to agent design. In this work we aim to alleviate the difficulty of designing and implementing new agents by proposing a minimalistic generation framework that simplifies the process of building agents. The framework we introduce allows the user to define desired agent behaviors in a high-level, declarative specification that is then used to construct a decoding monitor which guarantees the LLM will produce an output exhibiting the desired behavior. Our declarative approach, in which the behavior is described without concern for how it should be implemented or enforced, enables rapid design, implementation, and experimentation with different LLM-based agents. We demonstrate how the proposed framework can be used to implement recent LLM-based agents (e.g., ReACT), and show how the flexibility of our approach can be leveraged to define a new agent with more complex behavior, the Plan-Act-Summarize-Solve (PASS) agent. Lastly, we demonstrate that our method outperforms other agents on multiple popular reasoning-centric question-answering benchmarks.
Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs
Lee, Young-Suk, Sultan, Md Arafat, El-Kurdi, Yousef, Munawar, Tahira Naseem Asim, Florian, Radu, Roukos, Salim, Astudillo, Ramón Fernandez
Using in-context learning (ICL) for data generation, techniques such as Self-Instruct (Wang et al., 2023) or the follow-up Alpaca (Taori et al., 2023) can train strong conversational agents with only a small amount of human supervision. One limitation of these approaches is that they resort to very large language models (around 175B parameters) that are also proprietary and non-public. Here we explore the application of such techniques to language models that are much smaller (around 10B--40B parameters) and have permissive licenses. We find the Self-Instruct approach to be less effective at these sizes and propose new ICL methods that draw on two main ideas: (a) Categorization and simplification of the ICL templates to make prompt learning easier for the LM, and (b) Ensembling over multiple LM outputs to help select high-quality synthetic examples. Our algorithm leverages the 175 Self-Instruct seed tasks and employs separate pipelines for instructions that require an input and instructions that do not. Empirical investigations with different LMs show that: (1) Our proposed method yields higher-quality instruction tuning data than Self-Instruct, (2) It improves performances of both vanilla and instruction-tuned LMs by significant margins, and (3) Smaller instruction-tuned LMs generate more useful outputs than their larger un-tuned counterparts. Our codebase is available at https://github.com/IBM/ensemble-instruct.