Padhi, Inkit
Granite Guardian
Padhi, Inkit, Nagireddy, Manish, Cornacchia, Giandomenico, Chaudhury, Subhajit, Pedapati, Tejaswini, Dognin, Pierre, Murugesan, Keerthiram, Miehling, Erik, Cooper, Martín Santillán, Fraser, Kieran, Zizzo, Giulio, Hameed, Muhammad Zaid, Purcell, Mark, Desmond, Michael, Pan, Qian, Ashktorab, Zahra, Vejsbjerg, Inge, Daly, Elizabeth M., Hind, Michael, Geyer, Werner, Rawat, Ambrish, Varshney, Kush R., Sattigeri, Prasanna
We introduce the Granite Guardian models, a suite of safeguards designed to provide risk detection for prompts and responses, enabling safe and responsible use in combination with any large language model (LLM). These models offer comprehensive coverage across multiple risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related risks such as context relevance, groundedness, and answer relevance for retrieval-augmented generation (RAG). Trained on a unique dataset combining human annotations from diverse sources and synthetic data, Granite Guardian models address risks typically overlooked by traditional risk detection models, such as jailbreaks and RAG-specific issues. With AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks respectively, Granite Guardian is the most generalizable and competitive model available in the space. Released as open-source, Granite Guardian aims to promote responsible AI development across the community.
Final-Model-Only Data Attribution with a Unifying View of Gradient-Based Methods
Wei, Dennis, Padhi, Inkit, Ghosh, Soumya, Dhurandhar, Amit, Ramamurthy, Karthikeyan Natesan, Chang, Maria
Training data attribution (TDA) is the task of attributing model behavior to elements in the training data. This paper draws attention to the common setting where one has access only to the final trained model, and not the training algorithm or intermediate information from training. To serve as a gold standard for TDA in this "final-model-only" setting, we propose further training, with appropriate adjustment and averaging, to measure the sensitivity of the given model to training instances. We then unify existing gradient-based methods for TDA by showing that they all approximate the further training gold standard in different ways. We investigate empirically the quality of these gradient-based approximations to further training, for tabular, image, and text datasets and models. We find that the approximation quality of first-order methods is sometimes high but decays with the amount of further training. In contrast, the approximations given by influence function methods are more stable but surprisingly lower in quality.
When in Doubt, Cascade: Towards Building Efficient and Capable Guardrails
Nagireddy, Manish, Padhi, Inkit, Ghosh, Soumya, Sattigeri, Prasanna
Large language models (LLMs) have convincing performance in a variety of downstream tasks. However, these systems are prone to generating undesirable outputs such as harmful and biased text. In order to remedy such generations, the development of guardrail (or detector) models has gained traction. Motivated by findings from developing a detector for social bias, we adopt the notion of a use-mention distinction - which we identified as the primary source of under-performance in the preliminary versions of our social bias detector. Armed with this information, we describe a fully extensible and reproducible synthetic data generation pipeline which leverages taxonomy-driven instructions to create targeted and labeled data. Using this pipeline, we generate over 300K unique contrastive samples and provide extensive experiments to systematically evaluate performance on a suite of open source datasets. We show that our method achieves competitive performance with a fraction of the cost in compute and offers insight into iteratively developing efficient and capable guardrail models. Warning: This paper contains examples of text which are toxic, biased, and potentially harmful.
WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia
Hou, Yufang, Pascale, Alessandra, Carnerero-Cano, Javier, Tchrakian, Tigran, Marinescu, Radu, Daly, Elizabeth, Padhi, Inkit, Sattigeri, Prasanna
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions that have varying answers based on contradictory passages from Wikipedia, a dataset widely regarded as a high-quality pre-training resource for most LLMs. Specifically, we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages. Through rigorous human evaluations on a subset of WikiContradict instances involving 5 LLMs and over 3,500 judgements, we shed light on the behaviour and limitations of these models. For instance, when provided with two passages containing contradictory facts, all models struggle to generate answers that accurately reflect the conflicting nature of the context, especially for implicit conflicts requiring reasoning. Since human evaluation is costly, we also introduce an automated model that estimates LLM performance using a strong open-source language model, achieving an F-score of 0.8. Using this automated metric, we evaluate more than 1,500 answers from seven LLMs across all WikiContradict instances. To facilitate future work, we release WikiContradict on: https://ibm.biz/wikicontradict.
Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs
Kadhe, Swanand Ravindra, Ahmed, Farhan, Wei, Dennis, Baracaldo, Nathalie, Padhi, Inkit
Large language models (LLMs) have shown to pose social and ethical risks such as generating toxic language or facilitating malicious use of hazardous knowledge. Machine unlearning is a promising approach to improve LLM safety by directly removing harmful behaviors and knowledge. In this paper, we propose "SPlit, UNlearn, MerGE" (SPUNGE), a framework that can be used with any unlearning method to amplify its effectiveness. SPUNGE leverages data attributes during unlearning by splitting unlearning data into subsets based on specific attribute values, unlearning each subset separately, and merging the unlearned models. We empirically demonstrate that SPUNGE significantly improves the performance of two recent unlearning methods on state-of-the-art LLMs while maintaining their general capabilities on standard academic benchmarks.
Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations
Achintalwar, Swapnaja, Garcia, Adriana Alvarado, Anaby-Tavor, Ateret, Baldini, Ioana, Berger, Sara E., Bhattacharjee, Bishwaranjan, Bouneffouf, Djallel, Chaudhury, Subhajit, Chen, Pin-Yu, Chiazor, Lamogha, Daly, Elizabeth M., DB, Kirushikesh, de Paula, Rogério Abreu, Dognin, Pierre, Farchi, Eitan, Ghosh, Soumya, Hind, Michael, Horesh, Raya, Kour, George, Lee, Ja Young, Madaan, Nishtha, Mehta, Sameep, Miehling, Erik, Murugesan, Keerthiram, Nagireddy, Manish, Padhi, Inkit, Piorkowski, David, Rawat, Ambrish, Raz, Orna, Sattigeri, Prasanna, Strobelt, Hendrik, Swaminathan, Sarathkrishna, Tillmann, Christoph, Trivedi, Aashka, Varshney, Kush R., Wei, Dennis, Witherspooon, Shalisha, Zalmanovici, Marcel
Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations. Due to several limiting factors surrounding LLMs (training cost, API access, data availability, etc.), it may not always be feasible to impose direct safety constraints on a deployed model. Therefore, an efficient and reliable alternative is required. To this end, we present our ongoing efforts to create and deploy a library of detectors: compact and easy-to-build classification models that provide labels for various harms. In addition to the detectors themselves, we discuss a wide range of uses for these detector models - from acting as guardrails to enabling effective AI governance. We also deep dive into inherent challenges in their development and discuss future work aimed at making the detectors more reliable and broadening their scope.
Contextual Moral Value Alignment Through Context-Based Aggregation
Dognin, Pierre, Rios, Jesus, Luss, Ronny, Padhi, Inkit, Riemer, Matthew D, Liu, Miao, Sattigeri, Prasanna, Nagireddy, Manish, Varshney, Kush R., Bouneffouf, Djallel
Developing value-aligned AI agents is a complex undertaking and an ongoing challenge in the field of AI. Specifically within the domain of Large Language Models (LLMs), the capability to consolidate multiple independently trained dialogue agents, each aligned with a distinct moral value, into a unified system that can adapt to and be aligned with multiple moral values is of paramount importance. In this paper, we propose a system that does contextual moral value alignment based on contextual aggregation. Here, aggregation is defined as the process of integrating a subset of LLM responses that are best suited to respond to a user input, taking into account features extracted from the user's input. The proposed system shows better results in term of alignment to human value compared to the state of the art.
Alignment Studio: Aligning Large Language Models to Particular Contextual Regulations
Achintalwar, Swapnaja, Baldini, Ioana, Bouneffouf, Djallel, Byamugisha, Joan, Chang, Maria, Dognin, Pierre, Farchi, Eitan, Makondo, Ndivhuwo, Mojsilovic, Aleksandra, Nagireddy, Manish, Ramamurthy, Karthikeyan Natesan, Padhi, Inkit, Raz, Orna, Rios, Jesus, Sattigeri, Prasanna, Singh, Moninder, Thwala, Siphiwe, Uceda-Sosa, Rosario A., Varshney, Kush R.
The alignment of large language models is usually done by model providers to add or control behaviors that are common or universally understood across use cases and contexts. In contrast, in this article, we present an approach and architecture that empowers application developers to tune a model to their particular values, social norms, laws and other regulations, and orchestrate between potentially conflicting requirements in context. We lay out three main components of such an Alignment Studio architecture: Framers, Instructors, and Auditors that work in concert to control the behavior of a language model. We illustrate this approach with a running example of aligning a company's internal-facing enterprise chatbot to its business conduct guidelines.
Auditing and Generating Synthetic Data with Controllable Trust Trade-offs
Belgodere, Brian, Dognin, Pierre, Ivankay, Adam, Melnyk, Igor, Mroueh, Youssef, Mojsilovic, Aleksandra, Navratil, Jiri, Nitsure, Apoorva, Padhi, Inkit, Rigotti, Mattia, Ross, Jerret, Schiff, Yair, Vedpathak, Radhika, Young, Richard A.
Real-world data often exhibits bias, imbalance, and privacy risks. Synthetic datasets have emerged to address these issues. This paradigm relies on generative AI models to generate unbiased, privacy-preserving data while maintaining fidelity to the original data. However, assessing the trustworthiness of synthetic datasets and models is a critical challenge. We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models. It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation. We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases like education, healthcare, banking, and human resources, spanning different data modalities such as tabular, time-series, vision, and natural language. This holistic assessment is essential for compliance with regulatory safeguards. We introduce a trustworthiness index to rank synthetic datasets based on their safeguards trade-offs. Furthermore, we present a trustworthiness-driven model selection and cross-validation process during training, exemplified with "TrustFormers" across various data types. This approach allows for controllable trustworthiness trade-offs in synthetic data creation. Our auditing framework fosters collaboration among stakeholders, including data scientists, governance experts, internal reviewers, external certifiers, and regulators. This transparent reporting should become a standard practice to prevent bias, discrimination, and privacy violations, ensuring compliance with policies and providing accountability, safety, and performance guarantees.
The Impact of Positional Encoding on Length Generalization in Transformers
Kazemnejad, Amirhossein, Padhi, Inkit, Ramamurthy, Karthikeyan Natesan, Das, Payel, Reddy, Siva
Length generalization, the ability to generalize from small training context sizes to larger ones, is a critical challenge in the development of Transformer-based language models. Positional encoding (PE) has been identified as a major factor influencing length generalization, but the exact impact of different PE schemes on extrapolation in downstream tasks remains unclear. In this paper, we conduct a systematic empirical study comparing the length generalization performance of decoder-only Transformers with five different position encoding approaches including Absolute Position Embedding (APE), T5's Relative PE, ALiBi, and Rotary, in addition to Transformers without positional encoding (NoPE). Our evaluation encompasses a battery of reasoning and mathematical tasks. Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks. More importantly, NoPE outperforms other explicit positional encoding methods while requiring no additional computation. We theoretically demonstrate that NoPE can represent both absolute and relative PEs, but when trained with SGD, it mostly resembles T5's relative PE attention patterns. Finally, we find that scratchpad is not always helpful to solve length generalization and its format highly impacts the model's performance. Overall, our work suggests that explicit position embeddings are not essential for decoder-only Transformers to generalize well to longer sequences.