Melnyk, Igor
EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts
Chaudhury, Subhajit, Das, Payel, Swaminathan, Sarathkrishna, Kollias, Georgios, Nelson, Elliot, Pahwa, Khushbu, Pedapati, Tejaswini, Melnyk, Igor, Riemer, Matthew
Recent advances in Large Language Models (LLMs) have yielded impressive successes on many language tasks. However, efficient processing of long contexts using LLMs remains a significant challenge. We introduce \textbf{EpMAN} -- a method for processing long contexts in an \textit{episodic memory} module while \textit{holistically attending to} semantically relevant context chunks. The output of \textit{episodic attention} is then used to reweigh the decoder's self-attention to the stored KV cache of the context during training and generation. When an LLM decoder is trained using \textbf{EpMAN}, its performance on multiple challenging single-hop long-context recall and question-answering benchmarks is found to be stronger and more robust across the range from 16k to 256k tokens than baseline decoders trained with self-attention, and popular retrieval-augmented generation frameworks.
Larimar: Large Language Models with Episodic Memory Control
Das, Payel, Chaudhury, Subhajit, Nelson, Elliot, Melnyk, Igor, Swaminathan, Sarath, Dai, Sihui, Lozano, Aurรฉlie, Kollias, Georgios, Chenthamarakshan, Vijil, Jiลรญ, null, Navrรกtil, null, Dan, Soham, Chen, Pin-Yu
Efficient and accurate updating of knowledge stored in Large Language Models (LLMs) is one of the most pressing research challenges today. This paper presents Larimar - a novel, brain-inspired architecture for enhancing LLMs with a distributed episodic memory. Larimar's memory allows for dynamic, one-shot updates of knowledge without the need for computationally expensive re-training or fine-tuning. Experimental results on multiple fact editing benchmarks demonstrate that Larimar attains accuracy comparable to most competitive baselines, even in the challenging sequential editing setup, but also excels in speed - yielding speed-ups of 8-10x depending on the base LLM - as well as flexibility due to the proposed architecture being simple, LLM-agnostic, and hence general. We further provide mechanisms for selective fact forgetting, information leakage prevention, and input context length generalization with Larimar and show their effectiveness. Our code is available at https://github.com/IBM/larimar
Distributional Preference Alignment of LLMs via Optimal Transport
Melnyk, Igor, Mroueh, Youssef, Belgodere, Brian, Rigotti, Mattia, Nitsure, Apoorva, Yurochkin, Mikhail, Greenewald, Kristjan, Navratil, Jiri, Ross, Jerret
Current LLM alignment techniques use pairwise human preferences at a sample level, and as such, they do not imply an alignment on the distributional level. We propose in this paper Alignment via Optimal Transport (AOT), a novel method for distributional preference alignment of LLMs. AOT aligns LLMs on unpaired preference data by making the reward distribution of the positive samples stochastically dominant in the first order on the distribution of negative samples. We introduce a convex relaxation of this first-order stochastic dominance and cast it as an optimal transport problem with a smooth and convex cost. Thanks to the one-dimensional nature of the resulting optimal transport problem and the convexity of the cost, it has a closed-form solution via sorting on empirical measures. We fine-tune LLMs with this AOT objective, which enables alignment by penalizing the violation of the stochastic dominance of the reward distribution of the positive samples on the reward distribution of the negative samples. We analyze the sample complexity of AOT by considering the dual of the OT problem and show that it converges at the parametric rate. Empirically, we show on a diverse set of alignment datasets and LLMs that AOT leads to state-of-the-art models in the 7B family of models when evaluated with Open LLM Benchmarks and AlpacaEval.
Risk Assessment and Statistical Significance in the Age of Foundation Models
Nitsure, Apoorva, Mroueh, Youssef, Rigotti, Mattia, Greenewald, Kristjan, Belgodere, Brian, Yurochkin, Mikhail, Navratil, Jiri, Melnyk, Igor, Ross, Jerret
Foundation models such as large language models (LLMs) have shown remarkable capabilities redefining the field of artificial intelligence. At the same time, they present pressing and challenging socio-technical risks regarding the trustworthiness of their outputs and their alignment with human values and ethics [Bommasani et al., 2021]. Evaluating LLMs is therefore a multi-dimensional problem, where those risks are assessed across diverse tasks and domains [Chang et al., 2023]. In order to quantify these risks, Liang et al. [2022], Wang et al. [2023], Huang et al. [2023] proposed benchmarks of automatic metrics for probing the trustworthiness of LLMs. These metrics include accuracy, robustness, fairness, toxicity of the outputs, etc. Human evaluation benchmarks can be even more nuanced, and are often employed when tasks surpass the scope of standard metrics. Notable benchmarks based on human and automatic evaluations include, among others, Chatbot Arena [Zheng et al., 2023], HELM [Bommasani et al., 2023], MosaicML's Eval, Open LLM Leaderboard [Wolf, 2023], and BIG-bench [Srivastava et al., 2022], each catering to specific evaluation areas such as chatbot performance, knowledge assessment, and domain-specific challenges. Traditional metrics, however, sometimes do not correlate well with human judgments.
Auditing and Generating Synthetic Data with Controllable Trust Trade-offs
Belgodere, Brian, Dognin, Pierre, Ivankay, Adam, Melnyk, Igor, Mroueh, Youssef, Mojsilovic, Aleksandra, Navratil, Jiri, Nitsure, Apoorva, Padhi, Inkit, Rigotti, Mattia, Ross, Jerret, Schiff, Yair, Vedpathak, Radhika, Young, Richard A.
Real-world data often exhibits bias, imbalance, and privacy risks. Synthetic datasets have emerged to address these issues. This paradigm relies on generative AI models to generate unbiased, privacy-preserving data while maintaining fidelity to the original data. However, assessing the trustworthiness of synthetic datasets and models is a critical challenge. We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models. It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation. We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases like education, healthcare, banking, and human resources, spanning different data modalities such as tabular, time-series, vision, and natural language. This holistic assessment is essential for compliance with regulatory safeguards. We introduce a trustworthiness index to rank synthetic datasets based on their safeguards trade-offs. Furthermore, we present a trustworthiness-driven model selection and cross-validation process during training, exemplified with "TrustFormers" across various data types. This approach allows for controllable trustworthiness trade-offs in synthetic data creation. Our auditing framework fosters collaboration among stakeholders, including data scientists, governance experts, internal reviewers, external certifiers, and regulators. This transparent reporting should become a standard practice to prevent bias, discrimination, and privacy violations, ensuring compliance with policies and providing accountability, safety, and performance guarantees.
AlphaFold Distillation for Protein Design
Melnyk, Igor, Lozano, Aurelie, Das, Payel, Chenthamarakshan, Vijil
Although our proposed AFDistill system is novel, efficient and showed promising results during evaluations, there are a number of limitations of the current approach: AFDistill dependency on the accuracy of the AlphaFold forward folding model: The quality of the distilled model is directly related to the accuracy of the original forward folding model, including the biases inherited from it. Limited coverage of protein sequence space: Despite the advances in AlphaFold forward folding models, they are still limited in their ability to accurately predict the structure of many protein sequences, including the TM score and pLDDT confidence metrics, that AFDistill relies on. Uncertainty in structural predictions: The confidence metrics (TM score and pLDDT) used in the distillation process are subject to uncertainty, which can lead to errors in the distilled model's predictions and ultimately impact the quality of the generated sequences in downstream applications. The need for a large amount of computational resources: The training process of AFDistill model requires significant computational resources. However, this might be mitigated by the amortization effect where the high upfront training cost in downstream applications pays in terms of cheap and fast inference through the model.
Reprogramming Pretrained Language Models for Antibody Sequence Infilling
Melnyk, Igor, Chenthamarakshan, Vijil, Chen, Pin-Yu, Das, Payel, Dhurandhar, Amit, Padhi, Inkit, Das, Devleena
Antibodies comprise the most versatile class of binding molecules, with numerous applications in biomedicine. Computational design of antibodies involves generating novel and diverse sequences, while maintaining structural consistency. Unique to antibodies, designing the complementarity-determining region (CDR), which determines the antigen binding affinity and specificity, creates its own unique challenges. Recent deep learning models have shown impressive results, however the limited number of known antibody sequence/structure pairs frequently leads to degraded performance, particularly lacking diversity in the generated sequences. In our work we address this challenge by leveraging Model Reprogramming (MR), which repurposes pretrained models on a source language to adapt to the tasks that are in a different language and have scarce data - where it may be difficult to train a high-performing model from scratch or effectively fine-tune an existing pre-trained model on the specific task. Specifically, we introduce ReprogBert in which a pretrained English language model is repurposed for protein sequence infilling - thus considers cross-language adaptation using less data. Results on antibody design benchmarks show that our model on low-resourced antibody sequence dataset provides highly diverse CDR sequences, up to more than a two-fold increase of diversity over the baselines, without losing structural integrity and naturalness. The generated sequences also demonstrate enhanced antigen binding specificity and virus neutralization ability. Code is available at https://github.com/IBM/ReprogBERT
Knowledge Graph Generation From Text
Melnyk, Igor, Dognin, Pierre, Das, Payel
In this work we propose a novel end-to-end multi-stage Knowledge Graph (KG) generation system from textual inputs, separating the overall process into two stages. The graph nodes are generated first using pretrained language model, followed by a simple edge construction head, enabling efficient KG extraction from the text. For each stage we consider several architectural choices that can be used depending on the available training resources. We evaluated the model on a recent WebNLG 2020 Challenge dataset, matching the state-of-the-art performance on text-to-RDF generation task, as well as on New York Times (NYT) and a large-scale TekGen datasets, showing strong overall performance, outperforming the existing baselines. We believe that the proposed system can serve as a viable KG construction alternative to the existing linearization or sampling-based graph generation approaches. Our code can be found at https://github.com/IBM/Grapher
Image Captioning as an Assistive Technology: Lessons Learned from VizWiz 2020 Challenge
Dognin, Pierre, Melnyk, Igor, Mroueh, Youssef, Padhi, Inkit, Rigotti, Mattia, Ross, Jarret, Schiff, Yair, Young, Richard A., Belgodere, Brian
Image captioning has recently demonstrated impressive progress largely owing to the introduction of neural network algorithms trained on curated dataset like MS-COCO. Often work in this field is motivated by the promise of deployment of captioning systems in practical applications. However, the scarcity of data and contexts in many competition datasets renders the utility of systems trained on these datasets limited as an assistive technology in real-world settings, such as helping visually impaired people navigate and accomplish everyday tasks. This gap motivated the introduction of the novel VizWiz dataset, which consists of images taken by the visually impaired and captions that have useful, task-oriented information. In an attempt to help the machine learning computer vision field realize its promise of producing technologies that have positive social impact, the curators of the VizWiz dataset host several competitions, including one for image captioning. This work details the theory and engineering from our winning submission to the 2020 captioning competition. Our work provides a step towards improved assistive image captioning systems.
Tabular Transformers for Modeling Multivariate Time Series
Padhi, Inkit, Schiff, Yair, Melnyk, Igor, Rigotti, Mattia, Mroueh, Youssef, Dognin, Pierre, Ross, Jerret, Nair, Ravi, Altman, Erik
Tabular datasets are ubiquitous across many industries, especially in vital sectors such as healthcare and finance. Such industrial datasets often contain sensitive information, raising privacy and confidentiality issues that preclude their public release and limit their analysis to methods that are compatible with an appropriate anonymization process. We can distinguish between two types of tabular data: static tabular data that corresponds to independent rows in a table, and dynamic tabular data that corresponds to tabular time series, also referred to also as multivariate time series. The machine learning and deep learning communities have devoted considerable effort to learning from static tabular data, as well as generating synthetic static tabular data that can be released as a privacy compliant surrogate of the original data. On the other hand, less effort has been devoted to the more challenging dynamic case, where it is important to also account for the temporal component of the data.