He, Zexue
Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence
Granite Vision Team, null, Karlinsky, Leonid, Arbelle, Assaf, Daniels, Abraham, Nassar, Ahmed, Alfassi, Amit, Wu, Bo, Schwartz, Eli, Joshi, Dhiraj, Kondic, Jovana, Shabtay, Nimrod, Li, Pengyuan, Herzig, Roei, Abedin, Shafiq, Perek, Shaked, Harary, Sivan, Barzelay, Udi, Goldfarb, Adi Raz, Oliva, Aude, Wieles, Ben, Bhattacharjee, Bishwaranjan, Huang, Brandon, Auer, Christoph, Gutfreund, Dan, Beymer, David, Wood, David, Kuehne, Hilde, Hansen, Jacob, Shtok, Joseph, Wong, Ken, Bathen, Luis Angel, Mishra, Mayank, Lysak, Maksym, Dolfi, Michele, Yurochkin, Mikhail, Livathinos, Nikolaos, Harel, Nimrod, Azulai, Ophir, Naparstek, Oshri, de Lima, Rafael Teixeira, Panda, Rameswar, Doveh, Sivan, Gupta, Shubham, Das, Subhro, Zawad, Syed, Kim, Yusik, He, Zexue, Brooks, Alexander, Goodhart, Gabe, Govindjee, Anita, Leist, Derek, Ibrahim, Ibrahim, Soffer, Aya, Cox, David, Soule, Kate, Lastras, Luis, Desai, Nirmit, Ofek-koifman, Shila, Raghavan, Sriram, Syeda-Mahmood, Tanveer, Staar, Peter, Drory, Tal, Feris, Rogerio
Ensuring the safety of generative MLLMs is absolutely crucial in order to prevent harm, build trust, address ethical concerns, and enable their responsible deployment in real-world applications. Our results demonstrate that Granite Vision performs almost at par with baselines (despite being the lightest MLLM in the comparison pool) for VLM-as-a-Judge task. Notably, the addition of Safety Vectors to Granite Vision leads to a significant improvement in safety classification performance. We do acknowledge that further work needs to be done to improve high-level reasoning and correct occasional incorrect outputs to improve reliability in sensitive tasks, which require nuanced classification. To address these, we will incorporate more reasoning-focused and structure-related data into the training process in the future. In addition, we showed in this paper that finding safety vectors (SVs) in Granite Vision's attention heads led to significant improvements when safety tasks were reformulated as classification problems. Current reliance for SVs is on few-shot samples which are informative but may have limited scope in terms of capturing the range of possible safety issues that can be encountered. To further improve the model's ability to identify and address all safety concerns, we plan to investigate scaling up SVs using more training data in future research.
M+: Extending MemoryLLM with Scalable Long-Term Memory
Wang, Yu, Krotov, Dmitry, Hu, Yuanzhe, Gao, Yifan, Zhou, Wangchunshu, McAuley, Julian, Gutfreund, Dan, Feris, Rogerio, He, Zexue
Equipping large language models (LLMs) with latent-space memory has attracted increasing attention as they can extend the context window of existing language models. However, retaining information from the distant past remains a challenge. For example, MemoryLLM (Wang et al., 2024a), as a representative work with latent-space memory, compresses past information into hidden states across all layers, forming a memory pool of 1B parameters. While effective for sequence lengths up to 16k tokens, it struggles to retain knowledge beyond 20k tokens. In this work, we address this limitation by introducing M+, a memory-augmented model based on MemoryLLM that significantly enhances long-term information retention. M+ integrates a long-term memory mechanism with a co-trained retriever, dynamically retrieving relevant information during text generation. We evaluate M+ on diverse benchmarks, including long-context understanding and knowledge retention tasks. Experimental results show that M+ significantly outperforms MemoryLLM and recent strong baselines, extending knowledge retention from under 20k to over 160k tokens with similar GPU memory overhead.
Towards LifeSpan Cognitive Systems
Wang, Yu, Han, Chi, Wu, Tongtong, He, Xiaoxin, Zhou, Wangchunshu, Sadeq, Nafis, Chen, Xiusi, He, Zexue, Wang, Wei, Haffari, Gholamreza, Ji, Heng, McAuley, Julian
Building a human-like system that continuously interacts with complex environments -- whether simulated digital worlds or human society -- presents several key challenges. Central to this is enabling continuous, high-frequency interactions, where the interactions are termed experiences. We refer to this envisioned system as the LifeSpan Cognitive System (LSCS). A critical feature of LSCS is its ability to engage in incremental and rapid updates while retaining and accurately recalling past experiences. We identify two major challenges in achieving this: (1) Abstraction and Experience Merging, and (2) Long-term Retention with Accurate Recall. These properties are essential for storing new experiences, organizing past experiences, and responding to the environment in ways that leverage relevant historical data. Unlike language models with continual learning, which typically rely on large corpora for fine-tuning and focus on improving performance within specific domains or tasks, LSCS must rapidly and incrementally update with new information from its environment at a high frequency. Existing technologies with the potential of solving the above two major challenges can be classified into four classes based on a conceptual metric called Storage Complexity, which measures the relative space required to store past experiences. Each of these four classes of technologies has its own strengths and limitations. Given that none of the existing technologies can achieve LSCS alone, we propose a novel paradigm for LSCS that integrates all four classes of technologies. The new paradigm operates through two core processes: Absorbing Experiences and Generating Responses.
Large Scale Knowledge Washing
Wang, Yu, Wu, Ruihan, He, Zexue, Chen, Xiusi, McAuley, Julian
Large language models show impressive abilities in memorizing world knowledge, which leads to concerns regarding memorization of private information, toxic or sensitive knowledge, and copyrighted content. We introduce the problem of Large Scale Knowledge Washing, focusing on unlearning an extensive amount of factual knowledge. Previous unlearning methods usually define the reverse loss and update the model via backpropagation, which may affect the model's fluency and reasoning ability or even destroy the model due to extensive training with the reverse loss. Existing works introduce additional data from downstream tasks to prevent the model from losing capabilities, which requires downstream task awareness. Controlling the tradeoff of unlearning and maintaining existing capabilities is also challenging. To this end, we propose LAW (Large Scale Washing) to update the MLP layers in decoder-only large language models to perform knowledge washing, as inspired by model editing methods and based on the hypothesis that knowledge and reasoning are disentanglable. We derive a new objective with the knowledge to be unlearned to update the weights of certain MLP layers. Experimental results demonstrate the effectiveness of LAW in forgetting target knowledge while maintaining reasoning ability. The code will be open-sourced at https://github.com/wangyu-ustc/LargeScaleWashing.
Cognitive Bias in High-Stakes Decision-Making with LLMs
Echterhoff, Jessica, Liu, Yao, Alessa, Abeer, McAuley, Julian, He, Zexue
Large language models (LLMs) offer significant potential as tools to support an expanding range of decision-making tasks. However, given their training on human (created) data, LLMs can inherit both societal biases against protected groups, as well as be subject to cognitive bias. Such human-like bias can impede fair and explainable decisions made with LLM assistance. Our work introduces BiasBuster, a framework designed to uncover, evaluate, and mitigate cognitive bias in LLMs, particularly in high-stakes decision-making tasks. Inspired by prior research in psychology and cognitive sciences, we develop a dataset containing 16,800 prompts to evaluate different cognitive biases (e.g., prompt-induced, sequential, inherent). We test various bias mitigation strategies, amidst proposing a novel method using LLMs to debias their own prompts. Our analysis provides a comprehensive picture on the presence and effects of cognitive bias across different commercial and open-source models. We demonstrate that our self-help debiasing effectively mitigate cognitive bias without having to manually craft examples for each bias type.
CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory
He, Zexue, Karlinsky, Leonid, Kim, Donghyun, McAuley, Julian, Krotov, Dmitry, Feris, Rogerio
Large Language Models (LLMs) struggle to handle long input sequences due to high memory and runtime costs. Memory-augmented models have emerged as a promising solution to this problem, but current methods are hindered by limited memory capacity and require costly re-training to integrate with a new LLM. In this work, we introduce an associative memory module which can be coupled to any pre-trained (frozen) attention-based LLM without re-training, enabling it to handle arbitrarily long input sequences. Unlike previous methods, our associative memory module consolidates representations of individual tokens into a non-parametric distribution model, dynamically managed by properly balancing the novelty and recency of the incoming data. By retrieving information from this consolidated associative memory, the base LLM can achieve significant (up to 29.7% on Arxiv) perplexity reduction in long-context modeling compared to other baselines evaluated on standard benchmarks. This architecture, which we call CAMELoT (Consolidated Associative Memory Enhanced Long Transformer), demonstrates superior performance even with a tiny context window of 128 tokens, and also enables improved in-context learning with a much larger set of demonstrations.
Deciphering Compatibility Relationships with Textual Descriptions via Extraction and Explanation
Wang, Yu, He, Zexue, He, Zhankui, Xu, Hao, McAuley, Julian
Understanding and accurately explaining compatibility relationships between fashion items is a challenging problem in the burgeoning domain of AI-driven outfit recommendations. Present models, while making strides in this area, still occasionally fall short, offering explanations that can be elementary and repetitive. This work aims to address these shortcomings by introducing the Pair Fashion Explanation (PFE) dataset, a unique resource that has been curated to illuminate these compatibility relationships. Furthermore, we propose an innovative two-stage pipeline model that leverages this dataset. This fine-tuning allows the model to generate explanations that convey the compatibility relationships between items. Our experiments showcase the model's potential in crafting descriptions that are knowledgeable, aligned with ground-truth matching correlations, and that produce understandable and informative descriptions, as assessed by both automatic metrics and human evaluation. Our code and data are released at https://github.com/wangyu-ustc/PairFashionExplanation
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation
He, Zexue, Wang, Yu, Yan, An, Liu, Yao, Chang, Eric Y., Gentili, Amilcare, McAuley, Julian, Hsu, Chun-Nan
Curated datasets for healthcare are often limited due to the need of human annotations from experts. In this paper, we present MedEval, a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare. MedEval is comprehensive and consists of data from several healthcare systems and spans 35 human body regions from 8 examination modalities. With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data and supporting a wide range of tasks. Moreover, we systematically evaluated 10 generic and domain-specific language models under zero-shot and finetuning settings, from domain-adapted baselines in healthcare to general-purposed state-of-the-art large language models (e.g., ChatGPT). Our evaluations reveal varying effectiveness of the two categories of language models across different tasks, from which we notice the importance of instruction tuning for few-shot usage of large language models. Our investigation paves the way toward benchmarking language models for healthcare and provides valuable insights into the strengths and limitations of adopting large language models in medical domains, informing their practical applications and future advancements.
InterFair: Debiasing with Natural Language Feedback for Fair Interpretable Predictions
Majumder, Bodhisattwa Prasad, He, Zexue, McAuley, Julian
Debiasing methods in NLP models traditionally focus on isolating information related to a sensitive attribute (e.g., gender or race). We instead argue that a favorable debiasing method should use sensitive information 'fairly,' with explanations, rather than blindly eliminating it. This fair balance is often subjective and can be challenging to achieve algorithmically. We explore two interactive setups with a frozen predictive model and show that users able to provide feedback can achieve a better and fairer balance between task performance and bias mitigation. In one setup, users, by interacting with test examples, further decreased bias in the explanations (5-8%) while maintaining the same prediction accuracy. In the other setup, human feedback was able to disentangle associated bias and predictive information from the input leading to superior bias mitigation and improved task performance (4-5%) simultaneously.
Farzi Data: Autoregressive Data Distillation
Sachdeva, Noveen, He, Zexue, Kang, Wang-Cheng, Ni, Jianmo, Cheng, Derek Zhiyuan, McAuley, Julian
We study data distillation for auto-regressive machine learning tasks, where the input and output have a strict left-to-right causal structure. More specifically, we propose Farzi, which summarizes an event sequence dataset into a small number of synthetic sequences -- Farzi Data -- which are optimized to maintain (if not improve) model performance compared to training on the full dataset. Under the hood, Farzi conducts memory-efficient data distillation by (i) deriving efficient reverse-mode differentiation of the Adam optimizer by leveraging Hessian-Vector Products; and (ii) factorizing the high-dimensional discrete event-space into a latent-space which provably promotes implicit regularization. Empirically, for sequential recommendation and language modeling tasks, we are able to achieve 98-120% of downstream full-data performance when training state-of-the-art models on Farzi Data of size as little as 0.1% of the original dataset. Notably, being able to train better models with significantly less data sheds light on the design of future large auto-regressive models, and opens up new opportunities to further scale up model and data sizes.