Feris, Rogerio
Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs
Borazjanizadeh, Nasim, Herzig, Roei, Oks, Eduard, Darrell, Trevor, Feris, Rogerio, Karlinsky, Leonid
Human reasoning relies on constructing and manipulating mental models-simplified internal representations of situations that we use to understand and solve problems. Conceptual diagrams (for example, sketches drawn by humans to aid reasoning) externalize these mental models, abstracting irrelevant details to efficiently capture relational and spatial information. In contrast, Large Language Models (LLMs) and Large Multimodal Models (LMMs) predominantly reason through textual representations, limiting their effectiveness in complex multi-step combinatorial and planning tasks. In this paper, we propose a zero-shot fully automatic framework that enables LMMs to reason through multiple chains of self-generated intermediate conceptual diagrams, significantly enhancing their combinatorial planning capabilities. Our approach does not require any human initialization beyond a natural language description of the task. It integrates both textual and diagrammatic reasoning within an optimized graph-of-thought inference framework, enhanced by beam search and depth-wise backtracking. Evaluated on multiple challenging PDDL planning domains, our method substantially improves GPT-4o's performance (for example, from 35.5% to 90.2% in Blocksworld). On more difficult planning domains with solution depths up to 40, our approach outperforms even the o1-preview reasoning model (for example, over 13% improvement in Parking). These results highlight the value of conceptual diagrams as a complementary reasoning medium in LMMs.
Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence
Granite Vision Team, null, Karlinsky, Leonid, Arbelle, Assaf, Daniels, Abraham, Nassar, Ahmed, Alfassi, Amit, Wu, Bo, Schwartz, Eli, Joshi, Dhiraj, Kondic, Jovana, Shabtay, Nimrod, Li, Pengyuan, Herzig, Roei, Abedin, Shafiq, Perek, Shaked, Harary, Sivan, Barzelay, Udi, Goldfarb, Adi Raz, Oliva, Aude, Wieles, Ben, Bhattacharjee, Bishwaranjan, Huang, Brandon, Auer, Christoph, Gutfreund, Dan, Beymer, David, Wood, David, Kuehne, Hilde, Hansen, Jacob, Shtok, Joseph, Wong, Ken, Bathen, Luis Angel, Mishra, Mayank, Lysak, Maksym, Dolfi, Michele, Yurochkin, Mikhail, Livathinos, Nikolaos, Harel, Nimrod, Azulai, Ophir, Naparstek, Oshri, de Lima, Rafael Teixeira, Panda, Rameswar, Doveh, Sivan, Gupta, Shubham, Das, Subhro, Zawad, Syed, Kim, Yusik, He, Zexue, Brooks, Alexander, Goodhart, Gabe, Govindjee, Anita, Leist, Derek, Ibrahim, Ibrahim, Soffer, Aya, Cox, David, Soule, Kate, Lastras, Luis, Desai, Nirmit, Ofek-koifman, Shila, Raghavan, Sriram, Syeda-Mahmood, Tanveer, Staar, Peter, Drory, Tal, Feris, Rogerio
Ensuring the safety of generative MLLMs is absolutely crucial in order to prevent harm, build trust, address ethical concerns, and enable their responsible deployment in real-world applications. Our results demonstrate that Granite Vision performs almost at par with baselines (despite being the lightest MLLM in the comparison pool) for VLM-as-a-Judge task. Notably, the addition of Safety Vectors to Granite Vision leads to a significant improvement in safety classification performance. We do acknowledge that further work needs to be done to improve high-level reasoning and correct occasional incorrect outputs to improve reliability in sensitive tasks, which require nuanced classification. To address these, we will incorporate more reasoning-focused and structure-related data into the training process in the future. In addition, we showed in this paper that finding safety vectors (SVs) in Granite Vision's attention heads led to significant improvements when safety tasks were reformulated as classification problems. Current reliance for SVs is on few-shot samples which are informative but may have limited scope in terms of capturing the range of possible safety issues that can be encountered. To further improve the model's ability to identify and address all safety concerns, we plan to investigate scaling up SVs using more training data in future research.
M+: Extending MemoryLLM with Scalable Long-Term Memory
Wang, Yu, Krotov, Dmitry, Hu, Yuanzhe, Gao, Yifan, Zhou, Wangchunshu, McAuley, Julian, Gutfreund, Dan, Feris, Rogerio, He, Zexue
Equipping large language models (LLMs) with latent-space memory has attracted increasing attention as they can extend the context window of existing language models. However, retaining information from the distant past remains a challenge. For example, MemoryLLM (Wang et al., 2024a), as a representative work with latent-space memory, compresses past information into hidden states across all layers, forming a memory pool of 1B parameters. While effective for sequence lengths up to 16k tokens, it struggles to retain knowledge beyond 20k tokens. In this work, we address this limitation by introducing M+, a memory-augmented model based on MemoryLLM that significantly enhances long-term information retention. M+ integrates a long-term memory mechanism with a co-trained retriever, dynamically retrieving relevant information during text generation. We evaluate M+ on diverse benchmarks, including long-context understanding and knowledge retention tasks. Experimental results show that M+ significantly outperforms MemoryLLM and recent strong baselines, extending knowledge retention from under 20k to over 160k tokens with similar GPU memory overhead.
Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers
Mitra, Chancharik, Huang, Brandon, Chai, Tianning, Lin, Zhiqiu, Arbelle, Assaf, Feris, Rogerio, Karlinsky, Leonid, Darrell, Trevor, Ramanan, Deva, Herzig, Roei
Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks such as image captioning or visual question answering. Despite strong performance, LMMs are not directly suited for foundational discriminative vision-language tasks (i.e., tasks requiring discrete label predictions) such as image classification and multiple-choice VQA. One key challenge in utilizing LMMs for discriminative tasks is the extraction of useful features from generative models. To overcome this issue, we propose an approach for finding features in the model's latent space to more effectively leverage LMMs for discriminative tasks. Toward this end, we present Sparse Attention Vectors (SAVs) -- a finetuning-free method that leverages sparse attention head activations (fewer than 1\% of the heads) in LMMs as strong features for VL tasks. With only few-shot examples, SAVs demonstrate state-of-the-art performance compared to a variety of few-shot and finetuned baselines on a collection of discriminative tasks. Our experiments also imply that SAVs can scale in performance with additional examples and generalize to similar tasks, establishing SAVs as both effective and robust multimodal feature representations.
State-Space Large Audio Language Models
Bhati, Saurabhchand, Gong, Yuan, Karlinsky, Leonid, Kuehne, Hilde, Feris, Rogerio, Glass, James
Large Audio Language Models (LALM) combine the audio perception models and the Large Language Models (LLM) and show a remarkable ability to reason about the input audio, infer the meaning, and understand the intent. However, these systems rely on Transformers which scale quadratically with the input sequence lengths which poses computational challenges in deploying these systems in memory and time-constrained scenarios. Recently, the state-space models (SSMs) have emerged as an alternative to transformer networks. While there have been successful attempts to replace transformer-based audio perception models with state-space ones, state-space-based LALMs remain unexplored. First, we begin by replacing the transformer-based audio perception module and then replace the transformer-based LLM and propose the first state-space-based LALM. Experimental results demonstrate that space-based LALM despite having a significantly lower number of parameters performs competitively with transformer-based LALMs on close-ended tasks on a variety of datasets.
Scaling Granite Code Models to 128K Context
Stallone, Matt, Saxena, Vaibhav, Karlinsky, Leonid, McGinn, Bridget, Bula, Tim, Mishra, Mayank, Soria, Adriana Meza, Zhang, Gaoyuan, Prasad, Aditya, Shen, Yikang, Surendran, Saptha, Guttula, Shanmukha, Patel, Hima, Selvam, Parameswaran, Dang, Xuan-Hong, Koyfman, Yan, Sood, Atin, Feris, Rogerio, Desai, Nirmit, Cox, David D., Puri, Ruchir, Panda, Rameswar
This paper introduces long-context Granite code models that support effective context windows of up to 128K tokens. Our solution for scaling context length of Granite 3B/8B code models from 2K/4K to 128K consists of a light-weight continual pretraining by gradually increasing its RoPE base frequency with repository-level file packing and length-upsampled long-context data. Additionally, we also release instruction-tuned models with long-context support which are derived by further finetuning the long context base models on a mix of permissively licensed short and long-context instruction-response pairs. While comparing to the original short-context Granite code models, our long-context models achieve significant improvements on long-context tasks without any noticeable performance degradation on regular code completion benchmarks (e.g., HumanEval). We release all our long-context Granite code models under an Apache 2.0 license for both research and commercial use.
Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems
Borazjanizadeh, Nasim, Herzig, Roei, Darrell, Trevor, Feris, Rogerio, Karlinsky, Leonid
Recently, Large Language Models (LLMs) attained impressive performance in math and reasoning benchmarks. However, they still often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we introduce a new benchmark, SearchBench, containing 11 unique search problems, each equipped with automated pipelines to generate an arbitrary number of instances and analyze the feasibility, correctness, and optimality of LLM-generated solutions. We show that even the most advanced LLMs fail to solve these problems end-to-end in text, e.g., GPT4 solves only 1.4%. SearchBench problems require considering multiple pathways to the solution as well as backtracking, posing a significant challenge to auto-regressive models. Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4's performance rises to 11.7%. In this work, we show that in-context learning with A* algorithm implementations enhances performance. The full potential of this promoting approach emerges when combined with our proposed Multi-Stage-Multi-Try method, which breaks down the algorithm implementation into two stages and verifies the first stage against unit tests, raising GPT-4's performance above 57%.
$\textit{Trans-LoRA}$: towards data-free Transferable Parameter Efficient Finetuning
Wang, Runqian, Ghosh, Soumya, Cox, David, Antognini, Diego, Oliva, Aude, Feris, Rogerio, Karlinsky, Leonid
Low-rank adapters (LoRA) and their variants are popular parameter-efficient fine-tuning (PEFT) techniques that closely match full model fine-tune performance while requiring only a small number of additional parameters. These additional LoRA parameters are specific to the base model being adapted. When the base model needs to be deprecated and replaced with a new one, all the associated LoRA modules need to be re-trained. Such re-training requires access to the data used to train the LoRA for the original base model. This is especially problematic for commercial cloud applications where the LoRA modules and the base models are hosted by service providers who may not be allowed to host proprietary client task data. To address this challenge, we propose $\textit{Trans-LoRA}$ -- a novel method for lossless, nearly data-free transfer of LoRAs across base models. Our approach relies on synthetic data to transfer LoRA modules. Using large language models, we design a synthetic data generator to approximate the data-generating process of the $\textit{observed}$ task data subset. Training on the resulting synthetic dataset transfers LoRA modules to new models. We show the effectiveness of our approach using both LLama and Gemma model families. Our approach achieves lossless (mostly improved) LoRA transfer between models within and across different base model families, and even between different PEFT methods, on a wide variety of tasks.
Adaptive Memory Replay for Continual Learning
Smith, James Seale, Valkov, Lazar, Halbe, Shaunak, Gutta, Vyshnavi, Feris, Rogerio, Kira, Zsolt, Karlinsky, Leonid
Foundation Models (FMs) have become the hallmark of modern AI, however, these models are trained on massive data, leading to financially expensive training. Updating FMs as new data becomes available is important, however, can lead to `catastrophic forgetting', where models underperform on tasks related to data sub-populations observed too long ago. This continual learning (CL) phenomenon has been extensively studied, but primarily in a setting where only a small amount of past data can be stored. We advocate for the paradigm where memory is abundant, allowing us to keep all previous data, but computational resources are limited. In this setting, traditional replay-based CL approaches are outperformed by a simple baseline which replays past data selected uniformly at random, indicating that this setting necessitates a new approach. We address this by introducing a framework of adaptive memory replay for continual learning, where sampling of past data is phrased as a multi-armed bandit problem. We utilize Bolzmann sampling to derive a method which dynamically selects past data for training conditioned on the current task, assuming full data access and emphasizing training efficiency. Through extensive evaluations on both vision and language pre-training tasks, we demonstrate the effectiveness of our approach, which maintains high performance while reducing forgetting by up to 10% at no training efficiency cost.
CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory
He, Zexue, Karlinsky, Leonid, Kim, Donghyun, McAuley, Julian, Krotov, Dmitry, Feris, Rogerio
Large Language Models (LLMs) struggle to handle long input sequences due to high memory and runtime costs. Memory-augmented models have emerged as a promising solution to this problem, but current methods are hindered by limited memory capacity and require costly re-training to integrate with a new LLM. In this work, we introduce an associative memory module which can be coupled to any pre-trained (frozen) attention-based LLM without re-training, enabling it to handle arbitrarily long input sequences. Unlike previous methods, our associative memory module consolidates representations of individual tokens into a non-parametric distribution model, dynamically managed by properly balancing the novelty and recency of the incoming data. By retrieving information from this consolidated associative memory, the base LLM can achieve significant (up to 29.7% on Arxiv) perplexity reduction in long-context modeling compared to other baselines evaluated on standard benchmarks. This architecture, which we call CAMELoT (Consolidated Associative Memory Enhanced Long Transformer), demonstrates superior performance even with a tiny context window of 128 tokens, and also enables improved in-context learning with a much larger set of demonstrations.