Goto

Collaborating Authors

 Arbelle, Assaf


Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence

arXiv.org Artificial Intelligence

Ensuring the safety of generative MLLMs is absolutely crucial in order to prevent harm, build trust, address ethical concerns, and enable their responsible deployment in real-world applications. Our results demonstrate that Granite Vision performs almost at par with baselines (despite being the lightest MLLM in the comparison pool) for VLM-as-a-Judge task. Notably, the addition of Safety Vectors to Granite Vision leads to a significant improvement in safety classification performance. We do acknowledge that further work needs to be done to improve high-level reasoning and correct occasional incorrect outputs to improve reliability in sensitive tasks, which require nuanced classification. To address these, we will incorporate more reasoning-focused and structure-related data into the training process in the future. In addition, we showed in this paper that finding safety vectors (SVs) in Granite Vision's attention heads led to significant improvements when safety tasks were reformulated as classification problems. Current reliance for SVs is on few-shot samples which are informative but may have limited scope in terms of capturing the range of possible safety issues that can be encountered. To further improve the model's ability to identify and address all safety concerns, we plan to investigate scaling up SVs using more training data in future research.


Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers

arXiv.org Artificial Intelligence

Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks such as image captioning or visual question answering. Despite strong performance, LMMs are not directly suited for foundational discriminative vision-language tasks (i.e., tasks requiring discrete label predictions) such as image classification and multiple-choice VQA. One key challenge in utilizing LMMs for discriminative tasks is the extraction of useful features from generative models. To overcome this issue, we propose an approach for finding features in the model's latent space to more effectively leverage LMMs for discriminative tasks. Toward this end, we present Sparse Attention Vectors (SAVs) -- a finetuning-free method that leverages sparse attention head activations (fewer than 1\% of the heads) in LMMs as strong features for VL tasks. With only few-shot examples, SAVs demonstrate state-of-the-art performance compared to a variety of few-shot and finetuned baselines on a collection of discriminative tasks. Our experiments also imply that SAVs can scale in performance with additional examples and generalize to similar tasks, establishing SAVs as both effective and robust multimodal feature representations.


Augmenting In-Context-Learning in LLMs via Automatic Data Labeling and Refinement

arXiv.org Artificial Intelligence

The past decade has seen a big renaissance in the Machine Learning (ML) domain with the rise of neural networks which continue to break all limits at a rapid pace. Until recently, the common training paradigm was based on task-specific models, each trained on a separate dataset for a given task, e.g classification [Krizhevsky et al., 2012], detection [Redmon et al., 2016], summarizing [Nallapati et al., 2016], translation [Vaswani et al., 2017], etc. Today, we see the rise of Foundation Models [Bommasani et al., 2021] largely based on Large Language Models (LLMs), which have several interesting emerging properties, including In-Context-Learning (ICL) and Chainof-Thought (CoT) inference. ICL is an approach where the model's behavior is modulated through the model's input, i.e. the context. This context can include information that is required to answer a desired query. This concept is extremely useful in several pipelines, for example Figure 1: From an input-output dataset in Retrieval-Augmented Generation (RAG) [Lewis with no intermediate steps (CoT/Executable et al., 2020] systems. In other cases, the context can include programs), ADLR generates examples several examples of input-output pairs that outline with such steps and retains the the models' expected behavior.


Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

arXiv.org Artificial Intelligence

The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens. This motivates the need for a multimodal method to compress many shots into fewer tokens without finetuning. In this work, we enable LMMs to perform multimodal, many-shot in-context learning by leveraging Multimodal Task Vectors (MTV)--compact implicit representations of in-context examples compressed in the model's attention heads. Specifically, we first demonstrate the existence of such MTV in LMMs and then leverage these extracted MTV to enable many-shot in-context learning for various vision-and-language tasks. Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference.


NumeroLogic: Number Encoding for Enhanced LLMs' Numerical Reasoning

arXiv.org Artificial Intelligence

Language models struggle with handling numerical data and performing arithmetic operations. We hypothesize that this limitation can be partially attributed to non-intuitive textual numbers representation. When a digit is read or generated by a causal language model it does not know its place value (e.g. thousands vs. hundreds) until the entire number is processed. To address this issue, we propose a simple adjustment to how numbers are represented by including the count of digits before each number. For instance, instead of "42", we suggest using "{2:42}" as the new format. This approach, which we term NumeroLogic, offers an added advantage in number generation by serving as a Chain of Thought (CoT). By requiring the model to consider the number of digits first, it enhances the reasoning process before generating the actual number. We use arithmetic tasks to demonstrate the effectiveness of the NumeroLogic formatting. We further demonstrate NumeroLogic applicability to general natural language modeling, improving language understanding performance in the MMLU benchmark.


CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning

arXiv.org Artificial Intelligence

Computer vision models suffer from a phenomenon known as catastrophic forgetting when learning novel concepts from continuously shifting training data. Typical solutions for this continual learning problem require extensive rehearsal of previously seen data, which increases memory costs and may violate data privacy. Recently, the emergence of large-scale pre-trained vision transformer models has enabled prompting approaches as an alternative to data-rehearsal. These approaches rely on a key-query mechanism to generate prompts and have been found to be highly resistant to catastrophic forgetting in the well-established rehearsal-free continual learning setting. However, the key mechanism of these methods is not trained end-to-end with the task sequence. Our experiments show that this leads to a reduction in their plasticity, hence sacrificing new task accuracy, and inability to benefit from expanded parameter capacity. We instead propose to learn a set of prompt components which are assembled with input-conditioned weights to produce input-conditioned prompts, resulting in a novel attention-based end-to-end key-query scheme. Our experiments show that we outperform the current SOTA method DualPrompt on established benchmarks by as much as 4.5% in average final accuracy. We also outperform the state of art by as much as 4.4% accuracy on a continual learning benchmark which contains both class-incremental and domain-incremental task shifts, corresponding to many practical settings. Our code is available at https://github.com/GT-RIPL/CODA-Prompt


ConStruct-VL: Data-Free Continual Structured VL Concepts Learning

arXiv.org Artificial Intelligence

Recently, large-scale pre-trained Vision-and-Language (VL) foundation models have demonstrated remarkable capabilities in many zero-shot downstream tasks, achieving competitive results for recognizing objects defined by as little as short text prompts. However, it has also been shown that VL models are still brittle in Structured VL Concept (SVLC) reasoning, such as the ability to recognize object attributes, states, and inter-object relations. This leads to reasoning mistakes, which need to be corrected as they occur by teaching VL models the missing SVLC skills; often this must be done using private data where the issue was found, which naturally leads to a data-free continual (no task-id) VL learning setting. In this work, we introduce the first Continual Data-Free Structured VL Concepts Learning (ConStruct-VL) benchmark and show it is challenging for many existing data-free CL strategies. We, therefore, propose a data-free method comprised of a new approach of Adversarial Pseudo-Replay (APR) which generates adversarial reminders of past tasks from past task models. To use this method efficiently, we also propose a continual parameter-efficient Layered-LoRA (LaLo) neural architecture allowing no-memory-cost access to all past models at train time. We show this approach outperforms all data-free methods by as much as ~7% while even matching some levels of experience-replay (prohibitive for applications where data-privacy must be preserved). Our code is publicly available at https://github.com/jamessealesmith/ConStruct-VL


QANet - Quality Assurance Network for Microscopy Cell Segmentation

arXiv.org Artificial Intelligence

Tools and methods for automatic image segmentation are rapidly developing, each with its own strengths and weaknesses. While these methods are designed to be as general as possible, there are no guarantees for their performance on new data. The choice between methods is usually based on benchmark performance whereas the data in the benchmark can be significantly different than that of the user. We introduce a novel Deep Learning method which, given an image and a proposed corresponding segmentation, estimates the Intersection over Union measure (IoU) with respect to the unknown ground truth. We refer to this method as a Quality Assurance Network - QANet. The QANet is designed to give the user an estimate of the segmentation quality on the users own, private, data without the need for human inspection or labelling. It is based on the RibCage Network architecture, originally proposed as a discriminator in an adversarial network framework. Promising IoU prediction results are demonstrated based on the Cell Segmentation Benchmark. The code is freely available at: ANONYMOUS.