Shtok, Joseph
Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence
Granite Vision Team, null, Karlinsky, Leonid, Arbelle, Assaf, Daniels, Abraham, Nassar, Ahmed, Alfassi, Amit, Wu, Bo, Schwartz, Eli, Joshi, Dhiraj, Kondic, Jovana, Shabtay, Nimrod, Li, Pengyuan, Herzig, Roei, Abedin, Shafiq, Perek, Shaked, Harary, Sivan, Barzelay, Udi, Goldfarb, Adi Raz, Oliva, Aude, Wieles, Ben, Bhattacharjee, Bishwaranjan, Huang, Brandon, Auer, Christoph, Gutfreund, Dan, Beymer, David, Wood, David, Kuehne, Hilde, Hansen, Jacob, Shtok, Joseph, Wong, Ken, Bathen, Luis Angel, Mishra, Mayank, Lysak, Maksym, Dolfi, Michele, Yurochkin, Mikhail, Livathinos, Nikolaos, Harel, Nimrod, Azulai, Ophir, Naparstek, Oshri, de Lima, Rafael Teixeira, Panda, Rameswar, Doveh, Sivan, Gupta, Shubham, Das, Subhro, Zawad, Syed, Kim, Yusik, He, Zexue, Brooks, Alexander, Goodhart, Gabe, Govindjee, Anita, Leist, Derek, Ibrahim, Ibrahim, Soffer, Aya, Cox, David, Soule, Kate, Lastras, Luis, Desai, Nirmit, Ofek-koifman, Shila, Raghavan, Sriram, Syeda-Mahmood, Tanveer, Staar, Peter, Drory, Tal, Feris, Rogerio
Ensuring the safety of generative MLLMs is absolutely crucial in order to prevent harm, build trust, address ethical concerns, and enable their responsible deployment in real-world applications. Our results demonstrate that Granite Vision performs almost at par with baselines (despite being the lightest MLLM in the comparison pool) for VLM-as-a-Judge task. Notably, the addition of Safety Vectors to Granite Vision leads to a significant improvement in safety classification performance. We do acknowledge that further work needs to be done to improve high-level reasoning and correct occasional incorrect outputs to improve reliability in sensitive tasks, which require nuanced classification. To address these, we will incorporate more reasoning-focused and structure-related data into the training process in the future. In addition, we showed in this paper that finding safety vectors (SVs) in Granite Vision's attention heads led to significant improvements when safety tasks were reformulated as classification problems. Current reliance for SVs is on few-shot samples which are informative but may have limited scope in terms of capturing the range of possible safety issues that can be encountered. To further improve the model's ability to identify and address all safety concerns, we plan to investigate scaling up SVs using more training data in future research.
Augmenting In-Context-Learning in LLMs via Automatic Data Labeling and Refinement
Shtok, Joseph, Alfassy, Amit, Dahood, Foad Abo, Schwartz, Eliyahu, Doveh, Sivan, Arbelle, Assaf
The past decade has seen a big renaissance in the Machine Learning (ML) domain with the rise of neural networks which continue to break all limits at a rapid pace. Until recently, the common training paradigm was based on task-specific models, each trained on a separate dataset for a given task, e.g classification [Krizhevsky et al., 2012], detection [Redmon et al., 2016], summarizing [Nallapati et al., 2016], translation [Vaswani et al., 2017], etc. Today, we see the rise of Foundation Models [Bommasani et al., 2021] largely based on Large Language Models (LLMs), which have several interesting emerging properties, including In-Context-Learning (ICL) and Chainof-Thought (CoT) inference. ICL is an approach where the model's behavior is modulated through the model's input, i.e. the context. This context can include information that is required to answer a desired query. This concept is extremely useful in several pipelines, for example Figure 1: From an input-output dataset in Retrieval-Augmented Generation (RAG) [Lewis with no intermediate steps (CoT/Executable et al., 2020] systems. In other cases, the context can include programs), ADLR generates examples several examples of input-output pairs that outline with such steps and retains the the models' expected behavior.
NumeroLogic: Number Encoding for Enhanced LLMs' Numerical Reasoning
Schwartz, Eli, Choshen, Leshem, Shtok, Joseph, Doveh, Sivan, Karlinsky, Leonid, Arbelle, Assaf
Language models struggle with handling numerical data and performing arithmetic operations. We hypothesize that this limitation can be partially attributed to non-intuitive textual numbers representation. When a digit is read or generated by a causal language model it does not know its place value (e.g. thousands vs. hundreds) until the entire number is processed. To address this issue, we propose a simple adjustment to how numbers are represented by including the count of digits before each number. For instance, instead of "42", we suggest using "{2:42}" as the new format. This approach, which we term NumeroLogic, offers an added advantage in number generation by serving as a Chain of Thought (CoT). By requiring the model to consider the number of digits first, it enhances the reasoning process before generating the actual number. We use arithmetic tasks to demonstrate the effectiveness of the NumeroLogic formatting. We further demonstrate NumeroLogic applicability to general natural language modeling, improving language understanding performance in the MMLU benchmark.
Delta-encoder: an effective sample synthesis method for few-shot object recognition
Schwartz, Eli, Karlinsky, Leonid, Shtok, Joseph, Harary, Sivan, Marder, Mattias, Kumar, Abhishek, Feris, Rogerio, Giryes, Raja, Bronstein, Alex
Learning to classify new categories based on just one or a few examples is a long-standing challenge in modern computer vision. In this work, we propose a simple yet effective method for few-shot (and one-shot) object recognition. Our approach is based on a modified auto-encoder, denoted delta-encoder, that learns to synthesize new samples for an unseen category just by seeing few examples from it. The synthesized samples are then used to train a classifier. The proposed approach learns to both extract transferable intra-class deformations, or "deltas", between same-class pairs of training examples, and to apply those deltas to the few provided examples of a novel class (unseen during training) in order to efficiently synthesize samples from that new class. The proposed method improves the state-of-the-art of one-shot object-recognition and performs comparably in the few-shot case.
Delta-encoder: an effective sample synthesis method for few-shot object recognition
Schwartz, Eli, Karlinsky, Leonid, Shtok, Joseph, Harary, Sivan, Marder, Mattias, Kumar, Abhishek, Feris, Rogerio, Giryes, Raja, Bronstein, Alex
Learning to classify new categories based on just one or a few examples is a long-standing challenge in modern computer vision. In this work, we propose a simple yet effective method for few-shot (and one-shot) object recognition. Our approach is based on a modified auto-encoder, denoted delta-encoder, that learns to synthesize new samples for an unseen category just by seeing few examples from it. The synthesized samples are then used to train a classifier. The proposed approach learns to both extract transferable intra-class deformations, or "deltas", between same-class pairs of training examples, and to apply those deltas to the few provided examples of a novel class (unseen during training) in order to efficiently synthesize samples from that new class. The proposed method improves the state-of-the-art of one-shot object-recognition and performs comparably in the few-shot case.