Soule, Kate
Granite Embedding Models
Awasthy, Parul, Trivedi, Aashka, Li, Yulong, Bornea, Mihaela, Cox, David, Daniels, Abraham, Franz, Martin, Goodhart, Gabe, Iyer, Bhavani, Kumar, Vishwajeet, Lastras, Luis, McCarley, Scott, Murthy, Rudra, P, Vignesh, Rosenthal, Sara, Roukos, Salim, Sen, Jaydeep, Sharma, Sukriti, Sil, Avirup, Soule, Kate, Sultan, Arafat, Florian, Radu
We introduce the Granite Embedding models, a family of encoder-based embedding models designed for retrieval tasks, spanning dense-retrieval and sparse-retrieval architectures, with both English and Multilingual capabilities. This report provides the technical details of training these highly effective 12 layer embedding models, along with their efficient 6 layer distilled counterparts. Extensive evaluations show that the models, developed with techniques like retrieval oriented pretraining, contrastive finetuning, knowledge distillation, and model merging significantly outperform publicly available models of similar sizes on both internal IBM retrieval and search tasks, and have equivalent performance on widely-used information retrieval benchmarks, while being trained on high-quality data suitable for enterprise use. We publicly release all our Granite Embedding models under the Apache 2.0 license, allowing both research and commercial use at https://huggingface.co/collections/ibm-granite . Figure 1: Average performance on the Granite embedding models (in blue) vs BGE, GTE, Snowflake, E5, and Nomic models on 5 QA and IR datasets: BEIR, ClapNQ, CoIR, RedHat, and UnifiedSearch (the last 2 are internal IBM datasets). The goal of text embedding models is to convert variable length text into a fixed vector, encoding the text semantics into a multidimensional vector in such a way that semantically close texts are close in the vector space, while dissimilar texts have a low similarity. These embeddings can then be used in a variety of tasks, most commonly in retrieval applications, where the relevance of a document to a given query can be determined by the similarity of their embeddings (Dunn et al., 2017; Xiong et al., 2020; Neelakantan et al., 2022)(Zamani et al., 2018; Zhao et al., 2020), but also in document clustering (Angelov, 2020) and text classification (Sun et al., 2019). See Contributions section for full author list.
Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence
Granite Vision Team, null, Karlinsky, Leonid, Arbelle, Assaf, Daniels, Abraham, Nassar, Ahmed, Alfassi, Amit, Wu, Bo, Schwartz, Eli, Joshi, Dhiraj, Kondic, Jovana, Shabtay, Nimrod, Li, Pengyuan, Herzig, Roei, Abedin, Shafiq, Perek, Shaked, Harary, Sivan, Barzelay, Udi, Goldfarb, Adi Raz, Oliva, Aude, Wieles, Ben, Bhattacharjee, Bishwaranjan, Huang, Brandon, Auer, Christoph, Gutfreund, Dan, Beymer, David, Wood, David, Kuehne, Hilde, Hansen, Jacob, Shtok, Joseph, Wong, Ken, Bathen, Luis Angel, Mishra, Mayank, Lysak, Maksym, Dolfi, Michele, Yurochkin, Mikhail, Livathinos, Nikolaos, Harel, Nimrod, Azulai, Ophir, Naparstek, Oshri, de Lima, Rafael Teixeira, Panda, Rameswar, Doveh, Sivan, Gupta, Shubham, Das, Subhro, Zawad, Syed, Kim, Yusik, He, Zexue, Brooks, Alexander, Goodhart, Gabe, Govindjee, Anita, Leist, Derek, Ibrahim, Ibrahim, Soffer, Aya, Cox, David, Soule, Kate, Lastras, Luis, Desai, Nirmit, Ofek-koifman, Shila, Raghavan, Sriram, Syeda-Mahmood, Tanveer, Staar, Peter, Drory, Tal, Feris, Rogerio
Ensuring the safety of generative MLLMs is absolutely crucial in order to prevent harm, build trust, address ethical concerns, and enable their responsible deployment in real-world applications. Our results demonstrate that Granite Vision performs almost at par with baselines (despite being the lightest MLLM in the comparison pool) for VLM-as-a-Judge task. Notably, the addition of Safety Vectors to Granite Vision leads to a significant improvement in safety classification performance. We do acknowledge that further work needs to be done to improve high-level reasoning and correct occasional incorrect outputs to improve reliability in sensitive tasks, which require nuanced classification. To address these, we will incorporate more reasoning-focused and structure-related data into the training process in the future. In addition, we showed in this paper that finding safety vectors (SVs) in Granite Vision's attention heads led to significant improvements when safety tasks were reformulated as classification problems. Current reliance for SVs is on few-shot samples which are informative but may have limited scope in terms of capturing the range of possible safety issues that can be encountered. To further improve the model's ability to identify and address all safety concerns, we plan to investigate scaling up SVs using more training data in future research.
Enterprise Benchmarks for Large Language Model Evaluation
Zhang, Bing, Takeuchi, Mikio, Kawahara, Ryo, Asthana, Shubhi, Hossain, Md. Maruf, Ren, Guang-Jie, Soule, Kate, Zhu, Yada
The advancement of large language models (LLMs) has led to a greater challenge of having a rigorous and systematic evaluation of complex tasks performed, especially in enterprise applications. Therefore, LLMs need to be able to benchmark enterprise datasets for various tasks. This work presents a systematic exploration of benchmarking strategies tailored to LLM evaluation, focusing on the utilization of domain-specific datasets and consisting of a variety of NLP tasks. The proposed evaluation framework encompasses 25 publicly available datasets from diverse enterprise domains like financial services, legal, cyber security, and climate and sustainability. The diverse performance of 13 models across different enterprise tasks highlights the importance of selecting the right model based on the specific requirements of each task. Code and prompts are available on GitHub.
Large Language Model Routing with Benchmark Datasets
Shnitzer, Tal, Ou, Anthony, Silva, Mírian, Soule, Kate, Sun, Yuekai, Solomon, Justin, Thompson, Neil, Yurochkin, Mikhail
There is a rapidly growing number of open-source Large Language Models (LLMs) and benchmark datasets to compare them. While some models dominate these benchmarks, no single model typically achieves the best accuracy in all tasks and use cases. In this work, we address the challenge of selecting the best LLM out of a collection of models for new tasks. We propose a new formulation for the problem, in which benchmark datasets are repurposed to learn a "router" model for this LLM selection, and we show that this problem can be reduced to a collection of binary classification tasks. We demonstrate the utility and limitations of learning model routers from various benchmark datasets, where we consistently improve performance upon using any single model for all tasks.