Cox, David
Granite Embedding Models
Awasthy, Parul, Trivedi, Aashka, Li, Yulong, Bornea, Mihaela, Cox, David, Daniels, Abraham, Franz, Martin, Goodhart, Gabe, Iyer, Bhavani, Kumar, Vishwajeet, Lastras, Luis, McCarley, Scott, Murthy, Rudra, P, Vignesh, Rosenthal, Sara, Roukos, Salim, Sen, Jaydeep, Sharma, Sukriti, Sil, Avirup, Soule, Kate, Sultan, Arafat, Florian, Radu
We introduce the Granite Embedding models, a family of encoder-based embedding models designed for retrieval tasks, spanning dense-retrieval and sparse-retrieval architectures, with both English and Multilingual capabilities. This report provides the technical details of training these highly effective 12 layer embedding models, along with their efficient 6 layer distilled counterparts. Extensive evaluations show that the models, developed with techniques like retrieval oriented pretraining, contrastive finetuning, knowledge distillation, and model merging significantly outperform publicly available models of similar sizes on both internal IBM retrieval and search tasks, and have equivalent performance on widely-used information retrieval benchmarks, while being trained on high-quality data suitable for enterprise use. We publicly release all our Granite Embedding models under the Apache 2.0 license, allowing both research and commercial use at https://huggingface.co/collections/ibm-granite . Figure 1: Average performance on the Granite embedding models (in blue) vs BGE, GTE, Snowflake, E5, and Nomic models on 5 QA and IR datasets: BEIR, ClapNQ, CoIR, RedHat, and UnifiedSearch (the last 2 are internal IBM datasets). The goal of text embedding models is to convert variable length text into a fixed vector, encoding the text semantics into a multidimensional vector in such a way that semantically close texts are close in the vector space, while dissimilar texts have a low similarity. These embeddings can then be used in a variety of tasks, most commonly in retrieval applications, where the relevance of a document to a given query can be determined by the similarity of their embeddings (Dunn et al., 2017; Xiong et al., 2020; Neelakantan et al., 2022)(Zamani et al., 2018; Zhao et al., 2020), but also in document clustering (Angelov, 2020) and text classification (Sun et al., 2019). See Contributions section for full author list.
Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence
Granite Vision Team, null, Karlinsky, Leonid, Arbelle, Assaf, Daniels, Abraham, Nassar, Ahmed, Alfassi, Amit, Wu, Bo, Schwartz, Eli, Joshi, Dhiraj, Kondic, Jovana, Shabtay, Nimrod, Li, Pengyuan, Herzig, Roei, Abedin, Shafiq, Perek, Shaked, Harary, Sivan, Barzelay, Udi, Goldfarb, Adi Raz, Oliva, Aude, Wieles, Ben, Bhattacharjee, Bishwaranjan, Huang, Brandon, Auer, Christoph, Gutfreund, Dan, Beymer, David, Wood, David, Kuehne, Hilde, Hansen, Jacob, Shtok, Joseph, Wong, Ken, Bathen, Luis Angel, Mishra, Mayank, Lysak, Maksym, Dolfi, Michele, Yurochkin, Mikhail, Livathinos, Nikolaos, Harel, Nimrod, Azulai, Ophir, Naparstek, Oshri, de Lima, Rafael Teixeira, Panda, Rameswar, Doveh, Sivan, Gupta, Shubham, Das, Subhro, Zawad, Syed, Kim, Yusik, He, Zexue, Brooks, Alexander, Goodhart, Gabe, Govindjee, Anita, Leist, Derek, Ibrahim, Ibrahim, Soffer, Aya, Cox, David, Soule, Kate, Lastras, Luis, Desai, Nirmit, Ofek-koifman, Shila, Raghavan, Sriram, Syeda-Mahmood, Tanveer, Staar, Peter, Drory, Tal, Feris, Rogerio
Ensuring the safety of generative MLLMs is absolutely crucial in order to prevent harm, build trust, address ethical concerns, and enable their responsible deployment in real-world applications. Our results demonstrate that Granite Vision performs almost at par with baselines (despite being the lightest MLLM in the comparison pool) for VLM-as-a-Judge task. Notably, the addition of Safety Vectors to Granite Vision leads to a significant improvement in safety classification performance. We do acknowledge that further work needs to be done to improve high-level reasoning and correct occasional incorrect outputs to improve reliability in sensitive tasks, which require nuanced classification. To address these, we will incorporate more reasoning-focused and structure-related data into the training process in the future. In addition, we showed in this paper that finding safety vectors (SVs) in Granite Vision's attention heads led to significant improvements when safety tasks were reformulated as classification problems. Current reliance for SVs is on few-shot samples which are informative but may have limited scope in terms of capturing the range of possible safety issues that can be encountered. To further improve the model's ability to identify and address all safety concerns, we plan to investigate scaling up SVs using more training data in future research.
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
Shen, Maohao, Zeng, Guangtao, Qi, Zhenting, Hong, Zhang-Wei, Chen, Zhenfang, Lu, Wei, Wornell, Gregory, Das, Subhro, Cox, David, Gan, Chuang
Large language models (LLMs) have demonstrated remarkable Large language models (LLMs) have demonstrated performance across a wide range of reasoning remarkable reasoning capabilities across tasks, including mathematical problems (Cobbe et al., 2021; diverse domains. Recent studies have shown that Hendrycks et al., 2021a), programming (Chen et al., 2021; increasing test-time computation enhances LLMs' Zhuo et al., 2024) and logical reasoning (Han et al., 2024; reasoning capabilities. This typically involves extensive Liu et al., 2020). One of the key techniques enabling these sampling at inference time guided by an strong reasoning capabilities is Chain-of-Thought (CoT) external LLM verifier, resulting in a two-player prompting (Wei et al., 2022), which allows LLMs to address system. Despite external guidance, the effectiveness complex tasks by generating a series of intermediate of this system demonstrates the potential of reasoning steps. As a result, many early efforts focus on finetuning a single LLM to tackle complex tasks. Thus, we LLMs using large-scale, high-quality CoT reasoning pose a new research problem: Can we internalize chains, either through human annotation (Hendrycks et al., the searching capabilities to fundamentally 2021a; Yue et al., 2024) or by distilling synthetic data from enhance the reasoning abilities of a single LLM? more advanced models (Yu et al., 2024; Toshniwal et al., This work explores an orthogonal direction focusing 2024a; Ding et al., 2024). However, human annotation is on post-training LLMs for autoregressive extremely labor intensive, and distillation often limits the searching (i.e., an extended reasoning process model's reasoning capabilities to certain level.
Drawing Robust Scratch Tickets: Subnetworks with Inborn Robustness Are Found within Randomly Initialized Networks
Fu, Yonggan, Yu, Qixuan, Zhang, Yang, Wu, Shang, Ouyang, Xu, Cox, David, Lin, Yingyan Celine
Deep Neural Networks (DNNs) are known to be vulnerable to adversarial attacks, i.e., an imperceptible perturbation to the input can mislead DNNs trained on clean images into making erroneous predictions. To tackle this, adversarial training is currently the most effective defense method, by augmenting the training set with adversarial samples generated on the fly. Interestingly, we discover for the first time that there exist subnetworks with inborn robustness, matching or surpassing the robust accuracy of the adversarially trained networks with comparable model sizes, within randomly initialized networks without any model training, indicating that adversarial training on model weights is not indispensable towards adversarial robustness. We name such subnetworks Robust Scratch Tickets (RSTs), which are also by nature efficient. Distinct from the popular lottery ticket hypothesis, neither the original dense networks nor the identified RSTs need to be trained. To validate and understand this fascinating finding, we further conduct extensive experiments to study the existence and properties of RSTs under different models, datasets, sparsity patterns, and attacks, drawing insights regarding the relationship between DNNs' robustness and their initialization/overparameterization. Furthermore, we identify the poor adversarial transferability between RSTs of different sparsity ratios drawn from the same randomly initialized dense network, and propose a Random RST Switch (R2S) technique, which randomly switches between different RSTs, as a novel defense method built on top of RSTs. We believe our findings about RSTs have opened up a new perspective to study model robustness and extend the lottery ticket hypothesis.
The infrastructure powering IBM's Gen AI model development
Gershon, Talia, Seelam, Seetharami, Belgodere, Brian, Bonilla, Milton, Hoang, Lan, Barnett, Danny, Chung, I-Hsin, Mohan, Apoorve, Chen, Ming-Hung, Luo, Lixiang, Walkup, Robert, Evangelinos, Constantinos, Salaria, Shweta, Dombrowa, Marc, Park, Yoonho, Kayi, Apo, Schour, Liran, Alim, Alim, Sydney, Ali, Maniotis, Pavlos, Schares, Laurent, Metzler, Bernard, Karacali-Akyamac, Bengi, Wen, Sophia, Chiba, Tatsuhiro, Choochotkaew, Sunyanan, Yoshimura, Takeshi, Misale, Claudia, Elengikal, Tonia, Connor, Kevin O, Liu, Zhuoran, Molina, Richard, Schneidenbach, Lars, Caden, James, Laibinis, Christopher, Fonseca, Carlos, Tarasov, Vasily, Sundararaman, Swaminathan, Schmuck, Frank, Guthridge, Scott, Cohn, Jeremy, Eshel, Marc, Muench, Paul, Liu, Runyu, Pointer, William, Wyskida, Drew, Krull, Bob, Rose, Ray, Wolfe, Brent, Cornejo, William, Walter, John, Malone, Colm, Perucci, Clifford, Franco, Frank, Hinds, Nigel, Calio, Bob, Druyan, Pavel, Kilduff, Robert, Kienle, John, McStay, Connor, Figueroa, Andrew, Connolly, Matthew, Fost, Edie, Roma, Gina, Fonseca, Jake, Levy, Ido, Payne, Michele, Schenkel, Ryan, Malki, Amir, Schneider, Lion, Narkhede, Aniruddha, Moshref, Shekeba, Kisin, Alexandra, Dodin, Olga, Rippon, Bill, Wrieth, Henry, Ganci, John, Colino, Johnny, Habeger-Rose, Donna, Pandey, Rakesh, Gidh, Aditya, Gaur, Aditya, Patterson, Dennis, Salmani, Samsuddin, Varma, Rambilas, Rumana, Rumana, Sharma, Shubham, Gaur, Aditya, Mishra, Mayank, Panda, Rameswar, Prasad, Aditya, Stallone, Matt, Zhang, Gaoyuan, Shen, Yikang, Cox, David, Puri, Ruchir, Agrawal, Dakshi, Thorstensen, Drew, Belog, Joel, Tang, Brent, Gupta, Saurabh Kumar, Biswas, Amitabha, Maheshwari, Anup, Gampel, Eran, Van Patten, Jason, Runion, Matthew, Kaki, Sai, Bogin, Yigal, Reitz, Brian, Pritko, Steve, Najam, Shahan, Nambala, Surya, Chirra, Radhika, Welp, Rick, DiMitri, Frank, Telles, Felipe, Arvelo, Amilcar, Chu, King, Seminaro, Ed, Schram, Andrew, Eickhoff, Felix, Hanson, William, Mckeever, Eric, Joseph, Dinakaran, Chaudhary, Piyush, Shivam, Piyush, Chaudhary, Puneet, Jones, Wesley, Guthrie, Robert, Bostic, Chris, Islam, Rezaul, Duersch, Steve, Sawdon, Wayne, Lewars, John, Klos, Matthew, Spriggs, Michael, McMillan, Bill, Gao, George, Kamra, Ashish, Singh, Gaurav, Curry, Marc, Katarki, Tushar, Talerico, Joe, Shi, Zenghui, Malleni, Sai Sindhur, Gallen, Erwan
AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational models, where on occasion thousands of GPUs must cooperate on a single training job for the model to be trained in a reasonable time. Delivering efficient and high-performing AI training requires an end-to-end solution that combines hardware, software and holistic telemetry to cater for multiple types of AI workloads. In this report, we describe IBM's hybrid cloud infrastructure that powers our generative AI model development. This infrastructure includes (1) Vela: an AI-optimized supercomputing capability directly integrated into the IBM Cloud, delivering scalable, dynamic, multi-tenant and geographically distributed infrastructure for large-scale model training and other AI workflow steps and (2) Blue Vela: a large-scale, purpose-built, on-premises hosting environment that is optimized to support our largest and most ambitious AI model training tasks. Vela provides IBM with the dual benefit of high performance for internal use along with the flexibility to adapt to an evolving commercial landscape. Blue Vela provides us with the benefits of rapid development of our largest and most ambitious models, as well as future-proofing against the evolving model landscape in the industry. Taken together, they provide IBM with the ability to rapidly innovate in the development of both AI models and commercial offerings.
Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks
Abdelaziz, Ibrahim, Basu, Kinjal, Agarwal, Mayank, Kumaravel, Sadhana, Stallone, Matthew, Panda, Rameswar, Rizk, Yara, Bhargav, GP, Crouse, Maxwell, Gunasekara, Chulaka, Ikbal, Shajith, Joshi, Sachin, Karanam, Hima, Kumar, Vineet, Munawar, Asim, Neelam, Sumit, Raghu, Dinesh, Sharma, Udit, Soria, Adriana Meza, Sreedhar, Dheeraj, Venkateswaran, Praveen, Unuvar, Merve, Cox, David, Roukos, Salim, Lastras, Luis, Kapanipathi, Pavan
Large language models (LLMs) have recently shown tremendous promise in serving as the backbone to agentic systems, as demonstrated by their performance in multi-faceted, challenging benchmarks like SWE-Bench and Agent-Bench. However, to realize the true potential of LLMs as autonomous agents, they must learn to identify, call, and interact with external tools and application program interfaces (APIs) to complete complex tasks. These tasks together are termed function calling. Endowing LLMs with function calling abilities leads to a myriad of advantages, such as access to current and domain-specific information in databases and knowledge sources, and the ability to outsource tasks that can be reliably performed by tools, e.g., a Python interpreter or calculator. While there has been significant progress in function calling with LLMs, there is still a dearth of open models that perform on par with proprietary LLMs like GPT, Claude, and Gemini. Therefore, in this work, we introduce the GRANITE-20B-FUNCTIONCALLING model under an Apache 2.0 license. The model is trained using a multi-task training approach on seven fundamental tasks encompassed in function calling, those being Nested Function Calling, Function Chaining, Parallel Functions, Function Name Detection, Parameter-Value Pair Detection, Next-Best Function, and Response Generation. We present a comprehensive evaluation on multiple out-of-domain datasets comparing GRANITE-20B-FUNCTIONCALLING to more than 15 other best proprietary and open models. GRANITE-20B-FUNCTIONCALLING provides the best performance among all open models on the Berkeley Function Calling Leaderboard and fourth overall. As a result of the diverse tasks and datasets used for training our model, we show that GRANITE-20B-FUNCTIONCALLING has better generalizability on multiple tasks in seven different evaluation datasets.
$\textit{Trans-LoRA}$: towards data-free Transferable Parameter Efficient Finetuning
Wang, Runqian, Ghosh, Soumya, Cox, David, Antognini, Diego, Oliva, Aude, Feris, Rogerio, Karlinsky, Leonid
Low-rank adapters (LoRA) and their variants are popular parameter-efficient fine-tuning (PEFT) techniques that closely match full model fine-tune performance while requiring only a small number of additional parameters. These additional LoRA parameters are specific to the base model being adapted. When the base model needs to be deprecated and replaced with a new one, all the associated LoRA modules need to be re-trained. Such re-training requires access to the data used to train the LoRA for the original base model. This is especially problematic for commercial cloud applications where the LoRA modules and the base models are hosted by service providers who may not be allowed to host proprietary client task data. To address this challenge, we propose $\textit{Trans-LoRA}$ -- a novel method for lossless, nearly data-free transfer of LoRAs across base models. Our approach relies on synthetic data to transfer LoRA modules. Using large language models, we design a synthetic data generator to approximate the data-generating process of the $\textit{observed}$ task data subset. Training on the resulting synthetic dataset transfers LoRA modules to new models. We show the effectiveness of our approach using both LLama and Gemma model families. Our approach achieves lossless (mostly improved) LoRA transfer between models within and across different base model families, and even between different PEFT methods, on a wide variety of tasks.
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
Sun, Zhiqing, Shen, Yikang, Zhou, Qinhong, Zhang, Hongxin, Chen, Zhenfang, Cox, David, Yang, Yiming, Gan, Chuang
Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of large language models (LLMs) with human intentions, ensuring they are helpful, ethical, and reliable. However, this dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision and the related issues on quality, reliability, diversity, self-consistency, and undesirable biases. To address these challenges, we propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision. Our approach encompasses four stages: first, we use an LLM to generate synthetic prompts, and a topic-guided method to augment the prompt diversity; second, we use a small set of human-written principles for AI models to follow, and guide the LLM through in-context learning from demonstrations (of principles application) to produce helpful, ethical, and reliable responses to user's queries; third, we fine-tune the original LLM with the high-quality self-aligned responses so that the resulting model can generate desirable responses for each query directly without the principle set and the demonstrations anymore; and finally, we offer a refinement step to address the issues of overly-brief or indirect responses. Applying SELF-ALIGN to the LLaMA-65b base language model, we develop an AI assistant named Dromedary. With fewer than 300 lines of human annotations (including < 200 seed prompts, 16 generic principles, and 5 exemplars for in-context learning). Dromedary significantly surpasses the performance of several state-of-the-art AI systems, including Text-Davinci-003 and Alpaca, on benchmark datasets with various settings.
Audio-Visual Neural Syntax Acquisition
Lai, Cheng-I Jeff, Shi, Freda, Peng, Puyuan, Kim, Yoon, Gimpel, Kevin, Chang, Shiyu, Chuang, Yung-Sung, Bhati, Saurabhchand, Cox, David, Harwath, David, Zhang, Yang, Livescu, Karen, Glass, James
We study phrase structure induction from visually-grounded speech. The core idea is to first segment the speech waveform into sequences of word segments, and subsequently induce phrase structure using the inferred segment-level continuous representations. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without ever being exposed to text. By training on paired images and spoken captions, AV-NSL exhibits the capability to infer meaningful phrase structures that are comparable to those derived by naturally-supervised text parsers, for both English and German. Our findings extend prior work in unsupervised language acquisition from speech and grounded grammar induction, and present one approach to bridge the gap between the two topics.
SALMON: Self-Alignment with Principle-Following Reward Models
Sun, Zhiqing, Shen, Yikang, Zhang, Hongxin, Zhou, Qinhong, Chen, Zhenfang, Cox, David, Yang, Yiming, Gan, Chuang
Supervised Fine-Tuning (SFT) on response demonstrations combined with Reinforcement Learning from Human Feedback (RLHF) constitutes a powerful paradigm for aligning LLM-based AI agents. However, a significant limitation of such an approach is its dependency on high-quality human annotations, making its application to intricate tasks challenging due to difficulties in obtaining consistent response demonstrations and in-distribution response preferences. This paper presents a novel approach, namely SALMON (Self-ALignMent with principle-fOllowiNg reward models), to align base language models with minimal human supervision, using only a small set of human-defined principles, yet achieving superior performance. Central to our approach is a principle-following reward model. Trained on synthetic preference data, this model can generate reward scores based on arbitrary human-defined principles. By merely adjusting these principles during the RL training phase, we gain full control over the preferences with the reward model, subsequently influencing the behavior of the RL-trained policies, and eliminating the reliance on the collection of online human preferences. Applying our method to the LLaMA-2-70b base language model, we developed an AI assistant named Dromedary-2. With only 6 exemplars for in-context learning and 31 human-defined principles, Dromedary-2 significantly surpasses the performance of several state-of-the-art AI systems, including LLaMA-2-Chat-70b, on various benchmark datasets. We have open-sourced the code and model weights to encourage further research into aligning LLM-based AI agents with enhanced supervision efficiency, improved controllability, and scalable oversight.