Shoeybi, Mohammad
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
Lee, Chankyu, Roy, Rajarshi, Xu, Mengyao, Raiman, Jonathan, Shoeybi, Mohammad, Catanzaro, Bryan, Ping, Wei
Decoder-only large language model (LLM)-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, we introduce the NV-Embed model with a variety of architectural designs and training procedures to significantly enhance the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility. For model architecture, we propose a latent attention layer to obtain pooled embeddings, which consistently improves retrieval and downstream task accuracy compared to mean pooling or using the last
Nemotron-4 15B Technical Report
Parmar, Jupinder, Prabhumoye, Shrimai, Jennings, Joseph, Patwary, Mostofa, Subramanian, Sandeep, Su, Dan, Zhu, Chen, Narayanan, Deepak, Jhunjhunwala, Aastha, Dattagupta, Ayush, Jawa, Vibhu, Liu, Jiwei, Mahabaleshwarkar, Ameya, Nitski, Osvald, Brundyn, Annika, Maki, James, Martinez, Miguel, You, Jiaxuan, Kamalu, John, LeGresley, Patrick, Fridman, Denys, Casper, Jared, Aithal, Ashwath, Kuchaiev, Oleksii, Shoeybi, Mohammad, Cohen, Jonathan, Catanzaro, Bryan
For example, (Hoffmann et al., 2022) shows that given two roughly IsoFLOP GPT models with a similar data distribution, a 65-billion-parameter model on 1.4 trillion tokens and a 280-billion-parameter model on 300 billion tokens, the 65B model has better accuracy on downstream tasks. This trade-off of allocating compute towards training on more data as opposed to increasing model size is particularly appealing from an inference perspective, reducing latency and the amount of compute needed to serve models. As a consequence, a major focus of language modeling training efforts has shifted to collecting high-quality multi-trillion token datasets from public sources such as Common Crawl.
ODIN: Disentangled Reward Mitigates Hacking in RLHF
Chen, Lichang, Zhu, Chen, Soselia, Davit, Chen, Jiuhai, Zhou, Tianyi, Goldstein, Tom, Huang, Heng, Shoeybi, Mohammad, Catanzaro, Bryan
In this work, we study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback (RLHF) on LLMs. A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even human evaluators to achieve high scores. The same issue also holds for some reward models in RL. To address the challenges in both training and evaluation, we establish a more reliable evaluation protocol for comparing different training configurations, which inspects the trade-off between LLM evaluation score and response length obtained by varying training hyperparameters. Based on this evaluation, we conduct large-scale studies, where the results shed insights into the efficacy of hyperparameters and tricks used in RL on mitigating length bias. We further propose to improve the reward model by jointly training two linear heads on shared feature representations to predict the rewards, one trained to correlate with length, and the other trained to decorrelate with length and therefore focus more on the actual content. We then discard the length head in RL to prevent reward hacking on length. Experiments demonstrate that our approach almost eliminates the reward correlation with length, and improves the obtained policy by a significant margin.
InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining
Wang, Boxin, Ping, Wei, McAfee, Lawrence, Xu, Peng, Li, Bo, Shoeybi, Mohammad, Catanzaro, Bryan
Pretraining auto-regressive large language models (LLMs) with retrieval demonstrates better perplexity and factual accuracy by leveraging external databases. However, the size of existing pretrained retrieval-augmented LLM is still limited (e.g., Retro has 7.5B parameters), which limits the effectiveness of instruction tuning and zero-shot generalization. In this work, we introduce Retro 48B, the largest LLM pretrained with retrieval. Specifically, we continue to pretrain a 43B GPT model on additional 100 billion tokens using the Retro augmentation method by retrieving from 1.2 trillion tokens. Notably, the obtained foundation model, Retro 48B, largely outperforms the counterpart GPT 43B trained on 1.2T tokens in terms of perplexity with only 2.58% additional GPU hours, demonstrating the significant scaling potential of the method. After instruction tuning on Retro, InstructRetro demonstrates significant improvement over the instruction tuned GPT on a wide range of zero-shot tasks. Specifically, the average improvement of InstructRetro is 7% over its GPT counterpart across 8 short-form QA and reading comprehension tasks, 10% over GPT across 4 challenging long-form QA tasks, and 16% over GPT across 3 summarization tasks. Surprisingly, we find that one can ablate the encoder from InstructRetro architecture and directly use its decoder backbone, while achieving comparable results. Our results highlight the promising direction to obtain a better GPT decoder through continued pretraining with retrieval before instruction tuning. Our code and checkpoints are publicly available at: https://github.com/NVIDIA/Megatron-LM/tree/InstructRetro/tools/retro.
Retrieval meets Long Context Large Language Models
Xu, Peng, Ping, Wei, Wu, Xianchao, McAfee, Lawrence, Zhu, Chen, Liu, Zihan, Subramanian, Sandeep, Bakhturina, Evelina, Shoeybi, Mohammad, Catanzaro, Bryan
Extending the context window of large language models (LLMs) is getting popular recently, while the solution of augmenting LLMs with retrieval has existed for years. The natural questions are: i) Retrieval-augmentation versus long context window, which one is better for downstream tasks? In this work, we answer these questions by studying both solutions using two state-of-the-art pretrained LLMs, i.e., a proprietary 43B GPT and Llama2-70B. Perhaps surprisingly, we find that LLM with 4K context window using simple retrieval-augmentation at generation can achieve comparable performance to finetuned LLM with 16K context window via positional interpolation on long context tasks, while taking much less computation. More importantly, we demonstrate that retrieval can significantly improve the performance of LLMs regardless of their extended context window sizes. Our best model, retrieval-augmented Llama2-70B with 32K context window, outperforms GPT-3.5-turbo-16k and Davinci003 in terms of average score on nine long context tasks including question answering, query-based summarization, and in-context few-shot learning tasks. It also outperforms its non-retrieval Llama2-70B-32k baseline by a margin, while being much faster at generation. Our study provides general insights on the choice of retrieval-augmentation versus long context extension of LLM for practitioners. The long context large language models (LLM) have recently received a lot of attention in production (e.g., Anthropic, 2023; OpenAI, 2023b), research community (e.g., Chen et al., 2023; Liu et al., 2023; Tworkowski et al., 2023), and open source community (e.g., Kaiokendev, 2023).
ChatQA: Building GPT-4 Level Conversational QA Models
Liu, Zihan, Ping, Wei, Roy, Rajarshi, Xu, Peng, Lee, Chankyu, Shoeybi, Mohammad, Catanzaro, Bryan
In this work, we introduce ChatQA, a family of conversational question answering (QA) models that obtain GPT-4 level accuracies. Specifically, we propose a two-stage instruction tuning method that can significantly improve the zero-shot conversational QA results from large language models (LLMs). To handle retrieval-augmented generation in conversational QA, we fine-tune a dense retriever on a multi-turn QA dataset, which provides comparable results to using the state-of-the-art query rewriting model while largely reducing deployment cost. Notably, our ChatQA-70B can outperform GPT-4 in terms of average score on 10 conversational QA datasets (54.14 vs. 53.90), without relying on any synthetic data from OpenAI GPT models.
Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study
Wang, Boxin, Ping, Wei, Xu, Peng, McAfee, Lawrence, Liu, Zihan, Shoeybi, Mohammad, Dong, Yi, Kuchaiev, Oleksii, Li, Bo, Xiao, Chaowei, Anandkumar, Anima, Catanzaro, Bryan
Large decoder-only language models (LMs) can be largely improved in terms of perplexity by retrieval (e.g., RETRO), but its impact on text generation quality and downstream task accuracy is unclear. Thus, it is still an open question: shall we pretrain large autoregressive LMs with retrieval? To answer it, we perform a comprehensive study on a scalable pre-trained retrieval-augmented LM (i.e., RETRO) compared with standard GPT and retrieval-augmented GPT incorporated at fine-tuning or inference stages. We first provide the recipe to reproduce RETRO up to 9.5B parameters while retrieving a text corpus with 330B tokens. Based on that, we have the following novel findings: i) RETRO outperforms GPT on text generation with much less degeneration (i.e., repetition), moderately higher factual accuracy, and slightly lower toxicity with a nontoxic retrieval database. ii) On the LM Evaluation Harness benchmark, RETRO largely outperforms GPT on knowledge-intensive tasks, but is on par with GPT on other tasks. Furthermore, we introduce a simple variant of the model, RETRO++, which largely improves open-domain QA results of original RETRO (e.g., EM score +8.6 on Natural Question) and significantly outperforms retrieval-augmented GPT in both fine-tuning and zero-shot evaluation settings. Our findings highlight the promising direction of pretraining autoregressive LMs with retrieval as future foundation models. We release our code and model at: https://github.com/NVIDIA/Megatron-LM/blob/main/tools/retro/README.md
Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning
Yang, Zhuolin, Ping, Wei, Liu, Zihan, Korthikanti, Vijay, Nie, Weili, Huang, De-An, Fan, Linxi, Yu, Zhiding, Lan, Shiyi, Li, Bo, Liu, Ming-Yu, Zhu, Yuke, Shoeybi, Mohammad, Catanzaro, Bryan, Xiao, Chaowei, Anandkumar, Anima
Augmenting pretrained language models (LMs) with a vision encoder (e.g., Flamingo) has obtained the state-of-the-art results in image-to-text generation. However, these models store all the knowledge within their parameters, thus often requiring enormous model parameters to model the abundant visual concepts and very rich textual descriptions. Additionally, they are inefficient in incorporating new data, requiring a computational-expensive fine-tuning process. In this work, we introduce a Retrieval-augmented Visual Language Model, Re-ViLM, built upon the Flamingo, that supports retrieving the relevant knowledge from the external database for zero and in-context few-shot image-to-text generations. By storing certain knowledge explicitly in the external database, our approach reduces the number of model parameters and can easily accommodate new data during evaluation by simply updating the database. We also construct an interleaved image and text data that facilitates in-context few-shot learning capabilities. We demonstrate that Re-ViLM significantly boosts performance for image-to-text generation tasks, especially for zero-shot and few-shot generation in out-of-domain settings with 4 times less parameters compared with baseline methods.
RAVEN: In-Context Learning with Retrieval Augmented Encoder-Decoder Language Models
Huang, Jie, Ping, Wei, Xu, Peng, Shoeybi, Mohammad, Chang, Kevin Chen-Chuan, Catanzaro, Bryan
In this paper, we investigate the in-context learning ability of retrieval-augmented encoder-decoder language models. We first conduct a comprehensive analysis of the state-of-the-art ATLAS model and identify its limitations in in-context learning, primarily due to a mismatch between pretraining and testing, as well as a restricted context length. To address these issues, we propose RAVEN, a model that combines retrieval-augmented masked language modeling and prefix language modeling. We further introduce Fusion-in-Context Learning to enhance the few-shot performance by enabling the model to leverage more in-context examples without requiring additional training or model modifications. Through extensive experiments, we demonstrate that RAVEN significantly outperforms ATLAS and achieves results comparable to the most advanced language models in certain scenarios, despite having substantially fewer parameters. Our work underscores the potential of retrieval-augmented encoder-decoder language models for in-context learning and encourages further research in this direction.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Workshop, BigScience, :, null, Scao, Teven Le, Fan, Angela, Akiki, Christopher, Pavlick, Ellie, Ilić, Suzana, Hesslow, Daniel, Castagné, Roman, Luccioni, Alexandra Sasha, Yvon, François, Gallé, Matthias, Tow, Jonathan, Rush, Alexander M., Biderman, Stella, Webson, Albert, Ammanamanchi, Pawan Sasanka, Wang, Thomas, Sagot, Benoît, Muennighoff, Niklas, del Moral, Albert Villanova, Ruwase, Olatunji, Bawden, Rachel, Bekman, Stas, McMillan-Major, Angelina, Beltagy, Iz, Nguyen, Huu, Saulnier, Lucile, Tan, Samson, Suarez, Pedro Ortiz, Sanh, Victor, Laurençon, Hugo, Jernite, Yacine, Launay, Julien, Mitchell, Margaret, Raffel, Colin, Gokaslan, Aaron, Simhi, Adi, Soroa, Aitor, Aji, Alham Fikri, Alfassy, Amit, Rogers, Anna, Nitzav, Ariel Kreisberg, Xu, Canwen, Mou, Chenghao, Emezue, Chris, Klamm, Christopher, Leong, Colin, van Strien, Daniel, Adelani, David Ifeoluwa, Radev, Dragomir, Ponferrada, Eduardo González, Levkovizh, Efrat, Kim, Ethan, Natan, Eyal Bar, De Toni, Francesco, Dupont, Gérard, Kruszewski, Germán, Pistilli, Giada, Elsahar, Hady, Benyamina, Hamza, Tran, Hieu, Yu, Ian, Abdulmumin, Idris, Johnson, Isaac, Gonzalez-Dios, Itziar, de la Rosa, Javier, Chim, Jenny, Dodge, Jesse, Zhu, Jian, Chang, Jonathan, Frohberg, Jörg, Tobing, Joseph, Bhattacharjee, Joydeep, Almubarak, Khalid, Chen, Kimbo, Lo, Kyle, Von Werra, Leandro, Weber, Leon, Phan, Long, allal, Loubna Ben, Tanguy, Ludovic, Dey, Manan, Muñoz, Manuel Romero, Masoud, Maraim, Grandury, María, Šaško, Mario, Huang, Max, Coavoux, Maximin, Singh, Mayank, Jiang, Mike Tian-Jian, Vu, Minh Chien, Jauhar, Mohammad A., Ghaleb, Mustafa, Subramani, Nishant, Kassner, Nora, Khamis, Nurulaqilla, Nguyen, Olivier, Espejel, Omar, de Gibert, Ona, Villegas, Paulo, Henderson, Peter, Colombo, Pierre, Amuok, Priscilla, Lhoest, Quentin, Harliman, Rheza, Bommasani, Rishi, López, Roberto Luis, Ribeiro, Rui, Osei, Salomey, Pyysalo, Sampo, Nagel, Sebastian, Bose, Shamik, Muhammad, Shamsuddeen Hassan, Sharma, Shanya, Longpre, Shayne, Nikpoor, Somaieh, Silberberg, Stanislav, Pai, Suhas, Zink, Sydney, Torrent, Tiago Timponi, Schick, Timo, Thrush, Tristan, Danchev, Valentin, Nikoulina, Vassilina, Laippala, Veronika, Lepercq, Violette, Prabhu, Vrinda, Alyafeai, Zaid, Talat, Zeerak, Raja, Arun, Heinzerling, Benjamin, Si, Chenglei, Taşar, Davut Emre, Salesky, Elizabeth, Mielke, Sabrina J., Lee, Wilson Y., Sharma, Abheesht, Santilli, Andrea, Chaffin, Antoine, Stiegler, Arnaud, Datta, Debajyoti, Szczechla, Eliza, Chhablani, Gunjan, Wang, Han, Pandey, Harshit, Strobelt, Hendrik, Fries, Jason Alan, Rozen, Jos, Gao, Leo, Sutawika, Lintang, Bari, M Saiful, Al-shaibani, Maged S., Manica, Matteo, Nayak, Nihal, Teehan, Ryan, Albanie, Samuel, Shen, Sheng, Ben-David, Srulik, Bach, Stephen H., Kim, Taewoon, Bers, Tali, Fevry, Thibault, Neeraj, Trishala, Thakker, Urmish, Raunak, Vikas, Tang, Xiangru, Yong, Zheng-Xin, Sun, Zhiqing, Brody, Shaked, Uri, Yallow, Tojarieh, Hadar, Roberts, Adam, Chung, Hyung Won, Tae, Jaesung, Phang, Jason, Press, Ofir, Li, Conglong, Narayanan, Deepak, Bourfoune, Hatim, Casper, Jared, Rasley, Jeff, Ryabinin, Max, Mishra, Mayank, Zhang, Minjia, Shoeybi, Mohammad, Peyrounette, Myriam, Patry, Nicolas, Tazi, Nouamane, Sanseviero, Omar, von Platen, Patrick, Cornette, Pierre, Lavallée, Pierre François, Lacroix, Rémi, Rajbhandari, Samyam, Gandhi, Sanchit, Smith, Shaden, Requena, Stéphane, Patil, Suraj, Dettmers, Tim, Baruwa, Ahmed, Singh, Amanpreet, Cheveleva, Anastasia, Ligozat, Anne-Laure, Subramonian, Arjun, Névéol, Aurélie, Lovering, Charles, Garrette, Dan, Tunuguntla, Deepak, Reiter, Ehud, Taktasheva, Ekaterina, Voloshina, Ekaterina, Bogdanov, Eli, Winata, Genta Indra, Schoelkopf, Hailey, Kalo, Jan-Christoph, Novikova, Jekaterina, Forde, Jessica Zosa, Clive, Jordan, Kasai, Jungo, Kawamura, Ken, Hazan, Liam, Carpuat, Marine, Clinciu, Miruna, Kim, Najoung, Cheng, Newton, Serikov, Oleg, Antverg, Omer, van der Wal, Oskar, Zhang, Rui, Zhang, Ruochen, Gehrmann, Sebastian, Mirkin, Shachar, Pais, Shani, Shavrina, Tatiana, Scialom, Thomas, Yun, Tian, Limisiewicz, Tomasz, Rieser, Verena, Protasov, Vitaly, Mikhailov, Vladislav, Pruksachatkun, Yada, Belinkov, Yonatan, Bamberger, Zachary, Kasner, Zdeněk, Rueda, Alice, Pestana, Amanda, Feizpour, Amir, Khan, Ammar, Faranak, Amy, Santos, Ana, Hevia, Anthony, Unldreaj, Antigona, Aghagol, Arash, Abdollahi, Arezoo, Tammour, Aycha, HajiHosseini, Azadeh, Behroozi, Bahareh, Ajibade, Benjamin, Saxena, Bharat, Ferrandis, Carlos Muñoz, McDuff, Daniel, Contractor, Danish, Lansky, David, David, Davis, Kiela, Douwe, Nguyen, Duong A., Tan, Edward, Baylor, Emi, Ozoani, Ezinwanne, Mirza, Fatima, Ononiwu, Frankline, Rezanejad, Habib, Jones, Hessie, Bhattacharya, Indrani, Solaiman, Irene, Sedenko, Irina, Nejadgholi, Isar, Passmore, Jesse, Seltzer, Josh, Sanz, Julio Bonis, Dutra, Livia, Samagaio, Mairon, Elbadri, Maraim, Mieskes, Margot, Gerchick, Marissa, Akinlolu, Martha, McKenna, Michael, Qiu, Mike, Ghauri, Muhammed, Burynok, Mykola, Abrar, Nafis, Rajani, Nazneen, Elkott, Nour, Fahmy, Nour, Samuel, Olanrewaju, An, Ran, Kromann, Rasmus, Hao, Ryan, Alizadeh, Samira, Shubber, Sarmad, Wang, Silas, Roy, Sourav, Viguier, Sylvain, Le, Thanh, Oyebade, Tobi, Le, Trieu, Yang, Yoyo, Nguyen, Zach, Kashyap, Abhinav Ramesh, Palasciano, Alfredo, Callahan, Alison, Shukla, Anima, Miranda-Escalada, Antonio, Singh, Ayush, Beilharz, Benjamin, Wang, Bo, Brito, Caio, Zhou, Chenxi, Jain, Chirag, Xu, Chuxin, Fourrier, Clémentine, Periñán, Daniel León, Molano, Daniel, Yu, Dian, Manjavacas, Enrique, Barth, Fabio, Fuhrimann, Florian, Altay, Gabriel, Bayrak, Giyaseddin, Burns, Gully, Vrabec, Helena U., Bello, Imane, Dash, Ishani, Kang, Jihyun, Giorgi, John, Golde, Jonas, Posada, Jose David, Sivaraman, Karthik Rangasai, Bulchandani, Lokesh, Liu, Lu, Shinzato, Luisa, de Bykhovetz, Madeleine Hahn, Takeuchi, Maiko, Pàmies, Marc, Castillo, Maria A, Nezhurina, Marianna, Sänger, Mario, Samwald, Matthias, Cullan, Michael, Weinberg, Michael, De Wolf, Michiel, Mihaljcic, Mina, Liu, Minna, Freidank, Moritz, Kang, Myungsun, Seelam, Natasha, Dahlberg, Nathan, Broad, Nicholas Michio, Muellner, Nikolaus, Fung, Pascale, Haller, Patrick, Chandrasekhar, Ramya, Eisenberg, Renata, Martin, Robert, Canalli, Rodrigo, Su, Rosaline, Su, Ruisi, Cahyawijaya, Samuel, Garda, Samuele, Deshmukh, Shlok S, Mishra, Shubhanshu, Kiblawi, Sid, Ott, Simon, Sang-aroonsiri, Sinee, Kumar, Srishti, Schweter, Stefan, Bharati, Sushil, Laud, Tanmay, Gigant, Théo, Kainuma, Tomoya, Kusa, Wojciech, Labrak, Yanis, Bajaj, Yash Shailesh, Venkatraman, Yash, Xu, Yifan, Xu, Yingxin, Xu, Yu, Tan, Zhe, Xie, Zhongli, Ye, Zifan, Bras, Mathilde, Belkada, Younes, Wolf, Thomas
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.