Awan, Ammar Ahmad
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Abdin, Marah, Jacobs, Sam Ade, Awan, Ammar Ahmad, Aneja, Jyoti, Awadallah, Ahmed, Awadalla, Hany, Bach, Nguyen, Bahree, Amit, Bakhtiari, Arash, Bao, Jianmin, Behl, Harkirat, Benhaim, Alon, Bilenko, Misha, Bjorck, Johan, Bubeck, Sébastien, Cai, Qin, Cai, Martin, Mendes, Caio César Teodoro, Chen, Weizhu, Chaudhary, Vishrav, Chen, Dong, Chen, Dongdong, Chen, Yen-Chun, Chen, Yi-Ling, Chopra, Parul, Dai, Xiyang, Del Giorno, Allie, de Rosa, Gustavo, Dixon, Matthew, Eldan, Ronen, Fragoso, Victor, Iter, Dan, Gao, Mei, Gao, Min, Gao, Jianfeng, Garg, Amit, Goswami, Abhishek, Gunasekar, Suriya, Haider, Emman, Hao, Junheng, Hewett, Russell J., Huynh, Jamie, Javaheripi, Mojan, Jin, Xin, Kauffmann, Piero, Karampatziakis, Nikos, Kim, Dongwoo, Khademi, Mahoud, Kurilenko, Lev, Lee, James R., Lee, Yin Tat, Li, Yuanzhi, Li, Yunsheng, Liang, Chen, Liden, Lars, Liu, Ce, Liu, Mengchen, Liu, Weishung, Lin, Eric, Lin, Zeqi, Luo, Chong, Madan, Piyush, Mazzola, Matt, Mitra, Arindam, Modi, Hardik, Nguyen, Anh, Norick, Brandon, Patra, Barun, Perez-Becker, Daniel, Portet, Thomas, Pryzant, Reid, Qin, Heyang, Radmilac, Marko, Rosset, Corby, Roy, Sambudha, Ruwase, Olatunji, Saarikivi, Olli, Saied, Amin, Salim, Adil, Santacroce, Michael, Shah, Shital, Shang, Ning, Sharma, Hiteshi, Shukla, Swadheen, Song, Xia, Tanaka, Masahiro, Tupini, Andrea, Wang, Xin, Wang, Lijuan, Wang, Chunyu, Wang, Yu, Ward, Rachel, Wang, Guanhua, Witte, Philipp, Wu, Haiping, Wyatt, Michael, Xiao, Bin, Xu, Can, Xu, Jiahang, Xu, Weijian, Yadav, Sonali, Yang, Fan, Yang, Jianwei, Yang, Ziyi, Yang, Yifan, Yu, Donghan, Yuan, Lu, Zhang, Chengruidong, Zhang, Cyril, Zhang, Jianwen, Zhang, Li Lyna, Zhang, Yi, Zhang, Yue, Zhang, Yunan, Zhou, Xiren
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench). Moreover, we also introduce phi-3-vision, a 4.2 billion parameter model based on phi-3-mini with strong reasoning capabilities for image and text prompts.
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
Holmes, Connor, Tanaka, Masahiro, Wyatt, Michael, Awan, Ammar Ahmad, Rasley, Jeff, Rajbhandari, Samyam, Aminabadi, Reza Yazdani, Qin, Heyang, Bakhtiari, Arash, Kurilenko, Lev, He, Yuxiong
The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persistent and persistent deployment options, catering to diverse user scenarios from interactive sessions to long-running applications. We present a detailed benchmarking methodology, analyze the performance through latency-throughput curves, and investigate scalability via load balancing. Our evaluations demonstrate substantial improvements in throughput and latency across various models and hardware configurations. We discuss our roadmap for future enhancements, including broader model support and new hardware backends. The DeepSpeed-FastGen code is readily available for community engagement and contribution.
DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention
Yao, Zhewei, Wu, Xiaoxia, Li, Conglong, Zhang, Minjia, Qin, Heyang, Ruwase, Olatunji, Awan, Ammar Ahmad, Rajbhandari, Samyam, He, Yuxiong
Most of the existing multi-modal models, hindered by their incapacity to adeptly manage interleaved image-and-text inputs in multi-image, multi-round dialogues, face substantial constraints in resource allocation for training and data accessibility, impacting their adaptability and scalability across varied interaction realms. To address this, we present the DeepSpeed-VisualChat framework, designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities, with a focus on enhancing the proficiency of Large Vision and Language Models in handling interleaved inputs. Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions in multi-round, multi-image conversations. Compared to existing frameworks, DeepSpeed-VisualChat shows superior scalability up to 70B parameter language model size, representing a significant advancement in multi-modal language models and setting a solid foundation for future explorations.
DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies
Song, Shuaiwen Leon, Kruft, Bonnie, Zhang, Minjia, Li, Conglong, Chen, Shiyang, Zhang, Chengming, Tanaka, Masahiro, Wu, Xiaoxia, Rasley, Jeff, Awan, Ammar Ahmad, Holmes, Connor, Cai, Martin, Ghanem, Adam, Zhou, Zhongzhu, He, Yuxiong, Luferenko, Pete, Kumar, Divya, Weyn, Jonathan, Zhang, Ruixiong, Klocek, Sylwester, Vragov, Volodymyr, AlQuraishi, Mohammed, Ahdritz, Gustaf, Floristean, Christina, Negri, Cristina, Kotamarthi, Rao, Vishwanath, Venkatram, Ramanathan, Arvind, Foreman, Sam, Hippe, Kyle, Arcomano, Troy, Maulik, Romit, Zvyagin, Maxim, Brace, Alexander, Zhang, Bin, Bohorquez, Cindy Orozco, Clyde, Austin, Kale, Bharat, Perez-Rivera, Danilo, Ma, Heng, Mann, Carla M., Irvin, Michael, Pauloski, J. Gregory, Ward, Logan, Hayot, Valerie, Emani, Murali, Xie, Zhen, Lin, Diangen, Shukla, Maulik, Foster, Ian, Davis, James J., Papka, Michael E., Brettin, Thomas, Balaprakash, Prasanna, Tourassi, Gina, Gounley, John, Hanson, Heidi, Potok, Thomas E, Pasini, Massimiliano Lupo, Evans, Kate, Lu, Dan, Lunga, Dalton, Yin, Junqi, Dash, Sajal, Wang, Feiyi, Shankar, Mallikarjun, Lyngaas, Isaac, Wang, Xiao, Cong, Guojing, Zhang, Pei, Fan, Ming, Liu, Siyan, Hoisie, Adolfy, Yoo, Shinjae, Ren, Yihui, Tang, William, Felker, Kyle, Svyatkovskiy, Alexey, Liu, Hang, Aji, Ashwin, Dalton, Angela, Schulte, Michael, Schulz, Karl, Deng, Yuntian, Nie, Weili, Romero, Josh, Dallago, Christian, Vahdat, Arash, Xiao, Chaowei, Gibbs, Thomas, Anandkumar, Anima, Stevens, Rick
In the upcoming decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across sectors from drug development to renewable energy. To answer this call, we present DeepSpeed4Science initiative (deepspeed4science.ai) which aims to build unique capabilities through AI system technology innovations to help domain experts to unlock today's biggest science mysteries. By leveraging DeepSpeed's current technology pillars (training, inference and compression) as base technology enablers, DeepSpeed4Science will create a new set of AI system technologies tailored for accelerating scientific discoveries by addressing their unique complexity beyond the common technical approaches used for accelerating generic large language models (LLMs). In this paper, we showcase the early progress we made with DeepSpeed4Science in addressing two of the critical system challenges in structural biology research.
DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales
Yao, Zhewei, Aminabadi, Reza Yazdani, Ruwase, Olatunji, Rajbhandari, Samyam, Wu, Xiaoxia, Awan, Ammar Ahmad, Rasley, Jeff, Zhang, Minjia, Li, Conglong, Holmes, Connor, Zhou, Zhongzhu, Wyatt, Michael, Smith, Molly, Kurilenko, Lev, Qin, Heyang, Tanaka, Masahiro, Che, Shuai, Song, Shuaiwen Leon, He, Yuxiong
ChatGPT-like models have revolutionized various applications in artificial intelligence, from summarization and coding to translation, matching or even surpassing human performance. However, the current landscape lacks an accessible, efficient, and cost-effective end-to-end RLHF (Reinforcement Learning with Human Feedback) training pipeline for these powerful models, particularly when training at the scale of billions of parameters. This paper introduces DeepSpeed-Chat, a novel system that democratizes RLHF training, making it accessible to the AI community. DeepSpeed-Chat offers three key capabilities: an easy-to-use training and inference experience for ChatGPT-like models, a DeepSpeed-RLHF pipeline that replicates the training pipeline from InstructGPT, and a robust DeepSpeed-RLHF system that combines various optimizations for training and inference in a unified way. The system delivers unparalleled efficiency and scalability, enabling training of models with hundreds of billions of parameters in record time and at a fraction of the cost. With this development, DeepSpeed-Chat paves the way for broader access to advanced RLHF training, even for data scientists with limited resources, thereby fostering innovation and further development in the field of AI.
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
Singh, Siddharth, Ruwase, Olatunji, Awan, Ammar Ahmad, Rajbhandari, Samyam, He, Yuxiong, Bhatele, Abhinav
Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4 to 8x larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.
MCR-DL: Mix-and-Match Communication Runtime for Deep Learning
Anthony, Quentin, Awan, Ammar Ahmad, Rasley, Jeff, He, Yuxiong, Shafi, Aamir, Abduljabbar, Mustafa, Subramoni, Hari, Panda, Dhabaleswar
In recent years, the training requirements of many state-of-the-art Deep Learning (DL) models have scaled beyond the compute and memory capabilities of a single processor, and necessitated distribution among processors. Training such massive models necessitates advanced parallelism strategies to maintain efficiency. However, such distributed DL parallelism strategies require a varied mixture of collective and point-to-point communication operations across a broad range of message sizes and scales. Examples of models using advanced parallelism strategies include Deep Learning Recommendation Models (DLRM) and Mixture-of-Experts (MoE). Communication libraries' performance varies wildly across different communication operations, scales, and message sizes. We propose MCR-DL: an extensible DL communication framework that supports all point-to-point and collective operations while enabling users to dynamically mix-and-match communication backends for a given operation without deadlocks. MCR-DL also comes packaged with a tuning suite for dynamically selecting the best communication backend for a given input tensor. We select DeepSpeed-MoE and DLRM as candidate DL models and demonstrate a 31% improvement in DS-MoE throughput on 256 V100 GPUs on the Lassen HPC system. Further, we achieve a 20% throughput improvement in a dense Megatron-DeepSpeed model and a 25% throughput improvement in DLRM on 32 A100 GPUs with the Theta-GPU HPC system.
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
Rajbhandari, Samyam, Li, Conglong, Yao, Zhewei, Zhang, Minjia, Aminabadi, Reza Yazdani, Awan, Ammar Ahmad, Rasley, Jeff, He, Yuxiong
As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models. We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible.
Scalable and Efficient MoE Training for Multitask Multilingual Models
Kim, Young Jin, Awan, Ammar Ahmad, Muzio, Alexandre, Salinas, Andres Felipe Cruz, Lu, Liyang, Hendy, Amr, Rajbhandari, Samyam, He, Yuxiong, Awadalla, Hany Hassan
The Mixture of Experts (MoE) models are an emerging class of sparsely activated deep learning models that have sublinear compute costs with respect to their parameters. In contrast with dense models, the sparse architecture of MoE offers opportunities for drastically growing model size with significant accuracy gain while consuming much lower compute budget. However, supporting large scale MoE training also has its own set of system and modeling challenges. To overcome the challenges and embrace the opportunities of MoE, we first develop a system capable of scaling MoE models efficiently to trillions of parameters. It combines multi-dimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work. Besides boosting system efficiency, we also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve inference time efficiency. By combining the efficient system and training methods, we are able to significantly scale up large multitask multilingual models for language generation which results in a great improvement in model accuracy. A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks. The system support of efficient MoE training has been implemented and open-sourced with the DeepSpeed library.