Ardalani, Newsha
Text Quality-Based Pruning for Efficient Training of Language Models
Sharma, Vasu, Padthe, Karthik, Ardalani, Newsha, Tirumala, Kushal, Howes, Russell, Xu, Hu, Huang, Po-Yao, Li, Shang-Wen, Aghajanyan, Armen, Ghosh, Gargi, Zettlemoyer, Luke
By leveraging attention in recent years due to their impressive this numerical text quality score, we demonstrate performance in various natural language processing how it can be used to prune the original dataset, (NLP) tasks (Zhang et al., 2022; Penedo et al., enabling the training of LMs using only a fraction 2023; Touvron et al., 2023; Zhou et al., 2023; Liu of the data. Our approach aims to identify et al., 2019). However, their training process often and eliminate low-quality text instances, thereby relies on computationally intensive procedures that streamlining the training process and mitigating the involve massive datasets and compute requirements burden of handling large-scale datasets. We also remove which hinders training large scale LMs on noisy potentially harmful content from the data by real-world or domain specific datasets. What's ensuring that harmful content is rated poorly by our worse is that several of these datasets are uncurated text quality score which can then be pruned. We and may contain harmful content which the observe an absolute improvement of 0.9% averaged LM model can potentially pick up during the training over 14 downstream evaluation tasks for multiple process (Deshpande et al., 2023; Schramowski LM models while using 40% lesser data and training et al., 2022; Kuchnik et al., 2023).
Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data
Yang, Yu, Singh, Aaditya K., Elhoushi, Mostafa, Mahmoud, Anas, Tirumala, Kushal, Gloeckle, Fabian, Roziรจre, Baptiste, Wu, Carole-Jean, Morcos, Ari S., Ardalani, Newsha
Code datasets, often collected from diverse and uncontrolled sources such as GitHub, potentially suffer from quality issues, thereby affecting the performance and training efficiency of Large Language Models (LLMs) optimized for code generation. Previous studies demonstrated the benefit of using embedding spaces for data pruning, but they mainly focused on duplicate removal or increasing variety, and in other modalities, such as images. Our work focuses on using embeddings to identify and remove "low-quality" code data. First, we explore features of "low-quality" code in embedding space, through the use of synthetic corruptions. Armed with this knowledge, we devise novel pruning metrics that operate in embedding space to identify and remove low-quality entries in the Stack dataset. We demonstrate the benefits of this synthetic corruption informed pruning (SCIP) approach on the well-established HumanEval and MBPP benchmarks, outperforming existing embedding-based methods. Importantly, we achieve up to a 3% performance improvement over no pruning, thereby showing the promise of insights from synthetic corruptions for data pruning.
Data Acquisition: A New Frontier in Data-centric AI
Chen, Lingjiao, Acun, Bilge, Ardalani, Newsha, Sun, Yifan, Kang, Feiyang, Lyu, Hanrui, Kwon, Yongchan, Jia, Ruoxi, Wu, Carole-Jean, Zaharia, Matei, Zou, James
Datasets, the cornerstone of modern machine learning (ML) systems, have been increasingly sold and purchased for different ML pipelines [2]. Several data marketplaces have emerged to serve different stages of building ML-enhanced data applications. For example, NASDAQ Data Link [3] offers financial datasets cleaned and structured for model training, Amazon AWS data exchange [4] focuses on generic tabular datasets, and Databricks Marketplace [5] integrates raw datasets and ML pipelines to deliver insights. The data-as-a-service market size was more than 30 billions and is expected to double in the next five years [6]. While the data marketplaces are increasingly expanding, unfortunately, data acquisition for ML remains challenging, partially due to its ad-hoc nature: Based on discussions with real-world users, data acquirers often need to negotiate varying contracts with different data providers first, then purchase multiple datasets with different formats, and finally filtering out unnecessary data from the purchased datasets.
MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems
Hsia, Samuel, Golden, Alicia, Acun, Bilge, Ardalani, Newsha, DeVito, Zachary, Wei, Gu-Yeon, Brooks, David, Wu, Carole-Jean
Training and deploying large machine learning (ML) models is time-consuming and requires significant distributed computing infrastructures. Based on real-world large model training on datacenter-scale infrastructures, we show 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize the outstanding communication latency, in this work, we develop an agile performance modeling framework to guide parallelization and hardware-software co-design strategies. Using the suite of real-world large ML models on state-of-the-art GPU training hardware, we demonstrate 2.24x and 5.27x throughput improvement potential for pre-training and inference scenarios, respectively.
DataPerf: Benchmarks for Data-Centric AI Development
Mazumder, Mark, Banbury, Colby, Yao, Xiaozhe, Karlaลก, Bojan, Rojas, William Gaviria, Diamos, Sudnya, Diamos, Greg, He, Lynn, Parrish, Alicia, Kirk, Hannah Rose, Quaye, Jessica, Rastogi, Charvi, Kiela, Douwe, Jurado, David, Kanter, David, Mosquera, Rafael, Ciro, Juan, Aroyo, Lora, Acun, Bilge, Chen, Lingjiao, Raje, Mehul Smriti, Bartolo, Max, Eyuboglu, Sabri, Ghorbani, Amirata, Goodman, Emmett, Inel, Oana, Kane, Tariq, Kirkpatrick, Christine R., Kuo, Tzu-Sheng, Mueller, Jonas, Thrush, Tristan, Vanschoren, Joaquin, Warren, Margaret, Williams, Adina, Yeung, Serena, Ardalani, Newsha, Paritosh, Praveen, Bat-Leah, Lilith, Zhang, Ce, Zou, James, Wu, Carole-Jean, Coleman, Cody, Ng, Andrew, Mattson, Peter, Reddi, Vijay Janapa
Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing dataset benchmarks. In response, we present DataPerf, a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We aim to foster innovation in data-centric AI through competition, comparability, and reproducibility. We enable the ML community to iterate on datasets, instead of just architectures, and we provide an open, online platform with multiple rounds of challenges to support this iterative development. The first iteration of DataPerf contains five benchmarks covering a wide spectrum of data-centric techniques, tasks, and modalities in vision, speech, acquisition, debugging, and diffusion prompting, and we support hosting new contributed benchmarks from the community. The benchmarks, online evaluation platform, and baseline implementations are open source, and the MLCommons Association will maintain DataPerf to ensure long-term benefits to academia and industry.
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
Huang, Haiyang, Ardalani, Newsha, Sun, Anna, Ke, Liu, Lee, Hsien-Hsin S., Sridhar, Anjali, Bhosale, Shruti, Wu, Carole-Jean, Lee, Benjamin
Mixture-of-Experts (MoE) models have gained popularity in achieving state-of-the-art performance in a wide range of tasks in computer vision and natural language processing. They effectively expand the model capacity while incurring a minimal increase in computation cost during training. However, deploying such models for inference is difficult due to their large size and complex communication pattern. In this work, we provide a characterization of two MoE workloads, namely Language Modeling (LM) and Machine Translation (MT) and identify their sources of inefficiencies at deployment. We propose three optimization techniques to mitigate sources of inefficiencies, namely (1) Dynamic gating, (2) Expert Buffering, and (3) Expert load balancing. We show that dynamic gating improves maximum throughput by 6.21-11.23$\times$ for LM, 5.75-10.98$\times$ for MT Encoder and 2.58-5.71$\times$ for MT Decoder. It also reduces memory usage by up to 1.36$\times$ for LM and up to 1.1$\times$ for MT. We further propose Expert Buffering, a new caching mechanism that only keeps hot, active experts in GPU memory while buffering the rest in CPU memory. This reduces static memory allocation by up to 1.47$\times$. We finally propose a load balancing methodology that provides additional scalability to the workload.
MP-Rec: Hardware-Software Co-Design to Enable Multi-Path Recommendation
Hsia, Samuel, Gupta, Udit, Acun, Bilge, Ardalani, Newsha, Zhong, Pan, Wei, Gu-Yeon, Brooks, David, Wu, Carole-Jean
Deep learning recommendation systems serve personalized content under diverse tail-latency targets and input-query loads. In order to do so, state-of-the-art recommendation models rely on terabyte-scale embedding tables to learn user preferences over large bodies of contents. The reliance on a fixed embedding representation of embedding tables not only imposes significant memory capacity and bandwidth requirements but also limits the scope of compatible system solutions. This paper challenges the assumption of fixed embedding representations by showing how synergies between embedding representations and hardware platforms can lead to improvements in both algorithmic- and system performance. Based on our characterization of various embedding representations, we propose a hybrid embedding representation that achieves higher quality embeddings at the cost of increased memory and compute requirements. To address the system performance challenges of the hybrid representation, we propose MP-Rec -- a co-design technique that exploits heterogeneity and dynamic selection of embedding representations and underlying hardware platforms. On real system hardware, we demonstrate how matching custom accelerators, i.e., GPUs, TPUs, and IPUs, with compatible embedding representations can lead to 16.65x performance speedup. Additionally, in query-serving scenarios, MP-Rec achieves 2.49x and 3.76x higher correct prediction throughput and 0.19% and 0.22% better model quality on a CPU-GPU system for the Kaggle and Terabyte datasets, respectively.
Sustainable AI: Environmental Implications, Challenges and Opportunities
Wu, Carole-Jean, Raghavendra, Ramya, Gupta, Udit, Acun, Bilge, Ardalani, Newsha, Maeng, Kiwan, Chang, Gloria, Behram, Fiona Aga, Huang, James, Bai, Charles, Gschwind, Michael, Gupta, Anurag, Ott, Myle, Melnikov, Anastasia, Candido, Salvatore, Brooks, David, Chauhan, Geeta, Lee, Benjamin, Lee, Hsien-Hsin S., Akyildiz, Bugra, Balandat, Maximilian, Spisak, Joe, Jain, Ravi, Rabbat, Mike, Hazelwood, Kim
This paper explores the environmental impact of the super-linear growth trends for AI from a holistic perspective, spanning Data, Algorithms, and System Hardware. We characterize the carbon footprint of AI computing by examining the model development cycle across industry-scale machine learning use cases and, at the same time, considering the life cycle of system hardware. Taking a step further, we capture the operational and manufacturing carbon footprint of AI computing and present an end-to-end analysis for what and how hardware-software design and at-scale optimization can help reduce the overall carbon footprint of AI. Based on the industry experience and lessons learned, we share the key challenges and chart out important development directions across the many dimensions of AI. We hope the key messages and insights presented in this paper can inspire the community to advance the field of AI in an environmentally-responsible manner.
Deep Learning Scaling is Predictable, Empirically
Hestness, Joel, Narang, Sharan, Ardalani, Newsha, Diamos, Gregory, Jun, Heewoo, Kianinejad, Hassan, Patwary, Md. Mostofa Ali, Yang, Yang, Zhou, Yanqi
Deep learning (DL) creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. It is widely believed that growing training sets and models should improve accuracy and result in better products. As DL application domains grow, we would like a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art. This paper presents a large scale empirical characterization of generalization error and model size growth as training sets grow. We introduce a methodology for this measurement and test four machine learning domains: machine translation, language modeling, image processing, and speech recognition. Our empirical results show power-law generalization error scaling across a breadth of factors, resulting in power-law exponents---the "steepness" of the learning curve---yet to be explained by theoretical work. Further, model improvements only shift the error but do not appear to affect the power-law exponent. We also show that model size scales sublinearly with data size. These scaling relationships have significant implications on deep learning research, practice, and systems. They can assist model debugging, setting accuracy targets, and decisions about data set growth. They can also guide computing system design and underscore the importance of continued computational scaling.