Not enough data to create a plot.
Try a different view from the menu above.
Bhatele, Abhinav
Jorge: Approximate Preconditioning for GPU-efficient Second-order Optimization
Singh, Siddharth, Sating, Zachary, Bhatele, Abhinav
Despite their better convergence properties compared to first-order optimizers, second-order optimizers for deep learning have been less popular due to their significant computational costs. The primary efficiency bottleneck in such optimizers is matrix inverse calculations in the preconditioning step, which are expensive to compute on GPUs. In this paper, we introduce Jorge, a second-order optimizer that promises the best of both worlds - rapid convergence benefits of second-order methods, and high computational efficiency typical of first-order methods. We address the primary computational bottleneck of computing matrix inverses by completely eliminating them using an approximation of the preconditioner computation. This makes Jorge extremely efficient on GPUs in terms of wall-clock time. Further, we describe an approach to determine Jorge's hyperparam-eters directly from a well-tuned SGD baseline, thereby significantly minimizing tuning efforts. Our empirical evaluations demonstrate the distinct advantages of using Jorge, outperforming state-of-the-art optimizers such as SGD, AdamW, and Shampoo across multiple deep learning models, both in terms of sample efficiency and wall-clock time. Stochastic optimization methods such as stochastic gradient descent (SGD) (Robbins & Monro, 1951) and Adam (Kingma & Ba, 2015) are the de-facto standard for optimizing the objective function in the training of deep neural networks. These first-order optimization methods are relatively inexpensive in terms of their compute and memory requirements, and hence extremely popular. Second-order optimization methods typically have better convergence properties (fewer epochs to reach target validation metrics) than those of first-order methods.
Modeling Parallel Programs using Large Language Models
Nichols, Daniel, Marathe, Aniruddha, Menon, Harshitha, Gamblin, Todd, Bhatele, Abhinav
Parallel software codes in high performance computing (HPC) continue to grow in complexity and scale as we enter the exascale era. A diverse set of emerging hardware and programming paradigms make developing, optimizing, and maintaining parallel software burdensome for developers. One way to alleviate some of these burdens is with automated development and analysis tools. Such tools can perform complex and/or remedial tasks for developers that increase their productivity and decrease the chance for error. So far, such tools for code development and performance analysis have been limited in the complexity of tasks they can perform. However, with recent advancements in language modeling, and the wealth of code related data that is now available online, these tools have started to utilize predictive language models to automate more complex tasks. In this paper, we show how large language models (LLMs) can be applied to tasks specific to high performance and scientific codes. We train LLMs using code and performance data that is specific to parallel codes. We compare several recent LLMs on HPC related tasks and introduce a new model, HPC-Coder, trained on parallel code. In our experiments we show that this model can auto-complete HPC functions where general models cannot, decorate for loops with OpenMP pragmas, and model performance changes in two scientific application repositories.
Communication-minimizing Asynchronous Tensor Parallelism
Singh, Siddharth, Sating, Zack, Bhatele, Abhinav
In this work, we propose Tensor3D, a three dimensional (3D) hybrid tensor and data parallel framework which strives to alleviate As state-of-the-art neural networks scale to billions of parameters, the aforementioned performance bottlenecks of existing tensor designing parallel algorithms that can train these networks parallel approaches. Our framework relies on three key ideas to efficiently on multi-GPU clusters has become critical. This paper minimize the idle time spent in communication. First, we show how presents Tensor3D, a novel three-dimensional (3D) approach to a naive application of a tensor parallel strategy can lead to a significant parallelize tensor computations, that strives to minimize the idle amount of communication for satisfying the data dependencies time incurred due to communication in parallel training of large of parallelized layers of a neural network. To this end, we propose multi-billion parameter models. First, we introduce an intelligent an intelligent distribution of neural network parameters across distribution of neural network parameters across GPUs that eliminates GPUs that eliminates the aforementioned communication for satisfying communication required for satisfying data dependencies of data dependencies.
Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model Training
Singh, Siddharth, Bhatele, Abhinav
Parallel training of neural networks at scale is challenging due to significant overheads arising from communication. Recently, deep learning researchers have developed a variety of pruning algorithms that are capable of pruning (i.e. setting to zero) 80-90% of the parameters in a neural network to yield sparse subnetworks that equal the accuracy of the unpruned parent network. In this work, we propose a novel approach that exploits these sparse subnetworks to optimize the memory utilization and communication in two popular algorithms for parallel deep learning namely -- data and inter-layer parallelism. We integrate our approach into AxoNN, a highly scalable framework for parallel deep learning that relies on data and inter-layer parallelism, and demonstrate the reduction in communication time and memory utilization. On 512 NVIDIA V100 GPUs, our optimizations reduce the memory consumption of a 2.7 billion parameter model by 74%, and the total communication time by 40%, thus providing an overall speedup of 34% over AxoNN, 32% over DeepSpeed-3D and 46% over Sputnik, a sparse matrix computation baseline.
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
Singh, Siddharth, Ruwase, Olatunji, Awan, Ammar Ahmad, Rajbhandari, Samyam, He, Yuxiong, Bhatele, Abhinav
Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4 to 8x larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.
How to Train Your Neural Network: A Comparative Evaluation
Lin, Shu-Huai, Nichols, Daniel, Singh, Siddharth, Bhatele, Abhinav
The field of deep learning has witnessed a remarkable shift towards extremely compute- and memory-intensive neural networks. These newer larger models have enabled researchers to advance state-of-the-art tools across a variety of fields. This phenomenon has spurred the development of algorithms for distributed training of neural networks over a larger number of hardware accelerators. In this paper, we discuss and compare current state-of-the-art frameworks for large scale distributed deep learning. First, we survey current practices in distributed learning and identify the different types of parallelism used. Then, we present empirical results comparing their performance on large image and language training tasks. Additionally, we address their statistical efficiency and memory consumption behavior. Based on our results, we discuss algorithmic and implementation portions of each framework which hinder performance.
Myelin: An asynchronous, message-driven parallel framework for extreme-scale deep learning
Singh, Siddharth, Bhatele, Abhinav
In the last few years, the memory requirements to train state-of-the-art neural networks have far exceeded the DRAM capacities of modern hardware accelerators. This has necessitated the development of efficient algorithms to train these neural networks in parallel on large-scale GPU-based clusters. Since computation is relatively inexpensive on modern GPUs, designing and implementing extremely efficient communication in these parallel training algorithms is critical for extracting the maximum performance. This paper presents Myelin, a parallel deep learning framework that exploits asynchrony and message-driven execution to schedule neural network operations on each GPU, thereby reducing GPU idle time and maximizing hardware efficiency. By using the CPU memory as a scratch space for offloading data periodically during training, Myelin is able to reduce GPU memory consumption by four times. This allows us to increase the number of parameters per GPU by four times, thus reducing the amount of communication and increasing performance by over 13%. When tested against large transformer models with 12-100 billion parameters on 48-384 NVIDIA Tesla V100 GPUs, Myelin achieves a per-GPU throughput of 49.4-54.78% of theoretical peak and reduces the training time by 22-37 days (15-25% speedup) as compared to the state-of-the-art.