Goto

Collaborating Authors

 gpu cluster



Reviewer

Neural Information Processing Systems

We thank the reviewers for their detailed and thoughtful feedback. We respond to each reviewer individually below. So it is a mean (across training seeds) of the median (across complexes) of AUROC. We will clarify this in the Table 2 and Figure 1 captions. We will clarify this point.


PerfTracker: Online Performance Troubleshooting for Large-scale Model Training in Production

Guan, Yu, Yin, Zhiyu, Chen, Haoyu, Cheng, Sheng, Yang, Chaojie, Qian, Kun, Xu, Tianyin, Zhang, Yang, Zhao, Hanyu, Li, Yong, Lin, Wei, Cai, Dennis, Zhai, Ennan

arXiv.org Artificial Intelligence

Troubleshooting performance problems of large model training (LMT) is immensely challenging, due to unprecedented scales of modern GPU clusters, the complexity of software-hardware interactions, and the data intensity of the training process. Existing troubleshooting approaches designed for traditional distributed systems or datacenter networks fall short and can hardly apply to real-world training systems. In this paper, we present PerfTracker, the first online troubleshooting system utilizing fine-grained profiling, to diagnose performance issues of large-scale model training in production. PerfTracker can diagnose performance issues rooted in both hardware (e.g., GPUs and their interconnects) and software (e.g., Python functions and GPU operations). It scales to LMT on modern GPU clusters. PerfTracker effectively summarizes runtime behavior patterns of fine-grained LMT functions via online profiling, and leverages differential observability to localize the root cause with minimal production impact. PerfTracker has been deployed as a production service for large-scale GPU clusters of O(10, 000) GPUs (product homepage https://help.aliyun.com/zh/pai/user-guide/perftracker-online-performance-analysis-diagnostic-tool). It has been used to diagnose a variety of difficult performance issues.


Meta's Next Llama AI Models Are Training on a GPU Cluster 'Bigger Than Anything' Else

WIRED

Meta CEO Mark Zuckerberg laid down the newest marker in generative AI training on Wednesday, saying that the next major release of the company's Llama model is being trained on a cluster of GPUs that's "bigger than anything" else that's been reported. Llama 4 development is well underway, Zuckerberg told investors and analysts on an earnings call, with an initial launch expected early next year. "We're training the Llama 4 models on a cluster that is bigger than 100,000 H100s, or bigger than anything that I've seen reported for what others are doing," Zuckerberg said, referring to the Nvidia chips popular for training AI systems. "I expect that the smaller Llama 4 models will be ready first." Increasing the scale of AI training with more computing power and data is widely believed to be key to developing significantly more capable AI models.


Exclusive: New Research Finds Stark Global Divide in Ownership of Powerful AI Chips

TIME - Tech

When we think of the "cloud," we often imagine data floating invisibly in the ether. But the reality is far more tangible: the cloud is located in huge buildings called data centers, filled with powerful, energy-hungry computer chips. Those chips, particularly graphics processing units (GPUs), have become a critical piece of infrastructure for the world of AI, as they are required to build and run powerful chatbots like ChatGPT. As the number of things you can do with AI grows, so does the geopolitical importance of high-end chips--and where they are located in the world. The U.S. and China are competing to amass stockpiles, with Washington enacting sanctions aimed at preventing Beijing from buying the most cutting-edge varieties.


G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems

Xiao, Youshao, Zhao, Shangchun, Zhou, Zhenglei, Huan, Zhaoxin, Ju, Lin, Zhang, Xiaolu, Wang, Lin, Zhou, Jun

arXiv.org Artificial Intelligence

Recently, a new paradigm, meta learning, has been widely applied to Deep Learning Recommendation Models (DLRM) and significantly improves statistical performance, especially in cold-start scenarios. However, the existing systems are not tailored for meta learning based DLRM models and have critical problems regarding efficiency in distributed training in the GPU cluster. It is because the conventional deep learning pipeline is not optimized for two task-specific datasets and two update loops in meta learning. This paper provides a high-performance framework for large-scale training for Optimization-based Meta DLRM models over the \textbf{G}PU cluster, namely \textbf{G}-Meta. Firstly, G-Meta utilizes both data parallelism and model parallelism with careful orchestration regarding computation and communication efficiency, to enable high-speed distributed training. Secondly, it proposes a Meta-IO pipeline for efficient data ingestion to alleviate the I/O bottleneck. Various experimental results show that G-Meta achieves notable training speed without loss of statistical performance. Since early 2022, G-Meta has been deployed in Alipay's core advertising and recommender system, shrinking the continuous delivery of models by four times. It also obtains 6.48\% improvement in Conversion Rate (CVR) and 1.06\% increase in CPM (Cost Per Mille) in Alipay's homepage display advertising, with the benefit of larger training samples and tasks.


Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

Yang, Fei, Peng, Shuang, Sun, Ning, Wang, Fangyu, Tan, Ke, Wu, Fu, Qiu, Jiezhong, Pan, Aimin

arXiv.org Artificial Intelligence

Large language models (LLMs) such as GPT-3, OPT, and LLaMA have demonstrated remarkable accuracy in a wide range of tasks. However, training these models can incur significant expenses, often requiring tens of thousands of GPUs for months of continuous operation. Typically, this training is carried out in specialized GPU clusters equipped with homogeneous high-speed Remote Direct Memory Access (RDMA) network interface cards (NICs). The acquisition and maintenance of such dedicated clusters is challenging. Current LLM training frameworks, like Megatron-LM and Megatron-DeepSpeed, focus primarily on optimizing training within homogeneous cluster settings. In this paper, we introduce Holmes, a training framework for LLMs that employs thoughtfully crafted data and model parallelism strategies over the heterogeneous NIC environment. Our primary technical contribution lies in a novel scheduling method that intelligently allocates distinct computational tasklets in LLM training to specific groups of GPU devices based on the characteristics of their connected NICs. Furthermore, our proposed framework, utilizing pipeline parallel techniques, demonstrates scalability to multiple GPU clusters, even in scenarios without high-speed interconnects between nodes in distinct clusters. We conducted comprehensive experiments that involved various scenarios in the heterogeneous NIC environment. In most cases, our framework achieves performance levels close to those achievable with homogeneous RDMA-capable networks (InfiniBand or RoCE), significantly exceeding training efficiency within the pure Ethernet environment. Additionally, we verified that our framework outperforms other mainstream LLM frameworks under heterogeneous NIC environment in terms of training efficiency and can be seamlessly integrated with them.


Network Contention-Aware Cluster Scheduling with Reinforcement Learning

Ryu, Junyeol, Eo, Jeongyoon

arXiv.org Artificial Intelligence

With continuous advances in deep learning, distributed training is becoming common in GPU clusters. Specifically, for emerging workloads with diverse amounts, ratios, and patterns of communication, we observe that network contention can significantly degrade training throughput. However, widely used scheduling policies often face limitations as they are agnostic to network contention between jobs. In this paper, we present a new approach to mitigate network contention in GPU clusters using reinforcement learning. We formulate GPU cluster scheduling as a reinforcement learning problem and opt to learn a network contention-aware scheduling policy that efficiently captures contention sensitivities and dynamically adapts scheduling decisions through continuous evaluation and improvement. We show that compared to widely used scheduling policies, our approach reduces average job completion time by up to 18.2\% and effectively cuts the tail job completion time by up to 20.7\% while allowing a preferable trade-off between average job completion time and resource utilization.


Punica: Multi-Tenant LoRA Serving

Chen, Lequn, Ye, Zihao, Wu, Yongji, Zhuo, Danyang, Ceze, Luis, Krishnamurthy, Arvind

arXiv.org Artificial Intelligence

Low-rank adaptation (LoRA) has become an important and popular method to adapt pre-trained models to specific domains. We present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that allows batching of GPU operations for different LoRA models. This allows a GPU to hold only a single copy of the underlying pre-trained model when serving multiple, different LoRA models, significantly enhancing GPU efficiency in terms of both memory and computation. Our scheduler consolidates multi-tenant LoRA serving workloads in a shared GPU cluster. With a fixed-sized GPU cluster, our evaluations show that Punica achieves 12x higher throughput in serving multiple LoRA models compared to state-of-the-art LLM serving systems while only adding 2ms latency per token. We thus need to enable batching for different LoRA models. We increasingly popular in specializing pre-trained large thus only need to focus on the decode stage performance. LoRA retains the weights of the pretrained we can apply straightforward techniques, e.g., on-demand model and introduces trainable rank decomposition loading of LoRA model weights.


The Chan Zuckerberg Initiative is building a massive GPU cluster to 'cure, prevent or manage all diseases'

Engadget

The Chan Zuckerberg Initiative (CZI), the philanthropic organization created in 2015 by Priscilla Chan and her husband Mark Zuckerberg, announced a bold new generative AI initiative today. The group is funding and building a high-end GPU cluster that will use AI to create predictive models of healthy and diseased cells; it hopes they'll help researchers better understand the human body's cells and cellular reactions. The group believes the collection of computers will help it achieve its incredibly lofty goal of helping to "cure, prevent, or manage all diseases by the end of this century." "Researchers are gathering more data than ever before about the trillions of cells within our bodies, and it's too complex for our brains to grapple with," Jeff MacGregor, CZI vice president of communications, wrote in an emailed statement to Engadget. He lists an example of imaging one cell at nanometer resolution, which would use the same amount of data as 83,000 photos on a smartphone.