AITopics | Hou, Qi

Collaborating Authors

Hou, Qi

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

Zhang, Shulai, Zheng, Ningxin, Lin, Haibin, Jiang, Ziheng, Bao, Wenlei, Jiang, Chengquan, Hou, Qi, Cui, Weihao, Zheng, Size, Chang, Li-Wen, Chen, Quan, Liu, Xin

arXiv.org Artificial IntelligenceMar-4-2025

Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal. To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by $1.96\times$ and for end-to-end execution, COMET delivers a $1.71\times$ speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.

communication, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2502.19811

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

Chang, Li-Wen, Bao, Wenlei, Hou, Qi, Jiang, Chengquan, Zheng, Ningxin, Zhong, Yinmin, Zhang, Xuanrun, Song, Zuquan, Jiang, Ziheng, Lin, Haibin, Jin, Xin, Liu, Xin

arXiv.org Artificial IntelligenceJun-18-2024

Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation to meet a certain latency requirement. However, this kind of parallelism introduces additional communication that might contribute a significant portion of overall runtime. Thus limits scalability of this technique within a group of devices with high speed interconnects, such as GPUs with NVLinks in a node. This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux over-decomposes communication and computation operations into much finer-grained operations and further fuses them into a larger kernel to effectively hide communication without compromising kernel efficiency. Flux can potentially overlap up to 96% of communication given a fused kernel. Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPUs with various GPU generations and interconnects, and up to 1.66x and 1.30x speedups for prefill and decoding inference over vLLM on a cluster with 8 GPUs with various GPU generations and interconnects.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2406.06858

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Jiang, Ziheng, Lin, Haibin, Zhong, Yinmin, Huang, Qi, Chen, Yangrui, Zhang, Zhi, Peng, Yanghua, Li, Xiang, Xie, Cong, Nong, Shibiao, Jia, Yulu, He, Sun, Chen, Hongmin, Bai, Zhihao, Hou, Qi, Yan, Shipeng, Zhou, Ding, Sheng, Yiyao, Jiang, Zhuo, Xu, Haohan, Wei, Haoran, Zhang, Zhang, Nie, Pengfei, Zou, Leqi, Zhao, Sida, Xiang, Liang, Liu, Zherui, Li, Zhe, Jia, Xiaoying, Ye, Jianxi, Jin, Xin, Liu, Xin

arXiv.org Artificial IntelligenceFeb-23-2024

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2402.15627

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback