Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent

Nie, Xiaonan, Liu, Yi, Fu, Fangcheng, Xue, Jinbao, Jiao, Dian, Miao, Xupeng, Tao, Yangyu, Cui, Bin

Mar-5-2023–arXiv.org Artificial Intelligence

Recent years have witnessed the unprecedented achievements of large-scale pre-trained models, especially the Transformer models. Many products and services in Tencent Inc., such as WeChat, QQ, and Tencent Advertisement, have been opted in to gain the power of pre-trained models. In this work, we present Angel-PTM, a productive deep learning system designed for pre-training and fine-tuning Transformer models. Angel-PTM can train extremely large-scale models with hierarchical memory efficiently. The key designs of Angel-PTM are the fine-grained memory management via the Page abstraction and a unified scheduling method that coordinate the computations, data movements, and communications. Furthermore, Angel-PTM supports extreme model scaling with SSD storage and implements the lock-free updating mechanism to address the SSD I/O bandwidth bottlenecks. Experimental results demonstrate that Angel-PTM outperforms existing systems by up to 114.8% in terms of maximum model scale as well as up to 88.9% in terms of training throughput. Additionally, experiments on GPT3-175B and T5-MoE-1.2T models utilizing hundreds of GPUs verify the strong scalability of Angel-PTM.

artificial intelligence, deep learning, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Mar-5-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.46)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Information Technology (0.34)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)
    - Natural Language > Large Language Model (0.93)
  - Hardware (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found