GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model

Tan, Shicheng, Tam, Weng Lam, Wang, Yuanchun, Gong, Wenwen, Yang, Yang, Tang, Hongyin, He, Keqing, Liu, Jiahao, Wang, Jingang, Zhao, Shu, Zhang, Peng, Tang, Jie

Jun-11-2023–arXiv.org Artificial Intelligence

Currently, the reduction in the parameter scale of large-scale pre-trained language models (PLMs) through knowledge distillation has greatly facilitated their widespread deployment on various devices. However, the deployment of knowledge distillation systems faces great challenges in real-world industrial-strength applications, which require the use of complex distillation methods on even larger-scale PLMs (over 10B), limited by memory on GPUs and the switching of methods. To overcome these challenges, we propose GKD, a general knowledge distillation framework that supports distillation on larger-scale PLMs using various distillation methods. With GKD, developers can build larger distillation models on memory-limited GPUs and easily switch and combine different distillation methods within a single framework. Experimental results show that GKD can support the distillation of at least 100B-scale PLMs and 25 mainstream methods on 8 NVIDIA A100 (40GB) GPUs.

distillation, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Jun-11-2023

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.14)

Genre:
- Research Report > New Finding (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.46)
  - Natural Language (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found