Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture
Liu, Peiyu, Gao, Ze-Feng, Chen, Yushuo, Zhao, Wayne Xin, Wen, Ji-Rong
–arXiv.org Artificial Intelligence
In this paper, we propose a highly parameter-efficient approach to scaling pre-trained language models (PLMs) to a deeper model depth. Unlike prior work that shares all parameters or uses extra blocks, we design a more capable parameter-sharing architecture based on matrix product operator (MPO). MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts: the major part that contains the major information (central tensor) and the supplementary part that only has a small proportion of parameters (auxiliary tensors). Based on such a decomposition, our architecture shares the central tensor across all layers for reducing the model size and meanwhile keeps layer-specific auxiliary tensors (also using adapters) for enhancing the adaptation flexibility. To improve the model training, we further propose a stable initialization algorithm tailored for the MPO-based architecture. Extensive experiments have demonstrated the effectiveness of our proposed model in reducing the model size and achieving highly competitive performance.
arXiv.org Artificial Intelligence
Apr-10-2023
- Country:
- North America
- Dominican Republic (0.04)
- United States
- Texas (0.04)
- California (0.04)
- Washington > King County
- Seattle (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Canada > British Columbia
- Asia > China
- Africa > Ethiopia
- Addis Ababa > Addis Ababa (0.04)
- North America
- Genre:
- Research Report > New Finding (0.66)
- Industry:
- Education (0.93)
- Technology: