Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis

Wei, Xiuying, Moalla, Skander, Pascanu, Razvan, Gulcehre, Caglar

Jul-13-2024–arXiv.org Artificial Intelligence

State-of-the-art LLMs often rely on scale with high computational costs, which has sparked a research agenda to reduce parameter counts and costs without significantly impacting performance. Our study focuses on Transformer-based LLMs, specifically applying low-rank parametrization to the computationally intensive feedforward networks (FFNs), which are less studied than attention blocks. In contrast to previous works, (i) we explore low-rank parametrization at scale, up to 1.3B parameters; (ii) within Transformer language models rather than convolutional architectures; and (iii) starting from training from scratch. Experiments on the large RefinedWeb dataset show that low-rank parametrization is both efficient (e.g., 2.6$\times$ FFN speed-up with 32\% parameters) and effective during training. Interestingly, these structured FFNs exhibit steeper scaling curves than the original models. Motivated by this finding, we develop the wide and structured networks surpassing the current medium-sized and large-sized Transformer in perplexity and throughput performance.

arxiv preprint arxiv, language model, low-rank parametrization, (8 more...)

arXiv.org Artificial Intelligence

Jul-13-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Italy
  - Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East
  - Jordan (0.05)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found