AITopics | progressive layer dropping

Collaborating Authors

progressive layer dropping

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Review for NeurIPS paper: Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Neural Information Processing SystemsJan-27-2025, 04:06:59 GMT

Summary and Contributions: This paper proposes to accelerate training of Transformer networks by progressively reducing Transformer layers from the network during training. First, it compares two different architectures of BERT, PostLN and PreLN. PostLN applies layer normalization after the element-wise addition in Transformer blocks. The PreLN changes the placement of the location of layer normalization by placing it only on the input stream of the sublayers. It finds that PostLN is more sensitive to the choice of hyperparameters, and training often diverges with more aggressive learning rates whereas PreLN avoids vanishing gradients and leads to more stable optimization.

accelerating training, progressive layer dropping, transformer-based language model, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.40)

Add feedback

Review for NeurIPS paper: Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Neural Information Processing SystemsJan-27-2025, 04:06:51 GMT

The proposed method for training BERT is practically useful. My main concern on this paper is that the novelty in this paper is somewhat limited. It combines two existing techniques. One is PreLN which has been well studied in the literature for training BERT, and the other is stochastically dropping layers which was first proposed for training CV models. On the other hand, how to effectively combine these two techniques and fine tune to make them work for training BERT needs certain efforts.

accelerating training, progressive layer dropping, transformer-based language model, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.40)

Add feedback

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Neural Information Processing SystemsOct-10-2024, 23:15:48 GMT

Recently, Transformer-based language models have demonstrated remarkable performance across many NLP domains. However, the unsupervised pre-training step of these models suffers from unbearable overall computational expenses. Current methods for accelerating the pre-training either rely on massive parallelism with advanced hardware or are not applicable to language models. In this work, we propose a method based on progressive layer dropping that speeds the training of Transformer-based language models, not at the cost of excessive hardware resources but from model architecture change and training technique boosted efficiency. Extensive experiments on BERT show that the proposed method achieves a 25% reduction of computation cost in FLOPS and a 24% reduction in the end-to-end wall-clock training time. Furthermore, we show that our pre-trained models are equipped with strong knowledge transferability, achieving similar or even higher accuracy in downstream tasks to baseline models.

accelerating training, progressive layer dropping, transformer-based language model, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback