AITopics | incorporating bert

Incorporating BERT into Parallel Sequence Decoding with Adapters

Neural Information Processing SystemsDec-24-2025, 05:18:54 GMT

While large scale pre-trained language models such as BERT have achieved great success on various natural language understanding tasks, how to efficiently and effectively incorporate them into sequence-to-sequence models and the corresponding text generation tasks remains a non-trivial problem. In this paper, we propose to address this problem by taking two different BERT models as the encoder and decoder respectively, and fine-tuning them by introducing simple and lightweight adapter modules, which are inserted between BERT layers and tuned on the task-specific dataset. In this way, we obtain a flexible and efficient model which is able to jointly leverage the information contained in the source-side and target-side BERT models, while bypassing the catastrophic forgetting problem. Each component in the framework can be considered as a plug-in unit, making the framework flexible and task agnostic. Our framework is based on a parallel sequence decoding algorithm named Mask-Predict considering the bi-directional and conditional independent nature of BERT, and can be adapted to traditional autoregressive decoding easily. We conduct extensive experiments on neural machine translation tasks where the proposed method consistently outperforms autoregressive baselines while reducing the inference latency by half, and achieves $36.49$/$33.57$

incorporating bert, name change, parallel sequence decoding, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.96)

Add feedback

Review for NeurIPS paper: Incorporating BERT into Parallel Sequence Decoding with Adapters

Neural Information Processing SystemsJan-26-2025, 01:35:19 GMT

Additional Feedback: From supplementary: 'we only consider tokens that appear in the training and validation set, and manually modify the checkpoint of the multilingual BERT to omit the embeddings of unused tokens' This is an interesting detail, I think it should be included in the main paper. In equation 5, you pass the encoder layer output directly into an adapter, and the adapter immediately applies layernorm - the final stage of a BERT transformer layer is a residual connection followed by layernorm, so why do we need to apply *another* layernorm in the adapter? The method of Houslby et al. 2019 (and others), applying adapters *before* layernorm is much more intuitive to me. I think you are missing some key baselines. Firstly, large-scale pretraining could be viewed as a form of data augmentation, and so you should compare to normal MT data augmentation, most notably back-translation, which is key to many strong results in MT (e.g.

incorporating bert, layernorm, parallel sequence decoding, (11 more...)

Neural Information Processing Systems

Country: Asia > Myanmar (0.06)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.38)

Add feedback

Incorporating BERT into Parallel Sequence Decoding with Adapters

Neural Information Processing SystemsOct-10-2024, 14:58:34 GMT

While large scale pre-trained language models such as BERT have achieved great success on various natural language understanding tasks, how to efficiently and effectively incorporate them into sequence-to-sequence models and the corresponding text generation tasks remains a non-trivial problem. In this paper, we propose to address this problem by taking two different BERT models as the encoder and decoder respectively, and fine-tuning them by introducing simple and lightweight adapter modules, which are inserted between BERT layers and tuned on the task-specific dataset. In this way, we obtain a flexible and efficient model which is able to jointly leverage the information contained in the source-side and target-side BERT models, while bypassing the catastrophic forgetting problem. Each component in the framework can be considered as a plug-in unit, making the framework flexible and task agnostic. Our framework is based on a parallel sequence decoding algorithm named Mask-Predict considering the bi-directional and conditional independent nature of BERT, and can be adapted to traditional autoregressive decoding easily.

adapter, incorporating bert, parallel sequence decoding, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.38)

Add feedback

Filters

Collaborating Authors

incorporating bert

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Incorporating BERT into Parallel Sequence Decoding with Adapters

Review for NeurIPS paper: Incorporating BERT into Parallel Sequence Decoding with Adapters

Incorporating BERT into Parallel Sequence Decoding with Adapters