Review for NeurIPS paper: Incorporating BERT into Parallel Sequence Decoding with Adapters

Jan-26-2025, 01:35:19 GMT–Neural Information Processing Systems

Additional Feedback: From supplementary: 'we only consider tokens that appear in the training and validation set, and manually modify the checkpoint of the multilingual BERT to omit the embeddings of unused tokens' This is an interesting detail, I think it should be included in the main paper. In equation 5, you pass the encoder layer output directly into an adapter, and the adapter immediately applies layernorm - the final stage of a BERT transformer layer is a residual connection followed by layernorm, so why do we need to apply *another* layernorm in the adapter? The method of Houslby et al. 2019 (and others), applying adapters *before* layernorm is much more intuitive to me. I think you are missing some key baselines. Firstly, large-scale pretraining could be viewed as a form of data augmentation, and so you should compare to normal MT data augmentation, most notably back-translation, which is key to many strong results in MT (e.g.

incorporating bert, layernorm, parallel sequence decoding, (11 more...)

Neural Information Processing Systems

Jan-26-2025, 01:35:19 GMT

Conferences Web Page

Add feedback

Country:
- Asia > Myanmar (0.06)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.38)