Appendix for "R-Drop: Regularized Dropout for Neural Networks "

May-28-2025, 23:28:41 GMT–Neural Information Processing Systems

We provide more detailed settings for the experiments of each task in this part. A.1 Neural Machine Translation For all the NMT tasks, we use the public datasets from IWSLT competitions After tokenization, the resulted vocabularies for IWSLT datasets are near 10k, while for WMT datasets, the vocabulary size is about 32k. To train the Transformer based NMT models, we use transformer_iwslt_de_en configuration for IWSLT translations, which has 6 layers in both encoder and decoder, embedding size 512, feed-forward size 1, 024, attention heads 4, dropout value 0.3, weight decay 0.0001. Label smoothing [12] is adopted with value 0.1. To evaluate the performance, we use multi-bleu.perl

r-drop, transformer, translation, (16 more...)

Neural Information Processing Systems

May-28-2025, 23:28:41 GMT

Conferences PDF

Add feedback

Country:
- Oceania > Australia > New South Wales > Sydney (0.04)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Machine Translation (0.91)
  - Machine Learning > Neural Networks
    - Deep Learning (0.35)

Duplicate Docs Excel Report

Title
5a66b9200f29ac3fa0ae244cc2a51b39-Supplemental.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found