Pseudo-Label Training and Model Inertia in Neural Machine Translation
Hsu, Benjamin, Currey, Anna, Niu, Xing, Nădejde, Maria, Dinu, Georgiana
–arXiv.org Artificial Intelligence
However, these models have been observed to be brittle: NMT model predictions are sensitive to small input changes and can show significant variation across re-training or incremental model updates. This work studies a frequently used method in NMT, pseudo-label training (PLT), which is common to the related techniques of forward-translation (or self-training) and sequence-level knowledge distillation. While the effect of PLT on quality is well-documented, we highlight a lesserknown effect: PLT can enhance a model's stability to model updates and input perturbations, a set of properties we call model inertia. We study inertia effects under different training settings and we identify distribution simplification as a mechanism behind the observed results. Self-training (Fralick, 1967; Amini et al., 2022) is a popular semi-supervised technique used to boost the performance of neural machine translation (NMT) models. In self-training for NMT, also known as forward-translation, an initial model is used to translate monolingual data; this data is then concatenated with the original training data in a subsequent training step (Zhang & Zong, 2016; Marie et al., 2020; Edunov et al., 2020; Wang et al., 2021). Self-training is believed to be effective through inducing input smoothness and leading to better learning of decision boundaries from the addition of unlabeled data (Chapelle et al., 2006; He et al., 2020; Wei et al., 2021). It has also been observed to effectively diversify the training distribution (Wang et al., 2021; Nguyen et al., 2020). A closely related technique is that of knowledge distillation (Hinton et al., 2015; Gou et al., 2021), particularly sequence-level knowledge distillation (SKD), which uses hard targets in training and reduces to pseudo-labeled data augmentation (Kim & Rush, 2016). In NMT, knowledge distillation is effective through knowledge transfer from ensembles or larger-capacity models and as a data augmentation method (Freitag et al., 2017; Gordon & Duh, 2019; Tan et al., 2019; Currey et al., 2020).
arXiv.org Artificial Intelligence
May-19-2023
- Country:
- Europe (1.00)
- North America > United States
- Pennsylvania (0.14)
- Texas (0.14)
- Genre:
- Research Report > New Finding (0.46)
- Technology: