Machine Translation
Does Joint Training Really Help Cascaded Speech Translation?
Tran, Viet Anh Khoa, Thulke, David, Gao, Yingbo, Herold, Christian, Ney, Hermann
Currently, in speech translation, the straightforward approach - cascading a recognition system with a translation system - delivers state-of-the-art results. However, fundamental challenges such as error propagation from the automatic speech recognition system still remain. To mitigate these problems, recently, people turn their attention to direct data and propose various joint training methods. In this work, we seek to answer the question of whether joint training really helps cascaded speech translation. We review recent papers on the topic and also investigate a joint training criterion by marginalizing the transcription posterior probabilities. Our findings show that a strong cascaded baseline can diminish any improvements obtained using joint training, and we suggest alternatives to joint training. We hope this work can serve as a refresher of the current speech translation landscape, and motivate research in finding more efficient and creative ways to utilize the direct data for speech translation.
Competency-Aware Neural Machine Translation: Can Machine Translation Know its Own Translation Quality?
Zhang, Pei, Yang, Baosong, Wei, Haoran, Liu, Dayiheng, Fan, Kai, Si, Luo, Xie, Jun
Neural machine translation (NMT) is often criticized for failures that happen without awareness. The lack of competency awareness makes NMT untrustworthy. This is in sharp contrast to human translators who give feedback or conduct further investigations whenever they are in doubt about predictions. To fill this gap, we propose a novel competency-aware NMT by extending conventional NMT with a self-estimator, offering abilities to translate a source sentence and estimate its competency. The self-estimator encodes the information of the decoding procedure and then examines whether it can reconstruct the original semantics of the source sentence. Experimental results on four translation tasks demonstrate that the proposed method not only carries out translation tasks intact but also delivers outstanding performance on quality estimation. Without depending on any reference or annotated data typically required by state-of-the-art metric and quality estimation methods, our model yields an even higher correlation with human quality judgments than a variety of aforementioned methods, such as BLEURT, COMET, and BERTScore. Quantitative and qualitative analyses show better robustness of competency awareness in our model.
Training Dynamics for Curriculum Learning: A Study on Monolingual and Cross-lingual NLU
Christopoulou, Fenia, Lampouras, Gerasimos, Iacobacci, Ignacio
Curriculum Learning (CL) is a technique of training models via ranking examples in a typically increasing difficulty trend with the aim of accelerating convergence and improving generalisability. Current approaches for Natural Language Understanding (NLU) tasks use CL to improve in-distribution data performance often via heuristic-oriented or task-agnostic difficulties. In this work, instead, we employ CL for NLU by taking advantage of training dynamics as difficulty metrics, i.e., statistics that measure the behavior of the model at hand on specific task-data instances during training and propose modifications of existing CL schedulers based on these statistics. Differently from existing works, we focus on evaluating models on in-distribution (ID), out-of-distribution (OOD) as well as zero-shot (ZS) cross-lingual transfer datasets. We show across several NLU tasks that CL with training dynamics can result in better performance mostly on zero-shot cross-lingual transfer and OOD settings with improvements up by 8.5% in certain cases. Overall, experiments indicate that training dynamics can lead to better performing models with smoother training compared to other difficulty metrics while being 20% faster on average. In addition, through analysis we shed light on the correlations of task-specific versus task-agnostic metrics.
Using Selective Masking as a Bridge between Pre-training and Fine-tuning
Lad, Tanish, Maheshwari, Himanshu, Kottukkal, Shreyas, Mamidi, Radhika
Pre-training a language model and then fine-tuning it for downstream tasks has demonstrated state-of-the-art results for various NLP tasks. Pre-training is usually independent of the downstream task, and previous works have shown that this pre-training alone might not be sufficient to capture the task-specific nuances. We propose a way to tailor a pre-trained BERT model for the downstream task via task-specific masking before the standard supervised fine-tuning. For this, a word list is first collected specific to the task. For example, if the task is sentiment classification, we collect a small sample of words representing both positive and negative sentiments. Next, a word's importance for the task, called the word's task score, is measured using the word list. Each word is then assigned a probability of masking based on its task score. We experiment with different masking functions that assign the probability of masking based on the word's task score. The BERT model is further trained on MLM objective, where masking is done using the above strategy. Following this standard supervised fine-tuning is done for different downstream tasks. Results on these tasks show that the selective masking strategy outperforms random masking, indicating its effectiveness.
HaRiM$^+$: Evaluating Summary Quality with Hallucination Risk
Son, Seonil, Park, Junsoo, Hwang, Jeong-in, Lee, Junghwa, Noh, Hyungjong, Lee, Yeonsoo
One of the challenges of developing a summarization model arises from the difficulty in measuring the factual inconsistency of the generated text. In this study, we reinterpret the decoder overconfidence-regularizing objective suggested in (Miao et al., 2021) as a hallucination risk measurement to better estimate the quality of generated summaries. We propose a reference-free metric, HaRiM+, which only requires an off-the-shelf summarization model to compute the hallucination risk based on token likelihoods. Deploying it requires no additional training of models or ad-hoc modules, which usually need alignment to human judgments. For summary-quality estimation, HaRiM+ records state-of-the-art correlation to human judgment on three summary-quality annotation sets: FRANK, QAGS, and SummEval. We hope that our work, which merits the use of summarization models, facilitates the progress of both automated evaluation and generation of summary.
Breaking the Representation Bottleneck of Chinese Characters: Neural Machine Translation with Stroke Sequence Modeling
Wang, Zhijun, Liu, Xuebo, Zhang, Min
Existing research generally treats Chinese character as a minimum unit for representation. However, such Chinese character representation will suffer two bottlenecks: 1) Learning bottleneck, the learning cannot benefit from its rich internal features (e.g., radicals and strokes); and 2) Parameter bottleneck, each individual character has to be represented by a unique vector. In this paper, we introduce a novel representation method for Chinese characters to break the bottlenecks, namely StrokeNet, which represents a Chinese character by a Latinized stroke sequence (e.g., "ao1 (concave)" to "ajaie" and "tu1 (convex)" to "aeaqe"). Specifically, StrokeNet maps each stroke to a specific Latin character, thus allowing similar Chinese characters to have similar Latin representations. With the introduction of StrokeNet to neural machine translation (NMT), many powerful but not applicable techniques to non-Latin languages (e.g., shared subword vocabulary learning and ciphertext-based data augmentation) can now be perfectly implemented. Experiments on the widely-used NIST Chinese-English, WMT17 Chinese-English and IWSLT17 Japanese-English NMT tasks show that StrokeNet can provide a significant performance boost over the strong baselines with fewer model parameters, achieving 26.5 BLEU on the WMT17 Chinese-English task which is better than any previously reported results without using monolingual data. Code and scripts are freely available at https://github.com/zjwang21/StrokeNet.
TorchScale: Transformers at Scale
Ma, Shuming, Wang, Hongyu, Huang, Shaohan, Wang, Wenhui, Chi, Zewen, Dong, Li, Benhaim, Alon, Patra, Barun, Chaudhary, Vishrav, Song, Xia, Wei, Furu
Large Transformers have achieved state-of-the-art performance across many tasks. Most open-source libraries on scaling Transformers focus on improving training or inference with better parallelization. In this work, we present TorchScale, an open-source toolkit that allows researchers and developers to scale up Transformers efficiently and effectively. TorchScale has the implementation of several modeling techniques, which can improve modeling generality and capability, as well as training stability and efficiency. Experimental results on language modeling and neural machine translation demonstrate that TorchScale can successfully scale Transformers to different sizes without tears. The library is available at https://aka.ms/torchscale.
Rank-One Editing of Encoder-Decoder Models
Large sequence to sequence models for tasks such as Neural Machine Translation (NMT) are usually trained over hundreds of millions of samples. However, training is just the origin of a model's life-cycle. Real-world deployments of models require further behavioral adaptations as new requirements emerge or shortcomings become known. Typically, in the space of model behaviors, behavior deletion requests are addressed through model retrainings whereas model finetuning is done to address behavior addition requests, both procedures being instances of data-based model intervention. In this work, we present a preliminary study investigating rank-one editing as a direct intervention method for behavior deletion requests in encoder-decoder transformer models. We propose four editing tasks for NMT and show that the proposed editing algorithm achieves high efficacy, while requiring only a single instance of positive example to fix an erroneous (negative) model behavior.
Aligning Source Visual and Target Language Domains for Unpaired Video Captioning
Liu, Fenglin, Wu, Xian, You, Chenyu, Ge, Shen, Zou, Yuexian, Sun, Xu
Training supervised video captioning model requires coupled video-caption pairs. However, for many targeted languages, sufficient paired data are not available. To this end, we introduce the unpaired video captioning task aiming to train models without coupled video-caption pairs in target language. To solve the task, a natural choice is to employ a two-step pipeline system: first utilizing video-to-pivot captioning model to generate captions in pivot language and then utilizing pivot-to-target translation model to translate the pivot captions to the target language. However, in such a pipeline system, 1) visual information cannot reach the translation model, generating visual irrelevant target captions; 2) the errors in the generated pivot captions will be propagated to the translation model, resulting in disfluent target captions. To address these problems, we propose the Unpaired Video Captioning with Visual Injection system (UVC-VI). UVC-VI first introduces the Visual Injection Module (VIM), which aligns source visual and target language domains to inject the source visual information into the target language domain. Meanwhile, VIM directly connects the encoder of the video-to-pivot model and the decoder of the pivot-to-target model, allowing end-to-end inference by completely skipping the generation of pivot captions. To enhance the cross-modality injection of the VIM, UVC-VI further introduces a pluggable video encoder, i.e., Multimodal Collaborative Encoder (MCE). The experiments show that UVC-VI outperforms pipeline systems and exceeds several supervised systems. Furthermore, equipping existing supervised systems with our MCE can achieve 4% and 7% relative margins on the CIDEr scores to current state-of-the-art models on the benchmark MSVD and MSR-VTT datasets, respectively.
Fast Nearest Neighbor Machine Translation
Meng, Yuxian, Li, Xiaoya, Zheng, Xiayu, Wu, Fei, Sun, Xiaofei, Zhang, Tianwei, Li, Jiwei
Though nearest neighbor Machine Translation ($k$NN-MT) \citep{khandelwal2020nearest} has proved to introduce significant performance boosts over standard neural MT systems, it is prohibitively slow since it uses the entire reference corpus as the datastore for the nearest neighbor search. This means each step for each beam in the beam search has to search over the entire reference corpus. $k$NN-MT is thus two-orders slower than vanilla MT models, making it hard to be applied to real-world applications, especially online services. In this work, we propose Fast $k$NN-MT to address this issue. Fast $k$NN-MT constructs a significantly smaller datastore for the nearest neighbor search: for each word in a source sentence, Fast $k$NN-MT first selects its nearest token-level neighbors, which is limited to tokens that are the same as the query token. Then at each decoding step, in contrast to using the entire corpus as the datastore, the search space is limited to target tokens corresponding to the previously selected reference source tokens. This strategy avoids search through the whole datastore for nearest neighbors and drastically improves decoding efficiency. Without loss of performance, Fast $k$NN-MT is two-orders faster than $k$NN-MT, and is only two times slower than the standard NMT model. Fast $k$NN-MT enables the practical use of $k$NN-MT systems in real-world MT applications. The code is available at \url{https://github.com/ShannonAI/fast-knn-nmt}