Machine Translation
DATScore: Evaluating Translation with Data Augmented Translations
Eddine, Moussa Kamal, Shang, Guokan, Vazirgiannis, Michalis
The rapid development of large pretrained language models has revolutionized not only the field of Natural Language Generation (NLG) but also its evaluation. Inspired by the recent work of BARTScore: a metric leveraging the BART language model to evaluate the quality of generated text from various aspects, we introduce DATScore. DATScore uses data augmentation techniques to improve the evaluation of machine translation. Our main finding is that introducing data augmented translations of the source and reference texts is greatly helpful in evaluating the quality of the generated translation. We also propose two novel score averaging and term weighting strategies to improve the original score computing process of BARTScore. Experimental results on WMT show that DATScore correlates better with human meta-evaluations than the other recent state-of-the-art metrics, especially for low-resource languages. Ablation studies demonstrate the value added by our new scoring strategies. Moreover, we report in our extended experiments the performance of DATScore on 3 NLG tasks other than translation.
Improved Data Augmentation for Translation Suggestion
Zhang, Hongxiao, Lai, Siyu, Zhang, Songming, Huang, Hui, Chen, Yufeng, Xu, Jinan, Liu, Jian
Translation suggestion (TS) models are used to automatically provide alternative suggestions for incorrect spans in sentences generated by machine translation. This paper introduces the system used in our submission to the WMT'22 Translation Suggestion shared task. Our system is based on the ensemble of different translation architectures, including Transformer, SA-Transformer, and DynamicConv. We use three strategies to construct synthetic data from parallel corpora to compensate for the lack of supervised data. In addition, we introduce a multi-phase pre-training strategy, adding an additional pre-training phase with in-domain data. We rank second and third on the English-German and English-Chinese bidirectional tasks, respectively.
Better Smatch = Better Parser? AMR evaluation is not so simple anymore
Recently, astonishing advances have been observed in AMR parsing, as measured by the structural Smatch metric. In fact, today's systems achieve performance levels that seem to surpass estimates of human inter annotator agreement (IAA). Therefore, it is unclear how well Smatch (still) relates to human estimates of parse quality, as in this situation potentially fine-grained errors of similar weight may impact the AMR's meaning to different degrees. We conduct an analysis of two popular and strong AMR parsers that -- according to Smatch -- reach quality levels on par with human IAA, and assess how human quality ratings relate to Smatch and other AMR metrics. Our main findings are: i) While high Smatch scores indicate otherwise, we find that AMR parsing is far from being solved: we frequently find structurally small, but semantically unacceptable errors that substantially distort sentence meaning. ii) Considering high-performance parsers, better Smatch scores may not necessarily indicate consistently better parsing quality. To obtain a meaningful and comprehensive assessment of quality differences of parse(r)s, we recommend augmenting evaluations with macro statistics, use of additional metrics, and more human analysis.
Linearizing Transformer with Key-Value Memory
Efficient transformer variants with linear time complexity have been developed to mitigate the quadratic computational overhead of the vanilla transformer. Among them are low-rank projection methods such as Linformer and kernel-based Transformers. Despite their unique merits, they usually suffer from a performance drop comparing with the vanilla transformer on many sequence generation tasks, and often fail to obtain computation gain when the generation is short. We propose MemSizer, an approach towards closing the performance gap while improving the efficiency even with short generation. It projects the source sequences into lower dimension representations like Linformer, while enjoying efficient recurrent-style incremental computation similar to kernel-based transformers. This yields linear computation time and constant memory complexity at inference time. MemSizer also employs a lightweight multi-head mechanism which renders the computation as light as a single-head model. We demonstrate that MemSizer provides an improved balance between efficiency and accuracy over the vanilla transformer and other efficient transformer variants in three typical sequence generation tasks, including machine translation, abstractive text summarization, and language modeling.
Developing a general-purpose clinical language inference model from a large corpus of clinical notes
Sushil, Madhumita, Ludwig, Dana, Butte, Atul J., Rudrapatna, Vivek A.
Several biomedical language models have already been developed for clinical language inference. However, these models typically utilize general vocabularies and are trained on relatively small clinical corpora. We sought to evaluate the impact of using a domain-specific vocabulary and a large clinical training corpus on the performance of these language models in clinical language inference. We trained a Bidirectional Encoder Decoder from Transformers (BERT) model using a diverse, deidentified corpus of 75 million deidentified clinical notes authored at the University of California, San Francisco (UCSF). We evaluated this model on several clinical language inference benchmark tasks: clinical and temporal concept recognition, relation extraction and medical language inference. We also evaluated our model on two tasks using discharge summaries from UCSF: diagnostic code assignment and therapeutic class inference. Our model performs at par with the best publicly available biomedical language models of comparable sizes on the public benchmark tasks, and is significantly better than these models in a within-system evaluation on the two tasks using UCSF data. The use of in-domain vocabulary appears to improve the encoding of longer documents. The use of large clinical corpora appears to enhance document encoding and inferential accuracy. However, further research is needed to improve abbreviation resolution, and numerical, temporal, and implicitly causal inference.
Revisiting and Advancing Chinese Natural Language Understanding with Accelerated Heterogeneous Knowledge Pre-training
Zhang, Taolin, Dong, Junwei, Wang, Jianing, Wang, Chengyu, Wang, Ang, Liu, Yinghui, Huang, Jun, Li, Yong, He, Xiaofeng
Recently, knowledge-enhanced pre-trained language models (KEPLMs) improve context-aware representations via learning from structured relations in knowledge graphs, and/or linguistic knowledge from syntactic or dependency analysis. Unlike English, there is a lack of high-performing open-source Chinese KEPLMs in the natural language processing (NLP) community to support various language understanding applications. In this paper, we revisit and advance the development of Chinese natural language understanding with a series of novel Chinese KEPLMs released in various parameter sizes, namely CKBERT (Chinese knowledge-enhanced BERT).Specifically, both relational and linguistic knowledge is effectively injected into CKBERT based on two novel pre-training tasks, i.e., linguistic-aware masked language modeling and contrastive multi-hop relation modeling. Based on the above two pre-training paradigms and our in-house implemented TorchAccelerator, we have pre-trained base (110M), large (345M) and huge (1.3B) versions of CKBERT efficiently on GPU clusters. Experiments demonstrate that CKBERT outperforms strong baselines for Chinese over various benchmark NLP tasks and in terms of different model sizes.
CTC Alignments Improve Autoregressive Translation
Yan, Brian, Dalmia, Siddharth, Higuchi, Yosuke, Neubig, Graham, Metze, Florian, Black, Alan W, Watanabe, Shinji
Connectionist Temporal Classification (CTC) is a widely used approach for automatic speech recognition (ASR) that performs conditionally independent monotonic alignment. However for translation, CTC exhibits clear limitations due to the contextual and non-monotonic nature of the task and thus lags behind attentional decoder approaches in terms of translation quality. In this work, we argue that CTC does in fact make sense for translation if applied in a joint CTC/attention framework wherein CTC's core properties can counteract several key weaknesses of pure-attention models during training and decoding. To validate this conjecture, we modify the Hybrid CTC/Attention model originally proposed for ASR to support text-to-text translation (MT) and speech-to-text translation (ST). Our proposed joint CTC/attention models outperform pure-attention baselines across six benchmark translation tasks.
Understanding the Failure of Batch Normalization for Transformers in NLP
Wang, Jiaxi, Wu, Ji, Huang, Lei
Batch Normalization (BN) is a core and prevalent technique in accelerating the training of deep neural networks and improving the generalization on Computer Vision (CV) tasks. However, it fails to defend its position in Natural Language Processing (NLP), which is dominated by Layer Normalization (LN). In this paper, we are trying to answer why BN usually performs worse than LN in NLP tasks with Transformer models. We find that the inconsistency between training and inference of BN is the leading cause that results in the failure of BN in NLP. We define Training Inference Discrepancy (TID) to quantitatively measure this inconsistency and reveal that TID can indicate BN's performance, supported by extensive experiments, including image classification, neural machine translation, language modeling, sequence labeling, and text classification tasks. We find that BN can obtain much better test performance than LN when TID keeps small through training. To suppress the explosion of TID, we propose Regularized BN (RBN) that adds a simple regularization term to narrow the gap between batch statistics and population statistics of BN.
Multilingual CheckList: Generation and Evaluation
K, Karthikeyan, Bhatt, Shaily, Singh, Pankaj, Aditya, Somak, Dandapat, Sandipan, Sitaram, Sunayana, Choudhury, Monojit
Multilingual evaluation benchmarks usually contain limited high-resource languages and do not test models for specific linguistic capabilities. CheckList is a template-based evaluation approach that tests models for specific capabilities. The CheckList template creation process requires native speakers, posing a challenge in scaling to hundreds of languages. In this work, we explore multiple approaches to generate Multilingual CheckLists. We device an algorithm - Template Extraction Algorithm (TEA) for automatically extracting target language CheckList templates from machine translated instances of a source language templates. We compare the TEA CheckLists with CheckLists created with different levels of human intervention. We further introduce metrics along the dimensions of cost, diversity, utility, and correctness to compare the CheckLists. We thoroughly analyze different approaches to creating CheckLists in Hindi. Furthermore, we experiment with 9 more different languages. We find that TEA followed by human verification is ideal for scaling Checklist-based evaluation to multiple languages while TEA gives a good estimates of model performance.
Approaching Neural Chinese Word Segmentation as a Low-Resource Machine Translation Task
Chen, Pinzhen, Heafield, Kenneth
Chinese word segmentation has entered the deep learning era which greatly reduces the hassle of feature engineering. Recently, some researchers attempted to treat it as character-level translation, which further simplified model designing, but there is a performance gap between the translation-based approach and other methods. This motivates our work, in which we apply the best practices from low-resource neural machine translation to supervised Chinese segmentation. We examine a series of techniques including regularization, data augmentation, objective weighting, transfer learning, and ensembling. Compared to previous works, our low-resource translation-based method maintains the effortless model design, yet achieves the same result as state of the art in the constrained evaluation without using additional data.