Machine Translation
Causal Direction of Data Collection Matters: Implications of Causal and Anticausal Learning for NLP
Jin, Zhijing, von Kügelgen, Julius, Ni, Jingwei, Vaidhya, Tejas, Kaushal, Ayush, Sachan, Mrinmaya, Schölkopf, Bernhard
The principle of independent causal mechanisms (ICM) states that generative processes of real world data consist of independent modules which do not influence or inform each other. While this idea has led to fruitful developments in the field of causal inference, it is not widely-known in the NLP community. In this work, we argue that the causal direction of the data collection process bears nontrivial implications that can explain a number of published NLP findings, such as differences in semi-supervised learning (SSL) and domain adaptation (DA) performance across different settings. We categorize common NLP tasks according to their causal direction and empirically assay the validity of the ICM principle for text data using minimum description length. We conduct an extensive meta-analysis of over 100 published SSL and 30 DA studies, and find that the results are consistent with our expectations based on causal insights. This work presents the first attempt to analyze the ICM principle in NLP, and provides constructive suggestions for future modeling choices. Code available at https://github.com/zhijing-jin/icm4nlp
Natural Language Processing for Smart Healthcare
Zhou, Binggui, Yang, Guanghua, Shi, Zheng, Ma, Shaodan
Smart healthcare has achieved significant progress in recent years. Emerging artificial intelligence (AI) technologies enable various smart applications across various healthcare scenarios. As an essential technology powered by AI, natural language processing (NLP) plays a key role in smart healthcare due to its capability of analysing and understanding human language. In this work we review existing studies that concern NLP for smart healthcare from the perspectives of technique and application. We focus on feature extraction and modelling for various NLP tasks encountered in smart healthcare from a technical point of view. In the context of smart healthcare applications employing NLP techniques, the elaboration largely attends to representative smart healthcare scenarios, including clinical practice, hospital management, personal care, public health, and drug development. We further discuss the limitations of current works and identify the directions for future works.
Monotonic Simultaneous Translation with Chunk-wise Reordering and Refinement
Han, HyoJung, Ahn, Seokchan, Choi, Yoonjung, Chung, Insoo, Kim, Sangha, Cho, Kyunghyun
Recent work in simultaneous machine translation is often trained with conventional full sentence translation corpora, leading to either excessive latency or necessity to anticipate as-yet-unarrived words, when dealing with a language pair whose word orders significantly differ. This is unlike human simultaneous interpreters who produce largely monotonic translations at the expense of the grammaticality of a sentence being translated. In this paper, we thus propose an algorithm to reorder and refine the target side of a full sentence translation corpus, so that the words/phrases between the source and target sentences are aligned largely monotonically, using word alignment and non-autoregressive neural machine translation. We then train a widely used wait-k simultaneous translation model on this reordered-and-refined corpus. The proposed approach improves BLEU scores and resulting translations exhibit enhanced monotonicity with source sentences.
Multilingual Neural Machine Translation:Can Linguistic Hierarchies Help?
Saleh, Fahimeh, Buntine, Wray, Haffari, Gholamreza, Du, Lan
Multilingual Neural Machine Translation (MNMT) trains a single NMT model that supports translation between multiple languages, rather than training separate models for different languages. Learning a single model can enhance the low-resource translation by leveraging data from multiple languages. However, the performance of an MNMT model is highly dependent on the type of languages used in training, as transferring knowledge from a diverse set of languages degrades the translation performance due to negative transfer. In this paper, we propose a Hierarchical Knowledge Distillation (HKD) approach for MNMT which capitalises on language groups generated according to typological features and phylogeny of languages to overcome the issue of negative transfer. HKD generates a set of multilingual teacher-assistant models via a selective knowledge distillation mechanism based on the language groups, and then distils the ultimate multilingual model from those assistants in an adaptive way. Experimental results derived from the TED dataset with 53 languages demonstrate the effectiveness of our approach in avoiding the negative transfer effect in MNMT, leading to an improved translation performance (about 1 BLEU score on average) compared to strong baselines.
GlobalWoZ: Globalizing MultiWoZ to Develop Multilingual Task-Oriented Dialogue Systems
Ding, Bosheng, Hu, Junjie, Bing, Lidong, Aljunied, Sharifah Mahani, Joty, Shafiq, Si, Luo, Miao, Chunyan
Much recent progress in task-oriented dialogue (ToD) systems has been driven by available annotation data across multiple domains for training. Over the last few years, there has been a move towards data curation for multilingual ToD systems that are applicable to serve people speaking different languages. However, existing multilingual ToD datasets either have a limited coverage of languages due to the high cost of data curation, or ignore the fact that dialogue entities barely exist in countries speaking these languages. To tackle these limitations, we introduce a novel data curation method that generates GlobalWoZ -- a large-scale multilingual ToD dataset globalized from an English ToD dataset for three unexplored use cases. Our method is based on translating dialogue templates and filling them with local entities in the target-language countries. We release our dataset as well as a set of strong baselines to encourage research on learning multilingual ToD systems for real use cases.
BI-RADS BERT & Using Section Tokenization to Understand Radiology Reports
Kuling, Grey, Curpen, Dr. Belinda, Martel, Anne L.
Radiology reports are the main form of communication between radiologists and other clinicians, and contain important information for patient care. However in order to use this information for research it is necessary to convert the raw text into structured data suitable for analysis. Domain specific contextual word embeddings have been shown to achieve impressive accuracy at such natural language processing tasks in medicine. In this work we pre-trained a contextual embedding BERT model using breast radiology reports and developed a classifier that incorporated the embedding with auxiliary global textual features in order to perform a section tokenization task. This model achieved a 98% accuracy at segregating free text reports into sections of information outlined in the Breast Imaging Reporting and Data System (BI-RADS) lexicon, a significant improvement over the Classic BERT model without auxiliary information. We then evaluated whether using section tokenization improved the downstream extraction of the following fields: modality/procedure, previous cancer, menopausal status, purpose of exam, breast density and background parenchymal enhancement. Using the BERT model pre-trained on breast radiology reports combined with section tokenization resulted in an overall accuracy of 95.9% in field extraction. This is a 17% improvement compared to an overall accuracy of 78.9% for field extraction for models without section tokenization and with Classic BERT embeddings. Our work shows the strength of using BERT in radiology report analysis and the advantages of section tokenization in identifying key features of patient factors recorded in breast radiology reports.
Building Chinese Biomedical Language Models via Multi-Level Text Discrimination
Wang, Quan, Dai, Songtai, Xu, Benfeng, Lyu, Yajuan, Zhu, Yong, Wu, Hua, Wang, Haifeng
Pre-trained language models (PLMs), such as BERT and GPT, have revolutionized the field of NLP, not only in the general domain but also in the biomedical domain. Most prior efforts in building biomedical PLMs have resorted simply to domain adaptation and focused mainly on English. In this work we introduce eHealth, a biomedical PLM in Chinese built with a new pre-training framework. This new framework trains eHealth as a discriminator through both token-level and sequence-level discrimination. The former is to detect input tokens corrupted by a generator and select their original signals from plausible candidates, while the latter is to further distinguish corruptions of a same original sequence from those of the others. As such, eHealth can learn language semantics at both the token and sequence levels. Extensive experiments on 11 Chinese biomedical language understanding tasks of various forms verify the effectiveness and superiority of our approach. The pre-trained model is available to the public at \url{https://github.com/PaddlePaddle/Research/tree/master/KG/eHealth} and the code will also be released later.
Bandits Don't Follow Rules: Balancing Multi-Facet Machine Translation with Multi-Armed Bandits
Kreutzer, Julia, Vilar, David, Sokolov, Artem
Training data for machine translation (MT) is often sourced from a multitude of large corpora that are multi-faceted in nature, e.g. containing contents from multiple domains or different levels of quality or complexity. Naturally, these facets do not occur with equal frequency, nor are they equally important for the test scenario at hand. In this work, we propose to optimize this balance jointly with MT model parameters to relieve system developers from manual schedule design. A multi-armed bandit is trained to dynamically choose between facets in a way that is most beneficial for the MT system. We evaluate it on three different multi-facet applications: balancing translationese and natural training data, or data from multiple domains or multiple language pairs. We find that bandit learning leads to competitive MT systems across tasks, and our analysis provides insights into its learned strategies and the underlying data sets.
Multilingual AMR Parsing with Noisy Knowledge Distillation
Cai, Deng, Li, Xin, Ho, Jackie Chun-Sing, Bing, Lidong, Lam, Wai
We study multilingual AMR parsing from the perspective of knowledge distillation, where the aim is to learn and improve a multilingual AMR parser by using an existing English parser as its teacher. We constrain our exploration in a strict multilingual setting: there is but one model to parse all different languages including English. We identify that noisy input and precise output are the key to successful distillation. Together with extensive pre-training, we obtain an AMR parser whose performances surpass all previously published results on four different foreign languages, including German, Spanish, Italian, and Chinese, by large margins (up to 18.8 \textsc{Smatch} points on Chinese and on average 11.3 \textsc{Smatch} points). Our parser also achieves comparable performance on English to the latest state-of-the-art English-only parser.
Unsupervised Neural Machine Translation with Generative Language Models Only
Han, Jesse Michael, Babuschkin, Igor, Edwards, Harrison, Neelakantan, Arvind, Xu, Tao, Polu, Stanislas, Ray, Alex, Shyam, Pranav, Ramesh, Aditya, Radford, Alec, Sutskever, Ilya
We show how to derive state-of-the-art unsupervised neural machine translation systems from generatively pre-trained language models. Our method consists of three steps: few-shot amplification, distillation, and backtranslation. We first use the zero-shot translation ability of large pre-trained language models to generate translations for a small set of unlabeled sentences. We then amplify these zero-shot translations by using them as few-shot demonstrations for sampling a larger synthetic dataset. This dataset is distilled by discarding the few-shot demonstrations and then fine-tuning. During backtranslation, we repeatedly generate translations for a set of inputs and then fine-tune a single language model on both directions of the translation task at once, ensuring cycle-consistency by swapping the roles of gold monotext and generated translations when fine-tuning. By using our method to leverage GPT-3's zero-shot translation capability, we achieve a new state-of-the-art in unsupervised translation on the WMT14 English-French benchmark, attaining a BLEU score of 42.1.