Machine Translation
Fine-Tuning by Curriculum Learning for Non-Autoregressive Neural Machine Translation
Guo, Junliang, Tan, Xu, Xu, Linli, Qin, Tao, Chen, Enhong, Liu, Tie-Yan
Non-autoregressive translation (NAT) models remove the dependence on previous target tokens and generate all target tokens in parallel, resulting in significant inference speedup but at the cost of inferior translation accuracy compared to autoregressive translation (AT) models. Considering that AT models have higher accuracy and are easier to train than NAT models, and both of them share the same model configurations, a natural idea to improve the accuracy of NAT models is to transfer a well-trained AT model to an NAT model through fine-tuning. However, since AT and NAT models differ greatly in training strategy, straightforward fine-tuning does not work well. In this work, we introduce curriculum learning into fine-tuning for NAT. Specifically, we design a curriculum in the fine-tuning process to progressively switch the training from autoregressive generation to non-autoregressive generation. Experiments on four benchmark translation datasets show that the proposed method achieves good improvement (more than $1$ BLEU score) over previous NAT baselines in terms of translation accuracy, and greatly speed up (more than $10$ times) the inference process over AT baselines.
What Do You Mean `Why?': Resolving Sluices in Conversations
Hansen, Victor Petrรฉn Bach, Sรธgaard, Anders
What Do Y ou Mean'Why?': Resolving Sluices in Conversations Victor Petr en Bach Hansen, 1 2 Anders Sรธgaard 1 3 1 Department of Computer Science, University of Copenhagen, Denmark 2 Topdanmark A/S, Denmark 3 Google Research, Berlin victor.petren@di.ku.dk, soegaard@di.ku.dk Abstract In conversation, we often ask one-word questions such as'Why?' or'Who?'. Such questions are typically easy for humans to answer, but can be hard for computers, because their resolution requires retrieving both the right semantic frames and the right arguments from context. This paper introduces the novel ellipsis resolution task of resolving such one-word questions, referred to as sluices in linguistics. We present a crowd-sourced dataset containing annotations of sluices from over 4,000 dialogues collected from conversational QA datasets, as well as a series of strong baseline architectures. 1 Introduction Stand-alone wh-word questions, such as When? in Figure 1, are easy for us to understand, but in order to interpret them we need to retrieve implicit information from context. Learning to do so is an instance of sluicing, an ellipsis phenomenon, defined by Ross (1969) as'the effect of deleting everything but the preposed constituent of an embedded question, under the condition that the remainder of the question is identical to some other part of the sentence, or a preceding sentence.' In the context of conversations, one-word wh-word questions are particularly frequent (Anand and Hardt 2016; Rรธnning, Hardt, and Sรธgaard 2018), and because they are often hard to resolve, they seem to be a frequent source of error in conversational question answering (Choi et al. 2018; Reddy, Chen, and Manning 2018) and dialogue understanding (Vlachos and Clark 2014). We refer to this type of sluicing as conversational sluicing . Unlike previous work where sluice resolution is treated as predicting the span of the antecedent (Anand and Hardt 2016; Rรธnning, Hardt, and Sรธgaard 2018), we frame conversational sluice resolution as a Natural Language Generation (NLG) task, in which we seek to automatically generate the full question, given a question-answer context and a one-word question. Q 1: Where was the bombing?
Visualisation of embedding relations (Word2Vec, BERT)
In this story, we will visualise the word embedding vectors to understand the relations between words described by the embeddings. This story focuses on word2vec [1] and BERT [2]. To understand the embeddings, I suggest reading a different introduction (like this) as this story does not aim to describe them. This story is part of my journey to develop Neural Machine Translation (NMT) using BERT contextualised embedding vectors. Word embeddings are models to generate computer-friendly numeric vector representations for words.
Graph Transformer for Graph-to-Sequence Learning
The dominant graph-to-sequence transduction models employ graph neural networks for graph representation learning, where the structural information is reflected by the receptive field of neurons. Unlike graph neural networks that restrict the information exchange between immediate neighborhood, we propose a new model, known as Graph Transformer, that uses explicit relation encoding and allows direct communication between two distant nodes. It provides a more efficient way for global graph structure modeling. Experiments on the applications of text generation from Abstract Meaning Representation (AMR) and syntax-based neural machine translation show the superiority of our proposed model. Specifically, our model achieves 27.4 BLEU on LDC2015E86 and 29.7 BLEU on LDC2017T10 for AMR-to-text generation, outperforming the state-of-the-art results by up to 2.2 points. On the syntax-based translation tasks, our model establishes new single-model state-of-the-art BLEU scores, 21.3 for English-to-German and 14.1 for English-to-Czech, improving over the existing best results, including ensembles, by over 1 BLEU.
Understanding and Improving Layer Normalization
Xu, Jingjing, Sun, Xu, Zhang, Zhiyuan, Zhao, Guangxiang, Lin, Junyang
Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness stems from. In this paper, our main contribution is to take a step further in understanding LayerNorm. Many of previous studies believe that the success of LayerNorm comes from forward normalization. Unlike them, we find that the derivatives of the mean and variance are more important than forward normalization by re-centering and re-scaling backward gradients. Furthermore, we find that the parameters of LayerNorm, including the bias and gain, increase the risk of over-fitting and do not work in most cases. Experiments show that a simple version of LayerNorm (LayerNorm-simple) without the bias and gain outperforms LayerNorm on four datasets. It obtains the state-of-the-art performance on En-Vi machine translation. To address the over-fitting problem, we propose a new normalization method, Adaptive Normalization (AdaNorm), by replacing the bias and gain with a new transformation function. Experiments show that AdaNorm demonstrates better results than LayerNorm on seven out of eight datasets.
DataCareer: Your Career Platform for Data Science in the UK and Ireland
Grade: G13/3 (net (basic) monthly salary* for this vacancy: EUR 12 435,12, which may be supplemented by various allowances depending on your personal circumstances) Duration of appointment: 5 years Career path: Managerial Location: Munich Application deadline: 17.11.2019 With almost 7 000 employees, the European Patent Office (EPO) is the second-largest public service institution in Europe. It supports innovation, competitiveness and economic growth across Europe through a commitment to high-quality and efficient services delivered under the European Patent Convention, its founding treaty. It has a yearly budget of EUR 2.3 billion, entirely financed by the fees paid by its users. As set out in its Strategic Plan 2023, the EPO is proud to deliver high-quality patents and efficient services that foster innovation, competitiveness and economic growth.
Embedding Projection for Targeted Cross-lingual Sentiment: Model Comparisons and a Real-World Study
Barnes, Jeremy (University of Oslo) | Klinger, Roman
Sentiment analysis benefits from large, hand-annotated resources in order to train and test machine learning models, which are often data hungry. While some languages, e.g., English, have a vast arrayof these resources, most under-resourced languages do not, especially for fine-grained sentiment tasks, such as aspect-level or targeted sentiment analysis. To improve this situation, we propose a cross-lingual approach to sentiment analysis that is applicable to under-resourced languages and takes into account target-level information. This model incorporates sentiment information into bilingual distributional representations, byjointly optimizing them for semantics and sentiment, showing state-of-the-art performance at sentence-level when combined with machine translation. The adaptation to targeted sentiment analysis on multiple domains shows that our model outperforms other projection-based bilingual embedding methods on binary targetedsentiment tasks. Our analysis on ten languages demonstrates that the amount of unlabeled monolingual data has surprisingly little effect on the sentiment results. As expected, the choice of a annotated source language for projection to a target leads to better results for source-target language pairs which are similar. Therefore, our results suggest that more efforts should be spent on the creation of resources for less similar languages tothose which are resource-rich already. Finally, a domain mismatch leads to a decreased performance. This suggests resources in any language should ideally cover varieties of domains.
Legal translation tool launching for French
In addition to being designed particularly for the French markets of Canada, the company is trying to lure customers with enterprise-centred options such as customization, review by human translators, and cybersecurity. Kalaci says the technology, which is not affiliated with Amazon's Alexa, is hosted on Canadian servers and the text is destroyed once it is translated. There is also an option for firms to use their data to train a customised tool. Either way, he says, is an improvement over free services offered on the web. "Most web-based tools you use, have a disclosure wherein they say, 'Any content you put in here, we keep.' And that's how they keep improving their tools," says Kalaci.
Legal translation tool launching for French
In addition to being designed particularly for the French markets of Canada, the company is trying to lure customers with enterprise-centred options such as customization, review by human translators, and cybersecurity. Kalaci says the technology, which is not affiliated with Amazon's Alexa, is hosted on Canadian servers and the text is destroyed once it is translated. There is also an option for firms to use their data to train a customised tool. Either way, he says, is an improvement over free services offered on the web. "Most web-based tools you use, have a disclosure wherein they say, 'Any content you put in here, we keep.' And that's how they keep improving their tools," says Kalaci.
Human-centric Metric for Accelerating Pathology Reports Annotation
Ma, Ruibin, Chen, Po-Hsuan Cameron, Li, Gang, Weng, Wei-Hung, Lin, Angela, Gadepalli, Krishna, Cai, Yuannan
Pathology reports contain useful information such as the main involved organ, diagnosis, etc. These information can be identified from the free text reports and used for large-scale statistical analysis or serve as annotation for other modalities such as pathology slides images. However, manual classification for a huge number of reports on multiple tasks is labor-intensive. In this paper, we have developed an automatic text classifier based on BERT and we propose a human-centric metric to evaluate the model. According to the model confidence, we identify low-confidence cases that require further expert annotation and high-confidence cases that are automatically classified. We report the percentage of low-confidence cases and the performance of automatically classified cases. On the high-confidence cases, the model achieves classification accuracy comparable to pathologists. This leads a potential of reducing 80% to 98% of the manual annotation workload.