AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

I Wish I Would Have Loved This One, But I Didn't -- A Multilingual Dataset for Counterfactual Detection in Product Reviews

O'Neill, James, Rozenshtein, Polina, Kiryo, Ryuichi, Kubota, Motoko, Bollegala, Danushka

arXiv.org Artificial IntelligenceApr-14-2021

Counterfactual statements describe events that did not or cannot take place. We consider the problem of counterfactual detection (CFD) in product reviews. For this purpose, we annotate a multilingual CFD dataset from Amazon product reviews covering counterfactual statements written in English, German, and Japanese languages. The dataset is unique as it contains counterfactuals in multiple languages, covers a new application area of e-commerce reviews, and provides high quality professional annotations. We train CFD models using different text representation methods and classifiers. We find that these models are robust against the selectional biases introduced due to cue phrase-based sentence selection. Moreover, our CFD dataset is compatible with prior datasets and can be merged to learn accurate CFD models. Applying machine translation on English counterfactual examples to create multilingual data performs poorly, demonstrating the language-specificity of this problem, which has been ignored so far.

counterfactual, dataset, mask 0, (17 more...)

arXiv.org Artificial Intelligence

2104.06893

Country:

North America > United States (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)
Europe > Germany > Berlin (0.04)

Genre: Research Report (0.64)

Industry:

Health & Medicine (0.67)
Information Technology > Services (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
(2 more...)

Add feedback

The Curious Case of Hallucinations in Neural Machine Translation

Raunak, Vikas, Menezes, Arul, Junczys-Dowmunt, Marcin

arXiv.org Artificial IntelligenceApr-14-2021

In this work, we study hallucinations in Neural Machine Translation (NMT), which lie at an extreme end on the spectrum of NMT pathologies. Firstly, we connect the phenomenon of hallucinations under source perturbation to the Long-Tail theory of Feldman (2020), and present an empirically validated hypothesis that explains hallucinations under source perturbation. Secondly, we consider hallucinations under corpus-level noise (without any source perturbation) and demonstrate that two prominent types of natural hallucinations (detached and oscillatory outputs) could be generated and explained through specific corpus-level noise patterns. Finally, we elucidate the phenomenon of hallucination amplification in popular data-generation processes such as Backtranslation and sequence-level Knowledge Distillation.

computational linguistic, hallucination, translation, (12 more...)

arXiv.org Artificial Intelligence

2104.06683

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Italy > Tuscany > Florence (0.04)
Europe > Germany > Berlin (0.04)
(10 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique

Ali, Felermino D. M. A., Caines, Andrew, Malavi, Jaimito L. A.

arXiv.org Artificial IntelligenceApr-12-2021

Major advancement in the performance of machine translation models has been made possible in part thanks to the availability of large-scale parallel corpora. But for most languages in the world, the existence of such corpora is rare. Emakhuwa, a language spoken in Mozambique, is like most African languages low-resource in NLP terms. It lacks both computational and linguistic resources and, to the best of our knowledge, few parallel corpora including Emakhuwa already exist. In this paper we describe the creation of the Emakhuwa-Portuguese parallel corpus, which is a collection of texts from the Jehovah's Witness website and a variety of other sources including the African Story Book website, the Universal Declaration of Human Rights and Mozambican legal documents. The dataset contains 47,415 sentence pairs, amounting to 699,976 word tokens of Emakhuwa and 877,595 word tokens in Portuguese. After normalization processes which remain to be completed, the corpus will be made freely available for research use.

corpus, emakhuwa, translation, (14 more...)

arXiv.org Artificial Intelligence

2104.05753

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
South America > Brazil > São Paulo (0.04)
Europe > Italy > Tuscany > Florence (0.04)
(5 more...)

Genre: Research Report (0.50)

Industry: Law (0.56)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Macro-Average: Rare Types Are Important Too

Gowda, Thamme, You, Weiqiu, Lignos, Constantine, May, Jonathan

arXiv.org Artificial IntelligenceApr-12-2021

While traditional corpus-level evaluation metrics for machine translation (MT) correlate well with fluency, they struggle to reflect adequacy. Model-based MT metrics trained on segment-level human judgments have emerged as an attractive replacement due to strong correlation results. These models, however, require potentially expensive re-training for new domains and languages. Furthermore, their decisions are inherently non-transparent and appear to reflect unwelcome biases. We explore the simple type-based classifier metric, MacroF1, and study its applicability to MT evaluation. We find that MacroF1 is competitive on direct assessment, and outperforms others in indicating downstream cross-lingual information retrieval task performance. Further, we show that MacroF1 can be used to effectively compare supervised and unsupervised neural machine translation, and reveal significant qualitative differences in the methods' outputs.

orchestra, translation, untranslation, (13 more...)

arXiv.org Artificial Intelligence

2104.057

Country:

Asia > Middle East > Syria (0.14)
Africa > Democratic Republic of the Congo > North Kivu Province (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
(42 more...)

Genre: Research Report (1.00)

Industry:

Media (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Leisure & Entertainment > Sports (0.93)
(4 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Sentiment-based Candidate Selection for NMT

Jones, Alex, Wijaya, Derry Tanti

arXiv.org Artificial IntelligenceApr-10-2021

The explosion of user-generated content (UGC)--e.g. social media posts, comments, and reviews--has motivated the development of NLP applications tailored to these types of informal texts. Prevalent among these applications have been sentiment analysis and machine translation (MT). Grounded in the observation that UGC features highly idiomatic, sentiment-charged language, we propose a decoder-side approach that incorporates automatic sentiment scoring into the MT candidate selection process. We train separate English and Spanish sentiment classifiers, then, using n-best candidates generated by a baseline MT model with beam search, select the candidate that minimizes the absolute difference between the sentiment score of the source sentence and that of the translation, and perform a human evaluation to assess the produced translations. Unlike previous work, we select this minimally divergent translation by considering the sentiment scores of the source sentence and translation on a continuous interval, rather than using e.g. binary classification, allowing for more fine-grained selection of translation candidates. The results of human evaluations show that, in comparison to the open-source MT baseline model on top of which our sentiment-based pipeline is built, our pipeline produces more accurate translations of colloquial, sentiment-heavy source texts.

evaluation, sentiment, translation, (15 more...)

arXiv.org Artificial Intelligence

2104.0484

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Massachusetts > Suffolk County > Boston (0.14)
Europe > Finland > Uusimaa > Helsinki (0.05)
(9 more...)

Genre: Research Report (0.82)

Industry: Media (0.30)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

XFORMAL: A Benchmark for Multilingual Formality Style Transfer

Briakou, Eleftheria, Lu, Di, Zhang, Ke, Tetreault, Joel

arXiv.org Artificial IntelligenceApr-8-2021

We take the first step towards multilingual style transfer by creating and releasing XFORMAL, a benchmark of multiple formal reformulations of informal text in Brazilian Portuguese, French, and Italian. Results on XFORMAL suggest that state-of-the-art style transfer approaches perform close to simple baselines, indicating that style transfer is even more challenging when moving multilingual.

computational linguistic, proceedings, rewrite, (14 more...)

arXiv.org Artificial Intelligence

2104.04108

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Hong Kong (0.05)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(23 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
(2 more...)

Add feedback

Extended Parallel Corpus for Amharic-English Machine Translation

Gezmu, Andargachew Mekonnen, Nürnberger, Andreas, Bati, Tesfaye Bayu

arXiv.org Artificial IntelligenceApr-8-2021

This paper describes the acquisition, preprocessing, segmentation, and alignment of an Amharic-English parallel corpus. It will be useful for machine translation of an under-resourced language, Amharic. The corpus is larger than previously compiled corpora; it is released for research purposes. We trained neural machine translation and phrase-based statistical machine translation models using the corpus. In the automatic evaluation, neural machine translation models outperform phrase-based statistical machine translation models.

computational linguistic, machine translation, translation, (12 more...)

arXiv.org Artificial Intelligence

2104.03543

Country:

Europe > Germany > Berlin (0.05)
Europe > Germany > Saxony-Anhalt > Magdeburg (0.04)
Europe > Czechia > Prague (0.04)
(24 more...)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Dynabench: Rethinking Benchmarking in NLP

Kiela, Douwe, Bartolo, Max, Nie, Yixin, Kaushik, Divyansh, Geiger, Atticus, Wu, Zhengxuan, Vidgen, Bertie, Prasad, Grusha, Singh, Amanpreet, Ringshia, Pratik, Ma, Zhiyi, Thrush, Tristan, Riedel, Sebastian, Waseem, Zeerak, Stenetorp, Pontus, Jia, Robin, Bansal, Mohit, Potts, Christopher, Williams, Adina

arXiv.org Artificial IntelligenceApr-7-2021

We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.

computational linguistic, linguistic, proceedings, (15 more...)

arXiv.org Artificial Intelligence

2104.14337

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(18 more...)

Genre: Research Report (0.50)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)
Information Technology > Communications > Social Media > Crowdsourcing (0.46)
(2 more...)

Add feedback

Efficient transfer learning for NLP with ELECTRA

Mercier, François

arXiv.org Artificial IntelligenceApr-6-2021

Scope of Reproducibility Clark et al. [2020] claims that the ELECTRA approach is highly efficient in NLP performances relative to computation budget. As such, this study focus on this claim, summarized by the following question: Can we use ELECTRA to achieve close to SOTA performances for NLP in low-resource settings, in term of compute cost? Methodology This replication study has been conducted by fully reimplementing the small variant of the original ELECTRA model (Clark et al. [2020]). All experiments are performed on single GPU computers. GLUE benchmark dev set (Wang et al. [2018]) is used for models evaluation and compared with the original paper.

electra, implementation, original paper, (15 more...)

arXiv.org Artificial Intelligence

2104.02756

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

FixMyPose: Pose Correctional Captioning and Retrieval

Kim, Hyounghun, Zala, Abhay, Burri, Graham, Bansal, Mohit

arXiv.org Artificial IntelligenceApr-4-2021

Interest in physical therapy and individual exercises such as yoga/dance has increased alongside the well-being trend. However, such exercises are hard to follow without expert guidance (which is impossible to scale for personalized feedback to every trainee remotely). Thus, automated pose correction systems are required more than ever, and we introduce a new captioning dataset named FixMyPose to address this need. We collect descriptions of correcting a "current" pose to look like a "target" pose (in both English and Hindi). The collected descriptions have interesting linguistic properties such as egocentric relations to environment objects, analogous references, etc., requiring an understanding of spatial relations and commonsense knowledge about postures. Further, to avoid ML biases, we maintain a balance across characters with diverse demographics, who perform a variety of movements in several interior environments (e.g., homes, offices). From our dataset, we introduce the pose-correctional-captioning task and its reverse target-pose-retrieval task. During the correctional-captioning task, models must generate descriptions of how to move from the current to target pose image, whereas in the retrieval task, models should select the correct target pose given the initial pose and correctional description. We present strong cross-attention baseline models (uni/multimodal, RL, multilingual) and also show that our baselines are competitive with other models when evaluated on other image-difference datasets. We also propose new task-specific metrics (object-match, body-part-match, direction-match) and conduct human evaluation for more reliable evaluation, and we demonstrate a large human-model performance gap suggesting room for promising future work. To verify the sim-to-real transfer of our FixMyPose dataset, we collect a set of real images and show promising performance on these images.

correctional description, dataset, image pair, (17 more...)

arXiv.org Artificial Intelligence

2104.01703

Country:

North America > United States > North Carolina (0.04)
Asia > India (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.64)

Industry:

Health & Medicine (0.48)
Leisure & Entertainment (0.48)
Information Technology (0.46)
Education (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
(2 more...)

Add feedback