AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

Documenting the English Colossal Clean Crawled Corpus

Dodge, Jesse, Sap, Maarten, Marasovic, Ana, Agnew, William, Ilharco, Gabriel, Groeneveld, Dirk, Gardner, Matt

arXiv.org Artificial IntelligenceApr-18-2021

As language models are trained on ever more text, researchers are turning to some of the largest corpora available. Unlike most other types of datasets in NLP, large unlabeled text corpora are often presented with minimal documentation, and best practices for documenting them have not been established. In this work we provide the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl. We begin with a high-level summary of the data, including distributions of where the text came from and when it was written. We then give more detailed analysis on salient parts of this data, including the most frequent sources of text (e.g., patents.google.com, which contains a significant percentage of machine translated and/or OCR'd text), the effect that the filters had on the data (they disproportionately remove text in AAE), and evidence that some other benchmark NLP dataset examples are contained in the text. We release a web interface to an interactive, indexed copy of this dataset, encouraging the community to continuously explore and report additional findings.

computational linguistic, dataset, proceedings, (15 more...)

arXiv.org Artificial Intelligence

2104.08758

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Australia (0.04)
North America > United States > Oregon (0.04)
(15 more...)

Genre: Research Report > New Finding (0.93)

Industry: Government (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Add feedback

DCH-2: A Parallel Customer-Helpdesk Dialogue Corpus with Distributions of Annotators' Labels

Zeng, Zhaohao, Sakai, Tetsuya

arXiv.org Artificial IntelligenceApr-18-2021

We introduce a data set called DCH-2, which contains 4,390 real customer-helpdesk dialogues in Chinese and their English translations. DCH-2 also contains dialogue-level annotations and turn-level annotations obtained independently from either 19 or 20 annotators. The data set was built through our effort as organisers of the NTCIR-14 Short Text Conversation and NTCIR-15 Dialogue Evaluation tasks, to help researchers understand what constitutes an effective customer-helpdesk dialogue, and thereby build efficient and helpful helpdesk systems that are available to customers at all times. In addition, DCH-2 may be utilised for other purposes, for example, as a repository for retrieval-based dialogue systems, or as a parallel corpus for machine translation in the helpdesk domain.

annotator, dch-2, dialogue, (12 more...)

arXiv.org Artificial Intelligence

2104.08755

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > District of Columbia > Washington (0.04)
(3 more...)

Genre: Research Report (0.50)

Industry: Information Technology (0.47)

Technology:

Information Technology > Communications (0.95)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.34)

Add feedback

Joint Passage Ranking for Diverse Multi-Answer Retrieval

Min, Sewon, Lee, Kenton, Chang, Ming-Wei, Toutanova, Kristina, Hajishirzi, Hannaneh

arXiv.org Artificial IntelligenceApr-17-2021

We study multi-answer retrieval, an under-explored problem that requires retrieving passages to cover multiple distinct answers for a given question. This task requires joint modeling of retrieved passages, as models should not repeatedly retrieve passages containing the same answer at the cost of missing a different valid answer. Prior work focusing on single-answer retrieval is limited as it cannot reason about the set of passages jointly. In this paper, we introduce JPR, a joint passage retrieval model focusing on reranking. To model the joint probability of the retrieved passages, JPR makes use of an autoregressive reranker that selects a sequence of passages, equipped with novel training and decoding algorithms. Compared to prior approaches, JPR achieves significantly better answer coverage on three multi-answer datasets. When combined with downstream question answering, the improved retrieval enables larger answer generation models since they need to consider fewer passages, establishing a new state-of-the-art.

machine learning, natural language, question answering, (20 more...)

arXiv.org Artificial Intelligence

2104.08445

Country: North America > United States > New York (0.04)

Genre: Research Report (0.82)

Industry:

Health & Medicine (0.47)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.71)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback

Serial or Parallel? Plug-able Adapter for multilingual machine translation

Zhu, Yaoming, Feng, Jiangtao, Zhao, Chengqi, Wang, Mingxuan, Li, Lei

arXiv.org Artificial IntelligenceApr-16-2021

Developing a unified multilingual translation model is a key topic in machine translation research. However, existing approaches suffer from performance degradation: multilingual models yield inferior performance compared to the ones trained separately on rich bilingual data. We attribute the performance degradation to two issues: multilingual embedding conflation and multilingual fusion effects. To address the two issues, we propose PAM, a Transformer model augmented with defusion adaptation for multilingual machine translation. Specifically, PAM consists of embedding and layer adapters to shift the word and intermediate representations towards language-specific ones. Extensive experiment results on IWSLT, OPUS-100, and WMT benchmarks show that \method outperforms several strong competitors, including series adapter and multilingual knowledge distillation.

adapter, machine translation, translation, (14 more...)

arXiv.org Artificial Intelligence

2104.08154

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Consistency Training with Virtual Adversarial Discrete Perturbation

Park, Jungsoo, Kim, Gyuwan, Kang, Jaewoo

arXiv.org Artificial IntelligenceApr-15-2021

We propose an effective consistency training framework that enforces a training model's predictions given original and perturbed inputs to be similar by adding a discrete noise that would incur the highest divergence between predictions. This virtual adversarial discrete noise obtained by replacing a small portion of tokens while keeping original semantics as much as possible efficiently pushes a training model's decision boundary. Moreover, we perform an iterative refinement process to alleviate the degraded fluency of the perturbed sentence due to the conditional independence assumption. Experimental results show that our proposed method outperforms other consistency training baselines with text editing, paraphrasing, or a continuous noise on semi-supervised text classification tasks and a robustness benchmark.

arxiv preprint arxiv, dataset, prediction, (15 more...)

arXiv.org Artificial Intelligence

2104.07284

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.50)

Add feedback

On the Robustness of Goal Oriented Dialogue Systems to Real-world Noise

Krone, Jason, Sengupta, Sailik, Mansoor, Saab

arXiv.org Artificial IntelligenceApr-14-2021

Goal oriented dialogue systems, that interact in real-word environments, often encounter noisy data. In this work, we investigate how robust goal oriented dialogue systems are to noisy data. Specifically, our analysis considers intent classification (IC) and slot labeling (SL) models that form the basis of most dialogue systems. We collect a test-suite for six common phenomena found in live human-to-bot conversations (abbreviations, casing, misspellings, morphological variants, paraphrases, and synonyms) and show that these phenomena can degrade the IC/SL performance of state-of-the-art BERT based models. Through the use of synthetic data augmentation, we are improve IC/SL model's robustness to real-world noise by +11.5 for IC and +17.3 points for SL on average across noise types. We make our suite of noisy test data public to enable further research into the robustness of dialog systems.

abbreviation, augmentation, robustness, (15 more...)

arXiv.org Artificial Intelligence

2104.07149

Country:

North America > United States > Nevada > Clark County > Las Vegas (0.05)
North America > United States > Pennsylvania (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)

Add feedback

I Wish I Would Have Loved This One, But I Didn't -- A Multilingual Dataset for Counterfactual Detection in Product Reviews

O'Neill, James, Rozenshtein, Polina, Kiryo, Ryuichi, Kubota, Motoko, Bollegala, Danushka

arXiv.org Artificial IntelligenceApr-14-2021

Counterfactual statements describe events that did not or cannot take place. We consider the problem of counterfactual detection (CFD) in product reviews. For this purpose, we annotate a multilingual CFD dataset from Amazon product reviews covering counterfactual statements written in English, German, and Japanese languages. The dataset is unique as it contains counterfactuals in multiple languages, covers a new application area of e-commerce reviews, and provides high quality professional annotations. We train CFD models using different text representation methods and classifiers. We find that these models are robust against the selectional biases introduced due to cue phrase-based sentence selection. Moreover, our CFD dataset is compatible with prior datasets and can be merged to learn accurate CFD models. Applying machine translation on English counterfactual examples to create multilingual data performs poorly, demonstrating the language-specificity of this problem, which has been ignored so far.

counterfactual, dataset, mask 0, (17 more...)

arXiv.org Artificial Intelligence

2104.06893

Country:

North America > United States (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)
Europe > Germany > Berlin (0.04)

Genre: Research Report (0.64)

Industry:

Health & Medicine (0.67)
Information Technology > Services (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
(2 more...)

Add feedback

The Curious Case of Hallucinations in Neural Machine Translation

Raunak, Vikas, Menezes, Arul, Junczys-Dowmunt, Marcin

arXiv.org Artificial IntelligenceApr-14-2021

In this work, we study hallucinations in Neural Machine Translation (NMT), which lie at an extreme end on the spectrum of NMT pathologies. Firstly, we connect the phenomenon of hallucinations under source perturbation to the Long-Tail theory of Feldman (2020), and present an empirically validated hypothesis that explains hallucinations under source perturbation. Secondly, we consider hallucinations under corpus-level noise (without any source perturbation) and demonstrate that two prominent types of natural hallucinations (detached and oscillatory outputs) could be generated and explained through specific corpus-level noise patterns. Finally, we elucidate the phenomenon of hallucination amplification in popular data-generation processes such as Backtranslation and sequence-level Knowledge Distillation.

computational linguistic, hallucination, translation, (12 more...)

arXiv.org Artificial Intelligence

2104.06683

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Italy > Tuscany > Florence (0.04)
Europe > Germany > Berlin (0.04)
(10 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique

Ali, Felermino D. M. A., Caines, Andrew, Malavi, Jaimito L. A.

arXiv.org Artificial IntelligenceApr-12-2021

Major advancement in the performance of machine translation models has been made possible in part thanks to the availability of large-scale parallel corpora. But for most languages in the world, the existence of such corpora is rare. Emakhuwa, a language spoken in Mozambique, is like most African languages low-resource in NLP terms. It lacks both computational and linguistic resources and, to the best of our knowledge, few parallel corpora including Emakhuwa already exist. In this paper we describe the creation of the Emakhuwa-Portuguese parallel corpus, which is a collection of texts from the Jehovah's Witness website and a variety of other sources including the African Story Book website, the Universal Declaration of Human Rights and Mozambican legal documents. The dataset contains 47,415 sentence pairs, amounting to 699,976 word tokens of Emakhuwa and 877,595 word tokens in Portuguese. After normalization processes which remain to be completed, the corpus will be made freely available for research use.

corpus, emakhuwa, translation, (14 more...)

arXiv.org Artificial Intelligence

2104.05753

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
South America > Brazil > São Paulo (0.04)
Europe > Italy > Tuscany > Florence (0.04)
(5 more...)

Genre: Research Report (0.50)

Industry: Law (0.56)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Macro-Average: Rare Types Are Important Too

Gowda, Thamme, You, Weiqiu, Lignos, Constantine, May, Jonathan

arXiv.org Artificial IntelligenceApr-12-2021

While traditional corpus-level evaluation metrics for machine translation (MT) correlate well with fluency, they struggle to reflect adequacy. Model-based MT metrics trained on segment-level human judgments have emerged as an attractive replacement due to strong correlation results. These models, however, require potentially expensive re-training for new domains and languages. Furthermore, their decisions are inherently non-transparent and appear to reflect unwelcome biases. We explore the simple type-based classifier metric, MacroF1, and study its applicability to MT evaluation. We find that MacroF1 is competitive on direct assessment, and outperforms others in indicating downstream cross-lingual information retrieval task performance. Further, we show that MacroF1 can be used to effectively compare supervised and unsupervised neural machine translation, and reveal significant qualitative differences in the methods' outputs.

orchestra, translation, untranslation, (13 more...)

arXiv.org Artificial Intelligence

2104.057

Country:

Asia > Middle East > Syria (0.14)
Africa > Democratic Republic of the Congo > North Kivu Province (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
(42 more...)

Genre: Research Report (1.00)

Industry:

Media (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Leisure & Entertainment > Sports (0.93)
(4 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback