AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

Dataset Inference: Ownership Resolution in Machine Learning

Maini, Pratyush, Yaghini, Mohammad, Papernot, Nicolas

arXiv.org Machine LearningApr-21-2021

With increasingly more data and computation involved in their training, machine learning models constitute valuable intellectual property. This has spurred interest in model stealing, which is made more practical by advances in learning with partial, little, or no supervision. Existing defenses focus on inserting unique watermarks in a model's decision surface, but this is insufficient: the watermarks are not sampled from the training distribution and thus are not always preserved during model stealing. In this paper, we make the key observation that knowledge contained in the stolen model's training set is what is common to all stolen copies. The adversary's goal, irrespective of the attack employed, is always to extract this knowledge or its by-products. This gives the original model's owner a strong advantage over the adversary: model owners have access to the original training data. We thus introduce $dataset$ $inference$, the process of identifying whether a suspected model copy has private knowledge from the original model's dataset, as a defense against model stealing. We develop an approach for dataset inference that combines statistical testing with the ability to estimate the distance of multiple data points to the decision boundary. Our experiments on CIFAR10, SVHN, CIFAR100 and ImageNet show that model owners can claim with confidence greater than 99% that their model (or dataset as a matter of fact) was stolen, despite only exposing 50 of the stolen model's training points. Dataset inference defends against state-of-the-art attacks even when the adversary is adaptive. Unlike prior work, it does not require retraining or overfitting the defended model.

adversary, dataset, victim, (16 more...)

arXiv.org Machine Learning

2104.10706

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.88)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.67)

Add feedback

Google translation AI botches legal terms

#artificialintelligenceApr-20-2021, 10:25:16 GMT

Translation tools from Google and other companies could be contributing to significant misunderstanding of legal terms with conflicting meanings such as "enjoin," according to research due to be presented at an academic workshop. Google's translation software turns an English sentence about a court enjoining violence, or banning it, into one in the Indian language of Kannada that implies the court ordered violence, according to the new study. "Enjoin" can refer to either promoting or restraining an action. Mistranslations also arise with other contronyms, or words with contradictory meanings depending on context, including "all over," "eventual" and "garnish," the paper said. Google said machine translation is "is still just a complement to specialized professional translation" and that it is "continually researching improvements, from better handling ambiguous language, to mitigating bias, to making large quality gains for under-resourced languages."

software, translation, translation ai botch legal term, (6 more...)

#artificialintelligence

Genre: Research Report (0.61)

Industry: Law (0.79)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.75)

Add feedback

Demystify Optimization Challenges in Multilingual Transformers

Li, Xian, Gong, Hongyu

arXiv.org Artificial IntelligenceApr-20-2021

Multilingual Transformer improves parameter efficiency and crosslingual transfer. How to effectively train multilingual models has not been well studied. Using multilingual machine translation as a testbed, we study optimization challenges from loss landscape and parameter plasticity perspectives. We found that imbalanced training data poses task interference between high and low resource languages, characterized by nearly orthogonal gradients for major parameters and the optimization trajectory being mostly dominated by high resource. We show that local curvature of the loss surface affects the degree of interference, and existing heuristics of data subsampling implicitly reduces the sharpness, although still face a trade-off between high and low resource languages. We propose a principled multi-objective optimization algorithm, Curvature Aware Task Scaling (CATS), which improves both optimization and generalization especially for low resource. Experiments on TED, WMT and OPUS-100 benchmarks demonstrate that CATS advances the Pareto front of accuracy while being efficient to apply to massive multilingual settings at the scale of 100 languages.

arxiv preprint arxiv, demystify optimization challenge, translation, (12 more...)

arXiv.org Artificial Intelligence

2104.07639

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Can Latent Alignments Improve Autoregressive Machine Translation?

Haviv, Adi, Vassertail, Lior, Levy, Omer

arXiv.org Artificial IntelligenceApr-19-2021

Latent alignment objectives such as CTC and AXE significantly improve non-autoregressive machine translation models. Can they improve autoregressive models as well? We explore the possibility of training autoregressive machine translation models with latent alignment objectives, and observe that, in practice, this approach results in degenerate models. We provide a theoretical explanation for these empirical results, and prove that latent alignment objectives are incompatible with teacher forcing.

alignment, prediction, sequence, (15 more...)

arXiv.org Artificial Intelligence

2104.09554

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.05)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(4 more...)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Back-Training excels Self-Training at Unsupervised Domain Adaptation of Question Generation and Passage Retrieval

Kulshreshtha, Devang, Belfer, Robert, Serban, Iulian Vlad, Reddy, Siva

arXiv.org Artificial IntelligenceApr-18-2021

In this paper, we propose a new domain adaptation method called $\textit{back-training}$, a superior alternative to self-training. While self-training results in synthetic training data of the form quality inputs aligned with noisy outputs, back-training results in noisy inputs aligned with quality outputs. Our experimental results on unsupervised domain adaptation of question generation and passage retrieval models from $\textit{Natural Questions}$ domain to the machine learning domain show that back-training outperforms self-training by a large margin: 9.3 BLEU-1 points on generation, and 7.9 accuracy points on top-1 retrieval. We release $\textit{MLQuestions}$, a domain-adaptation dataset for the machine learning domain containing 50K unaligned passages and 35K unaligned questions, and 3K aligned passage and question pairs. Our data and code are available at https://github.com/McGill-NLP/MLQuestions

adaptation, domain adaptation, target domain, (13 more...)

arXiv.org Artificial Intelligence

2104.08801

Country:

North America > United States > Wisconsin (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > Quebec > Montreal (0.04)
(3 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Documenting the English Colossal Clean Crawled Corpus

Dodge, Jesse, Sap, Maarten, Marasovic, Ana, Agnew, William, Ilharco, Gabriel, Groeneveld, Dirk, Gardner, Matt

arXiv.org Artificial IntelligenceApr-18-2021

As language models are trained on ever more text, researchers are turning to some of the largest corpora available. Unlike most other types of datasets in NLP, large unlabeled text corpora are often presented with minimal documentation, and best practices for documenting them have not been established. In this work we provide the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl. We begin with a high-level summary of the data, including distributions of where the text came from and when it was written. We then give more detailed analysis on salient parts of this data, including the most frequent sources of text (e.g., patents.google.com, which contains a significant percentage of machine translated and/or OCR'd text), the effect that the filters had on the data (they disproportionately remove text in AAE), and evidence that some other benchmark NLP dataset examples are contained in the text. We release a web interface to an interactive, indexed copy of this dataset, encouraging the community to continuously explore and report additional findings.

computational linguistic, dataset, proceedings, (15 more...)

arXiv.org Artificial Intelligence

2104.08758

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Australia (0.04)
North America > United States > Oregon (0.04)
(15 more...)

Genre: Research Report > New Finding (0.93)

Industry: Government (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Add feedback

DCH-2: A Parallel Customer-Helpdesk Dialogue Corpus with Distributions of Annotators' Labels

Zeng, Zhaohao, Sakai, Tetsuya

arXiv.org Artificial IntelligenceApr-18-2021

We introduce a data set called DCH-2, which contains 4,390 real customer-helpdesk dialogues in Chinese and their English translations. DCH-2 also contains dialogue-level annotations and turn-level annotations obtained independently from either 19 or 20 annotators. The data set was built through our effort as organisers of the NTCIR-14 Short Text Conversation and NTCIR-15 Dialogue Evaluation tasks, to help researchers understand what constitutes an effective customer-helpdesk dialogue, and thereby build efficient and helpful helpdesk systems that are available to customers at all times. In addition, DCH-2 may be utilised for other purposes, for example, as a repository for retrieval-based dialogue systems, or as a parallel corpus for machine translation in the helpdesk domain.

annotator, dch-2, dialogue, (12 more...)

arXiv.org Artificial Intelligence

2104.08755

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > District of Columbia > Washington (0.04)
(3 more...)

Genre: Research Report (0.50)

Industry: Information Technology (0.47)

Technology:

Information Technology > Communications (0.95)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.34)

Add feedback

Joint Passage Ranking for Diverse Multi-Answer Retrieval

Min, Sewon, Lee, Kenton, Chang, Ming-Wei, Toutanova, Kristina, Hajishirzi, Hannaneh

arXiv.org Artificial IntelligenceApr-17-2021

We study multi-answer retrieval, an under-explored problem that requires retrieving passages to cover multiple distinct answers for a given question. This task requires joint modeling of retrieved passages, as models should not repeatedly retrieve passages containing the same answer at the cost of missing a different valid answer. Prior work focusing on single-answer retrieval is limited as it cannot reason about the set of passages jointly. In this paper, we introduce JPR, a joint passage retrieval model focusing on reranking. To model the joint probability of the retrieved passages, JPR makes use of an autoregressive reranker that selects a sequence of passages, equipped with novel training and decoding algorithms. Compared to prior approaches, JPR achieves significantly better answer coverage on three multi-answer datasets. When combined with downstream question answering, the improved retrieval enables larger answer generation models since they need to consider fewer passages, establishing a new state-of-the-art.

machine learning, natural language, question answering, (20 more...)

arXiv.org Artificial Intelligence

2104.08445

Country: North America > United States > New York (0.04)

Genre: Research Report (0.82)

Industry:

Health & Medicine (0.47)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.71)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback

Serial or Parallel? Plug-able Adapter for multilingual machine translation

Zhu, Yaoming, Feng, Jiangtao, Zhao, Chengqi, Wang, Mingxuan, Li, Lei

arXiv.org Artificial IntelligenceApr-16-2021

Developing a unified multilingual translation model is a key topic in machine translation research. However, existing approaches suffer from performance degradation: multilingual models yield inferior performance compared to the ones trained separately on rich bilingual data. We attribute the performance degradation to two issues: multilingual embedding conflation and multilingual fusion effects. To address the two issues, we propose PAM, a Transformer model augmented with defusion adaptation for multilingual machine translation. Specifically, PAM consists of embedding and layer adapters to shift the word and intermediate representations towards language-specific ones. Extensive experiment results on IWSLT, OPUS-100, and WMT benchmarks show that \method outperforms several strong competitors, including series adapter and multilingual knowledge distillation.

adapter, machine translation, translation, (14 more...)

arXiv.org Artificial Intelligence

2104.08154

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Consistency Training with Virtual Adversarial Discrete Perturbation

Park, Jungsoo, Kim, Gyuwan, Kang, Jaewoo

arXiv.org Artificial IntelligenceApr-15-2021

We propose an effective consistency training framework that enforces a training model's predictions given original and perturbed inputs to be similar by adding a discrete noise that would incur the highest divergence between predictions. This virtual adversarial discrete noise obtained by replacing a small portion of tokens while keeping original semantics as much as possible efficiently pushes a training model's decision boundary. Moreover, we perform an iterative refinement process to alleviate the degraded fluency of the perturbed sentence due to the conditional independence assumption. Experimental results show that our proposed method outperforms other consistency training baselines with text editing, paraphrasing, or a continuous noise on semi-supervised text classification tasks and a robustness benchmark.

arxiv preprint arxiv, dataset, prediction, (15 more...)

arXiv.org Artificial Intelligence

2104.07284

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.50)

Add feedback