AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

10 Best African Language Datasets for Data Science Projects

#artificialintelligenceJun-27-2021, 21:40:05 GMT

Africa has over 2000 languages, but these languages are not well-represented in the existing Natural Language Processing ecosystem. One challenge is the lack of useful African language datasets that we can use to solve different social and economic problems. In this article, I have compiled a list of African language datasets from across the web. You can use these datasets in various NLP tasks such as text classification, named entity recognition, machine translation, sentiment analysis, speech recognition, and topic modeling. I've made this collection of datasets public to give you an opportunity to use your skills and help solve different challenges.

african language dataset, dataset, language dataset, (11 more...)

#artificialintelligence

Country:

Africa > South Africa (0.06)
Africa > Senegal (0.06)
Africa > Rwanda (0.05)
(15 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.96)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.95)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.71)

Add feedback

Power Law Graph Transformer for Machine Translation and Representation Learning

Gokden, Burc

arXiv.org Artificial IntelligenceJun-27-2021

We present the Power Law Graph Transformer, a transformer model with well defined deductive and inductive tasks for prediction and representation learning. The deductive task learns the dataset level (global) and instance level (local) graph structures in terms of learnable power law distribution parameters. The inductive task outputs the prediction probabilities using the deductive task output, similar to a transductive model. We trained our model with Turkish-English and Portuguese-English datasets from TED talk transcripts for machine translation and compared the model performance and characteristics to a transformer model with scaled dot product attention trained on the same experimental setup. We report BLEU scores of $17.79$ and $28.33$ on the Turkish-English and Portuguese-English translation tasks with our model, respectively. We also show how a duality between a quantization set and N-dimensional manifold representation can be leveraged to transform between local and global deductive-inductive outputs using successive application of linear and non-linear transformations end-to-end.

attention stage, graph transformer model, transformer model, (13 more...)

arXiv.org Artificial Intelligence

2107.02039

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > Oregon (0.04)
(8 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Artificial Neural Network is Revolutionizing The Future of the Translation Industry

#artificialintelligenceJun-22-2021, 19:25:22 GMT

Do you know that a full-time working translator can translate approximately 520,000 words per year? There would be no wrong in saying that the translation industry has existed for centuries and will progress in double digits in the upcoming years. Because digital realms continuously push for more shared and globalized experiences, the current worth of the global translation industry is $56.1 billion, and the figure is expected to increase at a swift pace in upcoming years. The number is projected to surpass $70 billion by the year 2023. It's been more than 10 years since the launch of Google translate by utilizing phase-based machine translation algorithms.

translation, translation industry, upcoming year, (10 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Phrase-level Active Learning for Neural Machine Translation

Hu, Junjie, Neubig, Graham

arXiv.org Artificial IntelligenceJun-21-2021

Neural machine translation (NMT) is sensitive to domain shift. In this paper, we address this problem in an active learning setting where we can spend a given budget on translating in-domain data, and gradually fine-tune a pre-trained out-of-domain NMT model on the newly translated data. Existing active learning methods for NMT usually select sentences based on uncertainty scores, but these methods require costly translation of full sentences even when only one or two key phrases within the sentence are informative. To address this limitation, we re-examine previous work from the phrase-based machine translation (PBMT) era that selected not full sentences, but rather individual phrases. However, while incorporating these phrases into PBMT systems was relatively simple, it is less trivial for NMT systems, which need to be trained on full sequences to capture larger structural properties of sentences unique to the new domain. To overcome these hurdles, we propose to select both full sentences and individual phrases from unlabelled data in the new domain for routing to human translators. In a German-English translation task, our active learning approach achieves consistent improvements over uncertainty-based sentence selection methods, improving up to 1.2 BLEU score over strong active learning baselines.

machine translation, selection strategy, translation, (13 more...)

arXiv.org Artificial Intelligence

2106.11375

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Sweden > Uppsala County > Uppsala (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(17 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

Tan, Hao, Lei, Jie, Wolf, Thomas, Bansal, Mohit

arXiv.org Artificial IntelligenceJun-21-2021

Video understanding relies on perceiving the global content and modeling its internal connections (e.g., causality, movement, and spatio-temporal correspondence). To learn these interactions, we apply a mask-then-predict pre-training task on discretized video tokens generated via VQ-VAE. Unlike language, where the text tokens are more independent, neighboring video tokens typically have strong correlations (e.g., consecutive video frames usually look very similar), and hence uniformly masking individual tokens will make the task too trivial to learn useful representations. To deal with this issue, we propose a block-wise masking strategy where we mask neighboring video tokens in both spatial and temporal domains. We also add an augmentation-free contrastive learning method to further capture the global content by predicting whether the video clips are sampled from the same video. We pre-train our model on uncurated videos and show that our pre-trained model can reach state-of-the-art results on several video understanding datasets (e.g., SSV2, Diving48). Lastly, we provide detailed analyses on model scalability and pre-training method design.

dataset, representation, video, (13 more...)

arXiv.org Artificial Intelligence

2106.1125

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback

Pay Better Attention to Attention: Head Selection in Multilingual and Multi-Domain Sequence Modeling

Gong, Hongyu, Tang, Yun, Pino, Juan, Li, Xian

arXiv.org Artificial IntelligenceJun-21-2021

Multi-head attention has each of the attention heads collect salient information from different parts of an input sequence, making it a powerful mechanism for sequence modeling. Multilingual and multi-domain learning are common scenarios for sequence modeling, where the key challenge is to maximize positive transfer and mitigate negative transfer across languages and domains. In this paper, we find that non-selective attention sharing is sub-optimal for achieving good generalization across all languages and domains. We further propose attention sharing strategies to facilitate parameter sharing and specialization in multilingual and multi-domain sequence modeling. Our approach automatically learns shared and specialized attention heads for different languages and domains to mitigate their interference. Evaluated in various tasks including speech recognition, text-to-text and speech-to-text translation, the proposed attention sharing strategies consistently bring gains to sequence models built upon multi-head attention. For speech-to-text translation, our approach yields an average of $+2.0$ BLEU over $13$ language directions in multilingual setting and $+2.0$ BLEU over $3$ domains in multi-domain setting.

attention head, transformer, translation, (15 more...)

arXiv.org Artificial Intelligence

2106.1084

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > France (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(8 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Leveraging Language to Learn Program Abstractions and Search Heuristics

Wong, Catherine, Ellis, Kevin, Tenenbaum, Joshua B., Andreas, Jacob

arXiv.org Artificial IntelligenceJun-18-2021

Inductive program synthesis, or inferring programs from examples of desired behavior, offers a general paradigm for building interpretable, robust, and generalizable machine learning systems. Effective program synthesis depends on two key ingredients: a strong library of functions from which to build programs, and an efficient search strategy for finding programs that solve a given task. We introduce LAPS (Language for Abstraction and Program Search), a technique for using natural language annotations to guide joint learning of libraries and neurally-guided search models for synthesis. When integrated into a state-of-the-art library learning system (DreamCoder), LAPS produces higher-quality libraries and improves search efficiency and generalization on three domains -- string editing, image composition, and abstract reasoning about scenes -- even when no natural language hints are available at test time.

abstraction, leveraging language, library, (15 more...)

arXiv.org Artificial Intelligence

2106.11053

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report (0.64)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.69)
(4 more...)

Add feedback

Central Kurdish machine translation: First large scale parallel corpus and experiments

Amini, Zhila, Mohammadamini, Mohammad, Hosseini, Hawre, Mansouri, Mehran, Jaff, Daban

arXiv.org Artificial IntelligenceJun-17-2021

While the computational processing of Kurdish has experienced a relative increase, the machine translation of this language seems to be lacking a considerable body of scientific work. This is in part due to the lack of resources especially curated for this task. In this paper, we present the first large scale parallel corpus of Central Kurdish-English, Awta, containing 229,222 pairs of manually aligned translations. Our corpus is collected from different text genres and domains in an attempt to build more robust and real-world applications of machine translation. We make a portion of this corpus publicly available in order to foster research in this area. Further, we build several neural machine translation models in order to benchmark the task of Kurdish machine translation. Additionally, we perform extensive experimental analysis of results in order to identify the major challenges that Central Kurdish machine translation faces. These challenges include language-dependent and-independent ones as categorized in this paper, the first group of which are aware of Central Kurdish linguistic properties on different morphological, syntactic and semantic levels. Our best performing systems achieve 22.72 and 16.81 in BLEU score for Ku$\rightarrow$EN and En$\rightarrow$Ku, respectively.

corpus, machine translation, translation, (15 more...)

arXiv.org Artificial Intelligence

2106.09325

Country:

Asia > Middle East > Iraq > Kurdistan Region (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > Canada > Ontario > Toronto (0.04)
(6 more...)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Evaluating Gender Bias in Hindi-English Machine Translation

Gupta, Gauri, Ramesh, Krithika, Singh, Sanjay

arXiv.org Artificial IntelligenceJun-16-2021

With language models being deployed increasingly in the real world, it is essential to address the issue of the fairness of their outputs. The word embedding representations of these language models often implicitly draw unwanted associations that form a social bias within the model. The nature of gendered languages like Hindi, poses an additional problem to the quantification and mitigation of bias, owing to the change in the form of the words in the sentence, based on the gender of the subject. Additionally, there is sparse work done in the realm of measuring and debiasing systems for Indic languages. In our work, we attempt to evaluate and quantify the gender bias within a Hindi-English machine translation system. We implement a modified version of the existing TGBI metric based on the grammatical considerations for Hindi. We also compare and contrast the resulting bias measurements across multiple metrics for pre-trained embeddings and the ones learned by our machine translation model.

computational linguistic, gender bia, proceedings, (13 more...)

arXiv.org Artificial Intelligence

2106.0868

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Italy > Tuscany > Florence (0.05)
North America > United States > New York > New York County > New York City (0.05)
(6 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Code to Comment Translation: A Comparative Study on Model Effectiveness & Errors

Mahmud, Junayed, Faisal, Fahim, Arnob, Raihan Islam, Anastasopoulos, Antonios, Moran, Kevin

arXiv.org Artificial IntelligenceJun-15-2021

Automated source code summarization is a popular software engineering research topic wherein machine translation models are employed to "translate" code snippets into relevant natural language descriptions. Most evaluations of such models are conducted using automatic reference-based metrics. However, given the relatively large semantic gap between programming languages and natural language, we argue that this line of research would benefit from a qualitative investigation into the various error modes of current state-of-the-art models. Therefore, in this work, we perform both a quantitative and qualitative comparison of three recently proposed source code summarization models. In our quantitative evaluation, we compare the models based on the smoothed BLEU-4, METEOR, and ROUGE-L machine translation metrics, and in our qualitative evaluation, we perform a manual open-coding of the most common errors committed by the models when compared to ground truth captions. Our investigation reveals new insights into the relationship between metric-based performance and model prediction errors grounded in an empirically derived error taxonomy that can be used to drive future research efforts

code summarization, information, summarization, (15 more...)

arXiv.org Artificial Intelligence

2106.08415

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New York > New York County > New York City (0.05)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(5 more...)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.87)

Add feedback