AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

CONCRETE: Improving Cross-lingual Fact-checking with Cross-lingual Retrieval

Huang, Kung-Hsiang, Zhai, ChengXiang, Ji, Heng

arXiv.org Artificial IntelligenceSep-5-2022

Fact-checking has gained increasing attention due to the widespread of falsified information. Most fact-checking approaches focus on claims made in English only due to the data scarcity issue in other languages. The lack of fact-checking datasets in low-resource languages calls for an effective cross-lingual transfer technique for fact-checking. Additionally, trustworthy information in different languages can be complementary and helpful in verifying facts. To this end, we present the first fact-checking framework augmented with cross-lingual retrieval that aggregates evidence retrieved from multiple languages through a cross-lingual retriever. Given the absence of cross-lingual information retrieval datasets with claim-like queries, we train the retriever with our proposed Cross-lingual Inverse Cloze Task (X-ICT), a self-supervised algorithm that creates training instances by translating the title of a passage. The goal for X-ICT is to learn cross-lingual retrieval in which the model learns to identify the passage corresponding to a given translated title. On the X-Fact dataset, our approach achieves 2.23% absolute F1 improvement in the zero-shot cross-lingual setup over prior systems. The source code and data are publicly available at https://github.com/khuangaf/CONCRETE.

computational linguistic, proceedings, retrieval, (15 more...)

arXiv.org Artificial Intelligence

2209.02071

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Sweden (0.14)
North America > United States > Washington > King County > Seattle (0.04)
(14 more...)

Genre: Research Report (1.00)

Industry:

Media > News (0.94)
Government (0.94)
Automobiles & Trucks (0.69)
Transportation > Ground > Road (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.67)

Add feedback

Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages

Sandhan, Jivnesh, Daksh, Ayush, Paranjay, Om Adideva, Behera, Laxmidhar, Goyal, Pawan

arXiv.org Artificial IntelligenceSep-4-2022

Nowadays, the interest in code-mixing has become ubiquitous in Natural Language Processing (NLP); however, not much attention has been given to address this phenomenon for Speech Translation (ST) task. This can be solely attributed to the lack of code-mixed ST task labelled data. Thus, we introduce Prabhupadavani, which is a multilingual code-mixed ST dataset for 25 languages. It is multi-domain, covers ten language families, containing 94 hours of speech by 130+ speakers, manually aligned with corresponding text in the target language. The Prabhupadavani is about Vedic culture and heritage from Indic literature, where code-switching in the case of quotation from literature is important in the context of humanities teaching. To the best of our knowledge, Prabhupadvani is the first multi-lingual code-mixed ST dataset available in the ST literature. This data also can be used for a code-mixed machine translation task. All the dataset can be accessed at https://github.com/frozentoad9/CMST.

artificial intelligence, machine translation, natural language, (16 more...)

arXiv.org Artificial Intelligence

2201.11391

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Spain (0.05)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
(6 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Student Surpasses Teacher: Imitation Attack for Black-Box NLP APIs

Xu, Qiongkai, He, Xuanli, Lyu, Lingjuan, Qu, Lizhen, Haffari, Gholamreza

arXiv.org Artificial IntelligenceSep-4-2022

Machine-learning-as-a-service (MLaaS) has attracted millions of users to their splendid large-scale models. Although published as black-box APIs, the valuable models behind these services are still vulnerable to imitation attacks. Recently, a series of works have demonstrated that attackers manage to steal or extract the victim models. Nonetheless, none of the previous stolen models can outperform the original black-box APIs. In this work, we conduct unsupervised domain adaptation and multi-victim ensemble to showing that attackers could potentially surpass victims, which is beyond previous understanding of model extraction. Extensive experiments on both benchmark datasets and real-world APIs validate that the imitators can succeed in outperforming the original black-box models on transferred domains. We consider our work as a milestone in the research of imitation attack, especially on NLP APIs, as the superior performance could influence the defense or even publishing strategy of API providers.

attacker, imitation attack, victim model, (13 more...)

arXiv.org Artificial Intelligence

2108.13873

Country:

North America > United States > Oregon (0.04)
North America > United States > Maryland > Baltimore (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Czechia > Prague (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Transportation > Air (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Cloud Computing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Informative Language Representation Learning for Massively Multilingual Neural Machine Translation

Jin, Renren, Xiong, Deyi

arXiv.org Artificial IntelligenceSep-4-2022

In a multilingual neural machine translation model that fully shares parameters across all languages, an artificial language token is usually used to guide translation into the desired target language. However, recent studies show that prepending language tokens sometimes fails to navigate the multilingual neural machine translation models into right translation directions, especially on zero-shot translation. To mitigate this issue, we propose two methods, language embedding embodiment and language-aware multi-head attention, to learn informative language representations to channel translation into right directions. The former embodies language embeddings into different critical switching points along the information flow from the source to the target, aiming at amplifying translation direction guiding signals. The latter exploits a matrix, instead of a vector, to represent a language in the continuous space. The matrix is chunked into multiple heads so as to learn language representations in multiple subspaces. Experiment results on two datasets for massively multilingual neural machine translation demonstrate that language-aware multi-head attention benefits both supervised and zero-shot translation and significantly alleviates the off-target translation issue. Further linguistic typology prediction experiments show that matrix-based language representations learned by our methods are capable of capturing rich linguistic typology features.

computational linguistic, translation, translation direction, (14 more...)

arXiv.org Artificial Intelligence

2209.0153

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(20 more...)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Multimodal Neural Machine Translation with Search Engine Based Image Retrieval

Tang, ZhenHao, Zhang, XiaoBing, Long, Zi, Fu, XiangHua

arXiv.org Artificial IntelligenceSep-3-2022

Recently, numbers of works shows that the performance of neural machine translation (NMT) can be improved to a certain extent with using visual information. However, most of these conclusions are drawn from the analysis of experimental results based on a limited set of bilingual sentence-image pairs, such as Multi30K. In these kinds of datasets, the content of one bilingual parallel sentence pair must be well represented by a manually annotated image, which is different with the actual translation situation. Some previous works are proposed to addressed the problem by retrieving images from exiting sentence-image pairs with topic model. However, because of the limited collection of sentence-image pairs they used, their image retrieval method is difficult to deal with the out-of-vocabulary words, and can hardly prove that visual information enhance NMT rather than the co-occurrence of images and sentences. In this paper, we propose an open-vocabulary image retrieval methods to collect descriptive images for bilingual parallel corpus using image search engine. Next, we propose text-aware attentive visual encoder to filter incorrectly collected noise images. Experiment results on Multi30K and other two translation datasets show that our proposed method achieves significant improvements over strong baselines.

dataset, machine translation, translation, (14 more...)

arXiv.org Artificial Intelligence

2208.00767

Country:

Asia > China > Guangdong Province > Shenzhen (0.05)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Machine Learning on Image Captioning Application

#artificialintelligenceSep-1-2022, 20:00:42 GMT

Along with the development of technology, there are new discoveries, especially in the field of data science. One of the machine learning methods applied in data science is image processing, aka image processing. The application of image processing is closely related to everyday life. A simple example in image processing is the face detection feature on our cellphones, object detection to label a product (product detection), motor vehicle number plate detection (text extraction), and others. An example of the application of natural language processing that we usually use is machine translation, such as in Google Translate.

image processing, opération, transmission, (15 more...)

#artificialintelligence

Industry: Health & Medicine (0.31)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.89)

Add feedback

Unsupervised Simplification of Legal Texts

Cemri, Mert, Çukur, Tolga, Koç, Aykut

arXiv.org Artificial IntelligenceSep-1-2022

The processing of legal texts has been developing as an emerging field in natural language processing (NLP). Legal texts contain unique jargon and complex linguistic attributes in vocabulary, semantics, syntax, and morphology. Therefore, the development of text simplification (TS) methods specific to the legal domain is of paramount importance for facilitating comprehension of legal text by ordinary people and providing inputs to high-level models for mainstream legal NLP applications. While a recent study proposed a rule-based TS method for legal text, learning-based TS in the legal domain has not been considered previously. Here we introduce an unsupervised simplification method for legal texts (USLT). USLT performs domain-specific TS by replacing complex words and splitting long sentences. To this end, USLT detects complex words in a sentence, generates candidates via a masked-transformer model, and selects a candidate for substitution based on a rank score. Afterward, USLT recursively decomposes long sentences into a hierarchy of shorter core and context sentences while preserving semantic meaning. We demonstrate that USLT outperforms state-of-the-art domain-general TS methods in text simplicity while keeping the semantics intact.

assoc, proc, simplification, (16 more...)

arXiv.org Artificial Intelligence

2209.00557

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > United Kingdom (0.14)
(21 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Alleviating the Inequality of Attention Heads for Neural Machine Translation

Sun, Zewei, Huang, Shujian, Dai, Xin-Yu, Chen, Jiajun

arXiv.org Artificial IntelligenceAug-31-2022

Recent studies show that the attention heads in Transformer are not equal. We relate this phenomenon to the imbalance training of multi-head attention and the model dependence on specific heads. To tackle this problem, we propose a simple masking method: HeadMask, in two specific ways. Experiments show that translation improvements are achieved on multiple language pairs. Subsequent empirical analyses also support our assumption and confirm the effectiveness of the method.

baseline, machine translation, neural machine translation, (10 more...)

arXiv.org Artificial Intelligence

2009.09672

Country: Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

A Survey on Cross-Lingual Summarization

Wang, Jiaan, Meng, Fandong, Zheng, Duo, Liang, Yunlong, Li, Zhixu, Qu, Jianfeng, Zhou, Jie

arXiv.org Artificial IntelligenceAug-30-2022

Cross-lingual summarization is the task of generating a summary in one language (e.g., English) for the given document(s) in a different language (e.g., Chinese). Under the globalization background, this task has attracted increasing attention of the computational linguistics community. Nevertheless, there still remains a lack of comprehensive review for this task. Therefore, we present the first systematic critical review on the datasets, approaches, and challenges in this field. Specifically, we carefully organize existing datasets and approaches according to different construction methods and solution paradigms, respectively. For each type of datasets or approaches, we thoroughly introduce and summarize previous efforts and further compare them with each other to provide deeper analyses. In the end, we also discuss promising directions and offer our thoughts to facilitate future research. This survey is for both beginners and experts in cross-lingual summarization, and we hope it will serve as a starting point as well as a source of new ideas for researchers and engineers interested in this area.

computational linguistic, dataset, summarization, (15 more...)

arXiv.org Artificial Intelligence

2203.12515

Country:

Europe > Portugal > Lisbon > Lisbon (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Shanghai > Shanghai (0.04)
(23 more...)

Genre: Overview (1.00)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications (0.93)

Add feedback

CJaFr-v3 : A Freely Available Filtered Japanese-French Aligned Corpus

Blin, Raoul, Cromières, Fabien

arXiv.org Artificial IntelligenceAug-28-2022

We present a free Japanese-French parallel corpus. It includes 15M aligned segments and is obtained by compiling and filtering several existing resources. In this paper, we describe the existing resources, their quantity and quality, the filtering we applied to improve the quality of the corpus, and the content of the ready-to-use corpus. We also evaluate the usefulness of this corpus and the quality of our filtering by training and evaluating some standard MT systems with it.

bisegment, corpora, corpus, (13 more...)

arXiv.org Artificial Intelligence

2208.1317

Country:

Europe > Portugal > Lisbon > Lisbon (0.05)
Europe > France > Île-de-France > Paris > Paris (0.05)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)

Add feedback