AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks

Malinin, Andrey, Band, Neil, Ganshin, null, Alexander, null, Chesnokov, German, Gal, Yarin, Gales, Mark J. F., Noskov, Alexey, Ploskonosov, Andrey, Prokhorenkova, Liudmila, Provilkov, Ivan, Raina, Vatsal, Raina, Vyas, Roginskiy, null, Denis, null, Shmatova, Mariya, Tigas, Panos, Yangel, Boris

arXiv.org Artificial IntelligenceJul-23-2021

There has been significant research done on developing methods for improving robustness to distributional shift and uncertainty estimation. In contrast, only limited work has examined developing standard datasets and benchmarks for assessing these approaches. Additionally, most work on uncertainty estimation and robustness has developed new techniques based on small-scale regression or image classification tasks. However, many tasks of practical interest have different modalities, such as tabular data, audio, text, or sensor data, which offer significant challenges involving regression and discrete or continuous structured prediction. Thus, given the current state of the field, a standardized large-scale dataset of tasks across a range of modalities affected by distributional shifts is necessary. This will enable researchers to meaningfully evaluate the plethora of recently developed uncertainty quantification methods, as well as assessment criteria and state-of-the-art baselines. In this work, we propose the \emph{Shifts Dataset} for evaluation of uncertainty estimates and robustness to distributional shift. The dataset, which has been collected from industrial sources and services, is composed of three tasks, with each corresponding to a particular data modality: tabular weather prediction, machine translation, and self-driving car (SDC) vehicle motion prediction. All of these data modalities and tasks are affected by real, `in-the-wild' distributional shifts and pose interesting challenges with respect to uncertainty estimation. In this work we provide a description of the dataset and baseline results for all tasks.

dataset, distributional shift, prediction, (15 more...)

arXiv.org Artificial Intelligence

2107.07455

Country:

North America > United States (0.14)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(4 more...)

Genre: Research Report (0.50)

Industry:

Information Technology (0.88)
Transportation > Ground > Road (0.88)
Transportation > Passenger (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Simultaneous Speech Translation for Live Subtitling: from Delay to Display

Karakanta, Alina, Papi, Sara, Negri, Matteo, Turchi, Marco

arXiv.org Artificial IntelligenceJul-20-2021

With the increased audiovisualisation of communication, the need for live subtitles in multilingual events is more relevant than ever. In an attempt to automatise the process, we aim at exploring the feasibility of simultaneous speech translation (SimulST) for live subtitling. However, the word-for-word rate of generation of SimulST systems is not optimal for displaying the subtitles in a comprehensible and readable way. In this work, we adapt SimulST systems to predict subtitle breaks along with the translation. We then propose a display mode that exploits the predicted break structure by presenting the subtitles in scrolling lines. We compare our proposed mode with a display 1) word-for-word and 2) in blocks, in terms of reading speed and delay. Experiments on three language pairs (en$\rightarrow$it, de, fr) show that scrolling lines is the only mode achieving an acceptable reading speed while keeping delay close to a 4-second threshold. We argue that simultaneous translation for readable live subtitles still faces challenges, the main one being poor translation quality, and propose directions for steering future research.

artificial intelligence, machine translation, natural language, (3 more...)

arXiv.org Artificial Intelligence

2107.08807

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.89)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.60)

Add feedback

Tea: Program Repair Using Neural Network Based on Program Information Attention Matrix

Wang, Wenshuo, Wu, Chen, Cheng, Liang, Zhang, Yang

arXiv.org Artificial IntelligenceJul-17-2021

The advance in machine learning (ML)-driven natural language process (NLP) points a promising direction for automatic bug fixing for software programs, as fixing a buggy program can be transformed to a translation task. While software programs contain much richer information than one-dimensional natural language documents, pioneering work on using ML-driven NLP techniques for automatic program repair only considered a limited set of such information. We hypothesize that more comprehensive information of software programs, if appropriately utilized, can improve the effectiveness of ML-driven NLP approaches in repairing software programs. As the first step towards proving this hypothesis, we propose a unified representation to capture the syntax, data flow, and control flow aspects of software programs, and devise a method to use such a representation to guide the transformer model from NLP in better understanding and fixing buggy programs. Our preliminary experiment confirms that the more comprehensive information of software programs used, the better ML-driven NLP techniques can perform in fixing bugs in these programs.

information, representation, software program, (12 more...)

arXiv.org Artificial Intelligence

2107.08262

Country:

Asia (0.05)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Software (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.65)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)

Add feedback

Attackers can elicit 'toxic behavior' from AI translation systems, study finds

#artificialintelligenceJul-15-2021, 19:17:44 GMT

Neural machine translation (NMT), or AI techniques that can translate between languages, is in widespread use today owing to its robustness and versatility. But it's been shown that NMT systems can be manipulated if provided prompts containing certain words, phrases, or alphanumeric symbols. For example, in 2015, Google fixed a bug that caused Google Translate to offer homophobic slurs like "poof" and "queen" to those translating the word "gay" from English into Spanish, French, or Portuguese. In another glitch, Reddit users discovered that typing repeated words like "dog" into Translate and asking the system to translate into English yielded "doomsday predictions." A new study from researchers at the University of Melbourne, Facebook, Twitter, and Amazon suggests that NMT systems are even more vulnerable than previously believed.

ai translation system, back-translation, nmt system, (8 more...)

#artificialintelligence

Genre: Research Report (0.53)

Industry: Media > News (0.53)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

These Headphones Translate Foreign Languages on the Fly

WIREDJul-15-2021, 11:00:00 GMT

A few years ago, I spent a day at Suntory's Yamazaki Distillery outside of Kyoto, Japan. There's a bar at the end of the tour, and (pro tip) it's one of the only places in the world you can get Suntory's whiskeys at cost. When I purchased my first glass of whiskey, a pair of Japanese men who'd taken the Shinkansen in from Tokyo waved me over to their table. Through pantomime, one of them offered me a taste of the whisky in his glass, and we ended up spending hours sampling spirits and talking about Japanese whiskey through the magic of Google Translate on our phones. It was a halting, awkward way to have a conversation, but it was glorious, and it still stands as one of the best experiences of my life.

ambassador, headphone translate foreign language, translation, (8 more...)

WIRED

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.25)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.25)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.35)

Add feedback

Robust Learning for Text Classification with Multi-source Noise Simulation and Hard Example Mining

Xu, Guowei, Ding, Wenbiao, Fu, Weiping, Wu, Zhongqin, Liu, Zitao

arXiv.org Artificial IntelligenceJul-15-2021

Many real-world applications involve the use of Optical Character Recognition (OCR) engines to transform handwritten images into transcripts on which downstream Natural Language Processing (NLP) models are applied. In this process, OCR engines may introduce errors and inputs to downstream NLP models become noisy. Despite that pre-trained models achieve state-of-the-art performance in many NLP benchmarks, we prove that they are not robust to noisy texts generated by real OCR engines. This greatly limits the application of NLP models in real-world scenarios. In order to improve model performance on noisy OCR transcripts, it is natural to train the NLP model on labelled noisy texts. However, in most cases there are only labelled clean texts. Since there is no handwritten pictures corresponding to the text, it is impossible to directly use the recognition model to obtain noisy labelled data. Human resources can be employed to copy texts and take pictures, but it is extremely expensive considering the size of data for model training. Consequently, we are interested in making NLP models intrinsically robust to OCR errors in a low resource manner. We propose a novel robust training framework which 1) employs simple but effective methods to directly simulate natural OCR noises from clean texts and 2) iteratively mines the hard examples from a large number of simulated samples for optimal performance. 3) To make our model learn noise-invariant representations, a stability loss is employed. Experiments on three real-world datasets show that the proposed framework boosts the robustness of pre-trained models by a large margin. We believe that this work can greatly promote the application of NLP models in actual scenarios, although the algorithm we use is simple and straightforward. We make our codes and three datasets publicly available\footnote{https://github.com/tal-ai/Robust-learning-MSSHEM}.

ocr transcript, stability loss, transcript, (13 more...)

arXiv.org Artificial Intelligence

2107.07113

Country:

Asia > China > Beijing > Beijing (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Nevada > Clark County > Las Vegas (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.48)
(3 more...)

Add feedback

A Survey on Data Augmentation for Text Classification

Bayer, Markus, Kaufhold, Marc-André, Reuter, Christian

arXiv.org Artificial IntelligenceJul-14-2021

Data augmentation, the artificial creation of training data for machine learning by transformations, is a widely studied research field across machine learning disciplines. While it is useful for increasing the generalization capabilities of a model, it can also address many other challenges and problems, from overcoming a limited amount of training data over regularizing the objective to limiting the amount data used to protect privacy. Based on a precise description of the goals and applications of data augmentation (C1) and a taxonomy for existing works (C2), this survey is concerned with data augmentation methods for textual classification and aims to achieve a concise and comprehensive overview for researchers and practitioners (C3). Derived from the taxonomy, we divided more than 100 methods into 12 different groupings and provide state-of-the-art references expounding which methods are highly promising (C4). Finally, research perspectives that may constitute a building block for future work are given (C5).

augmentation, augmentation method, data augmentation, (14 more...)

arXiv.org Artificial Intelligence

2107.03158

Country:

Europe > United Kingdom (0.14)
North America > United States > Texas (0.14)
Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
(2 more...)

Genre:

Overview (1.00)
Summary/Review (0.92)
Research Report > New Finding (0.46)
Research Report > Promising Solution (0.45)

Industry: Information Technology > Security & Privacy (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Add feedback

Landscape Analysis: Neural Machine Translation

#artificialintelligenceJul-12-2021, 23:40:31 GMT

The Big 3, when it comes to neural machine translation (NMT), are Google, Microsoft, and Amazon. Among this group, Google is the most dominant in terms of supporting 109 languages compared to Microsoft's 73, and Amazon's 55. Overall, Google is flush with talent, data, and resources, and they leverage those assets to maintain their dominant position. With that said, Google Translate is a tool that businesses like Native can license in order to leverage best-in-class technology. In this sense, Google is currently a key partner and will only become a competitor when Native builds out its own neural translation engine.

google, neural machine translation, translation, (11 more...)

#artificialintelligence

Country: Europe > Germany (0.06)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Improving Low-resource Reading Comprehension via Cross-lingual Transposition Rethinking

Wu, Gaochen, Xu1, Bin, Qin, Yuxin, Kong, Fei, Liu, Bangchang, Zhao, Hongwen, Chang, Dejie

arXiv.org Artificial IntelligenceJul-11-2021

Extractive Reading Comprehension (ERC) has made tremendous advances enabled by the availability of large-scale high-quality ERC training data. Despite of such rapid progress and widespread application, the datasets in languages other than high-resource languages such as English remain scarce. To address this issue, we propose a Cross-Lingual Transposition ReThinking (XLTT) model by modelling existing high-quality extractive reading comprehension datasets in a multilingual environment. To be specific, we present multilingual adaptive attention (MAA) to combine intra-attention and inter-attention to learn more general generalizable semantic and lexical knowledge from each pair of language families. Furthermore, to make full use of existing datasets, we adopt a new training framework to train our model by calculating task-level similarities between each existing dataset and target dataset. The experimental results show that our XLTT model surpasses six baselines on two multilingual ERC benchmarks, especially more effective for low-resource languages with 3.9 and 4.1 average improvement in F1 and EM, respectively.

dataset, representation, training dataset, (13 more...)

arXiv.org Artificial Intelligence

2107.05002

Country:

Asia > China > Beijing > Beijing (0.05)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
(6 more...)

Genre: Research Report > New Finding (0.66)

Industry: Education > Assessment & Standards > Student Performance (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)

Add feedback

Zoom acquires an AI company building real-time translation

#artificialintelligenceJul-6-2021, 05:05:30 GMT

Zoom has announced that it's acquiring a company known as Kites (short for Karlsruhe Information Technology Solutions), which has worked on creating real-time translation and transcription software. Zoom says the acquisition is a move to help it make communicating with people who speak different languages easier, and that it's looking to add translation capabilities to its video conferencing app. According to its site, Kites began at the Karlsruhe Institute of Technology, and its technology was originally developed to act as in-classroom translation for students who needed help understanding the English or German their professors were lecturing in. Zoom already has real-time transcriptions, but it's limited to people who are talking in English. On a support page, Zoom also makes it clear that its current live transcription feature may not meet certain accuracy requirements.

ai company building real-time translation, kite, zoom

#artificialintelligence

Country: Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.57)

Technology:

Information Technology > Architecture > Real Time Systems (0.93)
Information Technology > Communications (0.66)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.40)

Add feedback