Goto

Collaborating Authors

 Machine Translation


Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks

arXiv.org Artificial Intelligence

There has been significant research done on developing methods for improving robustness to distributional shift and uncertainty estimation. In contrast, only limited work has examined developing standard datasets and benchmarks for assessing these approaches. Additionally, most work on uncertainty estimation and robustness has developed new techniques based on small-scale regression or image classification tasks. However, many tasks of practical interest have different modalities, such as tabular data, audio, text, or sensor data, which offer significant challenges involving regression and discrete or continuous structured prediction. Thus, given the current state of the field, a standardized large-scale dataset of tasks across a range of modalities affected by distributional shifts is necessary. This will enable researchers to meaningfully evaluate the plethora of recently developed uncertainty quantification methods, as well as assessment criteria and state-of-the-art baselines. In this work, we propose the \emph{Shifts Dataset} for evaluation of uncertainty estimates and robustness to distributional shift. The dataset, which has been collected from industrial sources and services, is composed of three tasks, with each corresponding to a particular data modality: tabular weather prediction, machine translation, and self-driving car (SDC) vehicle motion prediction. All of these data modalities and tasks are affected by real, `in-the-wild' distributional shifts and pose interesting challenges with respect to uncertainty estimation. In this work we provide a description of the dataset and baseline results for all tasks.


Simultaneous Speech Translation for Live Subtitling: from Delay to Display

arXiv.org Artificial Intelligence

With the increased audiovisualisation of communication, the need for live subtitles in multilingual events is more relevant than ever. In an attempt to automatise the process, we aim at exploring the feasibility of simultaneous speech translation (SimulST) for live subtitling. However, the word-for-word rate of generation of SimulST systems is not optimal for displaying the subtitles in a comprehensible and readable way. In this work, we adapt SimulST systems to predict subtitle breaks along with the translation. We then propose a display mode that exploits the predicted break structure by presenting the subtitles in scrolling lines. We compare our proposed mode with a display 1) word-for-word and 2) in blocks, in terms of reading speed and delay. Experiments on three language pairs (en$\rightarrow$it, de, fr) show that scrolling lines is the only mode achieving an acceptable reading speed while keeping delay close to a 4-second threshold. We argue that simultaneous translation for readable live subtitles still faces challenges, the main one being poor translation quality, and propose directions for steering future research.


Tea: Program Repair Using Neural Network Based on Program Information Attention Matrix

arXiv.org Artificial Intelligence

The advance in machine learning (ML)-driven natural language process (NLP) points a promising direction for automatic bug fixing for software programs, as fixing a buggy program can be transformed to a translation task. While software programs contain much richer information than one-dimensional natural language documents, pioneering work on using ML-driven NLP techniques for automatic program repair only considered a limited set of such information. We hypothesize that more comprehensive information of software programs, if appropriately utilized, can improve the effectiveness of ML-driven NLP approaches in repairing software programs. As the first step towards proving this hypothesis, we propose a unified representation to capture the syntax, data flow, and control flow aspects of software programs, and devise a method to use such a representation to guide the transformer model from NLP in better understanding and fixing buggy programs. Our preliminary experiment confirms that the more comprehensive information of software programs used, the better ML-driven NLP techniques can perform in fixing bugs in these programs.


Attackers can elicit 'toxic behavior' from AI translation systems, study finds

#artificialintelligence

Neural machine translation (NMT), or AI techniques that can translate between languages, is in widespread use today owing to its robustness and versatility. But it's been shown that NMT systems can be manipulated if provided prompts containing certain words, phrases, or alphanumeric symbols. For example, in 2015, Google fixed a bug that caused Google Translate to offer homophobic slurs like "poof" and "queen" to those translating the word "gay" from English into Spanish, French, or Portuguese. In another glitch, Reddit users discovered that typing repeated words like "dog" into Translate and asking the system to translate into English yielded "doomsday predictions." A new study from researchers at the University of Melbourne, Facebook, Twitter, and Amazon suggests that NMT systems are even more vulnerable than previously believed.


These Headphones Translate Foreign Languages on the Fly

WIRED

A few years ago, I spent a day at Suntory's Yamazaki Distillery outside of Kyoto, Japan. There's a bar at the end of the tour, and (pro tip) it's one of the only places in the world you can get Suntory's whiskeys at cost. When I purchased my first glass of whiskey, a pair of Japanese men who'd taken the Shinkansen in from Tokyo waved me over to their table. Through pantomime, one of them offered me a taste of the whisky in his glass, and we ended up spending hours sampling spirits and talking about Japanese whiskey through the magic of Google Translate on our phones. It was a halting, awkward way to have a conversation, but it was glorious, and it still stands as one of the best experiences of my life.


Robust Learning for Text Classification with Multi-source Noise Simulation and Hard Example Mining

arXiv.org Artificial Intelligence

Many real-world applications involve the use of Optical Character Recognition (OCR) engines to transform handwritten images into transcripts on which downstream Natural Language Processing (NLP) models are applied. In this process, OCR engines may introduce errors and inputs to downstream NLP models become noisy. Despite that pre-trained models achieve state-of-the-art performance in many NLP benchmarks, we prove that they are not robust to noisy texts generated by real OCR engines. This greatly limits the application of NLP models in real-world scenarios. In order to improve model performance on noisy OCR transcripts, it is natural to train the NLP model on labelled noisy texts. However, in most cases there are only labelled clean texts. Since there is no handwritten pictures corresponding to the text, it is impossible to directly use the recognition model to obtain noisy labelled data. Human resources can be employed to copy texts and take pictures, but it is extremely expensive considering the size of data for model training. Consequently, we are interested in making NLP models intrinsically robust to OCR errors in a low resource manner. We propose a novel robust training framework which 1) employs simple but effective methods to directly simulate natural OCR noises from clean texts and 2) iteratively mines the hard examples from a large number of simulated samples for optimal performance. 3) To make our model learn noise-invariant representations, a stability loss is employed. Experiments on three real-world datasets show that the proposed framework boosts the robustness of pre-trained models by a large margin. We believe that this work can greatly promote the application of NLP models in actual scenarios, although the algorithm we use is simple and straightforward. We make our codes and three datasets publicly available\footnote{https://github.com/tal-ai/Robust-learning-MSSHEM}.


A Survey on Data Augmentation for Text Classification

arXiv.org Artificial Intelligence

Data augmentation, the artificial creation of training data for machine learning by transformations, is a widely studied research field across machine learning disciplines. While it is useful for increasing the generalization capabilities of a model, it can also address many other challenges and problems, from overcoming a limited amount of training data over regularizing the objective to limiting the amount data used to protect privacy. Based on a precise description of the goals and applications of data augmentation (C1) and a taxonomy for existing works (C2), this survey is concerned with data augmentation methods for textual classification and aims to achieve a concise and comprehensive overview for researchers and practitioners (C3). Derived from the taxonomy, we divided more than 100 methods into 12 different groupings and provide state-of-the-art references expounding which methods are highly promising (C4). Finally, research perspectives that may constitute a building block for future work are given (C5).


Landscape Analysis: Neural Machine Translation

#artificialintelligence

The Big 3, when it comes to neural machine translation (NMT), are Google, Microsoft, and Amazon. Among this group, Google is the most dominant in terms of supporting 109 languages compared to Microsoft's 73, and Amazon's 55. Overall, Google is flush with talent, data, and resources, and they leverage those assets to maintain their dominant position. With that said, Google Translate is a tool that businesses like Native can license in order to leverage best-in-class technology. In this sense, Google is currently a key partner and will only become a competitor when Native builds out its own neural translation engine.


Improving Low-resource Reading Comprehension via Cross-lingual Transposition Rethinking

arXiv.org Artificial Intelligence

Extractive Reading Comprehension (ERC) has made tremendous advances enabled by the availability of large-scale high-quality ERC training data. Despite of such rapid progress and widespread application, the datasets in languages other than high-resource languages such as English remain scarce. To address this issue, we propose a Cross-Lingual Transposition ReThinking (XLTT) model by modelling existing high-quality extractive reading comprehension datasets in a multilingual environment. To be specific, we present multilingual adaptive attention (MAA) to combine intra-attention and inter-attention to learn more general generalizable semantic and lexical knowledge from each pair of language families. Furthermore, to make full use of existing datasets, we adopt a new training framework to train our model by calculating task-level similarities between each existing dataset and target dataset. The experimental results show that our XLTT model surpasses six baselines on two multilingual ERC benchmarks, especially more effective for low-resource languages with 3.9 and 4.1 average improvement in F1 and EM, respectively.


Zoom acquires an AI company building real-time translation

#artificialintelligence

Zoom has announced that it's acquiring a company known as Kites (short for Karlsruhe Information Technology Solutions), which has worked on creating real-time translation and transcription software. Zoom says the acquisition is a move to help it make communicating with people who speak different languages easier, and that it's looking to add translation capabilities to its video conferencing app. According to its site, Kites began at the Karlsruhe Institute of Technology, and its technology was originally developed to act as in-classroom translation for students who needed help understanding the English or German their professors were lecturing in. Zoom already has real-time transcriptions, but it's limited to people who are talking in English. On a support page, Zoom also makes it clear that its current live transcription feature may not meet certain accuracy requirements.