Goto

Collaborating Authors

 Machine Translation


Self-Guided Curriculum Learning for Neural Machine Translation

arXiv.org Artificial Intelligence

In the field of machine learning, the well-trained model is assumed to be able to recover the training labels, i.e. the synthetic labels predicted by the model should be as close to the ground-truth labels as possible. Inspired by this, we propose a self-guided curriculum strategy to encourage the learning of neural machine translation (NMT) models to follow the above recovery criterion, where we cast the recovery degree of each training example as its learning difficulty. Specifically, we adopt the sentence level BLEU score as the proxy of recovery degree. Different from existing curricula relying on linguistic prior knowledge or third-party language models, our chosen learning difficulty is more suitable to measure the degree of knowledge mastery of the NMT models. Experiments on translation benchmarks, including WMT14 English$\Rightarrow$German and WMT17 Chinese$\Rightarrow$English, demonstrate that our approach can consistently improve translation performance against strong baseline Transformer.


Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation

arXiv.org Artificial Intelligence

The data scarcity in low-resource languages has become a bottleneck to building robust neural machine translation systems. Fine-tuning a multilingual pre-trained model (e.g., mBART (Liu et al., 2020)) on the translation task is a good approach for low-resource languages; however, its performance will be greatly limited when there are unseen languages in the translation pairs. In this paper, we present a continual pre-training (CPT) framework on mBART to effectively adapt it to unseen languages. We first construct noisy mixed-language text from the monolingual corpus of the target language in the translation pair to cover both the source and target languages, and then, we continue pre-training mBART to reconstruct the original monolingual text. Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline, as well as other strong baselines, across all tested low-resource translation pairs containing unseen languages. Furthermore, our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training. The code is available at https://github.com/zliucr/cpt-nmt.


How is Artificial Intelligence Challenging the Translation Industry?

#artificialintelligence

Language is perhaps the most defining factor of humankind. What makes humans different from other animals on the planet is our ability to speak out and communicate via framed words and sentences. The language of a population is one of the most defining factors across countries and nationalities, regions, and cultures. It can define the history, sociocultural situation, and even geographic diversity. From ancient times, there has been a trend for people to understand the language of one another. History traces back to Greeks and Romans traveling all across the world to discover, decipher and translate languages to find out the cultural, political, and social situations from one era to another.


Restoring and Mining the Records of the Joseon Dynasty via Neural Language Modeling and Machine Translation

arXiv.org Artificial Intelligence

Understanding voluminous historical records provides clues on the past in various aspects, such as social and political issues and even natural science facts. However, it is generally difficult to fully utilize the historical records, since most of the documents are not written in a modern language and part of the contents are damaged over time. As a result, restoring the damaged or unrecognizable parts as well as translating the records into modern languages are crucial tasks. In response, we present a multi-task learning approach to restore and translate historical documents based on a self-attention mechanism, specifically utilizing two Korean historical records, ones of the most voluminous historical records in the world. Experimental results show that our approach significantly improves the accuracy of the translation task than baselines without multi-task learning. In addition, we present an in-depth exploratory analysis on our translated results via topic modeling, uncovering several significant historical events.


Full-Sentence Models Perform Better in Simultaneous Translation Using the Information Enhanced Decoding Strategy

arXiv.org Artificial Intelligence

Simultaneous translation, which starts translating each sentence after receiving only a few words in source sentence, has a vital role in many scenarios. Although the previous prefix-to-prefix framework is considered suitable for simultaneous translation and achieves good performance, it still has two inevitable drawbacks: the high computational resource costs caused by the need to train a separate model for each latency $k$ and the insufficient ability to encode information because each target token can only attend to a specific source prefix. We propose a novel framework that adopts a simple but effective decoding strategy which is designed for full-sentence models. Within this framework, training a single full-sentence model can achieve arbitrary given latency and save computational resources. Besides, with the competence of the full-sentence model to encode the whole sentence, our decoding strategy can enhance the information maintained in the decoded states in real time. Experimental results show that our method achieves better translation quality than baselines on 4 directions: Zh$\rightarrow$En, En$\rightarrow$Ro and En$\leftrightarrow$De.


Why Ambitious Predictions About A.I. Are Always Wrong

Slate

Since the very beginning of the computer revolution, researchers have dreamed of creating computers that would rival the human brain. Our brains are information machines that use inputs to generate outputs, and so are computers. How hard could it be to build computers that work as well as our brains? In 1954 a Georgetown-IBM team predicted that language translation programs would be perfected in three to five years. In 1965 Herbert Simon said that "machines will be capable, within twenty years, of doing any work a man can do."


The internet is excluding Asian-Americans who don't speak English

MIT Technology Review

And it starts right at the beginning. Instead of the Hmong word for "hello" or "welcome," she says, is "something else that said, like, 'your honor' or'the queen' or'the king' instead." Seeing something so simple done incorrectly was frustrating and off-putting. "Not only was it just probably churned through Google Translate, it wasn't even peer edited and reviewed to ensure that there was fluency and coherence," she says. Xiong says this kind of carelessness is common online--and it's one reason she and others in the Hmong community can feel excluded from politics.


Limited English Skills Can Mean Limited Access to the COVID-19 Vaccine

Slate

This story was published in partnership with Type Investigations with support from the Puffin Foundation. In California, non-English speakers handed COVID-19 vaccination cards without information on what they mean. In Pennsylvania, people who speak Mandarin, Korean, and Japanese unable to make vaccine appointments due to a lack of interpreters at hospital call centers. These are just a few of the examples captured in a new complaint filed on Friday to the U.S. Department of Health and Human Services' Office for Civil Rights, Federal Emergency Management Agency's Office of Equal Rights, and Department of Homeland Security's Office for Civil Rights and Civil Liberties. The complaint, brought by the National Health Law Program, finds widespread problems across the country that inhibit access to COVID-19 resources for people with limited English proficiency (LEP).


Translate All: Automating multiple file type batch translation with AWS CloudFormation

#artificialintelligence

This is a guest post by Cyrus Wong, an AWS Machine Learning Hero. You can learn more about and connect with AWS Machine Learning Heroes at the community page. On July 29, 2020, AWS announced that Amazon Translate now supports Microsoft Office documents, including .docx, The world is full of bilingual countries and cities like Hong Kong. I find myself always needing to prepare Office documents and presentation slides in both English and Chinese.


Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation

arXiv.org Artificial Intelligence

Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly-accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers, exhibiting a clear preference for human over machine output. Surprisingly, we also find that automatic metrics based on pre-trained embeddings can outperform human crowd workers. We make our corpus publicly available for further research.