Machine Translation
Hallucinating to better text translation
As babies, we babble and imitate our way to learning languages. We don't start off reading raw text, which requires fundamental knowledge and understanding about the world, as well as the advanced ability to interpret and infer descriptions and relationships. Rather, humans begin our language journey slowly, by pointing and interacting with our environment, basing our words and perceiving their meaning through the context of the physical and social world. Eventually, we can craft full sentences to communicate complex ideas. Similarly, when humans begin learning and translating into another language, the incorporation of other sensory information, like multimedia, paired with the new and unfamiliar words, like flashcards with images, improves language acquisition and retention. Then, with enough practice, humans can accurately translate new, unseen sentences in context without the accompanying media; however, imagining a picture based on the original text helps.
Mozilla brings free, offline translation to Firefox โ TechCrunch
Mozilla has added an official translation tool to Firefox that doesn't rely on cloud processing to do its work, instead performing the machine learning-based process right on your own computer. It's a huge step forward for a popular service tied strongly to giants like Google and Microsoft. The translation tool, called Firefox Translations, can be added to your browser here. It will need to download some resources the first time it translates a language, and presumably it may download improved models if needed, but the actual translation work is done by your computer, not in a datacenter a couple hundred miles away. This is important not because a lot of people need to translate in their browsers while offline -- like screen door for a submarine, it's not really a use case that makes sense.
Exploring Diversity in Back Translation for Low-Resource Machine Translation
Burchell, Laurie, Birch, Alexandra, Heafield, Kenneth
Back translation is one of the most widely used methods for improving the performance of neural machine translation systems. Recent research has sought to enhance the effectiveness of this method by increasing the 'diversity' of the generated translations. We argue that the definitions and metrics used to quantify 'diversity' in previous work have been insufficient. This work puts forward a more nuanced framework for understanding diversity in training data, splitting it into lexical diversity and syntactic diversity. We present novel metrics for measuring these different aspects of diversity and carry out empirical analysis into the effect of these types of diversity on final neural machine translation model performance for low-resource English$\leftrightarrow$Turkish and mid-resource English$\leftrightarrow$Icelandic. Our findings show that generating back translation using nucleus sampling results in higher final model performance, and that this method of generation has high levels of both lexical and syntactic diversity. We also find evidence that lexical diversity is more important than syntactic for back translation performance.
Exploiting Transliterated Words for Finding Similarity in Inter-Language News Articles using Machine Learning
Naeem, Sameea, Rahman, Arif ur, Haider, Syed Mujtaba, Mughal, Abdul Basit
Finding similarities between two inter-language news articles is a challenging problem of Natural Language Processing (NLP). It is difficult to find similar news articles in a different language other than the native language of user, there is a need for a Machine Learning based automatic system to find the similarity between two inter-language news articles. In this article, we propose a Machine Learning model with the combination of English Urdu word transliteration which will show whether the English news article is similar to the Urdu news article or not. The existing approaches to find similarities has a major drawback when the archives contain articles of low-resourced languages like Urdu along with English news article. The existing approaches to find similarities has drawback when the archives contain low-resourced languages like Urdu along with English news articles. We used lexicon to link Urdu and English news articles. As Urdu language processing applications like machine translation, text to speech, etc are unable to handle English text at the same time so this research proposed technique to find similarities in English and Urdu news articles based on transliteration.
Grammar Accuracy Evaluation (GAE): Quantifiable Quantitative Evaluation of Machine Translation Models
Park, Dojun, Jang, Youngjin, Kim, Harksoo
Natural Language Generation (NLG) refers to the operation of expressing the calculation results of a system in human language. Since the quality of generated sentences from an NLG model cannot be fully represented using only quantitative evaluation, they are evaluated using qualitative evaluation by humans in which the meaning or grammar of a sentence is scored according to a subjective criterion. Nevertheless, the existing evaluation methods have a problem as a large score deviation occurs depending on the criteria of evaluators. In this paper, we propose Grammar Accuracy Evaluation (GAE) that can provide the specific evaluating criteria. As a result of analyzing the quality of machine translation by BLEU and GAE, it was confirmed that the BLEU score does not represent the absolute performance of machine translation models and GAE compensates for the shortcomings of BLEU with flexible evaluation of alternative synonyms and changes in sentence structure.
Google Translate Provides Assist for Extra Indian Languages - Channel969
Google Translate has added help for some extra Indian languages. Whereas Hindi has been supported by Google Translate for an extended now, a number of new regional languages have been added to the platform by Google. Languages together with Assamese, a outstanding one in Northeast India; Bhojpuri, Dhivehi (used within the Maldives), Dogri (Northern India), Konkani (central India), Maithili (about 34 million folks in Northern India communicate this language), Meiteilon or Manipuri, utilized by about two million folks in Northeast India, Mizo, and Sanskrit have been added to the platform. Together with these languages, Google Translate has additionally added help for a number of worldwide languages. Now, Google Translate helps over 133 languages spoken internationally, protecting main Indian languages as properly.
Semantics-aware Attention Improves Neural Machine Translation
Slobodkin, Aviv, Choshen, Leshem, Abend, Omri
The integration of syntactic structures into Transformer machine translation has shown positive results, but to our knowledge, no work has attempted to do so with semantic structures. In this work we propose two novel parameter-free methods for injecting semantic information into Transformers, both rely on semantics-aware masking of (some of) the attention heads. One such method operates on the encoder, through a Scene-Aware Self-Attention (SASA) head. Another on the decoder, through a Scene-Aware Cross-Attention (SACrA) head. We show a consistent improvement over the vanilla Transformer and syntax-aware models for four language pairs. We further show an additional gain when using both semantic and syntactic structures in some language pairs.
SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation
Khurana, Sameer, Laurent, Antoine, Glass, James
We propose the SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation learning framework. Unlike previous works on speech representation learning, which learns multilingual contextual speech embedding at the resolution of an acoustic frame (10-20ms), this work focuses on learning multimodal (speech-text) multilingual speech embedding at the resolution of a sentence (5-10s) such that the embedding vector space is semantically aligned across different languages. We combine state-of-the-art multilingual acoustic frame-level speech representation learning model XLS-R with the Language Agnostic BERT Sentence Embedding (LaBSE) model to create an utterance-level multimodal multilingual speech encoder SAMU-XLSR. Although we train SAMU-XLSR with only multilingual transcribed speech data, cross-lingual speech-text and speech-speech associations emerge in its learned representation space. To substantiate our claims, we use SAMU-XLSR speech encoder in combination with a pre-trained LaBSE text sentence encoder for cross-lingual speech-to-text translation retrieval, and SAMU-XLSR alone for cross-lingual speech-to-speech translation retrieval. We highlight these applications by performing several cross-lingual text and speech translation retrieval tasks across several datasets.
Google Translate adds 24 new languages
"For many supported languages, even the largest languages in Africa that we have supported - say like Yoruba, Igbo, the translation is not great. It will definitely get the idea across but often it will lose much of the subtlety of the language," Google Translate research scientist Isaac Caswell told the BBC.