Machine Translation
Cross-lingual Machine Reading Comprehension with Language Branch Knowledge Distillation
Liu, Junhao, Shou, Linjun, Pei, Jian, Gong, Ming, Yang, Min, Jiang, Daxin
Cross-lingual Machine Reading Comprehension (CLMRC) remains a challenging problem due to the lack of large-scale annotated datasets in low-source languages, such as Arabic, Hindi, and Vietnamese. Many previous approaches use translation data by translating from a rich-source language, such as English, to low-source languages as auxiliary supervision. However, how to effectively leverage translation data and reduce the impact of noise introduced by translation remains onerous. In this paper, we tackle this challenge and enhance the cross-lingual transferring performance by a novel augmentation approach named Language Branch Machine Reading Comprehension (LBMRC). A language branch is a group of passages in one single language paired with questions in all target languages. We train multiple machine reading comprehension (MRC) models proficient in individual language based on LBMRC. Then, we devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages. Combining the LBMRC and multilingual distillation can be more robust to the data noises, therefore, improving the model's cross-lingual ability. Meanwhile, the produced single multilingual model is applicable to all target languages, which saves the cost of training, inference, and maintenance for multiple models. Extensive experiments on two CLMRC benchmarks clearly show the effectiveness of our proposed method.
15 Top AI/ML/AR/VR Based App Ideas for Startups and SMEs in 2020–21
Planning to invest in a mobile app? Here are the top 15 AI/ML/VR/AR app development ideas that ensure your success in 2020–21! With the availability of around 5 million apps existing in the app stores, the trends of developing ordinary mobile apps are just fading away. The increasing usage of mobile applications with each passing year also pushes the demand for innovative technologies to meet future mobile app users' demands. And Artificial Intelligence and Machine Learning (AI & ML) have become the most influencing technologies in the field of mobile app development and creating a plethora of opportunities for startups in 2021.
A Survey of Embedding Space Alignment Methods for Language and Knowledge Graphs
Kalinowski, Alexander, An, Yuan
The purpose of this survey is to explore the core techniques and categorizations of methods for aligning low-dimensional embedding spaces. Projecting sparse, high-dimensional data sets into compact, lower-dimensional spaces allows not only for a significant reduction in storage space, but also builds dense representations with many applications. These embedding spaces have become a staple in representation learning ever since their heralded application to natural language in a technique called word2vec, and have replaced traditional machine learning features as easy-to-build, high-quality representations of the source objects. There has been a wealth of study around techniques for embedding objects, such as images, natural language and knowledge graphs, and many research agendas focused on mapping one embedding space to another, either for the purpose of aligning and unifying to a common space, applications to joint downstream tasks or ease of transfer learning. In order to fully leverage these dense representations and translate them across domains and problem spaces, techniques for establishing alignments between them must be developed and understood.
A Benchmark Corpus and Neural Approach for Sanskrit Derivative Nouns Analysis
Singh, Arun Kumar, Dave, Sushant, P., Prathosh A., Lall, Brejesh, Mehta, Shresth
This paper presents first benchmark corpus of Sanskrit Pratyaya (suffix) and inflectional words (padas) formed due to suffixes along with neural network based approaches to process the formation and splitting of inflectional words. Inflectional words spans the primary and secondary derivative nouns as the scope of current work. Pratyayas are an important dimension of morphological analysis of Sanskrit texts. There have been Sanskrit Computational Linguistics tools for processing and analyzing Sanskrit texts. Unfortunately there has not been any work to standardize & validate these tools specifically for derivative nouns analysis. In this work, we prepared a Sanskrit suffix benchmark corpus called Pratyaya-Kosh to evaluate the performance of tools. We also present our own neural approach for derivative nouns analysis while evaluating the same on most prominent Sanskrit Morphological Analysis tools. This benchmark will be freely dedicated and available to researchers worldwide and we hope it will motivate all to improve morphological analysis in Sanskrit Language.
Energy-Based Reranking: Improving Neural Machine Translation Using Energy-Based Models
Naskar, Subhajit, Rooshenas, Amirmohammad, Sun, Simeng, Iyyer, Mohit, McCallum, Andrew
The discrepancy between maximum likelihood estimation (MLE) and task measures such as BLEU score has been studied before for autoregressive neural machine translation (NMT) and resulted in alternative training algorithms (Ranzato et al., 2016; Norouzi et al., 2016; Shen et al., 2016; Wu et al., 2018). However, MLE training remains the de facto approach for autoregressive NMT because of its computational efficiency and stability. Despite this mismatch between the training objective and task measure, we notice that the samples drawn from an MLE-based trained NMT support the desired distribution -- there are samples with much higher BLEU score comparing to the beam decoding output. To benefit from this observation, we train an energy-based model to mimic the behavior of the task measure (i.e., the energy-based model assigns lower energy to samples with higher BLEU score), which is resulted in a re-ranking algorithm based on the samples drawn from NMT: energy-based re-ranking (EBR). Our EBR consistently improves the performance of the Transformer-based NMT: +3 BLEU points on Sinhala-English, +2.0 BLEU points on IWSLT'17 French-English, and +1.7 BLEU points on WMT'19 German-English tasks.
Lost in Translation: How Artificial Intelligence is Breaking the Language Barrier - DefinedCrowd
Human interaction with machines has experienced a great leap forward in recent years, largely driven by artificial intelligence (AI). From smart homes to self-driving cars, AI has become a seamless part of our daily lives. Voice interactions play a key role in many of these technological advances, most notably in language translation. Here, AI enables instant translation across a number of mediums: text, voice, images and even street signs. The technology works by recognizing individual words, then leveraging similarities in how various languages express the relationships between those words.
Machine Learning case study: GOOGLE
Machine learning is a sub-field of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning algorithms are usually categorized as supervised or unsupervised. Artificial Intelligence is a branch of computer science that endeavors to replicate or simulate human intelligence in a machine, so machines can perform tasks that typically require human intelligence. Some programmable functions of AI systems include planning, learning, reasoning, problem-solving, and decision making. My social, promotional, and primary mails might be different than what you have in your mailbox.
UniCase -- Rethinking Casing in Language Models
Powalski, Rafal, Stanislawek, Tomasz
In this paper, we introduce a new approach to dealing with the problem of case-sensitiveness in Language Modelling (LM). We propose simple architecture modification to the RoBERTa language model, accompanied by a new tokenization strategy, which we named Unified Case LM (UniCase). We tested our solution on the GLUE benchmark, which led to increased performance by 0.42 points. Moreover, we prove that the UniCase model works much better when we have to deal with text data, where all tokens are uppercased (+5.88 point).
A Technical Report: BUT Speech Translation Systems
Vydana, Hari Krishna, Burget, Lukas, Cernocky, Jan
The paper describes the BUT's speech translation systems. The systems are English$\longrightarrow$German offline speech translation systems. The systems are based on our previous works \cite{Jointly_trained_transformers}. Though End-to-End and cascade~(ASR-MT) spoken language translation~(SLT) systems are reaching comparable performances, a large degradation is observed when translating ASR hypothesis compared to the oracle input text. To reduce this performance degradation, we have jointly-trained ASR and MT modules with ASR objective as an auxiliary loss. Both the networks are connected through the neural hidden representations. This model has an End-to-End differentiable path with respect to the final objective function and also utilizes the ASR objective for better optimization. During the inference both the modules(i.e., ASR and MT) are connected through the hidden representations corresponding to the n-best hypotheses. Ensembling with independently trained ASR and MT models have further improved the performance of the system.
Translating lost languages using machine learning
Recent research suggests that most languages that have ever existed are no longer spoken. Dozens of these dead languages are also considered to be lost, or "undeciphered" -- that is, we don't know enough about their grammar, vocabulary, or syntax to be able to actually understand their texts. Lost languages are more than a mere academic curiosity; without them, we miss an entire body of knowledge about the people who spoke them. Unfortunately, most of them have such minimal records that scientists can't decipher them by using machine-translation algorithms like Google Translate. Some don't have a well-researched "relative" language to be compared to, and often lack traditional dividers like white space and punctuation.