"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).
Otter.ai to Bring AI-Powered Meeting Note Collaboration Service to Japan in partnership with NTT DOCOMO Partnership includes Investment and Customer Trials of Otter's Real-Time Transcription Los Altos, CA, January 23, 2020 –Otter.ai DOCOMO made a strategic investment in Otter through its wholly-owned subsidiary NTT DOCOMO Ventures, Inc. and announced plans for its AI-based translation service subsidiary to integrate Otter's meeting note collaboration into its offering to provide highly accurate English transcripts translated into Japanese. As a part of Otter's customer engagement with DOCOMO the Otter Voice Meeting Notes application is being used on a trial basis in Berlitz Corporation's English language classes in Japan. Students use Otter to transcribe and review the content of lessons, click on sections of text, and initiate voice playback. DOCOMO, Otter.ai and Berlitz are expanding their collaboration in language education to verify Otter's effectiveness in the study of English DOCOMO is featuring Otter during demonstrations at the DOCOMO Open House 2020, taking place in the Tokyo Big Sight exhibition complex January 23 and 24, 2020.
Deep neural machine translation (NMT) can learn representations containing linguistic information. And despite the differences between various models, they all tend to learn similar properties. This phenomena got researchers wondering whether the learned information is fully distributed and embedded to individual neurons. Recent research results confirmed that hypothesis, revealing that simple properties such as coordinating conjunctions and determiners can be attributed to individual neurons, while more complex linguistic properties such as syntax and semantics are distributed across multiple neurons. Following on this, researchers from The Chinese University of Hong Kong, Tencent AI Lab and University of Macau have proposed a new neuron interaction based representation composition for NMT.
Any high school student would guess there is a cosine involved when they see an integral of a sine. Regardless of whether the person understands the thought process behind these functions, it does the job for them. This intuition behind calculus is rarely explored. Though Newton and Leibnitz developed advanced mathematics to solve real-world problems, today most of the schools teach differential equations through semantics. The linguistic appeal of mathematics might get grades in high school, but in the world of research, this is hysterical.
Does anyone know of scientific literature that shows that, even in cases in which we have enough parallel data (English-French), use of monolingual data can be beneficial? To me it seems reasonable that if we, for instance, added monolingual data to the decoder, it would be better at scoring candidate predictions in terms of fluency. That being said, I cannot find peer-reviewed articles that show this.
We address the problem of learning classifiers when observations have multiple views, some of which may not be observed for all examples. We assume the existence of view generating functions which may complete the missing views in an approximate way. This situation corresponds for example to learning text classifiers from multilingual collections where documents are not available in all languages. In that case, Machine Translation (MT) systems may be used to translate each document in the missing languages. We derive a generalization error bound for classifiers learned on examples with multiple artificially created views.
The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of fast weights" generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost.
Neural Machine Translation (NMT) has achieved remarkable progress with the quick evolvement of model structures. In this paper, we propose the concept of layer-wise coordination for NMT, which explicitly coordinates the learning of hidden representations of the encoder and decoder together layer by layer, gradually from low level to high level. Specifically, we design a layer-wise attention and mixed attention mechanism, and further share the parameters of each layer between the encoder and decoder to regularize and coordinate the learning. Experiments show that combined with the state-of-the-art Transformer model, layer-wise coordination achieves improvements on three IWSLT and two WMT translation tasks. More specifically, our method achieves 34.43 and 29.01 BLEU score on WMT16 English-Romanian and WMT14 English-German tasks, outperforming the Transformer baseline.
Computer vision has benefited from initializing multiple deep layers with weights pretrained on large supervised training sets like ImageNet. Natural language processing (NLP) typically sees initialization of only the lowest layer of deep models with pretrained word vectors. In this paper, we use a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation (MT) to contextualize word vectors. We show that adding these context vectors (CoVe) improves performance over using only unsupervised word and character vectors on a wide variety of common NLP tasks: sentiment analysis (SST, IMDb), question classification (TREC), entailment (SNLI), and question answering (SQuAD). For fine-grained sentiment analysis and entailment, CoVe improves performance of our baseline models to the state of the art.
Teaching is critical to human society: it is with teaching that prospective students are educated and human civilization can be inherited and advanced. A good teacher not only provides his/her students with qualified teaching materials (e.g., textbooks), but also sets up appropriate learning objectives (e.g., course projects and exams) considering different situations of a student. When it comes to artificial intelligence, treating machine learning models as students, the loss functions that are optimized act as perfect counterparts of the learning objective set by the teacher. In this work, we explore the possibility of imitating human teaching behaviors by dynamically and automatically outputting appropriate loss functions to train machine learning models. Different from typical learning settings in which the loss function of a machine learning model is predefined and fixed, in our framework, the loss function of a machine learning model (we call it student) is defined by another machine learning model (we call it teacher).
Neural language models (NLMs) have recently gained a renewed interest by achieving state-of-the-art performance across many natural language processing (NLP) tasks. However, NLMs are very computationally demanding largely due to the computational cost of the decoding process, which consists of a softmax layer over a large vocabulary.We observe that in the decoding of many NLP tasks, only the probabilities of the top-K hypotheses need to be calculated preciously and K is often much smaller than the vocabulary size. This paper proposes a novel softmax layer approximation algorithm, called Fast Graph Decoder (FGD), which quickly identifies, for a given context, a set of K words that are most likely to occur according to a NLM. We demonstrate that FGD reduces the decoding time by an order of magnitude while attaining close to the full softmax baseline accuracy on neural machine translation and language modeling tasks. We also prove the theoretical guarantee on the softmax approximation quality.