AITopics

Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation

He, Tianyu, Tan, Xu, Xia, Yingce, He, Di, Qin, Tao, Chen, Zhibo, Liu, Tie-Yan

Neural Machine Translation (NMT) has achieved remarkable progress with the quick evolvement of model structures. In this paper, we propose the concept of layer-wise coordination for NMT, which explicitly coordinates the learning of hidden representations of the encoder and decoder together layer by layer, gradually from low level to high level. Specifically, we design a layer-wise attention and mixed attention mechanism, and further share the parameters of each layer between the encoder and decoder to regularize and coordinate the learning. Experiments show that combined with the state-of-the-art Transformer model, layer-wise coordination achieves improvements on three IWSLT and two WMT translation tasks. More specifically, our method achieves 34.43 and 29.01 BLEU score on WMT16 English-Romanian and WMT14 English-German tasks, outperforming the Transformer baseline.

deep learning, neural machine translation, neural network, (18 more...)

Country:

Europe (0.93)
North America > United States > California > San Francisco County > San Francisco (0.14)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

FRAGE: Frequency-Agnostic Word Representation

Gong, Chengyue, He, Di, Tan, Xu, Qin, Tao, Wang, Liwei, Liu, Tie-Yan

Continuous word representation (aka word embedding) is a basic building block in many neural network-based models used in natural language processing tasks. Although it is widely accepted that words with similar semantics should be close to each other in the embedding space, we find that word embeddings learned in several tasks are biased towards word frequency: the embeddings of high-frequency and low-frequency words lie in different subregions of the embedding space, and the embedding of a rare word and a popular word can be far from each other even if they are semantically similar. This makes learned word embeddings ineffective, especially for rare words, and consequently limits the performance of these neural network models. In order to mitigate the issue, in this paper, we propose a neat, simple yet effective adversarial training method to blur the boundary between the embeddings of high-frequency words and low-frequency words. We conducted comprehensive studies on ten datasets across four natural language processing tasks, including word similarity, language modeling, machine translation and text classification. Results show that we achieve higher performance than the baselines in all tasks.

artificial intelligence, machine learning, natural language, (19 more...)

Country:

Asia > China (0.14)
North America > United States (0.14)
North America > Canada (0.14)
(3 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation

He, Tianyu, Tan, Xu, Xia, Yingce, He, Di, Qin, Tao, Chen, Zhibo, Liu, Tie-Yan

Neural Machine Translation (NMT) has achieved remarkable progress with the quick evolvement of model structures. In this paper, we propose the concept of layer-wise coordination for NMT, which explicitly coordinates the learning of hidden representations of the encoder and decoder together layer by layer, gradually fromlow level to high level. Specifically, we design a layer-wise attention and mixed attention mechanism, and further share the parameters of each layer between the encoder and decoder to regularize and coordinate the learning. Experiments showthat combined with the state-of-the-art Transformer model, layer-wise coordination achieves improvements on three IWSLT and two WMT translation tasks. More specifically, our method achieves 34.43 and 29.01 BLEU score on WMT16 English-Romanian and WMT14 English-German tasks, outperforming the Transformer baseline.

decoder, deep learning, neural network, (18 more...)

Country:

Europe (0.93)
North America > United States > California > San Francisco County > San Francisco (0.14)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

FRAGE: Frequency-Agnostic Word Representation

Gong, Chengyue, He, Di, Tan, Xu, Qin, Tao, Wang, Liwei, Liu, Tie-Yan

Continuous word representation (aka word embedding) is a basic building block in many neural network-based models used in natural language processing tasks. Although it is widely accepted that words with similar semantics should be close to each other in the embedding space, we find that word embeddings learned in several tasks are biased towards word frequency: the embeddings of high-frequency and low-frequency words lie in different subregions of the embedding space, and the embedding of a rare word and a popular word can be far from each other even if they are semantically similar. This makes learned word embeddings ineffective, especially for rare words, and consequently limits the performance of these neural network models. In order to mitigate the issue, in this paper, we propose a neat, simple yet effective adversarial training method to blur the boundary between the embeddings of high-frequency words and low-frequency words. We conducted comprehensive studies on ten datasets across four natural language processing tasks, including word similarity, language modeling, machine translation and text classification. Results show that we achieve higher performance than the baselines in all tasks.

deep learning, neural network, rare word, (17 more...)

Country:

Asia > China (0.14)
North America > United States (0.14)
North America > Canada (0.14)
(3 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceNov-5-2018

When CTC Training Meets Acoustic Landmarks

He, Di, Yang, Xuesong, Lim, Boon Pang, Liang, Yi, Hasegawa-Johnson, Mark, Chen, Deming

Connectionist temporal classification (CTC) training criterion provides an alternative acoustic model (AM) training strategy for automatic speech recognition in an end-to-end fashion. Although CTC criterion benefits acoustic modeling without needs of time-aligned phonetics transcription, it remains in need of efforts of tweaking to convergence, especially in the resource-constrained scenario. In this paper, we proposed to improve CTC training by incorporating acoustic landmarks. We tailored a new set of acoustic landmarks to help CTC training converge more quickly while also reducing recognition error rates. We leveraged new target label sequences mixed with both phone and manner changes to guide CTC training. Experiments on TIMIT demonstrated that CTC based acoustic models converge faster and smoother significantly when they are augmented by acoustic landmarks. The models pretrained with mixed target labels can be finetuned furthermore, which reduced phone error rate by 8.72% on TIMIT. The consistent performance gain is also observed on reduced TIMIT and WSJ as well, in which case, we are the first to succeed in testing the effectiveness of acoustic landmark theory on mid-sized ASR tasks.

deep learning, landmark, speech recognition, (20 more...)

arXiv.org Artificial Intelligence

1811.02063

Country: North America > United States > Illinois (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.57)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

arXiv.org Machine LearningSep-18-2018

FRAGE: Frequency-Agnostic Word Representation

Gong, Chengyue, He, Di, Tan, Xu, Qin, Tao, Wang, Liwei, Liu, Tie-Yan

Continuous word representation (aka word embedding) is a basic building block in many neural network-based models used in natural language processing tasks. Although it is widely accepted that words with similar semantics should be close to each other in the embedding space, we find that word embeddings learned in several tasks are biased towards word frequency: the embeddings of high-frequency and low-frequency words lie in different subregions of the embedding space, and the embedding of a rare word and a popular word can be far from each other even if they are semantically similar. This makes learned word embeddings ineffective, especially for rare words, and consequently limits the performance of these neural network models. In this paper, we develop a neat, simple yet effective way to learn \emph{FRequency-AGnostic word Embedding} (FRAGE) using adversarial training. We conducted comprehensive studies on ten datasets across four natural language processing tasks, including word similarity, language modeling, machine translation and text classification. Results show that with FRAGE, we achieve higher performance than the baselines in all tasks.

deep learning, neural network, rare word, (18 more...)

arXiv.org Machine Learning

1809.06858

Country:

Asia > China (0.14)
North America > United States (0.14)
North America > Canada (0.14)
(3 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningJun-8-2018

Towards Binary-Valued Gates for Robust LSTM Training

Li, Zhuohan, He, Di, Tian, Fei, Chen, Wei, Qin, Tao, Wang, Liwei, Liu, Tie-Yan

Long Short-Term Memory (LSTM) is one of the most widely used recurrent structures in sequence modeling. It aims to use gates to control information flow (e.g., whether to skip some information or not) in the recurrent computations, although its practical implementation based on soft gates only partially achieves this goal. In this paper, we propose a new way for LSTM training, which pushes the output values of the gates towards 0 or 1. By doing so, we can better control the information flow: the gates are mostly open or closed, instead of in a middle state, which makes the results more interpretable. Empirical studies show that (1) Although it seems that we restrict the model capacity, there is no performance drop: we achieve better or comparable performances due to its better generalization ability; (2) The outputs of gates are not sensitive to their inputs: we can easily compress the LSTM unit in multiple ways, e.g., low-rank approximation and low-precision approximation. The compressed models are even better than the baseline models without compression.

arxiv preprint arxiv, deep learning, neural network, (15 more...)

arXiv.org Machine Learning

1806.02988

Country: Asia (0.68)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsDec-31-2017

Decoding with Value Networks for Neural Machine Translation

He, Di, Lu, Hanqing, Xia, Yingce, Qin, Tao, Wang, Liwei, Liu, Tie-Yan

Neural Machine Translation (NMT) has become a popular technology in recent years, and beam search is its de facto decoding method due to the shrunk search space and reduced computational complexity. However, since it only searches for local optima at each time step through one-step forward looking, it usually cannot output the best target sentence. Inspired by the success and methodology of AlphaGo, in this paper we propose using a prediction network to improve beam search, which takes the source sentence $x$, the currently available decoding output $y_1,\cdots, y_{t-1}$ and a candidate word $w$ at step $t$ as inputs and predicts the long-term value (e.g., BLEU score) of the partial target sentence if it is completed by the NMT model. Following the practice in reinforcement learning, we call this prediction network \emph{value network}. Specifically, we propose a recurrent structure for the value network, and train its parameters from bilingual data. During the test time, when choosing a word $w$ for decoding, we consider both its conditional probability given by the NMT model and its long-term value predicted by the value network. Experiments show that such an approach can significantly improve the translation accuracy on several translation tasks.

deep learning, neural network, value network, (19 more...)

Country:

North America > United States (0.28)
Asia > Middle East > Qatar (0.14)

Industry: Leisure & Entertainment > Games > Go (0.34)

Neural Information Processing SystemsDec-31-2016

Dual Learning for Machine Translation

He, Di, Xia, Yingce, Qin, Tao, Wang, Liwei, Yu, Nenghai, Liu, Tie-Yan, Ma, Wei-Ying

While neural machine translation (NMT) is making good progress in the past two years, tens of millions of bilingual sentence pairs are needed for its training. However, human labeling is very costly. To tackle this training data bottleneck, we develop a dual-learning mechanism, which can enable an NMT system to automatically learn from unlabeled data through a dual-learning game. This mechanism is inspired by the following observation: any machine translation task has a dual task, e.g., English-to-French translation (primal) versus French-to-English translation (dual); the primal and dual tasks can form a closed loop, and generate informative feedback signals to train the translation models, even if without the involvement of a human labeler. In the dual-learning mechanism, we use one agent to represent the model for the primal task and the other agent to represent the model for the dual task, then ask them to teach each other through a reinforcement learning process. Based on the feedback signals generated during this process (e.g., the language-model likelihood of the output of a model, and the reconstruction error of the original sentence after the primal and dual translations), we can iteratively update the two models until convergence (e.g., using the policy gradient methods). We call the corresponding approach to neural machine translation \emph{dual-NMT}. Experiments show that dual-NMT works very well on English$\leftrightarrow$French translation; especially, by learning from monolingual data (with 10\% bilingual data for warm start), it achieves a comparable accuracy to NMT trained from the full bilingual data for the French-to-English translation task.

deep learning, neural network, translation model, (18 more...)