Deep Learning
On the Properties of Neural Machine Translation: Encoder-Decoder Approaches
Cho, Kyunghyun, van Merrienboer, Bart, Bahdanau, Dzmitry, Bengio, Yoshua
On the Properties of Neural Machine Translation: Encoder-Decoder Approaches Kyunghyun Cho Bart van Merri enboer Universit e de Montr eal Dzmitry Bahdanau Jacobs University, Germany Yoshua Bengio Universit e de Montr eal, CIFAR Senior Fellow Abstract Neural machine translation is a relatively new approach to statistical machine translation based purely on neural networks. The neural machine translation models often consist of an encoder and a decoder. The encoder extracts a fixed-length representation from a variable-length input sentence, and the decoder generates a correct translation from this representation. In this paper, we focus on analyzing the properties of the neural machine translation using two models; RNN Encoder-Decoder and a newly proposed gated recursive con-volutional neural network. We show that the neural machine translation performs relatively well on short sentences without unknown words, but its performance degrades rapidly as the length of the sentence and the number of unknown words increase. Furthermore, we find that the proposed gated recursive convolutional network learns a grammatical structure of a sentence automatically. 1 Introduction A new approach for statistical machine translation based purely on neural networks has recently been proposed (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014). This new approach, which we refer to as neural machine translation, is inspired by the recent trend of deep representational learning. All the neural network models used in (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Cho et al., 2014) consist of an encoder and a decoder.
Learning Topology and Dynamics of Large Recurrent Neural Networks
She, Yiyuan, He, Yuejia, Wu, Dapeng
Large-scale recurrent networks have drawn increasing attention recently because of their capabilities in modeling a large variety of real-world phenomena and physical mechanisms. This paper studies how to identify all authentic connections and estimate system parameters of a recurrent network, given a sequence of node observations. This task becomes extremely challenging in modern network applications, because the available observations are usually very noisy and limited, and the associated dynamical system is strongly nonlinear. By formulating the problem as multivariate sparse sigmoidal regression, we develop simple-to-implement network learning algorithms, with rigorous convergence guarantee in theory, for a variety of sparsity-promoting penalty forms. A quantile variant of progressive recurrent network screening is proposed for efficient computation and allows for direct cardinality control of network topology in estimation. Moreover, we investigate recurrent network stability conditions in Lyapunov's sense, and integrate such stability constraints into sparse network learning. Experiments show excellent performance of the proposed algorithms in network topology identification and forecasting.
Deep Directed Generative Autoencoders
Ozair, Sherjil, Bengio, Yoshua
For discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\neq f(x)$, where $f(\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\log p(x)$. The objective is to learn an encoder $f(\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
AI in MedTech: Risks and Opportunities of Innovative Technologies in Medical Applications
An increasing number of medical devices incorporate artificial intelligence (AI) capabilities to support therapeutic and diagnostic applications. In spite of the risks connected with this innovative technology, the applicable regulatory framework does not specify any requirements for this class of medical devices. To make matters even more complicated for manufacturers, there are no standards, guidance documents or common specifications for medical devices on how to demonstrate conformity with the essential requirements. The term artificial intelligence (AI) describes the capability of algorithms to take over tasks and decisions by mimicking human intelligence.1 Many experts believe that machine learning, a subset of artificial intelligence, will play a significant role in the medtech sector.2,3 "Machine learning" is the term used to describe algorithms capable of learning directly from a large volume of "training data". The algorithm builds a model based on training data and applies the experience, it has gained from the training to make predictions and decisions on new, unknown data. Artificial neural networks are a subset of machine learning methods, which have evolved from the idea of simulating the human brain.22 Neural networks are information-processing systems used for machine learning and comprise multiple layers of neurons. Between the input layer, which receives information, and the output layer, there are numerous hidden layers of neurons. In simple terms, neural networks comprise neurons โ also known as nodes โ which receive external information or information from other connected nodes, modify this information, and pass it on, either to the next neuron layer or to the output layer as the final result.5 Deep learning is a variation of artificial neural networks, which consist of multiple hidden neural network layers between the input and output layers. The inner layers are designed to extract higher-level features from the raw external data.
Deep Tempering
Desjardins, Guillaume, Luo, Heng, Courville, Aaron, Bengio, Yoshua
Restricted Boltzmann Machines (RBMs) are one of the fundamental building blocks of deep learning. Approximate maximum likelihood training of RBMs typically necessitates sampling from these models. In many training scenarios, computationally efficient Gibbs sampling procedures are crippled by poor mixing. In this work we propose a novel method of sampling from Boltzmann machines that demonstrates a computationally efficient way to promote mixing. Our approach leverages an under-appreciated property of deep generative models such as the Deep Belief Network (DBN), where Gibbs sampling from deeper levels of the latent variable hierarchy results in dramatically increased ergodicity. Our approach is thus to train an auxiliary latent hierarchical model, based on the DBN. When used in conjunction with parallel-tempering, the method is asymptotically guaranteed to simulate samples from the target RBM. Experimental results confirm the effectiveness of this sampling strategy in the context of RBM training.
Studying the Effect of Metre Perception on Rhythm and Melody Modelling with LSTMs
Lambert, Andrew John (City University London) | Weyde, Tillman (City University London) | Armstrong, Newton (City University London)
In this paper we take a connectionist machine learning approach to the problem of metre perception and melody learning in musical signals. We present a two-layered network consisting of a nonlinear oscillator network and a recurrent neural network. The oscillator network acts as an entrained resonant filter to the musical signal. It `perceives' metre by resonating nonlinearly to the inherent periodicities within the signal, creating a hierarchy of strong and weak periods. The neural network learns the long-term temporal structures present in this signal. We show that this network outperforms our previous approach of a single layer recurrent neural network in a melody and rhythm prediction task. We hypothesise that our system is enabled to make use of the relatively long temporal resonance in the oscillator network output, and therefore model more coherent long-term structures. A system such as this could be used in a multitude of analytic and generative scenarios, including live performance applications.
Deep Learning-Based Goal Recognition in Open-Ended Digital Games
Min, Wookhee (North Carolina State University) | Ha, Eun Young (North Carolina State University) | Rowe, Jonathan (North Carolina State University) | Mott, Bradford (North Carolina State University) | Lester, James (North Carolina State University)
While many open-ended digital games feature non-linear storylines and multiple solution paths, it is challenging for game developers to create effective game experiences in these settings due to the freedom given to the player. To address these challenges, goal recognition, a computational player-modeling task, has been investigated to enable digital games to dynamically predict playersโ goals. This paper presents a goal recognition framework based on stacked denoising autoencoders, a variant of deep learning. The learned goal recognition models, which are trained from a corpus of player interactions, not only offer improved performance, but also offer the substantial advantage of eliminating the need for labor-intensive feature engineering. An evaluation demonstrates that the deep learning-based goal recognition framework significantly outperforms the previous state-of-the-art goal recognition approach based on Markov logic networks.
Domain Adaptive Neural Networks for Object Recognition
Ghifary, Muhammad, Kleijn, W. Bastiaan, Zhang, Mengjie
We propose a simple neural network model to deal with the domain adaptation problem in object recognition. Our model incorporates the Maximum Mean Discrepancy (MMD) measure as a regularization in the supervised learning to reduce the distribution mismatch between the source and target domains in the latent space. From experiments, we demonstrate that the MMD regularization is an effective tool to provide good domain adaptation models on both SURF features and raw image pixels of a particular image data set. We also show that our proposed model, preceded by the denoising auto-encoder pretrain-ing, achieves better performance than recent benchmark models on the same data sets. This work represents the first study of MMD measure in the context of neural networks.
When Does a Mixture of Products Contain a Product of Mixtures?
Montufar, Guido F., Morton, Jason
We derive relations between theoretical properties of restricted Boltzmann machines (RBMs), popular machine learning models which form the building blocks of deep learning models, and several natural notions from discrete mathematics and convex geometry. We give implications and equivalences relating RBM-representable probability distributions, perfectly reconstructible inputs, Hamming modes, zonotopes and zonosets, point configurations in hyperplane arrangements, linear threshold codes, and multi-covering numbers of hypercubes. As a motivating application, we prove results on the relative representational power of mixtures of product distributions and products of mixtures of pairs of product distributions (RBMs) that formally justify widely held intuitions about distributed representations. In particular, we show that a mixture of products requiring an exponentially larger number of parameters is needed to represent the probability distributions which can be obtained as products of mixtures.
Raiders of the Lost Architecture: Kernels for Bayesian Optimization in Conditional Parameter Spaces
Swersky, Kevin, Duvenaud, David, Snoek, Jasper, Hutter, Frank, Osborne, Michael A.
In practical Bayesian optimization, we must often search over structures with differing numbers of parameters. For instance, we may wish to search over neural network architectures with an unknown number of layers. To relate performance data gathered for different architectures, we define a new kernel for conditional parameter spaces that explicitly includes information about which parameters are relevant in a given structure. We show that this kernel improves model quality and Bayesian optimization results over several simpler baseline kernels.