Deep Learning
Question/Answer Matching for CQA System via Combining Lexical and Sequential Information
Shen, Yikang (Beihang University) | Rong, Wenge (Beihang University) | Sun, Zhiwei (Beihang University) | Ouyang, Yuanxin (Beihang University) | Xiong, Zhang (Beihang University)
Community-based Question Answering (CQA) has become popular in knowledge sharing sites since it allows users to get answers to complex, detailed, and personal questions directly from other users. Large archives of historical questions and associated answers have been accumulated. Retrieving relevant historical answers that best match a question is an essential component of a CQA service. Most state of the art approaches are based on bag-of-words models, which have been proven successful in a range of text matching tasks, but are insufficient for capturing the important word sequence information in short text matching. In this paper, a new architecture is proposed to more effectively model the complicated matching relations between questions and answers. It utilises a similarity matrix which contains both lexical and sequential information. Afterwards the information is put into a deep architecture to find potentially suitable answers. The experimental study shows its potential in improving matching accuracy of question and answer.
To Drop or Not to Drop: Robustness, Consistency and Differential Privacy Properties of Dropout
Jain, Prateek, Kulkarni, Vivek, Thakurta, Abhradeep, Williams, Oliver
Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor. On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.
Learning Stochastic Recurrent Networks
Bayer, Justin, Osendorfer, Christian
A BSTRACT Leveraging advances in variational inference, we propose to enhance recurrent neural networks with latent variables, resulting in Stochastic Recurrent Networks (STORNs). The model i) can be trained with stochastic gradient methods, ii) allows structured and multi-modal conditionals at each time step, iii) features a reliable estimator of the marginal likelihood and iv) is a generalisation of deterministic recurrent neural networks. We evaluate the method on four polyphonic musical data sets and motion capture data. 1 I NTRODUCTION Recurrent Neural Networks (RNNs) are flexible and powerful tools for modeling sequences. While only bearing marginal existence in the 1990's, recent successes in real world applications (Graves, 2013; Graves et al., 2013; Sutskever et al., 2014; Graves et al., 2008; Cho et al., 2014) have resurged interest. This is partially due to architectural enhancements (Hochreiter & Schmidhuber, 1997), new optimisation findings (Martens & Sutskever, 2011; Sutskever et al., 2013; Bengio et al., 2012) and the increased computional power available to researchers.
Toxicity Prediction using Deep Learning
Unterthiner, Thomas, Mayr, Andreas, Klambauer, Gรผnter, Hochreiter, Sepp
Everyday we are exposed to various chemicals via food additives, cleaning and cosmetic products and medicines -- and some of them might be toxic. However testing the toxicity of all existing compounds by biological experiments is neither financially nor logistically feasible. Therefore the government agencies NIH, EPA and FDA launched the Tox21 Data Challenge within the "Toxicology in the 21st Century" (Tox21) initiative. The goal of this challenge was to assess the performance of computational methods in predicting the toxicity of chemical compounds. State of the art toxicity prediction methods build upon specifically-designed chemical descriptors developed over decades. Though Deep Learning is new to the field and was never applied to toxicity prediction before, it clearly outperformed all other participating methods. In this application paper we show that deep nets automatically learn features resembling well-established toxicophores. In total, our Deep Learning approach won both of the panel-challenges (nuclear receptors and stress response) as well as the overall Grand Challenge, and thereby sets a new standard in tox prediction.
An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks
Goodfellow, Ian J., Mirza, Mehdi, Xiao, Da, Courville, Aaron, Bengio, Yoshua
Catastrophic forgetting is a problem faced by many machine learning models and algorithms. When trained on one task, then trained on a second task, many machine learning models "forget" how to perform the first task. This is widely believed to be a serious problem for neural networks. Here, we investigate the extent to which the catastrophic forgetting problem occurs for modern neural networks, comparing both established and recent gradient-based training algorithms and activation functions. We also examine the effect of the relationship between the first task and the second task on catastrophic forgetting. We find that it is always best to train using the dropout algorithm--the dropout algorithm is consistently best at adapting to the new task, remembering the old task, and has the best tradeoff curve between these two extremes. We find that different tasks and relationships between tasks result in very different rankings of activation function performance. This suggests the choice of activation function should always be cross-validated.
Encoding Time Series as Images for Visual Inspection and Classification Using Tiled Convolutional Neural Networks
Wang, Zhiguang (University of Maryland Baltimore County) | Oates, Tim (University of Maryland Baltimore County)
Inspired by recent successes of deep learning in computer vision and speech recognition, we propose a novel framework to encode time series data as different types of images, namely, Gramian Angular Fields (GAF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for classification. Using a polar coordinate system, GAF images are represented as a Gramian matrix where each element is the trigonometric sum (i.e., superposition of directions) between different time intervals. MTF images represent the first order Markov transition probability along one dimension and temporal dependency along the other. We used Tiled Convolutional Neural Networks (tiled CNNs) on 12 standard datasets to learn high-level features from individual GAF, MTF, and GAF-MTF images that resulted from combining GAF and MTF representations into a single image. The classification results of our approach are competitive with five stateof-the-art approaches. An analysis of the features and weights learned via tiled CNNs explains why the approach works.
Deep Apprenticeship Learning for Playing Video Games
Bogdanovic, Miroslav (University of Oxford) | Markovikj, Dejan (University of Oxford) | Denil, Misha (University of Oxford) | Freitas, Nando de (University of Oxford)
Recently it has been shown that deep neural networks can learn to play Atari games by directly observing raw pixels of the playing area. We show how apprenticeship learning can be applied in this setting so that an agent can learn to perform a task (i.e. play a game) by observing the expert, without any explicitly provided knowledge of the gameโs internal state or objectives.
Why does Deep Learning work? - A perspective from Group Theory
Paul, Arnab, Venkatasubramanian, Suresh
Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.
Random Walk Initialization for Training Very Deep Feedforward Networks
Sussillo, David, Abbott, L. F.
Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.
Unsupervised Domain Adaptation by Backpropagation
Ganin, Yaroslav, Lempitsky, Victor
Top-performing deep architectures are trained on massive amounts of labeled data. In the absence of labeled data for a certain task, domain adaptation often provides an attractive option given that labeled data of similar nature but from a different domain (e.g. synthetic images) are available. Here, we propose a new approach to domain adaptation in deep architectures that can be trained on large amount of labeled data from the source domain and large amount of unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of "deep" features that are (i) discriminative for the main learning task on the source domain and (ii) invariant with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a simple new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation. Overall, the approach can be implemented with little effort using any of the deep-learning packages. The method performs very well in a series of image classification experiments, achieving adaptation effect in the presence of big domain shifts and outperforming previous state-of-the-art on Office datasets.