Learning Graphical Models
Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification
Is accurate classification possible in the absence of hand-labeled data? This paper introduces the Monotonic Feature (MF) abstraction--where the probability of class membership increases monotonically with the MF's value. The paper proves that when an MF is given, PAC learning is possible with no hand-labeled data under certain assumptions. We argue that MFs arise naturally in a broad range of textual classification applications. On the classic "20 Newsgroups" data set, a learner given an MF and unlabeled data achieves classification accuracy equal to that of a state-of-the-art semi-supervised learner relying on 160 hand-labeled examples. Even when MFs are not given as input, their presence or absence can be determined from a small amount of hand-labeled data, which yields a new semi-supervised learning method that reduces error by 15% on the 20 Newsgroups data.
Translated Learning: Transfer Learning across Different Feature Spaces
Dai, Wenyuan, Chen, Yuqiang, Xue, Gui-rong, Yang, Qiang, Yu, Yong
This paper investigates a new machine learning strategy called translated learning. Unlike many previous learning tasks, we focus on how to use labeled data from one feature space to enhance the classification of other entirely different learning spaces. For example, we might wish to use labeled text data to help learn a model for classifying image data, when the labeled images are difficult to obtain. An important aspect of translated learning is to build a "bridge" to link one feature space (known as the "source space") to another space (known as the "target space") through a translator in order to migrate the knowledge from source to target. The translated learning solution uses a language model to link the class labels to the features in the source spaces, which in turn is translated to the features in the target spaces. Finally, this chain of linkages is completed by tracing back to the instances in the target spaces. We show that this path of linkage can be modeled using a Markov chain and risk minimization. Through experiments on the text-aided image classification and cross-language classification tasks, we demonstrate that our translated learning framework can greatly outperform many state-of-the-art baseline methods.
Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction
Cohen, Shay B., Gimpel, Kevin, Smith, Noah A.
We explore a new Bayesian model for probabilistic grammars, a family of distributions over discrete structures that includes hidden Markov models and probabilistic context-free grammars. Our model extends the correlated topic model framework to probabilistic grammars, exploiting the logistic normal distribution as a prior over the grammar parameters. We derive a variational EM algorithm for that model, and then experiment with the task of unsupervised grammar induction for natural language dependency parsing. We show that our model achieves superior results over previous models that use different priors.
Differentiable Sparse Coding
Bagnell, J. A., Bradley, David M.
We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance.
The Infinite Factorial Hidden Markov Model
Gael, Jurgen V., Teh, Yee W., Ghahramani, Zoubin
We introduces a new probability distribution over a potentially infinite number of binary Markov chains which we call the Markov Indian buffet process. This process extends the IBP to allow temporal dependencies in the hidden variables. We use this stochastic process to build a nonparametric extension of the factorial hidden Markov model. After working out an inference scheme which combines slice sampling and dynamic programming we demonstrate how the infinite factorial hidden Markov model can be used for blind source separation.
Bayesian Model of Behaviour in Economic Games
Ray, Debajyoti, King-casas, Brooks, Montague, P. R., Dayan, Peter
Classical Game Theoretic approaches that make strong rationality assumptions have difficulty modeling observed behaviour in Economic games of human subjects. We investigate the role of finite levels of iterated reasoning and non-selfish utility functions in a Partially Observable Markov Decision Process model that incorporates Game Theoretic notions of interactivity. Our generative model captures a broad class of characteristic behaviours in a multi-round Investment game. We invert the generative process for a recognition model that is used to classify 200 subjects playing an Investor-Trustee game against randomly matched opponents.
A general framework for investigating how far the decoding process in the brain can be simplified
Oizumi, Masafumi, Ishii, Toshiyuki, Ishibashi, Kazuya, Hosoya, Toshihiko, Okada, Masato
``How is information decoded in the brain?'' is one of the most difficult and important questions in neuroscience. Whether neural correlation is important or not in decoding neural activities is of special interest. We have developed a general framework for investigating how far the decoding process in the brain can be simplified. First, we hierarchically construct simplified probabilistic models of neural responses that ignore more than $K$th-order correlations by using a maximum entropy principle. Then, we compute how much information is lost when information is decoded using the simplified models, i.e., ``mismatched decoders''. We introduce an information theoretically correct quantity for evaluating the information obtained by mismatched decoders. We applied our proposed framework to spike data for vertebrate retina. We used 100-ms natural movies as stimuli and computed the information contained in neural activities about these movies. We found that the information loss is negligibly small in population activities of ganglion cells even if all orders of correlation are ignored in decoding. We also found that if we assume stationarity for long durations in the information analysis of dynamically changing stimuli like natural movies, pseudo correlations seem to carry a large portion of the information.
Implicit Mixtures of Restricted Boltzmann Machines
Nair, Vinod, Hinton, Geoffrey E.
We present a mixture model whose components are Restricted Boltzmann Machines (RBMs). This possibility has not been considered before because computing the partition function of an RBM is intractable, which appears to make learning a mixture of RBMs intractable as well. Surprisingly, when formulated as a third-order Boltzmann machine, such a mixture model can be learned tractably using contrastive divergence. The energy function of the model captures three-way interactions among visible units, hidden units, and a single hidden multinomial unit that represents the cluster labels. The distinguishing feature of this model is that, unlike other mixture models, the mixing proportions are not explicitly parameterized. Instead, they are defined implicitly via the energy function and depend on all the parameters in the model. We present results for the MNIST and NORB datasets showing that the implicit mixture of RBMs learns clusters that reflect the class structure in the data.
DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
Lacoste-Julien, Simon, Sha, Fei, Jordan, Michael I.
Probabilistic topic models (and their extensions) have become popular as models of latent structures in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood estimation, an approach which may be suboptimal in the context of an overall classification problem. In this paper, we describe DiscLDA, a discriminative learning framework for such models as Latent Dirichlet Allocation (LDA) in the setting of dimensionality reduction with supervised side information. In DiscLDA, a class-dependent linear transformation is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood using Monte Carlo EM. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. We compare the predictive power of the latent structure of DiscLDA with unsupervised LDA on the 20 Newsgroup ocument classification task.
Translated Learning: Transfer Learning across Different Feature Spaces
Dai, Wenyuan, Chen, Yuqiang, Xue, Gui-rong, Yang, Qiang, Yu, Yong
This paper investigates a new machine learning strategy called translated learning. Unlikemany previous learning tasks, we focus on how to use labeled data from one feature space to enhance the classification of other entirely different learning spaces. For example, we might wish to use labeled text data to help learn a model for classifying image data, when the labeled images are difficult to obtain. Animportant aspect of translated learning is to build a "bridge" to link one feature space (known as the "source space") to another space (known as the "target space")through a translator in order to migrate the knowledge from source to target. The translated learning solution uses a language model to link the class labels to the features in the source spaces, which in turn is translated to the features inthe target spaces. Finally, this chain of linkages is completed by tracing back to the instances in the target spaces. We show that this path of linkage can be modeled using a Markov chain and risk minimization. Through experiments on the text-aided image classification and cross-language classification tasks, we demonstrate that our translated learning framework can greatly outperform many state-of-the-art baseline methods.