Learning Graphical Models
Modeling the Dynamics of Online Learning Activity
Mavroforakis, Charalampos, Valera, Isabel, Rodriguez, Manuel Gomez
Learning has become an online activity - people routinely use a wide variety of online learning platforms, ranging from wikis and question answering (Q&A) sites to online communities and blogs, to learn about a large range of topics. In this context, people find solutions to their problems by looking for closely related pieces of information, executing a sequence of queries or, more generally, performing a series of online actions. For example, a high school student may study several closely related wiki pages to prepare an essay about a historical event; a software developer may read several answers within a Q&A site to solve a specific programming problem; and, a researcher may check a specialized blog written by one of her peers to learn about a new concept or technique. All the above are examples of learning patterns, in which people perform a series of online actions - reading a wiki page, an answer, or a blog - to achieve a predefined goal - writing an essay, solving a programming problem, or learning about a new concept or technique. In this context, one may expect that people with similar goals undertake similar sequences of online actions and thus adopt similar learning patterns. Therefore, one could leverage the vast availability of online traces of users' learning activity to disambiguate among interleaved learning patterns adopted by individuals over time, as well as to automatically identify and track those people's interests and goals over time. In this work, we introduce a novel probabilistic model, the Hierarchical Dirichlet Hawkes Process (HDHP), for clustering continuous-time grouped streaming data, which we use to uncover the dynamics of learning activity on the web. The HDHP leverages the properties of the Hierarchical Dirichlet Process (HDP) [18], a popular Bayesian nonparametric model for clustering problems involving multiple groups of data, combined with the Hawkes process [13], a temporal point process particularly well fitted to model social activity [11, 19, 20]. In particular, the former is used to account for an infinite number of learning patterns, which are shared across users (groups) of an online learning platform.
Modeling community structure and topics in dynamic text networks
Henry, Teague, Banks, David, Chai, Christine, Owens-Oas, Derek
Dynamic text networks have been widely studied in recent years, primarily because the Internet stores textual data in a way that allows links between different documents. Articles on the Wikipedia (Hoffman et al., 2010), citation networks in journal articles (Moody, 2004), and linked blog posts (Latouche et al., 2011) are examples of dynamic text networks, or networks of documents that are generated over time. But each application has idiosyncratic features, such as the structure of the links and the nature of the time varying documents, so analysis typically requires bespoke models that directly address those aspects.
Generalization error minimization: a new approach to model evaluation and selection with an application to penalized regression
Xu, Ning, Hong, Jian, Fisher, Timothy C. G.
We study model evaluation and model selection from the perspective of generalization ability (GA): the ability of a model to predict outcomes in new samples from the same population. We believe that GA is one way formally to address concerns about the external validity of a model. The GA of a model estimated on a sample can be measured by its empirical out-of-sample errors, called the generalization errors (GE). We derive upper bounds for the GE, which depend on sample sizes, model complexity and the distribution of the loss function. The upper bounds can be used to evaluate the GA of a model, ex ante. We propose using generalization error minimization (GEM) as a framework for model selection. Using GEM, we are able to unify a big class of penalized regression estimators, including lasso, ridge and bridge, under the same set of assumptions. We establish finite-sample and asymptotic properties (including $\mathcal{L}_2$-consistency) of the GEM estimator for both the $n \geqslant p$ and the $n < p$ cases. We also derive the $\mathcal{L}_2$-distance between the penalized and corresponding unpenalized regression estimates. In practice, GEM can be implemented by validation or cross-validation. We show that the GE bounds can be used for selecting the optimal number of folds in $K$-fold cross-validation. We propose a variant of $R^2$, the $GR^2$, as a measure of GA, which considers both both in-sample and out-of-sample goodness of fit. Simulations are used to demonstrate our key results.
Fast Sampling for Bayesian Max-Margin Models
Hu, Wenbo, Zhu, Jun, Zhang, Bo
Bayesian max-margin models have shown superiority in various practical applications, such as text categorization, collaborative prediction, social network link prediction and crowdsourcing, and they conjoin the flexibility of Bayesian modeling and predictive strengths of max-margin learning. However, Monte Carlo sampling for these models still remains challenging, especially for applications that involve large-scale datasets. In this paper, we present the stochastic subgradient Hamiltonian Monte Carlo (HMC) methods, which are easy to implement and computationally efficient. We show the approximate detailed balance property of subgradient HMC which reveals a natural and validated generalization of the ordinary HMC. Furthermore, we investigate the variants that use stochastic subsampling and thermostats for better scalability and mixing. Using stochastic subgradient Markov Chain Monte Carlo (MCMC), we efficiently solve the posterior inference task of various Bayesian max-margin models and extensive experimental results demonstrate the effectiveness of our approach.
Low-rank and Sparse Soft Targets to Learn Better DNN Acoustic Models
Dighe, Pranay, Asaei, Afsaneh, Bourlard, Herve
Conventional deep neural networks (DNN) for speech acoustic modeling rely on Gaussian mixture models (GMM) and hidden Markov model (HMM) to obtain binary class labels as the targets for DNN training. Subword classes in speech recognition systems correspond to context-dependent tied states or senones. The present work addresses some limitations of GMM-HMM senone alignments for DNN training. We hypothesize that the senone probabilities obtained from a DNN trained with binary labels can provide more accurate targets to learn better acoustic models. However, DNN outputs bear inaccuracies which are exhibited as high dimensional unstructured noise, whereas the informative components are structured and low-dimensional. We exploit principle component analysis (PCA) and sparse coding to characterize the senone subspaces. Enhanced probabilities obtained from low-rank and sparse reconstructions are used as soft-targets for DNN acoustic modeling, that also enables training with untranscribed data. Experiments conducted on AMI corpus shows 4.6% relative reduction in word error rate.
A Bayesian Group Sparse Multi-Task Regression Model for Imaging Genetics
Greenlaw, Keelin, Szefer, Elena, Graham, Jinko, Lesperance, Mary, Nathoo, Farouk S.
Motivation: Recent advances in technology for brain imaging and high-throughput genotyping have motivated studies examining the influence of genetic variation on brain structure. Wang et al. (Bioinformatics, 2012) have developed an approach for the analysis of imaging genomic studies using penalized multi-task regression with regularization based on a novel group $l_{2,1}$-norm penalty which encourages structured sparsity at both the gene level and SNP level. While incorporating a number of useful features, the proposed method only furnishes a point estimate of the regression coefficients; techniques for conducting statistical inference are not provided. A new Bayesian method is proposed here to overcome this limitation. Results: We develop a Bayesian hierarchical modeling formulation where the posterior mode corresponds to the estimator proposed by Wang et al. (Bioinformatics, 2012), and an approach that allows for full posterior inference including the construction of interval estimates for the regression parameters. We show that the proposed hierarchical model can be expressed as a three-level Gaussian scale mixture and this representation facilitates the use of a Gibbs sampling algorithm for posterior simulation. Simulation studies demonstrate that the interval estimates obtained using our approach achieve adequate coverage probabilities that outperform those obtained from the nonparametric bootstrap. Our proposed methodology is applied to the analysis of neuroimaging and genetic data collected as part of the Alzheimer's Disease Neuroimaging Initiative (ADNI), and this analysis of the ADNI cohort demonstrates clearly the value added of incorporating interval estimation beyond only point estimation when relating SNPs to brain imaging endophenotypes.
X-CNN: Cross-modal Convolutional Neural Networks for Sparse Datasets
Veličković, Petar, Wang, Duo, Lane, Nicholas D., Liò, Pietro
In this paper we propose cross-modal convolutional neural networks (X-CNNs), a novel biologically inspired type of CNN architectures, treating gradient descent-specialised CNNs as individual units of processing in a larger-scale network topology, while allowing for unconstrained information flow and/or weight sharing between analogous hidden layers of the network---thus generalising the already well-established concept of neural network ensembles (where information typically may flow only between the output layers of the individual networks). The constituent networks are individually designed to learn the output function on their own subset of the input data, after which cross-connections between them are introduced after each pooling operation to periodically allow for information exchange between them. This injection of knowledge into a model (by prior partition of the input data through domain knowledge or unsupervised methods) is expected to yield greatest returns in sparse data environments, which are typically less suitable for training CNNs. For evaluation purposes, we have compared a standard four-layer CNN as well as a sophisticated FitNet4 architecture against their cross-modal variants on the CIFAR-10 and CIFAR-100 datasets with differing percentages of the training data being removed, and find that at lower levels of data availability, the X-CNNs significantly outperform their baselines (typically providing a 2--6% benefit, depending on the dataset size and whether data augmentation is used), while still maintaining an edge on all of the full dataset tests.
From both sides now: the math of linear regression ·
Linear regression is the most basic and the most widely used technique in machine learning; yet for all its simplicity, studying it can unlock some of the most important concepts in statistics. If you have a basic undestanding of linear regression expressed as \hat{Y} \theta_0 \theta_1X, but don't have a background in statistics and find statements like "ridge regression is equivalent to the maximum a posteriori (MAP) estimate with a zero-mean Gaussian prior" bewildering, then this post is for you. With a superficial goal of understanding that somewhat obtuse statement, its main objective is to explore the topic, starting from the standard formulation of linear regression, moving on to the probabilistic approach (maximum likelihood formulation) and from there to Bayesian linear regression. I'll use the \theta character throughout to refer to the coefficients (weights) of a regression model, either explicitly broken out as \theta_0 and \theta_1 for intercept and slope respectively, or just \theta referring to the vector of coefficients. I'll usually use the expression \theta Tx_i for the prediction a model gives at x_i, the assumption being that a 1 has been added to the vector of values at x_i . 1 In the single predictor case, we know that the least squares fit is the line that minimizes the sum of the squared distances between observed data and predicted values, i.e. it minimizes the Residual Sum of Squares (RSS): These residuals are pretty important in how we reason about our model.
Introduction to Machine Learning & Face Detection in Python
This course is about the fundamental concepts of machine learning, focusing on neural networks, SVM and decision trees. These topics are getting very hot nowadays because these learning algorithms can be used in several fields from software engineering to investment banking. Learning algorithms can recognize patterns which can help detect cancer for example or we may construct algorithms that can have a very very good guess about stock prices movement in the market. In each section we will talk about the theoretical background for all of these algorithms then we are going to implement these problems together. The first chapter is about regression: very easy yet very powerful and widely used machine learning technique.
Probabilistic Dimensionality Reduction via Structure Learning
We propose a novel probabilistic dimensionality reduction framework that can naturally integrate the generative model and the locality information of data. Based on this framework, we present a new model, which is able to learn a smooth skeleton of embedding points in a low-dimensional space from high-dimensional noisy data. The formulation of the new model can be equivalently interpreted as two coupled learning problem, i.e., structure learning and the learning of projection matrix. This interpretation motivates the learning of the embedding points that can directly form an explicit graph structure. We develop a new method to learn the embedding points that form a spanning tree, which is further extended to obtain a discriminative and compact feature representation for clustering problems. Unlike traditional clustering methods, we assume that centers of clusters should be close to each other if they are connected in a learned graph, and other cluster centers should be distant. This can greatly facilitate data visualization and scientific discovery in downstream analysis. Extensive experiments are performed that demonstrate that the proposed framework is able to obtain discriminative feature representations, and correctly recover the intrinsic structures of various real-world datasets.