Statistical Learning
Scalable MCMC for Mixed Membership Stochastic Blockmodels
Li, Wenzhe, Ahn, Sungjin, Welling, Max
We propose a stochastic gradient Markov chain Monte Carlo (SG-MCMC) algorithm for scalable inference in mixed-membership stochastic blockmodels (MMSB). Our algorithm is based on the stochastic gradient Riemannian Langevin sampler and achieves both faster speed and higher accuracy at every iteration than the current state-of-the-art algorithm based on stochastic variational inference. In addition we develop an approximation that can handle models that entertain a very large number of communities. The experimental results show that SG-MCMC strictly dominates competing algorithms in all cases.
Multiple co-clustering based on nonparametric mixture models with heterogeneous marginal distributions
Tokuda, Tomoki, Yoshimoto, Junichiro, Shimizu, Yu, Toki, Shigeru, Okada, Go, Takamura, Masahiro, Yamamoto, Tetsuya, Yoshimura, Shinpei, Okamoto, Yasumasa, Yamawaki, Shigeto, Doya, Kenji
We propose a novel method for multiple clustering that assumes a co-clustering structure (partitions in both rows and columns of the data matrix) in each view. The new method is applicable to high-dimensional data. It is based on a nonparametric Bayesian approach in which the number of views and the number of feature-/subject clusters are inferred in a data-driven manner. We simultaneously model different distribution families, such as Gaussian, Poisson, and multinomial distributions in each cluster block. This makes our method applicable to datasets consisting of both numerical and categorical variables, which biomedical data typically do. Clustering solutions are based on variational inference with mean field approximation. We apply the proposed method to synthetic and real data, and show that our method outperforms other multiple clustering methods both in recovering true cluster structures and in computation time. Finally, we apply our method to a depression dataset with no true cluster structure available, from which useful inferences are drawn about possible clustering structures of the data.
Computational and Statistical Boundaries for Submatrix Localization in a Large Noisy Matrix
Cai, T. Tony, Liang, Tengyuan, Rakhlin, Alexander
The interplay between computational efficiency and statistical accuracy in high-dimensional inference has drawn increasing attention in the literature. In this paper, we study computational and statistical boundaries for submatrix localization. Given one observation of (one or multiple non-overlapping) signal submatrix (of magnitude $\lambda$ and size $k_m \times k_n$) contaminated with a noise matrix (of size $m \times n$), we establish two transition thresholds for the signal to noise $\lambda/\sigma$ ratio in terms of $m$, $n$, $k_m$, and $k_n$. The first threshold, $\sf SNR_c$, corresponds to the computational boundary. Below this threshold, it is shown that no polynomial time algorithm can succeed in identifying the submatrix, under the \textit{hidden clique hypothesis}. We introduce adaptive linear time spectral algorithms that identify the submatrix with high probability when the signal strength is above the threshold $\sf SNR_c$. The second threshold, $\sf SNR_s$, captures the statistical boundary, below which no method can succeed with probability going to one in the minimax sense. The exhaustive search method successfully finds the submatrix above this threshold. The results show an interesting phenomenon that $\sf SNR_c$ is always significantly larger than $\sf SNR_s$, which implies an essential gap between statistical optimality and computational efficiency for submatrix localization.
Similarity Learning for High-Dimensional Sparse Data
Liu, Kuan, Bellet, Aurélien, Sha, Fei
In many applications, such as text processing, computer vision or biology, data is represented as very highdimensional but sparse vectors. The ability to compute meaningful similarity scores between these objects is crucial to many tasks, such as classification, clustering or ranking. However, handcrafting a relevant similarity measure for such data is challenging because it is usually the case that only a small, often unknown subset of features is actually relevant to the task at hand. For instance, in drug discovery, chemical compounds can be represented as sparse features describing their 3D properties, and only a few of them play an role in determining whether the compound will bind to a target receptor (Guyon et al., 2004). In text classification, where each document is represented as a sparse bag of words, only a small subset of the words is generally sufficient to discriminate among documents of different topics. A principled way to obtain a similarity measure tailored to the problem of interest is to learn it from data. This line of research, known as similarity and distance metric learning, has been successfully applied to many application domains (see Kulis, 2012; Bellet et al., 2013, for recent surveys). The basic idea is to learn the parameters of a similarity (or distance) function such that it satisfies proximity-based constraints, requiring for instance that some data instance x be more similar to y than to z according to the learned function.
Constructing Dynamic Treatment Regimes in Infinite-Horizon Settings
The application of existing methods for constructing optimal dynamic treatment regimes is limited to cases where investigators are interested in optimizing a utility function over a fixed period of time (finite horizon). In this manuscript, we develop an inferential procedure based on temporal difference residuals for optimal dynamic treatment regimes in infinite-horizon settings, where there is no a priori fixed end of follow-up point. The proposed method can be used to determine the optimal regime in chronic diseases where patients are monitored and treated throughout their life. We derive large sample results necessary for conducting inference. We also simulate a cohort of patients with diabetes to mimic the third wave of the National Health and Nutrition Examination Survey, and we examine the performance of the proposed method in controlling the level of hemoglobin A1c. Supplementary materials for this article are available online.
Fast and Guaranteed Tensor Decomposition via Sketching
Wang, Yining, Tung, Hsiao-Yu, Smola, Alexander, Anandkumar, Animashree
Tensor CANDECOMP/PARAFAC (CP) decomposition has wide applications in statistical learning of latent variable models and in data mining. In this paper, we propose fast and randomized tensor CP decomposition algorithms based on sketching. We build on the idea of count sketches, but introduce many novel ideas which are unique to tensors. We develop novel methods for randomized computation of tensor contractions via FFTs, without explicitly forming the tensors. Such tensor contractions are encountered in decomposition methods such as tensor power iterations and alternating least squares. We also design novel colliding hashes for symmetric tensors to further save time in computing the sketches. We then combine these sketching ideas with existing whitening and tensor power iterative techniques to obtain the fastest algorithm on both sparse and dense tensors. The quality of approximation under our method does not depend on properties such as sparsity, uniformity of elements, etc. We apply the method for topic modeling and obtain competitive results.
Partition MCMC for inference on acyclic digraphs
Acyclic digraphs are the underlying representation of Bayesian networks, a widely used class of probabilistic graphical models. Learning the underlying graph from data is a way of gaining insights about the structural properties of a domain. Structure learning forms one of the inference challenges of statistical graphical models. MCMC methods, notably structure MCMC, to sample graphs from the posterior distribution given the data are probably the only viable option for Bayesian model averaging. Score modularity and restrictions on the number of parents of each node allow the graphs to be grouped into larger collections, which can be scored as a whole to improve the chain's convergence. Current examples of algorithms taking advantage of grouping are the biased order MCMC, which acts on the alternative space of permuted triangular matrices, and non ergodic edge reversal moves. Here we propose a novel algorithm, which employs the underlying combinatorial structure of DAGs to define a new grouping. As a result convergence is improved compared to structure MCMC, while still retaining the property of producing an unbiased sample. Finally the method can be combined with edge reversal moves to improve the sampler further.
Dimensionality Reduction for Binary Data through the Projection of Natural Parameters
Landgraf, Andrew J., Lee, Yoonkyung
Principal component analysis (PCA) for binary data, known as logistic PCA, has become a popular alternative to dimensionality reduction of binary data. It is motivated as an extension of ordinary PCA by means of a matrix factorization, akin to the singular value decomposition, that maximizes the Bernoulli log-likelihood. We propose a new formulation of logistic PCA which extends Pearson's formulation of a low dimensional data representation with minimum error to binary data. Our formulation does not require a matrix factorization, as previous methods do, but instead looks for projections of the natural parameters from the saturated model. Due to this difference, the number of parameters does not grow with the number of observations and the principal component scores on new data can be computed with simple matrix multiplication. We derive explicit solutions for data matrices of special structure and provide computationally efficient algorithms for solving for the principal component loadings. Through simulation experiments and an analysis of medical diagnoses data, we compare our formulation of logistic PCA to the previous formulation as well as ordinary PCA to demonstrate its benefits.
Accelerometer based Activity Classification with Variational Inference on Sticky HDP-SLDS
Basbug, Mehmet Emin, Ozcan, Koray, Velipasalar, Senem
As part of daily monitoring of human activities, wearable sensors and devices are becoming increasingly popular sources of data. With the advent of smartphones equipped with acceloremeter, gyroscope and camera; it is now possible to develop activity classification platforms everyone can use conveniently. In this paper, we propose a fast inference method for an unsupervised non-parametric time series model namely variational inference for sticky HDP-SLDS(Hierarchical Dirichlet Process Switching Linear Dynamical System). We show that the proposed algorithm can differentiate various indoor activities such as sitting, walking, turning, going up/down the stairs and taking the elevator using only the acceloremeter of an Android smartphone Samsung Galaxy S4. We used the front camera of the smartphone to annotate activity types precisely. We compared the proposed method with Hidden Markov Models with Gaussian emission probabilities on a dataset of 10 subjects. We showed that the efficacy of the stickiness property. We further compared the variational inference to the Gibbs sampler on the same model and show that variational inference is faster in one order of magnitude.
Piecewise-Linear Approximation for Feature Subset Selection in a Sequential Logit Model
Sato, Toshiki, Takano, Yuichi, Miyashiro, Ryuhei
This paper concerns a method of selecting a subset of features for a sequential logit model. Tanaka and Nakagawa (2014) proposed a mixed integer quadratic optimization formulation for solving the problem based on a quadratic approximation of the logistic loss function. However, since there is a significant gap between the logistic loss function and its quadratic approximation, their formulation may fail to find a good subset of features. To overcome this drawback, we apply a piecewise-linear approximation to the logistic loss function. Accordingly, we frame the feature subset selection problem of minimizing an information criterion as a mixed integer linear optimization problem. The computational results demonstrate that our piecewise-linear approximation approach found a better subset of features than the quadratic approximation approach.