Statistical Learning
Alternating Direction Methods for Latent Variable Gaussian Graphical Model Selection
Ma, Shiqian, Xue, Lingzhou, Zou, Hui
Chandrasekaran, Parrilo and Willsky (2010) proposed a convex optimization problem to characterize graphical model selection in the presence of unobserved variables. This convex optimization problem aims to estimate an inverse covariance matrix that can be decomposed into a sparse matrix minus a low-rank matrix from sample data. Solving this convex optimization problem is very challenging, especially for large problems. In this paper, we propose two alternating direction methods for solving this problem. The first method is to apply the classical alternating direction method of multipliers to solve the problem as a consensus problem. The second method is a proximal gradient based alternating direction method of multipliers. Our methods exploit and take advantage of the special structure of the problem and thus can solve large problems very efficiently. Global convergence result is established for the proposed methods. Numerical results on both synthetic data and gene expression data show that our methods usually solve problems with one million variables in one to two minutes, and are usually five to thirty five times faster than a state-of-the-art Newton-CG proximal point algorithm.
The Issue-Adjusted Ideal Point Model
Gerrish, Sean M., Blei, David M.
Legislative behavior centers around the votes made by lawmakers. These votes are captured in roll call data, a matrix with lawmakers in the rows and proposed legislation in the columns. We illustrate a sample of roll call votes for the United States Senate in Figure 1. The seminal work of Poole and Rosenthal (1985) introduced the ideal point model, using roll call data to infer the latent political positions of the lawmakers. The ideal point model is a latent factor model of binary data and an application of item-response theory (Lord 1980) to roll call data. It gives each lawmaker a latent political position along a single dimension and then uses these points (called the ideal points) in a model of the votes.
Bayesian Mixture Models for Frequent Itemset Discovery
In binary-transaction data-mining, traditional frequent itemset mining often produces results which are not straightforward to interpret. To overcome this problem, probability models are often used to produce more compact and conclusive results, albeit with some loss of accuracy. Bayesian statistics have been widely used in the development of probability models in machine learning in recent years and these methods have many advantages, including their abilities to avoid overfitting. In this paper, we develop two Bayesian mixture models with the Dirichlet distribution prior and the Dirichlet process (DP) prior to improve the previous non-Bayesian mixture model developed for transaction dataset mining. We implement the inference of both mixture models using two methods: a collapsed Gibbs sampling scheme and a variational approximation algorithm. Experiments in several benchmark problems have shown that both mixture models achieve better performance than a non-Bayesian mixture model. The variational algorithm is the faster of the two approaches while the Gibbs sampling method achieves a more accurate results. The Dirichlet process mixture model can automatically grow to a proper complexity for a better approximation. Once the model is built, it can be very fast to query and run analysis on (typically 10 times faster than Eclat, as we will show in the experiment section). However, these approaches also show that mixture models underestimate the probabilities of frequent itemsets. Consequently, these models have a higher sensitivity but a lower specificity.
Optimal Weighting of Multi-View Data with Low Dimensional Hidden States
In areas like Natural Language Processing, data often have multi-view and high dimension. Recently, CCA [8] has been applied to the multi-view setting as a unsupervised dimension reduction method in [7][10][3] with performance guarantee if the data is generated under certain structure. In [7], they assume the high dimensional multi-view data is generated independently conditioning on a low dimensional hidden state (the model structure will be illustrated later in detail). Under this assumption, the low dimensional features provided by CCA won't lose any useful information compared with the original high dimensional features when applied to linear regression. Also, [6] has applied this CCA method to generate a low dimensional vector representation of words which works well in a lot of NLP tasks. The reason for CCA to work well is that the low dimensional hidden state (throughout the paper we'll use k to denote the dimension of hidden state) 1 contains most information for the supervised tasks and by doing CCA, we are able to generate k dimensional estimate of the hidden state from each view as mentioned by [4], or more precisely, we can find all k directions in the high dimensional space of each view that have nonzero correlation with the hidden state via CCA. Only two views are enough to implement the CCA algorithms above (see [7] for detailed introduction about CCA). Despite it's power in dimension reduction, CCA with two views is still not optimal in the sense that it ends up with a hidden state estimator from each view but it's impossible to tell which view is better by only looking at the two views.
Towards a learning-theoretic analysis of spike-timing dependent plasticity
Balduzzi, David, Besserve, Michel
This paper suggests a learning-theoretic perspective on how synaptic plasticity benefits global brain functioning. We introduce a model, the selectron, that (i) arises as the fast time constant limit of leaky integrate-and-fire neurons equipped with spiking timing dependent plasticity (STDP) and (ii) is amenable to theoretical analysis. We show that the selectron encodes reward estimates into spikes and that an error bound on spikes is controlled by a spiking margin and the sum of synaptic weights. Moreover, the efficacy of spikes (their usefulness to other reward maximizing selectrons) also depends on total synaptic strength. Finally, based on our analysis, we propose a regularized version of STDP, and show the regularization improves the robustness of neuronal learning when faced with multiple stimuli.
High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity
Loh, Po-Ling, Wainwright, Martin J.
Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependence, as well. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently nonconvex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing nonconvex programs, we are able to both analyze the statistical error associated with any global optimum, and more surprisingly, to prove that a simple algorithm based on projected gradient descent will converge in polynomial time to a small neighborhood of the set of all global minimizers. On the statistical side, we provide nonasymptotic bounds that hold with high probability for the cases of noisy, missing and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm is guaranteed to converge at a geometric rate to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing close agreement with the predicted scalings.
Mining Social Data to Extract Intellectual Knowledge
Abstract-- Social data mining is an interesting phenomenon which colligates different sources of social data to extract information. This information can be used in relationship prediction, decision making, pattern recognition, social mapping, responsibility distribution and many other applications. This paper presents a systematical data mining architecture to mine intellectual knowledge from social data. In this research, we use social networking site facebook as primary data source. We collect different attributes such as about me, comments, wall post and age from facebook as raw data and use advanced data mining approaches to excavate intellectual knowledge. We also analyze our mined knowledge with comparison for possible usages like as human behavior prediction, pattern recognition, job responsibility distribution, decision making and product promoting.
On Move Pattern Trends in a Large Go Games Corpus
Baudiลก, Petr, Moudลรญk, Josef
We process a large corpus of game records of the board game of Go and propose a way of extracting summary information on played moves. We then apply several basic data-mining methods on the summary information to identify the most differentiating features within the summary information, and discuss their correspondence with traditional Go knowledge. We show statistically significant mappings of the features to player attributes such as playing strength or informally perceived "playing style" (e.g. territoriality or aggressivity), describe accurate classifiers for these attributes, and propose applications including seeding real-work ranks of internet players, aiding in Go study and tuning of Go-playing programs, or contribution to Go-theoretical discussion on the scope of "playing style".
A Bayesian Nonparametric Approach to Image Super-resolution
Polatkan, Gungor, Zhou, Mingyuan, Carin, Lawrence, Blei, David, Daubechies, Ingrid
Super-resolution methods form high-resolution images from low-resolution images. In this paper, we develop a new Bayesian nonparametric model for super-resolution. Our method uses a beta-Bernoulli process to learn a set of recurring visual patterns, called dictionary elements, from the data. Because it is nonparametric, the number of elements found is also determined from the data. We test the results on both benchmark and natural images, comparing with several other models from the research literature. We perform large-scale human evaluation experiments to assess the visual quality of the results. In a first implementation, we use Gibbs sampling to approximate the posterior. However, this algorithm is not feasible for large-scale data. To circumvent this, we then develop an online variational Bayes (VB) algorithm. This algorithm finds high quality dictionaries in a fraction of the time needed by the Gibbs sampler.
Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
Fard, Mahdi Milani, Grinberg, Yuri, Farahmand, Amir-massoud, Pineau, Joelle, Precup, Doina
We address the problem of automatic generation of features for value function approximation. Bellman Error Basis Functions (BEBFs) have been shown to improve the error of policy evaluation with function approximation, with a convergence rate similar to that of value iteration. We propose a simple, fast and robust algorithm based on random projections to generate BEBFs for sparse feature spaces. We provide a finite sample analysis of the proposed method, and prove that projections logarithmic in the dimension of the original space are enough to guarantee contraction in the error. Empirical results demonstrate the strength of this method.